Scene Understanding

Publish Date Title Authors PDF Code
2025-12-18 MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning Yuanchen Ju et.al. 2512.16909 null
2025-12-18 SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning Tin Stribor Sohn et.al. 2512.16461 null
2025-12-18 Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception Bangya Liu et.al. 2512.16265 null
2025-12-16 Unified Semantic Transformer for 3D Scene Understanding Sebastian Koch et.al. 2512.14364 null
2025-12-16 Consistent Instance Field for Dynamic Scene Understanding Junyi Wu et.al. 2512.14126 null
2025-12-16 Deep Learning Perspective of Scene Understanding in Autonomous Robots Afia Maham et.al. 2512.14020 null
2025-12-15 I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners Lu Ling et.al. 2512.13683 null
2025-12-15 MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion Minghui Hou et.al. 2512.13177 null
2025-12-15 DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass Vivek Alumootil et.al. 2512.13122 null
2025-12-15 SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework Anja Sheppard et.al. 2512.12945 null
2025-12-13 INDOOR-LiDAR: Bridging Simulation and Reality for Robot-Centric 360 degree Indoor LiDAR Perception – A Robot-Centric Hybrid Dataset Haichuan Li et.al. 2512.12377 null
2025-12-13 MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding Benjamin Beilharz et.al. 2512.12307 null
2025-12-13 A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection Peizheng Li et.al. 2512.12205 null
2025-12-13 Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video Daniel Adebi et.al. 2512.12165 null
2025-12-12 Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis Valentina Lilova et.al. 2512.11574 null
2025-12-12 Reconstruction as a Bridge for Event-Based Visual Question Answering Hanyue Lou et.al. 2512.11510 null
2025-12-12 VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing Emanuel Sánchez Aimar et.al. 2512.11490 null
2025-12-10 LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating Junting Chen et.al. 2512.09920 null
2025-12-09 SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding Seongyong Kim et.al. 2512.09062 null
2025-12-09 LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training Qing Xu et.al. 2512.08439 null
2025-12-09 CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning Zeyuan Chen et.al. 2512.08135 null
2025-12-08 SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery Meng Cao et.al. 2512.07733 null
2025-12-08 STRinGS: Selective Text Refinement in Gaussian Splatting Abhinav Raundhal et.al. 2512.07230 null
2025-12-08 A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning Siyang Jiang et.al. 2512.07136 null
2025-12-05 Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction Shilin Hu et.al. 2512.06179 null
2025-12-05 BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving Karthik Mohan et.al. 2512.06096 null
2025-12-05 Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision Lennart Maack et.al. 2512.05740 null
2025-12-05 Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction Ruihong Yin et.al. 2512.05597 null
2025-12-05 VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation Chinthani Sugandhika et.al. 2512.05524 null
2025-12-04 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer Xianfeng Wu et.al. 2512.05060 null
2025-12-03 C3G: Learning Compact 3D Representations with 2K Gaussians Honggyu An et.al. 2512.04021 null
2025-12-03 Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding Haoran Zhou et.al. 2512.03601 null
2025-12-03 What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models Tianchen Deng et.al. 2512.03422 null
2025-12-03 ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding Lingjun Zhao et.al. 2512.03370 null
2025-12-02 SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding Hongpei Zheng et.al. 2512.03284 null
2025-11-29 When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI Yanhui Li et.al. 2512.03087 null
2025-12-02 Layout Anything: One Transformer for Universal Room Layout Estimation Md Sohag Mia et.al. 2512.02952 null
2025-12-02 Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding Yerim Jeon et.al. 2512.02487 null
2025-12-02 HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild Valentin Bieri et.al. 2512.02450 null
2025-12-01 ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation Chenyang Gu et.al. 2512.02013 null
2025-12-01 OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic Songyan Zhang et.al. 2512.01830 null
2025-12-01 IGen: Scalable Data Generation for Robot Learning from Open-World Images Chenghao Gu et.al. 2512.01773 null
2025-12-01 SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge Yumeng He et.al. 2512.01629 null
2025-12-01 MDiff4STR: Mask Diffusion Model for Scene Text Recognition Yongkun Du et.al. 2512.01422 null
2025-12-01 VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering Zihua Liu et.al. 2512.01178 null
2025-11-30 FOM-Nav: Frontier-Object Maps for Object Goal Navigation Thomas Chabal et.al. 2512.01009 null
2025-11-30 Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting Haishan Wang et.al. 2512.00850 null
2025-11-29 Describe Anything Anywhere At Any Moment Nicolas Gorlo et.al. 2512.00565 null
2025-11-29 Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR Lixing Guo et.al. 2512.00294 null
2025-11-28 DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation Zirui Wang et.al. 2512.00226 null
2025-10-28 A Comprehensive Survey on Surgical Digital Twin Afsah Sharaf Khan et.al. 2512.00019 null
2025-11-28 DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation Hongfei Zhang et.al. 2511.23127 null
2025-11-28 Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding Anik De et.al. 2511.23071 null
2025-11-28 HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model Chen Li et.al. 2511.22961 null
2025-11-28 See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection YuEun Lee et.al. 2511.22906 null
2025-11-27 GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes Di Wang et.al. 2511.22645 null
2025-11-27 CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving Zhaohui Wang et.al. 2511.22532 null
2025-11-27 RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding Xiyan Liu et.al. 2511.22466 null
2025-11-26 SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding Tae-Min Choi et.al. 2511.21339 null
2025-11-26 Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding Yutao Tang et.al. 2511.21191 null
2025-11-26 Scaling Foundation Models for Radar Scene Understanding Pushkal Mishra et.al. 2511.21105 null
2025-11-25 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding Xiaoye Wang et.al. 2511.20646 null
2025-11-25 CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception Miguel Carvalho et.al. 2511.19820 null
2025-11-24 Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models Jonathan Lee et.al. 2511.19526 null
2025-11-24 Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving Jianhua Han et.al. 2511.19221 null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 null
2025-11-24 Autonomous Surface Selection For Manipulator-Based UV Disinfection In Hospitals Using Foundation Models Xueyan Oh et.al. 2511.18709 null
2025-11-23 Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span Heeseung Yun et.al. 2511.18470 null
2025-11-22 Plan-X: Instruct Video Generation via Semantic Planning Lun Huang et.al. 2511.17986 null
2025-11-21 CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation Prantik Howlader et.al. 2511.17755 null
2025-11-18 Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression Siddiqua Namrah et.al. 2511.17612 null
2025-11-21 SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation Seamie Hayes et.al. 2511.17361 null
2025-11-21 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM Chiori Hori et.al. 2511.17335 null
2025-11-20 POMA-3D: The Point Map Way to 3D Scene Understanding Ye Mao et.al. 2511.16567 null
2025-11-20 LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs Doriand Petit et.al. 2511.16454 null
2025-11-20 Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM Gergely Dinya et.al. 2511.16282 null
2025-11-20 How Robot Dogs See the Unseeable Oliver Bimber et.al. 2511.16262 null
2025-11-20 Real-Time 3D Object Detection with Inference-Aligned Learning Chenyu Zhao et.al. 2511.16140 null
2025-11-20 Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click Raphael Ruschel et.al. 2511.15948 null
2025-11-19 WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion Sajjad Pakdamansavoji et.al. 2511.15874 null
2025-11-19 ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation Simon Boeder et.al. 2511.15396 null
2025-11-19 Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception Jiashu Yang et.al. 2511.15279 null
2025-11-18 RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems Jaro Meyer et.al. 2511.14948 null
2025-11-18 Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models Hao Zhen et.al. 2511.14120 null
2025-11-18 Error-Driven Scene Editing for 3D Grounding in Large Language Models Yue Zhang et.al. 2511.14086 null
2025-11-18 RISE: Single Static Radar-based Indoor Scene Understanding Kaichen Zhou et.al. 2511.14019 null
2025-11-17 VLMs Guided Interpretable Decision Making for Autonomous Driving Xin Hu et.al. 2511.13881 null
2025-11-17 Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation Lingfeng Zhang et.al. 2511.13269 null
2025-11-17 Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving Jiacheng Tang et.al. 2511.13079 null
2025-11-17 Visual Room 2.0: Seeing is Not Understanding for MLLMs Haokun Li et.al. 2511.12928 null
2025-11-16 RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation Xiaoshuai Hao et.al. 2511.12436 null
2025-11-14 Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy Vinit Mehta et.al. 2511.11777 null
2025-11-13 ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts Haowen Jiang et.al. 2511.11740 null
2025-11-14 AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning Jirong Zha et.al. 2511.11025 null
2025-11-13 DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation Xuexun Liu et.al. 2511.10003 null
2025-11-12 Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding Jingtian Ma et.al. 2511.08978 null
2025-11-11 RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation Hae-Won Jo et.al. 2511.08651 null
2025-11-05 Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants I. Bailo et.al. 2511.08609 null
2025-11-11 OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition Lixu Sun et.al. 2511.08133 null
2025-11-11 HD $^2$ -SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving Zhiwen Yang et.al. 2511.07925 null
2025-11-11 Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views Haida Feng et.al. 2511.07813 null
2025-11-10 Inference-Time Scaling of Diffusion Models for Infrared Data Generation Kai A. Horstmann et.al. 2511.07362 null
2025-11-10 PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving Simon Gerstenecker et.al. 2511.07292 null
2025-11-10 Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images JiaKui Hu et.al. 2511.07222 null
2025-11-10 TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding Duc Nguyen et.al. 2511.07007 null
2025-11-10 PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory Qunchao Jin et.al. 2511.06840 null
2025-11-09 Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR) Tobias Rueckert et.al. 2511.06549 null
2025-11-08 Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation Lin Li et.al. 2511.05935 null
2025-11-08 Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning Fei Yu et.al. 2511.05894 null
2025-11-07 Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots Justin Williams et.al. 2511.05642 null
2025-11-06 Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition Nicholas Babey et.al. 2511.05622 null
2025-10-30 Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution Shiyao Sang et.al. 2511.05540 null
2025-11-06 GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies Maëlic Neau et.al. 2511.04357 null
2025-11-06 CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation Yuwen Tao et.al. 2511.03992 null
2025-11-06 Simple 3D Pose Features Support Human and Machine Social Scene Understanding Wenshuo Qin et.al. 2511.03988 null
2025-11-06 Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images Sam Bahrami et.al. 2511.03970 null
2025-11-05 SILVI: Simple Interface for Labeling Video Interactions Ozan Kanbertay et.al. 2511.03819 null
2025-11-05 SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding Mauro Orazio Drago et.al. 2511.03325 null
2025-11-04 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation Gyeom Hwangbo et.al. 2511.03001 null
2025-11-04 DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding Zixuan Liu et.al. 2511.02495 null
2025-11-04 Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization Tao Liu et.al. 2511.02489 link
2025-11-04 From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics Nicolas Schuler et.al. 2511.02427 null
2025-11-03 Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis Soham Joshi et.al. 2511.02046 null
2025-10-31 The Eigenvalues Entropy as a Classifier Evaluation Measure Doulaye Dembélé et.al. 2511.01904 null
2025-11-03 A Compact Model for Polar Multiple-Channel Field Effect Transistors: A Case Study in III-V Nitride Semiconductors Aias Asteris et.al. 2511.01699 null
2025-11-03 Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models Xiaoyu Zhan et.al. 2511.01618 null
2025-11-03 PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model Wenqi Liang et.al. 2511.01571 null
2025-11-03 Fast and Robust Remote Two-Qubit Gates on Distributed Qubits Yunan Li et.al. 2511.01418 null
2025-11-03 A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model Sampriti Soor et.al. 2511.01317 null
2025-11-03 LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping Lijie Wang et.al. 2511.01186 null
2025-11-02 GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies Ziye Wang et.al. 2511.00998 null
2025-11-01 Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach Oluwatosin Alabi et.al. 2511.00643 null
2025-11-01 CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World Yating Yu et.al. 2511.00613 null
2025-11-01 Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models Panwang Pan et.al. 2511.00503 link
2025-10-30 AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency Piyushkumar Patel et.al. 2511.00107 null
2025-10-31 Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs Sushil Samuel Dinesh et.al. 2510.27558 null
2025-10-31 NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding Wei Xu et.al. 2510.27481 null
2025-10-31 Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing Yijia Wang et.al. 2510.27335 null
2025-10-31 Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis Weiming Chen et.al. 2510.27324 null
2025-10-31 HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition Jiacheng Hong et.al. 2510.27148 null
2025-10-30 A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics Simindokht Jahangard et.al. 2510.27033 null
2025-10-30 The ANUBIS detector and its sensitivity to neutral long-lived particles ANUBIS Collaboration et.al. 2510.26932 null
2025-10-30 HEIR: Learning Graph-Based Motion Hierarchies Cheng Zheng et.al. 2510.26786 null
2025-10-30 Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios Manjunath Prasad Holenarasipura Rajiv et.al. 2510.26580 null
2025-10-30 AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM Mirko Usuelli et.al. 2510.26358 null
2025-10-30 GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? Mingyu Sung et.al. 2510.26339 null
2025-10-30 Letter of Intent: The Forward Physics Facility Luis A. Anchordoqui et.al. 2510.26260 null
2025-10-30 Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM Ali Caglayan et.al. 2510.26131 null
2025-10-29 Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks Xu Zheng et.al. 2510.25760 link
2025-10-29 More than a Moment: Towards Coherent Sequences of Audio Descriptions Eshika Khandelwal et.al. 2510.25440 null
2025-10-29 U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching Junsheng Zhou et.al. 2510.25210 null
2025-10-29 EA3D: Online Open-World 3D Object Extraction from Streaming Videos Xiaoyu Zhou et.al. 2510.25146 null
2025-10-29 Learning Spatial-Aware Manipulation Ordering Yuxiang Yan et.al. 2510.25138 null
2025-10-29 Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments Manjunath Prasad Holenarasipura Rajiv et.al. 2510.25070 null
2025-10-28 VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos Qiucheng Wu et.al. 2510.24904 null
2025-10-28 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Inclusion AI et.al. 2510.24821 link
2025-10-28 Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Jonas Hein et.al. 2510.24332 null
2025-10-28 Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning Aodi Wu et.al. 2510.24152 null
2025-10-27 Optimized Loudspeaker Panning for Adaptive Sound-Field Correction and Non-stationary Listening Areas Yuancheng Luo et.al. 2510.23937 null
2025-10-27 DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning Eddison Pham et.al. 2510.23907 null
2025-10-27 Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations Yujia Zhang et.al. 2510.23607 link
2025-10-27 PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity Yuqian Yuan et.al. 2510.23603 link
2025-10-27 InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras Erich Liang et.al. 2510.23589 null
2025-10-27 Localising under the drape: proprioception in the era of distributed surgical robotic system Martin Huber et.al. 2510.23512 null
2025-10-27 UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception Karthikeyan Chandra Sekaran et.al. 2510.23478 null
2025-10-27 Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics Ş. Ekmen et.al. 2510.23403 null
2025-10-27 Evaluation of Vision-LLMs in Surveillance Video Pascal Benschop et.al. 2510.23190 null
2025-10-27 Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI Aryan Mathur et.al. 2510.23148 null
2025-10-27 SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency Quanjian Song et.al. 2510.22994 null
2025-10-27 Charting the Design Space of Neural Graph Representations for Subgraph Matching Vaibhav Raj et.al. 2510.22897 null
2025-10-26 IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction Hao Li et.al. 2510.22706 link
2025-10-26 Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views Anna Deichler et.al. 2510.22672 null
2025-10-25 BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles Seyed Ahmad Hosseini Miangoleh et.al. 2510.22370 null
2025-10-25 Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments Weixian Qian et.al. 2510.22204 null
2025-10-25 MOGRAS: Human Motion with Grasping in 3D Scenes Kunal Bhosikar et.al. 2510.22199 null
2025-10-25 LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction Yuhang Gao et.al. 2510.22141 null
2025-10-25 CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding Lihuang Fang et.al. 2510.22119 null
2025-10-07 Avi: Action from Volumetric Inference Harris Song et.al. 2510.21746 null
2025-10-24 OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields Lisa Weijler et.al. 2510.21441 null
2025-10-24 ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models Pranav Saxena et.al. 2510.21069 null
2025-10-22 Uncertainty evaluation of segmentation models for Earth observation Melanie Rey et.al. 2510.19586 null
2025-10-22 Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization Juncheng Wang et.al. 2510.19330 null
2025-10-21 Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations Phuoc Nguyen et.al. 2510.18697 null
2025-10-21 MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning Wenhui Huang et.al. 2510.18337 null
2025-10-21 UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding Da Zhang et.al. 2510.18262 null
2025-10-21 OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion Tianyu Huang et.al. 2510.18253 null
2025-10-20 Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models Katie Luo et.al. 2510.17274 null
2025-10-19 SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes Xiongkun Linghu et.al. 2510.16714 null
2025-10-18 Structured Interfaces for Automated Reasoning with 3D Scene Graphs Aaron Ray et.al. 2510.16643 null
2025-10-11 ESCA: Contextualizing Embodied Agents via Scene-Graph Generation Jiani Huang et.al. 2510.15963 null
2025-10-07 GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments Leela Krishna et.al. 2510.14992 null
2025-10-16 QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps Matti Pekkanen et.al. 2510.14546 null
2025-10-15 Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models Jia Yun Chua et.al. 2510.13993 null
2025-10-15 SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB Muhammad Ishfaq Hussain et.al. 2510.13404 null
2025-10-15 FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding Francesco Barbato et.al. 2510.13243 null
2025-10-14 VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages Jesse Atuhurra et.al. 2510.12845 null
2025-10-14 SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding Zhiliu Yang et.al. 2510.12749 null
2025-10-13 PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation Hatem Ibrahem et.al. 2510.11992 null
2025-10-13 PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image Pradyumna Yalandur Muralidhar et.al. 2510.11649 null
2025-10-13 A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation Denis Zavadski et.al. 2510.11567 null
2025-10-13 mmWalk: Towards Multi-modal Multi-view Walking Assistance Kedi Ying et.al. 2510.11520 null
2025-10-13 REACT3D: Recovering Articulations for Interactive Physical 3D Scenes Zhao Huang et.al. 2510.11340 null
2025-10-12 Real2USD: Scene Representations in Universal Scene Description Language Christopher D. Hsu et.al. 2510.10778 null
2025-10-11 B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding Feng Xiao et.al. 2510.10194 null
2025-10-10 CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation Kaiwen Wei et.al. 2510.09266 null
2025-10-08 Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs Hanieh Shojaei Miandashti et.al. 2510.08631 null
2025-10-03 Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes Nirmal Elamon et.al. 2510.08589 null
2025-10-09 The impact of abstract and object tags on image privacy classification Darya Baranouskaya et.al. 2510.07976 null
2025-10-09 CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving Tianrui Zhang et.al. 2510.07944 link
2025-10-09 An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images Kanglin Ning et.al. 2510.07817 null
2025-10-07 Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model Danush Kumar Venkatesh et.al. 2510.07345 null
2025-10-08 Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion Jie Luo et.al. 2510.06687 null
2025-10-07 When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach Daniel Gonzálbez-Biosca et.al. 2510.05661 null
2025-10-07 HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video Hongchi Xia et.al. 2510.05560 link
2025-10-06 Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction Chi Yan et.al. 2510.04759 link
2025-10-02 LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition Rixin Zhou et.al. 2510.01651 null
2025-10-01 VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs Mohamad Al Mdfaa et.al. 2510.01483 null
2025-09-30 Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification Artur Barros et.al. 2509.26457 null
2025-09-30 Neighbor-aware informal settlement mapping with graph convolutional networks Thomas Hallopeau et.al. 2509.26171 null
2025-09-30 Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models Yuansen Liu et.al. 2509.26165 link
2025-09-30 EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models Seamie Hayes et.al. 2509.26087 null
2025-09-30 VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs Peng Liu et.al. 2509.25916 null
2025-09-29 PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos Ting-Hsuan Liao et.al. 2509.25183 null
2025-09-29 Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs Yue Zhang et.al. 2509.25139 null
2025-09-29 Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots Ermanno Bartoli et.al. 2509.24966 null
2025-09-29 CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D Mohamad Amin Mirzaei et.al. 2509.24528 null
2025-09-29 PhysiAgent: An Embodied Agent Framework in Physical World Zhihao Wang et.al. 2509.24524 null
2025-09-29 Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy Haijier Chen et.al. 2509.24385 null
2025-09-29 Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context Yongqiang Wang et.al. 2509.24275 null
2025-09-28 FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing Yi Yang et.al. 2509.23927 null
2025-09-28 Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation Hanyu Zhou et.al. 2509.23828 null
2025-09-28 From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving Yixiao Chen et.al. 2509.23641 null
2025-09-28 From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations Javed Ahmad et.al. 2509.23555 null
2025-09-26 Good Weights: Proactive, Adaptive Dead Reckoning Fusion for Continuous and Robust Visual SLAM Yanwei Du et.al. 2509.22910 null
2025-09-20 Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment Abhiroop Chatterjee et.al. 2509.22697 null
2025-09-26 UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective Jun He et.al. 2509.22228 null
2025-09-26 Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics Saurav Jha et.al. 2509.22014 null
2025-09-26 Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding Vahid Mirjalili et.al. 2509.21922 null
2025-09-25 Real-Time Indoor Object SLAM with LLM-Enhanced Priors Yang Jiao et.al. 2509.21602 null
2025-09-25 Residual Vector Quantization For Communication-Efficient Multi-Agent Perception Dereje Shenkut et.al. 2509.21464 null
2025-09-23 TUN3D: Towards Real-World Scene Understanding from Unposed Images Anton Konushin et.al. 2509.21388 link
2025-09-25 DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection Jiayi Zuo et.al. 2509.20701 null
2025-09-23 SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment Binod Singh et.al. 2509.20401 null
2025-09-24 Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning Xun Li et.al. 2509.20077 null
2025-09-24 OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving Pei Liu et.al. 2509.19973 null
2025-09-23 Category-Level Object Shape and Pose Estimation in Less Than a Millisecond Lorenzo Shaikewitz et.al. 2509.18979 null
2025-09-23 Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations Hanqing Liu et.al. 2509.18953 null
2025-09-23 Surgical Video Understanding with Label Interpolation Garam Kim et.al. 2509.18802 null
2025-09-23 MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning Omar Rayyan et.al. 2509.18757 null
2025-09-23 PIE: Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving Chengran Yuan et.al. 2509.18609 null
2025-09-22 Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration Zhitao Zeng et.al. 2509.17429 null
2025-09-20 Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding Haoyuan Li et.al. 2509.16721 null
2025-09-20 ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting Xiaoyang Yan et.al. 2509.16552 null
2025-09-19 Towards Sharper Object Boundaries in Self-Supervised Depth Estimation Aurélien Cecille et.al. 2509.15987 null
2025-09-19 RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation Paul Julius Kühn et.al. 2509.15886 null
2025-09-19 SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models Sen Wang et.al. 2509.15536 null
2025-09-18 Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems Yicheng Zhang et.al. 2509.15213 null
2025-09-18 SPATIALGEN: Layout-guided 3D Indoor Scene Generation Chuan Fang et.al. 2509.14981 link
2025-09-16 Semantic 3D Reconstructions with SLAM for Central Airway Obstruction Ayberk Acar et.al. 2509.13541 null
2025-09-16 ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors Romain Hardy et.al. 2509.13525 null
2025-09-16 3D Aware Region Prompted Vision Language Model An-Chieh Cheng et.al. 2509.13317 null
2025-09-16 Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving Ruibo Li et.al. 2509.13116 null
2025-09-16 Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings Abdalla Arafa et.al. 2509.12938 null
2025-09-16 MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization Yiyi Zhang et.al. 2509.12893 null
2025-09-15 RailSafeNet: Visual Scene Understanding for Tram Safety Ondřej Valach et.al. 2509.12125 link
2025-09-15 Microsurgical Instrument Segmentation for Robot-Assisted Surgery Tae Kyeong Jeong et.al. 2509.11727 null
2025-09-15 See What I Mean? Mobile Eye-Perspective Rendering for Optical See-through Head-mounted Displays Gerlinde Emsenhuber et.al. 2509.11653 null
2025-09-14 Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision Tianyao Sun et.al. 2509.11476 null
2025-09-14 DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation Yunheng Wang et.al. 2509.11197 null
2025-09-14 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment Nhut Le et.al. 2509.11097 null
2025-09-13 OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds Chongyu Wang et.al. 2509.10842 null
2025-09-12 Multimodal SAM-adapter for Semantic Segmentation Iacopo Curti et.al. 2509.10408 null
2025-09-10 SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation Michael J. Munje et.al. 2509.08757 null
2025-09-09 OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics Yinan Deng et.al. 2509.07500 null
2025-09-09 DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis Sven Kirchner et.al. 2509.07463 null
2025-09-08 Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception Ensong Liu et.al. 2509.06506 null
2025-09-07 UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning Huy Le et.al. 2509.06165 null
2025-09-06 Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation Tianhao Guo et.al. 2509.05746 null
2025-09-05 SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing Chaolei Wang et.al. 2509.05144 null
2025-09-03 Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding Hongpei Zheng et.al. 2509.03635 null
2025-09-03 Rashomon in the Streets: Explanation Ambiguity in Scene Understanding Helge Spieker et.al. 2509.03169 null
2025-09-02 Generalizable Skill Learning for Construction Robots with Crowdsourced Natural Language Instructions, Composable Skills Standardization, and Large Language Model Hongrui Yu et.al. 2509.02876 null
2025-09-02 SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images Pushpendra Dhakara et.al. 2509.02287 null
2025-09-02 Omnidirectional Spatial Modeling from Correlated Panoramas Xinshen Zhang et.al. 2509.02164 null
2025-09-02 AI-Driven Marine Robotics: Emerging Trends in Underwater Perception and Ecosystem Monitoring Scarlett Raine et.al. 2509.01878 null
2025-09-01 Articulated Object Estimation in the Wild Abdelrhman Werby et.al. 2509.01708 null
2025-09-01 Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation Maëlic Neau et.al. 2509.01209 null
2025-08-31 SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting Zhuodong Jiang et.al. 2509.00800 null
2025-08-31 OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving Pei Liu et.al. 2509.00789 null
2025-08-30 ConceptBot: Enhancing Robot’s Autonomy through Task Decomposition with Large Language Models and Knowledge Graph Alessandro Leanza et.al. 2509.00570 null
2025-08-29 Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment Jinzhou Tang et.al. 2509.00210 null
2025-08-18 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving Ali K. AlShami et.al. 2508.21080 null
2025-08-27 Hyperspectral Sensors and Autonomous Driving: Technologies, Limitations, and Opportunities Imad Ali Shah et.al. 2508.19905 null
2025-08-27 Context-Aware Risk Estimation in Home Environments: A Probabilistic Framework for Service Robots Sena Ishii et.al. 2508.19788 null
2025-08-27 LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation Yupeng Zhang et.al. 2508.19699 link
2025-08-27 Scalable Object Detection in the Car Interior With Vision Foundation Models Bálint Mészáros et.al. 2508.19651 null
2025-08-25 ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation Jianwen Tan et.al. 2508.18050 null
2025-08-25 HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation Xiping Wang et.al. 2508.17832 null
2025-08-24 Investigating Domain Gaps for Indoor 3D Object Detection Zijing Zhao et.al. 2508.17439 null
2025-08-24 An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing Zihan Liang et.al. 2508.17435 null
2025-08-24 SEER-VAR: Semantic Egocentric Environment Reasoner for Vehicle Augmented Reality Yuzhi Lai et.al. 2508.17255 null
2025-08-24 Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding Yunxiang Yang et.al. 2508.17205 null
2025-08-23 PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models Xianjing Cheng et.al. 2508.17050 null
2025-08-22 HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction Sara Rojas et.al. 2508.16433 null
2025-08-21 ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification Bochao Sun et.al. 2508.15632 null
2025-08-19 Hybrelighter: Combining Deep Anisotropic Diffusion and Scene Reconstruction for On-device Real-time Relighting in Mixed Reality Hanwen Zhao et.al. 2508.14930 null
2025-08-20 MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation Guile Wu et.al. 2508.14327 null
2025-08-19 GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting Elena Alegret et.al. 2508.14278 null
2025-08-19 ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving Xianda Guo et.al. 2508.13977 null
2025-08-19 Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference Yunxiang Yang et.al. 2508.13439 null
2025-08-17 PreSem-Surf: RGB-D Surface Reconstruction with Progressive Semantic Modeling and SG-MLP Pre-Rendering Mechanism Yuyan Ye et.al. 2508.13228 null
2025-08-17 LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving Nan Song et.al. 2508.12404 null
2025-08-17 Splat Feature Solver Butian Xiong et.al. 2508.12216 null
2025-08-16 InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes Hongyuan Liu et.al. 2508.12015 null
2025-08-14 Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset Wentao Mo et.al. 2508.11058 null
2025-08-13 Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation Xu Tang et.al. 2508.09626 null
2025-08-12 Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment Shi-Chen Zhang et.al. 2508.08811 null
2025-08-11 SAGOnline: Segment Any Gaussians Online Wentao Sun et.al. 2508.08219 null
2025-08-11 TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking Tony Danjun Wang et.al. 2508.07968 null
2025-08-11 DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models Licheng Zhang et.al. 2508.07714 null
2025-08-10 Understanding Dynamic Scenes in Ego Centric 4D Point Clouds Junsheng Huang et.al. 2508.07251 null
2025-08-05 Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images Qi Xun Yeo et.al. 2508.06546 null
2025-08-07 VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments Kaiser Hamid et.al. 2508.05852 null
2025-08-07 Point cloud segmentation for 3D Clothed Human Layering Davide Garavaso et.al. 2508.05531 null
2025-08-07 EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery Bingyu Yang et.al. 2508.05205 null
2025-08-07 A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding Mahmoud Chick Zaouali et.al. 2508.05064 null
2025-08-07 TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring Zhu Xu et.al. 2508.04943 null
2025-08-06 PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment Gustav Hanning et.al. 2508.04659 null
2025-08-05 SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision Zhaoxu Li et.al. 2508.03177 null
2025-08-05 CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation Lekang Wen et.al. 2508.03060 null
2025-08-04 FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation Cui Miao et.al. 2508.02190 null
2025-08-04 GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting Lei Yao et.al. 2508.02172 null
2025-08-03 DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion Zhigang Sun et.al. 2508.01778 null
2025-08-03 AG $^2$ aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing Zhaonan Wang et.al. 2508.01740 null
2025-08-03 Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation Julia Hindel et.al. 2508.01713 null
2025-08-02 TEACH: Text Encoding as Curriculum Hints for Scene Text Recognition Xiahan Yang et.al. 2508.01153 null
2025-08-02 OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding Dianyi Yang et.al. 2508.01150 null
2025-08-01 3D Reconstruction via Incremental Structure From Motion Muhammad Zeeshan et.al. 2508.01019 null
2025-08-01 Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF Massoud Pourmandi et.al. 2508.00967 null
2025-07-31 Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs Bhavya Goyal et.al. 2508.00169 null
2025-07-31 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding Ting Huang et.al. 2507.23478 null
2025-07-31 FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models Yiming Yang et.al. 2507.23325 null
2025-07-31 FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning Jiajun Cao et.al. 2507.23318 null
2025-07-30 DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion Qingcheng Zhao et.al. 2507.22825 null
2025-07-30 UAVScenes: A Multi-Modal Dataset for UAVs Sijie Wang et.al. 2507.22412 null
2025-07-29 EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation Zhijiang Li et.al. 2507.21971 null
2025-07-28 GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction Tianhao Li et.al. 2507.20963 null
2025-07-28 Compositional Video Synthesis by Temporal Object-Centric Learning Adil Kaan Akan et.al. 2507.20855 null
2025-07-27 VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving Levente Tempfli et.al. 2507.20397 null
2025-07-27 Solving Scene Understanding for Autonomous Navigation in Unstructured Environments Naveen Mathews Renji et.al. 2507.20389 null
2025-07-26 FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images Hao-Yu Hou et.al. 2507.19993 null
2025-07-26 UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block Luoxi Jing et.al. 2507.19948 null
2025-07-26 RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection Xiaokai Bai et.al. 2507.19856 null
2025-07-26 Taking Language Embedded 3D Gaussian Splatting into the Wild Yuze Wang et.al. 2507.19830 null
2025-07-25 Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing Haichuan Li et.al. 2507.19691 null
2025-07-25 VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions Haoang Lu et.al. 2507.19188 null
2025-07-24 Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting Xingyu Miao et.al. 2507.18678 null
2025-07-23 From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding Anna-Maria Halacheva et.al. 2507.17585 null
2025-07-23 IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird’s-Eye View Perception Haichuan Li et.al. 2507.17445 null
2025-07-22 ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension Yizhi Hu et.al. 2507.16877 null
2025-07-22 Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge Tobias Rueckert et.al. 2507.16559 null
2025-07-22 Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach Jon Gutiérrez-Zaballa et.al. 2507.16556 null
2025-07-22 DenseSR: Image Shadow Removal as Dense Prediction Yu-Fan Lin et.al. 2507.16472 link
2025-07-21 Label tree semantic losses for rich multi-class medical image segmentation Junwen Wang et.al. 2507.15777 null
2025-07-21 Towards Holistic Surgical Scene Graph Jongmin Shin et.al. 2507.15541 null
2025-07-21 ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting Ruijie Zhu et.al. 2507.15454 link
2025-07-21 VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving Haichao Liu et.al. 2507.15266 null
2025-07-19 DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF Doriand Petit et.al. 2507.14596 null
2025-07-19 Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions Jintang Xue et.al. 2507.14555 null
2025-07-19 Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 Sujata Gaihre et.al. 2507.14544 null
2025-07-19 CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding Zhou Chen et.al. 2507.14426 null
2025-07-18 Semantic Segmentation based Scene Understanding in Autonomous Vehicles Ehsan Rassekh et.al. 2507.14303 null
2025-07-18 Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation Masahiro Ogawa et.al. 2507.13628 null
2025-07-17 Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection Jingyao Wang et.al. 2507.13061 null
2025-07-17 Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models Yifan Xu et.al. 2507.12916 null
2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning Penglei Sun et.al. 2507.12795 null
2025-07-16 Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection Sandipan Sarma et.al. 2507.12628 null
2025-07-15 Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis Maciej Szankin et.al. 2507.11730 null
2025-07-15 Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander Li Wang et.al. 2507.11079 null
2025-07-15 Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation Yanbo Wang et.al. 2507.11001 null
2025-07-14 Static or Temporal? Semantic Scene Simplification to Aid Wayfinding in Immersive Simulations of Bionic Vision Justin M. Kasowski et.al. 2507.10813 null
2025-07-14 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Mingxian Lin et.al. 2507.10548 link
2025-07-13 VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding Younggun Kim et.al. 2507.09815 null
2025-07-13 Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles Yangang Ren et.al. 2507.09537 null
2025-07-12 Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding Wencan Huang et.al. 2507.09334 null
2025-07-12 THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage Trong-Thuan Nguyen et.al. 2507.09200 null
2025-07-12 Towards Spatial Audio Understanding via Question Answering Parthasaarathy Sudarsanam et.al. 2507.09195 null
2025-07-12 On the Fragility of Multimodal Perception to Temporal Misalignment in Autonomous Driving Md Hasan Shahriar et.al. 2507.09095 null
2025-07-10 OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding JingLi Lin et.al. 2507.07984 link
2025-07-10 MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation Bangning Wei et.al. 2507.07519 null
2025-07-09 SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds Matthias Zeller et.al. 2507.06906 null
2025-07-09 Token Bottleneck: One Token to Remember Dynamics Taekyung Kim et.al. 2507.06543 link
2025-07-09 What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies Yaoqi Huang et.al. 2507.06513 null
2025-07-08 Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion Aleksandar Jevtić et.al. 2507.06230 link
2025-07-08 SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning Xin Hu et.al. 2507.05798 null
2025-07-07 All in One: Visual-Description-Guided Unified Point Cloud Segmentation Zongyan Han et.al. 2507.05211 null
2025-07-07 MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding Jing Liang et.al. 2507.04686 null
2025-07-05 Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation Ziyu Zhu et.al. 2507.04047 null
2025-07-05 Habitat Classification from Ground-Level Imagery Using Deep Neural Networks Hongrui Shi et.al. 2507.04017 null
2025-07-04 Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds Matthias Zeller et.al. 2507.03463 null
2025-07-03 LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans Zhening Huang et.al. 2507.02861 link
2025-07-03 LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion Fangfu Liu et.al. 2507.02813 link
2025-07-03 SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment Qi Xu et.al. 2507.02705 link
2025-07-04 Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach Elena Ryumina et.al. 2507.02205 link
2025-07-02 ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning Xiao Wang et.al. 2507.02200 null
2025-07-02 ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving Kai Chen et.al. 2507.01735 null
2025-07-01 GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond Anna-Maria Halacheva et.al. 2507.00886 null
2025-07-01 BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving Zeming Chen et.al. 2507.00707 null
2025-06-29 IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering Parker Liu et.al. 2506.23329 link
2025-07-01 SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting Yiming Huang et.al. 2506.23309 null
2025-06-29 Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation Zhenhua Ning et.al. 2506.23120 null
2025-06-28 Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding Xingyilang Yin et.al. 2506.22817 null
2025-06-28 VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding Minchao Jiang et.al. 2506.22799 null
2025-06-26 CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery Felix Holm et.al. 2506.21813 null
2025-06-24 FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models Shiyi Wang et.al. 2506.21627 null
2025-06-26 CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations Julian Lorenz et.al. 2506.21357 null
2025-06-27 ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation Xiwei Xuan et.al. 2506.21233 null
2025-06-25 IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals Markus Gross et.al. 2506.20671 null
2025-06-25 Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios Wenbin Gan et.al. 2506.20531 null
2025-06-25 DreamAnywhere: Object-Centric Panoramic 3D Scene Generation Edoardo Alberto Dominici et.al. 2506.20367 null
2025-06-24 HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions Mrunmai Vivek Phatak et.al. 2506.19639 null
2025-06-24 Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects Federico Tavella et.al. 2506.19579 null
2025-06-24 Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning Pengfei Hao et.al. 2506.19469 null
2025-06-24 Segment Any 3D-Part in a Scene from a Sentence Hongyu Wu et.al. 2506.19331 null
2025-06-24 Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding Runwei Guan et.al. 2506.19288 null
2025-06-24 Object-aware Sound Source Localization via Audio-Visual Scene Understanding Sung Jin Um et.al. 2506.18557 null
2025-06-23 DIP: Unsupervised Dense In-Context Post-training of Visual Representations Sophia Sirko-Galouchenko et.al. 2506.18463 link
2025-06-22 TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving Wenzhuo Liu et.al. 2506.18084 null
2025-06-22 Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis Mohamed Benkedadra et.al. 2506.17910 null
2025-06-21 Optimization-Free Patch Attack on Stereo Depth Estimation Hangcheng Liu et.al. 2506.17632 null
2025-06-21 Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations Zhihao Yuan et.al. 2506.17545 null
2025-06-17 Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment Weiming Zhang et.al. 2506.14271 null
2025-06-17 Unified Representation Space for 3D Visual Grounding Yinuo Zheng et.al. 2506.14238 null
2025-06-17 SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability Juho Bai et.al. 2506.14144 null
2025-06-17 Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems Sanjeda Akter et.al. 2506.14096 null
2025-06-16 FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding Chenlu Zhan et.al. 2506.13629 null
2025-06-16 A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects Guohuan Xie et.al. 2506.13552 null
2025-06-14 A Spatial Relationship Aware Dataset for Robotics Peng Wang et.al. 2506.12525 link
2025-06-14 Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding Youze Wang et.al. 2506.12336 null
2025-06-12 GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset Sahar Nasirihaghighi et.al. 2506.11356 null
2025-06-12 SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis Weiliang Chen et.al. 2506.10981 null
2025-06-13 SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields Qijing Li et.al. 2506.09565 null
2025-06-11 ODG: Occupancy Prediction Using Dual Gaussians Yunxiao Shi et.al. 2506.09417 null
2025-06-10 SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting Mengjiao Ma et.al. 2506.08710 link
2025-06-10 PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly Liang Ma et.al. 2506.08708 null
2025-06-10 From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge Agnese Taluzzi et.al. 2506.08553 null
2025-06-10 Robust Visual Localization via Semantic-Guided Multi-Scale Transformer Zhongtao Tian et.al. 2506.08526 null
2025-06-09 Open World Scene Graph Generation using Vision Language Models Amartya Dutta et.al. 2506.08189 link
2025-06-09 Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods Beining Xu et.al. 2506.07779 null
2025-06-09 OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting Jens Piekenbrinck et.al. 2506.07697 null
2025-06-09 Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent Shoon Kit Lim et.al. 2506.07509 link
2025-06-09 SpatialLM: Training Large Language Models for Structured Indoor Modeling Yongsen Mao et.al. 2506.07491 link
2025-06-08 BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction Yunxiao Shi et.al. 2506.07002 null
2025-06-07 IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion Hongming Chen et.al. 2506.06804 null
2025-06-07 PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments Minghao Zou et.al. 2506.06631 null
2025-06-06 Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments Chad R Samuelson et.al. 2506.06562 null
2025-06-06 Enhancing Situational Awareness in Underwater Robotics with Multi-modal Spatial Perception Pushyami Kaveti et.al. 2506.06476 null
2025-06-06 Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study Leon Mayer et.al. 2506.06232 null
2025-06-06 STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving Christian Fruhwirth-Reisinger et.al. 2506.06218 null
2025-06-06 Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness Steven Landgraf et.al. 2506.05917 null
2025-06-06 HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios Daming Wang et.al. 2506.05883 null
2025-06-06 Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models Hugues Thomas et.al. 2506.05689 null
2025-06-06 Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection Shanmukha Vellamcheti et.al. 2506.05651 null
2025-06-05 SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning Fanqi Kong et.al. 2506.05425 null
2025-06-06 Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs Haoyuan Li et.al. 2506.05318 null
2025-06-06 ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation Daniel Rho et.al. 2506.05317 null
2025-06-04 OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis Junting Chen et.al. 2506.04217 link
2025-06-04 BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation Jialei Chen et.al. 2506.03675 null
2025-06-04 Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI Wing Man Casca Kwok et.al. 2506.03607 null
2025-06-03 Trajectory Prediction Meets Large Language Models: A Survey Yi Xu et.al. 2506.03408 link
2025-06-04 Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments Di Wen et.al. 2506.02845 link
2025-06-03 PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis Mijeong Kim et.al. 2506.02794 null
2025-06-03 Large-scale Self-supervised Video Foundation Model for Intelligent Surgery Shu Yang et.al. 2506.02692 null
2025-06-03 Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024 Patrick Pfreundschuh et.al. 2506.02676 null
2025-06-03 Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models Safaa Abdullahi Moallim Mohamud et.al. 2506.02615 null
2025-06-03 Sign Language: Towards Sign Understanding for Robot Autonomy Ayush Agrawal et.al. 2506.02556 null
2025-06-02 MLLMs Need 3D-Aware Representation Supervision for Scene Understanding Xiaohu Huang et.al. 2506.01946 null
2025-06-02 SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes Yuji Wang et.al. 2506.01558 null
2025-06-02 FDSG: Forecasting Dynamic Scene Graphs Yi Yang et.al. 2506.01487 null
2025-06-02 Learning Sparsity for Effective and Efficient Music Performance Question Answering Xingjian Diao et.al. 2506.01319 null
2025-05-30 Tackling View-Dependent Semantics in 3D Language Gaussian Splatting Jiazhong Cen et.al. 2505.24746 null
2025-05-30 Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors Duo Zheng et.al. 2505.24625 link
2025-05-30 EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding Ege Özsoy et.al. 2505.24287 null
2025-05-29 ConversAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners Jad Bendarkawi et.al. 2505.24000 null
2025-05-29 A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation Shuzhou Sun et.al. 2505.23451 null
2025-05-29 SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model Bowen Chen et.al. 2505.23010 null
2025-05-28 On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation Liyao Tang et.al. 2505.22444 null
2025-05-28 LiDAR Based Semantic Perception for Forklifts in Outdoor Environments Benjamin Serfling et.al. 2505.22258 null
2025-05-28 3D Question Answering via only 2D Vision-Language Models Fengyun Wang et.al. 2505.22143 null
2025-05-29 DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation Tianjun Gu et.al. 2505.21969 null
2025-05-28 Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs Insu Lee et.al. 2505.21955 null
2025-05-27 A Graph Completion Method that Jointly Predicts Geometry and Topology Enables Effective Molecule Assembly Rohan V. Koodli et.al. 2505.21833 null
2025-05-29 Compositional Scene Understanding through Inverse Generative Modeling Yanbo Wang et.al. 2505.21780 null
2025-05-30 Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks Keanu Nichols et.al. 2505.21649 null
2025-05-27 Assured Autonomy with Neuro-Symbolic Perception R. Spencer Hallyburton et.al. 2505.21322 null
2025-05-27 Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning Lintao Xu et.al. 2505.21231 null
2025-05-27 Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts Yue Zhang et.al. 2505.21079 null
2025-05-27 OccLE: Label-Efficient 3D Semantic Occupancy Prediction Naiyu Fang et.al. 2505.20617 null
2025-05-27 OmniIndoor3D: Comprehensive Indoor 3D Reconstruction Xiaobao Wei et.al. 2505.20610 null
2025-05-26 From Data to Modeling: Fully Open-vocabulary Scene Graph Generation Zuyao Chen et.al. 2505.20106 null
2025-05-26 DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization Jianxin Huang et.al. 2505.20041 null
2025-05-26 Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement Afrah Shaahid et.al. 2505.19895 null
2025-05-26 LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study Dongil Yang et.al. 2505.19510 link
2025-05-25 FHGS: Feature-Homogenized Gaussian Splatting Q. G. Duan et.al. 2505.19154 null
2025-05-25 Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection Md. Mithun Hossain et.al. 2505.19010 null
2025-05-24 Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding Guofeng Mei et.al. 2505.18819 null
2025-05-24 Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps Sicheng Feng et.al. 2505.18675 link
2025-05-23 SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain Jiawei Zhou et.al. 2505.17727 null
2025-05-23 From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation Mahmoud Chick Zaouali et.al. 2505.17402 null
2025-05-22 Assessing the generalization performance of SAM for ureteroscopy scene understanding Martin Villagrana et.al. 2505.17210 null
2025-05-22 CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation Haihong Hao et.al. 2505.16663 link
2025-05-21 SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval Nikolaos Chaidos et.al. 2505.15867 link
2025-05-21 HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning Xiaodong Mei et.al. 2505.15703 null
2025-05-21 Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Kaiyuan Chen et.al. 2505.15517 link
2025-05-21 RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation Naman Patel et.al. 2505.15373 null
2025-05-21 DC-Scene: Data-Centric Learning for 3D Scene Understanding Ting Huang et.al. 2505.15232 link
2025-05-19 ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling Ege Özsoy et.al. 2505.12890 null
2025-05-19 AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning Kai Zhang et.al. 2505.12782 null
2025-05-19 Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps Ziqi Wen et.al. 2505.12660 null
2025-05-18 LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding Hanyu Zhou et.al. 2505.12253 null
2025-05-18 SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving Muleilan Pei et.al. 2505.12246 null
2025-05-18 Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind Qingmei Li et.al. 2505.12207 link
2025-05-18 Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding Xuefei Sun et.al. 2505.12194 null
2025-05-17 TinyRS-R1: Compact Multimodal Language Model for Remote Sensing Aybora Koksal et.al. 2505.12099 null
2025-05-15 StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation Daniel A. P. Oliveira et.al. 2505.10292 link
2025-05-15 APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds Yuan Gao et.al. 2505.09971 link
2025-05-14 DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection Jianlin Sun et.al. 2505.09168 link
2025-05-14 Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning Dayong Liang et.al. 2505.09118 null
2025-05-13 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving Zongchuang Zhao et.al. 2505.08725 link
2025-05-12 Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions Yi Zhang et.al. 2505.07611 null
2025-05-11 Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding Chih-Chung Hsu et.al. 2505.06991 null
2025-05-11 Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation Seokjun Kwon et.al. 2505.06951 null
2025-05-09 Camera Control at the Edge with Language Models for Scene Understanding Alexiy Buynitsky et.al. 2505.06402 null
2025-05-09 Camera-Only Bird’s Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles Anupkumar Bochare et.al. 2505.06113 null
2025-05-08 Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization Sooyoung Park et.al. 2505.05343 link
2025-05-08 PADriver: Towards Personalized Autonomous Driving Genghua Kou et.al. 2505.05240 null
2025-05-08 Does CLIP perceive art the same way we do? Andrea Asperti et.al. 2505.05229 null
2025-05-07 GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes Feng Xiao et.al. 2505.04659 link
2025-05-07 RAFT: Robust Augmentation of FeaTures for Image Segmentation Edward Humes et.al. 2505.04529 null
2025-05-03 Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models Gracjan Góral et.al. 2505.03821 null
2025-05-06 MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation Mingcheng Li et.al. 2505.02648 null
2025-05-04 Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation Volodymyr Havrylov et.al. 2505.02075 link
2025-05-04 Segment Any RGB-Thermal Model with Language-aided Distillation Dong Xing et.al. 2505.01950 null
2025-05-02 Embracing Diffraction: A Paradigm Shift in Wireless Sensing and Communication Anurag Pallaprolu et.al. 2505.01625 null
2025-04-30 V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving Jannik Lübberstedt et.al. 2505.00156 null
2025-04-30 LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics Marc Glocker et.al. 2504.21716 link
2025-04-30 ImaginateAR: AI-Assisted In-Situ Authoring in Augmented Reality Jaewook Lee et.al. 2504.21360 null
2025-04-28 Category-Level and Open-Set Object Pose Estimation for Robotics Peter Hönig et.al. 2504.19572 null
2025-04-28 Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding Yan Wang et.al. 2504.19500 null
2025-04-27 Beyond Physical Reach: Comparing Head- and Cane-Mounted Cameras for Last-Mile Navigation by Blind Users Apurv Varshney et.al. 2504.19345 null
2025-04-27 OpenFusion++: An Open-vocabulary Real-time Scene Understanding System Xiaofeng Jin et.al. 2504.19266 null
2025-04-27 CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis Alexander Baumann et.al. 2504.19223 null
2025-04-27 Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving Mi Zheng et.al. 2504.19183 null
2025-04-23 TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance Meng Chu et.al. 2504.16505 null
2025-04-21 Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends Mohammad Abu Tami et.al. 2504.16134 null
2025-04-22 Vision language models are unreliable at trivial spatial cognition Sangeet Khemlani et.al. 2504.16061 null
2025-04-20 Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension Lin Li et.al. 2504.14642 null
2025-04-20 RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots Zhang Zhang et.al. 2504.14604 null
2025-04-20 Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding Tong Zeng et.al. 2504.14526 link
2025-04-20 Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation Guoyi Zhang et.al. 2504.14481 null
2025-04-18 HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering Alexander Rusnak et.al. 2504.13590 null
2025-04-18 Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding Yuchen Rao et.al. 2504.13580 link
2025-04-18 Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation Cheng Yuan et.al. 2504.13440 null
2025-04-17 Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs Shaohui Dai et.al. 2504.13153 link
2025-04-17 Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks Nassim Belmecheri et.al. 2504.12817 null
2025-04-17 Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation Changsheng Lv et.al. 2504.12606 null
2025-04-16 Generalized Visual Relation Detection with Diffusion Models Kaifeng Gao et.al. 2504.12100 null
2025-04-17 DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency Mengshi Qi et.al. 2504.12080 link
2025-04-16 CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting Wei Sun et.al. 2504.11893 null
2025-04-15 Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning Juan Garcia Giraldo et.al. 2504.11268 null
2025-04-14 Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization Darryl Hannan et.al. 2504.10727 null
2025-04-14 SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding Marc Gutiérrez-Pérez et.al. 2504.10106 link
2025-04-12 Text To 3D Object Generation For Scalable Room Assembly Sonia Laguna et.al. 2504.09328 null
2025-04-11 FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment Sebastián Barbas Laina et.al. 2504.08603 null
2025-04-11 FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents Xin Tan et.al. 2504.08581 null
2025-04-11 DSM: Building A Diverse Semantic Map for 3D Visual Grounding Qinghongbing Xie et.al. 2504.08307 null
2025-04-10 SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos Joshua Li et.al. 2504.07867 null
2025-04-10 DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction Xu Zhao et.al. 2504.07524 null
2025-04-09 RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration Omar Alama et.al. 2504.06994 null
2025-04-09 Audio-visual Event Localization on Portrait Mode Short Videos Wuyang Liu et.al. 2504.06884 null
2025-04-09 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking Chang Nie et.al. 2504.06863 null
2025-04-09 Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding Pedro Hermosilla et.al. 2504.06719 link
2025-04-09 Domain-Conditioned Scene Graphs for State-Grounded Task Planning Jonas Herzog et.al. 2504.06661 null
2025-04-09 Attributes-aware Visual Emotion Representation Learning Rahul Singh Maharjan et.al. 2504.06578 null
2025-04-08 CamContextI2V: Context-aware Controllable Video Generation Luis Denninger et.al. 2504.06022 link
2025-04-08 AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems Zhuoli Zhuang et.al. 2504.05950 null
2025-04-08 PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario Sriram Mandalika et.al. 2504.05908 null
2025-04-08 InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation Jiangsan Zhao et.al. 2504.05751 null
2025-04-07 RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model Congcong Wen et.al. 2504.04988 null
2025-04-07 Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding Zahir Alsulaimawi et.al. 2504.04772 null
2025-04-07 DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Bo-Wen Yin et.al. 2504.04701 link
2025-04-06 Planning Safety Trajectories with Dual-Phase, Physics-Informed, and Transportation Knowledge-Driven Large Language Models Rui Gan et.al. 2504.04562 null
2025-04-04 3D Scene Understanding Through Local Random Access Sequence Modeling Wanhee Lee et.al. 2504.03875 link
2025-04-07 NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving Kexin Tian et.al. 2504.03164 null
2025-04-03 F-ViTA: Foundation Model Guided Visible to Thermal Translation Jay N. Paranjape et.al. 2504.02801 link
2025-04-03 Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision Xiaofeng Han et.al. 2504.02477 link
2025-04-02 Scene-Centric Unsupervised Panoptic Segmentation Oliver Hahn et.al. 2504.01955 link
2025-04-02 Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness Haochen Wang et.al. 2504.01901 null
2025-04-02 CoMatcher: Multi-View Collaborative Feature Matching Jintao Zhang et.al. 2504.01872 null
2025-04-02 TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication Petr Vanc et.al. 2504.01708 null
2025-04-02 Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation Junjie Chen et.al. 2504.01668 null
2025-04-01 WikiVideo: Article Generation from Multiple Videos Alexander Martin et.al. 2504.00939 link
2025-04-01 Zero-Shot 4D Lidar Panoptic Segmentation Yushan Zhang et.al. 2504.00848 null
2025-04-01 PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks Abdelrahman Elskhawy et.al. 2504.00844 null
2025-04-01 Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights Yuchen Liu et.al. 2504.00839 null
2025-03-30 PhysPose: Refining 6D Object Poses with Physical Constraints Martin Malenický et.al. 2503.23587 null
2025-03-30 Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model Jannik Endres et.al. 2503.23502 link
2025-03-29 Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery Boyi Ma et.al. 2503.23130 null
2025-03-29 Evaluating Compositional Scene Understanding in Multimodal Generative Models Shuhao Fu et.al. 2503.23125 link
2025-03-29 Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments Yifan Xu et.al. 2503.23105 null
2025-03-29 Empowering Large Language Models with 3D Situation Awareness Zhihao Yuan et.al. 2503.23024 null
2025-03-28 Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users Antonia Karamolegkou et.al. 2503.22610 null
2025-03-28 Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration Heiko Renz et.al. 2503.22588 null
2025-03-28 NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving Fuhao Li et.al. 2503.22436 null
2025-03-28 Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision Rulin Zhou et.al. 2503.22394 null
2025-03-28 A Dataset for Semantic Segmentation in the Presence of Unknowns Zakaria Laskar et.al. 2503.22309 null
2025-03-28 Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction Seokha Moon et.al. 2503.22087 null
2025-03-27 Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting Anand Bhattad et.al. 2503.21770 null
2025-03-27 uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images Jonathan Lee et.al. 2503.21562 link
2025-03-27 Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving Lucas Nunes et.al. 2503.21449 link
2025-03-26 DINeMo: Learning Neural Mesh Models with no 3D Annotations Weijie Guo et.al. 2503.20220 null
2025-03-25 The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs Jonathan Sauder et.al. 2503.20000 null
2025-03-25 SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining Xiang Xu et.al. 2503.19912 link
2025-03-25 OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations Christina Kassab et.al. 2503.19764 null
2025-03-26 COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting Jiaxin Zhang et.al. 2503.19443 link
2025-03-25 Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting Zhiying Yan et.al. 2503.19332 null
2025-03-25 BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation Hanshuo Qiu et.al. 2503.19303 null
2025-03-24 Efficient and Accurate Scene Text Recognition with Cascaded-Transformers Savas Ozkan et.al. 2503.18883 null
2025-03-24 Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition Yifei Zhang et.al. 2503.18746 null
2025-03-24 Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving Hongkuan Zhou et.al. 2503.18730 null
2025-03-23 MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation Jiaxin Huang et.al. 2503.18135 null
2025-03-23 PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding Hongjia Zhai et.al. 2503.18107 null
2025-03-23 PanopticSplatting: End-to-End Panoptic Gaussian Splatting Yuxuan Xie et.al. 2503.18073 null
2025-03-23 PolarFree: Polarization-based Reflection-free Imaging Mingde Yao et.al. 2503.18055 null
2025-03-23 SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining Yue Li et.al. 2503.18052 null
2025-03-23 Geometric Constrained Non-Line-of-Sight Imaging Xueying Liu et.al. 2503.17992 null
2025-03-22 A Causal Adjustment Module for Debiasing Scene Graph Generation Li Liu et.al. 2503.17862 null
2025-03-21 Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation Giacomo Savazzi et.al. 2503.17224 null
2025-03-21 ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail Chandan Yeshwanth et.al. 2503.17044 null
2025-03-21 Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision Maoji Zheng et.al. 2503.16811 null
2025-03-21 OpenCity3D: What do Vision-Language Models know about Urban Environments? Valentin Bieri et.al. 2503.16776 null
2025-03-20 Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding Jinlong Li et.al. 2503.16707 null
2025-03-20 ContactFusion: Stochastic Poisson Surface Maps from Visual and Contact Sensing Aditya Kamireddypalli et.al. 2503.16592 null
2025-03-20 From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction Ayberk Acar et.al. 2503.16263 null
2025-03-20 Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation Andrea Maracani et.al. 2503.16184 null
2025-03-20 What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? Xuanming Cui et.al. 2503.15846 null
2025-03-19 A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition Ritabrata Chakraborty et.al. 2503.15639 null
2025-03-19 Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene Shengqiong Wu et.al. 2503.15019 null
2025-03-19 Universal Scene Graph Generation Shengqiong Wu et.al. 2503.15005 null
2025-03-19 SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments Yinqi Chen et.al. 2503.14837 null
2025-03-20 These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models Parker Ewen et.al. 2503.14665 null
2025-03-17 Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey Liewen Liao et.al. 2503.14537 null
2025-03-18 DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation Mu Chen et.al. 2503.13957 link
2025-03-18 Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation Sayak Nag et.al. 2503.13947 null
2025-03-18 ChatBEV: A Visual Language Model that Understands BEV Maps Qingyao Xu et.al. 2503.13938 null
2025-03-18 PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds Barza Nisar et.al. 2503.13914 null
2025-03-17 Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training Corentin Sautier et.al. 2503.13203 null
2025-03-17 Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Henghui Du et.al. 2503.13068 null
2025-03-17 InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving Ruiqi Song et.al. 2503.13047 null
2025-03-17 HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding Jiahe Zhao et.al. 2503.12955 null
2025-03-17 NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models Sung-Yeon Park et.al. 2503.12772 null
2025-03-16 Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding Imran Kabir et.al. 2503.12663 null
2025-03-16 Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset Yutao Hu et.al. 2503.12385 null
2025-03-15 TACO: Taming Diffusion for in-the-wild Video Amodal Completion Ruijie Lu et.al. 2503.12049 null
2025-03-14 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling Christopher Xie et.al. 2503.11806 null
2025-03-14 EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting Di Li et.al. 2503.11345 null
2025-03-14 Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset Yibing Weng et.al. 2503.11342 null
2025-03-13 Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations Piyush Gupta et.al. 2503.10941 null
2025-03-11 MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation Anzhe Cheng et.al. 2503.10686 null
2025-03-13 TARS: Traffic-Aware Radar Scene Flow Estimation Jialong Wu et.al. 2503.10210 null
2025-03-13 TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness Mu Chen et.al. 2503.09941 null
2025-03-12 Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval Stefan Sylvius Wagner et.al. 2503.09867 null
2025-03-11 Language-Depth Navigated Thermal and Visible Image Fusion Jinchang Zhang et.al. 2503.08676 null
2025-03-11 Generating Robot Constitutions & Benchmarks for Semantic Safety Pierre Sermanet et.al. 2503.08663 null
2025-03-11 Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding Tim Steinke et.al. 2503.08474 null
2025-03-11 TrackOcc: Camera-based 4D Panoptic Occupancy Tracking Zhuoguang Chen et.al. 2503.08471 null
2025-03-11 Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking Xucheng Guo et.al. 2503.08370 null
2025-03-11 DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos Lorenzo Mur-Labadia et.al. 2503.08344 null
2025-03-11 Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving Runwei Guan et.al. 2503.08336 null
2025-03-11 General-Purpose Aerial Intelligent Agents Empowered by Large Language Models Ji Zhao et.al. 2503.08302 null
2025-03-10 FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction Dennis Rotondi et.al. 2503.07909 null
2025-03-10 Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction Zongzheng Zhang et.al. 2503.07485 null
2025-03-10 CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting Haicheng Liao et.al. 2503.07234 null
2025-03-10 A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning Xin Wen et.al. 2503.06960 null
2025-03-10 LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs Hanyu Zhou et.al. 2503.06934 null
2025-03-08 SplatTalk: 3D VQA with Gaussian Splatting Anh Thai et.al. 2503.06271 null
2025-03-08 Segment Anything, Even Occluded Wei-En Tai et.al. 2503.06261 null
2025-03-08 VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion Meng Wang et.al. 2503.06219 null
2025-03-08 Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images YingLiang Ma et.al. 2503.06190 null
2025-03-08 Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction Kai Li et.al. 2503.06161 null
2025-03-08 Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity Xiaohao Xu et.al. 2503.06014 null
2025-03-07 HexPlane Representation for 3D Semantic Scene Understanding Zeren Chen et.al. 2503.05127 null
2025-03-06 Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning Victor Sebastian Martinez Pozos et.al. 2503.04900 null
2025-03-06 EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images Rohit Menon et.al. 2503.04441 null
2025-03-06 An Egocentric Vision-Language Model based Portable Real-time Smart Assistant Yifei Huang et.al. 2503.04250 null
2025-03-06 H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision Yunxiao Shi et.al. 2503.04059 null
2025-03-06 GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding Xihan Wang et.al. 2503.04034 null
2025-03-05 SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection Devanish N. Kamtam et.al. 2503.03942 null
2025-03-05 Vision-Language Models Struggle to Align Entities across Modalities Iñigo Alonso et.al. 2503.03854 null
2025-03-05 Improving 6D Object Pose Estimation of metallic Household and Industry Objects Thomas Pöllabauer et.al. 2503.03655 null
2025-03-04 MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments Ege Özsoy et.al. 2503.02579 link
2025-03-04 Label-Efficient LiDAR Panoptic Segmentation Ahmet Selim Çanakçı et.al. 2503.02372 null
2025-03-04 SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images Gargi Panda et.al. 2503.02270 null
2025-03-03 vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding Ali Tourani et.al. 2503.01783 link
2025-03-03 OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding Dianyi Yang et.al. 2503.01646 null
2025-03-03 Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond Guanyao Wu et.al. 2503.01210 link
2025-03-03 Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations Junsong Zhang et.al. 2503.01114 null
2025-03-01 Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing Yanjun Li et.al. 2503.00548 null
2025-03-01 Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning Hanxun Yu et.al. 2503.00513 link
2025-03-04 Floorplan-SLAM: A Real-Time, High-Accuracy, and Long-Term Multi-Session Point-Plane SLAM for Efficient Floorplan Reconstruction Haolin Wang et.al. 2503.00397 null
2025-02-28 Vibrotactile information coding strategies for a body-worn vest to aid robot-human collaboration Adrian Vecina Tercero et.al. 2502.21056 null
2025-02-27 Towards Statistical Factuality Guarantee for Large Vision-Language Models Zhuohang Li et.al. 2502.20560 null
2025-02-26 Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator Xiankang He et.al. 2502.19204 link
2025-02-25 VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion Pei Liu et.al. 2502.18042 null
2025-02-24 AAD-LLM: Neural Attention-Driven Auditory Scene Understanding Xilin Jiang et.al. 2502.16794 link
2025-02-28 Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model Yaxuan Huang et.al. 2502.16779 link
2025-02-23 Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration Kim Jun-Seong et.al. 2502.16652 null
2025-02-21 Weakly Supervised Video Scene Graph Generation via Natural Language Supervision Kibum Kim et.al. 2502.15370 link
2025-02-21 DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation Luzhou Ge et.al. 2502.15309 link
2025-02-21 Hierarchical Context Transformer for Multi-level Semantic Scene Understanding Luoying Hao et.al. 2502.15184 link
2025-02-20 CrossOver: 3D Scene Cross-Modal Alignment Sayan Deb Sarkar et.al. 2502.15011 link
2025-02-20 Hier-SLAM++: Neuro-Symbolic Semantic SLAM with a Hierarchically Categorical Gaussian Splatting Boying Li et.al. 2502.14931 null
2025-02-19 Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning Rui Zhao et.al. 2502.14917 null
2025-02-16 Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review Ufaq Khan et.al. 2502.14886 null
2025-02-21 AVD2: Accident Video Diffusion for Accident Video Description Cheng Li et.al. 2502.14801 null
2025-02-18 Spiking Vision Transformer with Saccadic Attention Shuai Wang et.al. 2502.12677 null
2025-02-16 NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM Zihan Wang et.al. 2502.11142 link
2025-02-15 Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy Mingyang Zhao et.al. 2502.10704 link
2025-02-14 Leveraging V2X for Collaborative HD Maps Construction Using Scene Graph Generation Gamal Elghazaly et.al. 2502.10127 null
2025-02-13 FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation Bin Yang et.al. 2502.09274 null
2025-02-13 Billet Number Recognition Based on Test-Time Adaptation Yuan Wei et.al. 2502.09026 null
2025-02-13 EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition Xiao Wang et.al. 2502.09020 link
2025-02-13 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning Guoqin Tang et.al. 2502.08903 null
2025-02-10 Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing Sicen Guo et.al. 2502.06219 null
2025-02-08 Content-based Video Retrieval in Traffic Videos using Latent Dirichlet Allocation Topic Model Mohammad Kianpisheh et.al. 2502.05457 null
2025-02-06 sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views Eyvaz Najafli et.al. 2502.04318 null
2025-02-06 Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation Lin Li et.al. 2502.03856 null
2025-02-05 EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality Junlong Chen et.al. 2502.03564 null
2025-02-04 Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation Junha Lee et.al. 2502.02548 null
2025-02-04 Event-aided Semantic Scene Completion Shangwei Guo et.al. 2502.02334 link
2025-02-03 AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis Basit Alawode et.al. 2502.01785 null
2025-01-30 Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation Yuelei Li et.al. 2501.18733 null
2025-01-30 Efficient Interactive 3D Multi-Object Removal Jingcheng Ni et.al. 2501.17636 null
2025-02-04 Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding Akash Kumar et.al. 2501.17053 null
2025-01-29 PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Wei Chow et.al. 2501.16411 link
2025-01-26 Ocean-OCR: Towards General OCR Application via a Vision-Language Model Song Chen et.al. 2501.15558 link
2025-01-26 Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics Ali Tourani et.al. 2501.15505 link
2025-01-24 HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation Xin Zhou et.al. 2501.14729 link
2025-01-24 Scene Understanding Enabled Semantic Communication with Open Channel Coding Zhe Xiang et.al. 2501.14520 null
2025-01-23 GeomGS: LiDAR-Guided Geometry-Aware Gaussian Splatting for Robot Localization Jaewon Lee et.al. 2501.13417 null
2025-01-22 Neural Radiance Fields for the Real World: A Survey Wenhui Xiao et.al. 2501.13104 null
2025-01-22 PSGSL: A Probabilistic Framework Integrating Semantic Scene Understanding and Gas Sensing for Gas Source Localization Pepe Ojeda et.al. 2501.12812 null
2025-01-20 Dynamic Scene Understanding from Vision-Language Representations Shahaf Pruss et.al. 2501.11653 null
2025-01-20 EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery Guankun Wang et.al. 2501.11347 link
2025-01-20 A Survey of World Models for Autonomous Driving Tuo Feng et.al. 2501.11260 null
2025-01-17 A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features Enes Karanfil et.al. 2501.10144 null
2025-01-16 CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation Alex Berian et.al. 2501.09838 link
2025-01-16 YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks Saptarashmi Bandyopadhyay et.al. 2501.09355 null
2025-01-15 Embodied Scene Understanding for Vision Language Models via MetaVQA Weizhen Wang et.al. 2501.09167 null
2025-01-15 GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap Donghwi Jung et.al. 2501.08575 link
2025-01-14 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding Haomiao Xiong et.al. 2501.07819 link
2025-01-13 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models Yasiru Ranasinghe et.al. 2501.07396 null
2025-01-13 Hierarchical Superpixel Segmentation via Structural Information Theory Minhui Xie et.al. 2501.07069 link
2025-01-12 Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving Haoxiang Gao et.al. 2501.06680 null
2025-01-08 NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data Nirit Alkalay et.al. 2501.06235 null
2025-01-10 Self-Supervised Partial Cycle-Consistency for Multi-View Matching Fedor Taggenbrock et.al. 2501.06000 link
2025-01-10 UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation Xinyao Liao et.al. 2501.05687 null
2025-01-09 Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding Mohammed Elhenawy et.al. 2501.05566 null
2025-01-09 A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision Ali Rohan et.al. 2501.05147 null
2025-01-08 TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning Seungmin Baek et.al. 2501.04293 null
2025-01-07 A Bayesian Modeling Framework for Estimation and Ground Segmentation of Cluttered Staircases Prasanna Sriganesh et.al. 2501.04170 null
2025-01-07 LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving Lingdong Kong et.al. 2501.04005 null
2025-01-07 CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds Keonwoo Kim et.al. 2501.03879 null
2025-01-07 Advancing the Understanding of Fine-Grained 3D Forest Structures using Digital Cousins and Simulation-to-Reality: Methods and Datasets Jing Liu et.al. 2501.03637 null
2025-01-03 VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment Wenyan Cong et.al. 2501.01949 null
2025-01-03 IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks Aecheon Jung et.al. 2501.01685 link
2025-01-09 GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models Zhangyang Qi et.al. 2501.01428 null
2025-01-02 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer Jiajun Deng et.al. 2501.01163 null
2025-01-02 Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction Xuan Yu et.al. 2501.01119 null
2024-12-31 STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes Jiawei Yang et.al. 2501.00602 null
2024-12-31 Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Yue Fan et.al. 2501.00358 null
2024-12-31 OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies Runnan Chen et.al. 2501.00326 link
2024-12-30 Text-to-Image GAN with Pretrained Representations Xiaozhou You et.al. 2501.00116 null
2024-12-30 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives Zeyu Yang et.al. 2412.20720 null
2024-12-27 An Actionable Hierarchical Scene Representation Enhancing Autonomous Inspection Missions in Unknown Environments Vignesh Kottayam Viswanathan et.al. 2412.19582 null
2024-12-27 xFLIE: Leveraging Actionable Hierarchical Scene Representations for Autonomous Semantic-Aware Inspection Missions Vignesh Kottayam Viswanathan et.al. 2412.19571 link
2024-12-27 MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios Jiaqi Fan et.al. 2412.19406 null
2024-12-26 Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation Tao Liu et.al. 2412.19021 null
2024-12-25 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding Tatiana Zemskova et.al. 2412.18450 link
2024-12-24 MR-COGraphs: Communication-efficient Multi-Robot Open-vocabulary Mapping System via 3D Scene Graphs Qiuyi Gu et.al. 2412.18381 null
2024-12-24 Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing Suwesh Prasad Sah et.al. 2412.18165 link
2024-12-24 UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision Yuru Wang et.al. 2412.18131 null
2024-12-24 LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding Hao Li et.al. 2412.17635 null
2024-12-21 Application of Multimodal Large Language Models in Autonomous Driving Md Robiul Islam et.al. 2412.16410 null
2024-12-20 Improving Object Detection for Time-Lapse Imagery Using Temporal Features in Wildlife Monitoring Marcus Jenkins et.al. 2412.16329 link
2024-12-19 AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving Shuo Xing et.al. 2412.15206 link
2024-12-19 ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects Qihang Cao et.al. 2412.14837 null
2024-12-19 PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation Shoumeng Qiu et.al. 2412.14821 link
2024-12-18 GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting Yuning Peng et.al. 2412.13654 link
2024-12-18 RelationField: Relate Anything in Radiance Fields Sebastian Koch et.al. 2412.13652 null
2024-12-18 Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset Sithu Aung et.al. 2412.13569 null
2024-12-17 RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning Kanghoon Yoon et.al. 2412.12788 link
2024-12-18 Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration Ziheng Zhou et.al. 2412.12628 null
2024-12-17 Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning Qi Sun et.al. 2412.11974 link
2024-12-16 DINO-Foresight: Looking into the Future with DINO Efstathios Karypidis et.al. 2412.11673 link
2024-12-16 An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds TianZhu Liu et.al. 2412.11407 null
2024-12-15 SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation Hang Zhang et.al. 2412.11026 null
2024-12-13 SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians Siyun Liang et.al. 2412.10231 null
2024-12-13 Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance Jiahao Lyu et.al. 2412.10159 null
2024-12-17 WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model Songyan Zhang et.al. 2412.09951 link
2024-12-12 LIVE-GS: LLM Powers Interactive VR by Enhancing Gaussian Splatting Haotian Mao et.al. 2412.09176 null
2024-12-11 SLGaussian: Fast Language Gaussian Splatting in Sparse Views Kangjie Chen et.al. 2412.08331 null
2024-12-11 TGOSPA Metric Parameters Selection and Evaluation for Visual Multi-object Tracking Jan Krejčí et.al. 2412.08321 null
2024-12-11 THUD++: Large-Scale Dynamic Indoor Scene Dataset and Benchmark for Mobile Robots Zeshun Li et.al. 2412.08096 null
2024-12-11 MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents Yun Xing et.al. 2412.08014 null
2024-12-10 Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation Thong Thanh Nguyen et.al. 2412.07160 null
2024-12-11 ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models Jieyu Zhang et.al. 2412.07012 link
2024-12-07 Timely reliable Bayesian decision-making enabled using memristors Lekai Song et.al. 2412.06838 null
2024-12-09 Visual Lexicon: Rich Image Features in Language Space XuDong Wang et.al. 2412.06774 null
2024-12-09 LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations Mingjie Xu et.al. 2412.06322 link
2024-12-09 Event fields: Capturing light fields at high speed, resolution, and dynamic range Ziyuan Qu et.al. 2412.06191 null
2024-12-07 TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances Wenting Xu et.al. 2412.05596 null
2024-12-06 Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model Lening Wang et.al. 2412.05280 link
2024-12-06 EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding Yuqi Wu et.al. 2412.04380 link
2024-12-04 Designing DNNs for a trade-off between robustness and processing performance in embedded devices Jon Gutiérrez-Zaballa et.al. 2412.03682 null
2024-12-04 Assessing the performance of CT image denoisers using Laguerre-Gauss Channelized Hotelling Observer for lesion detection Prabhat Kc et.al. 2412.02920 null
2024-12-03 BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding Chenguang Huang et.al. 2412.02449 null
2024-12-04 SparseLGS: Sparse View Language Embedded Gaussian Splatting Jun Hu et.al. 2412.02245 null
2024-12-02 Occam’s LGS: A Simple Approach for Language Gaussian Splatting Jiahuan Cheng et.al. 2412.01807 null
2024-12-02 Holistic Understanding of 3D Scenes as Universal Scene Description Anna-Maria Halacheva et.al. 2412.01398 null
2024-12-02 LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences Hongyan Zhi et.al. 2412.01292 null
2024-12-02 A Semantic Communication System for Real-time 3D Reconstruction Tasks Jiaxing Zhang et.al. 2412.01191 null
2024-12-02 TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition Xingsong Ye et.al. 2412.01137 link
2024-12-01 ChatSplat: 3D Conversational Gaussian Splatting Hanlin Chen et.al. 2412.00734 null
2024-11-30 Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding Duo Zheng et.al. 2412.00493 null
2024-11-29 SIMS: Simulating Human-Scene Interactions with Real World Script Planning Wenjia Wang et.al. 2411.19921 null
2024-11-29 Quantifying the synthetic and real domain gap in aerial scene understanding Alina Marcu et.al. 2411.19913 null
2024-11-29 Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding Wenbo Zhang et.al. 2411.19551 null
2024-11-28 GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks Muhammad Sohail Danish et.al. 2411.19325 link
2024-11-28 On-chip Hyperspectral Image Segmentation with Fully Convolutional Networks for Scene Understanding in Autonomous Driving Jon Gutiérrez-Zaballa et.al. 2411.19274 null
2024-11-28 InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception Haijie Li et.al. 2411.19235 null
2024-11-27 Reconstructing Animals and the Wild Peter Kulits et.al. 2411.18807 null
2024-11-27 Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents Joongwon Chae et.al. 2411.18270 null
2024-11-27 HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation Trong-Thuan Nguyen et.al. 2411.18042 null
2024-11-26 Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning Hoàng-Ân Lê et.al. 2411.17536 link
2024-11-26 HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving Jon Gutiérrez-Zaballa et.al. 2411.17530 null
2024-11-25 RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics Chan Hee Song et.al. 2411.16537 null
2024-11-27 An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models Wentao Qu et.al. 2411.16308 link
2024-11-25 Open-Vocabulary Octree-Graph for 3D Scene Understanding Zhigang Wang et.al. 2411.16253 null
2024-11-24 SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition Yongkun Du et.al. 2411.15858 link
2024-11-24 ROOT: VLM based System for Indoor Scene Understanding and Beyond Yonghui Wang et.al. 2411.15714 link
2024-11-23 Comparative Analysis of Resource-Efficient CNN Architectures for Brain Tumor Classification Md Ashik Khan et.al. 2411.15596 null
2024-11-23 Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing Yadong Qu et.al. 2411.15585 null
2024-11-22 UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations Yuan Ren et.al. 2411.15355 null
2024-11-21 Multimodal 3D Reasoning Segmentation with Complex Scenes Xueying Jiang et.al. 2411.13927 null
2024-11-20 Unbiased Scene Graph Generation by Type-Aware Message Passing on Heterogeneous and Dual Graphs Guanglu Sun et.al. 2411.13287 null
2024-11-20 Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation Rohith Peddi et.al. 2411.13059 null
2024-11-19 GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving Shaoqing Xu et.al. 2411.12452 link
2024-11-19 Classification of Geographical Land Structure Using Convolution Neural Network and Transfer Learning Mustafa M. Abd Zaid et.al. 2411.12415 null
2024-11-18 Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation Hanieh Shojaei Miandashti et.al. 2411.11935 null
2024-11-18 MGNiceNet: Unified Monocular Geometric Scene Understanding Markus Schön et.al. 2411.11466 null
2024-11-18 The ADUULM-360 Dataset – A Multi-Modal Dataset for Depth Estimation in Adverse Weather Markus Schön et.al. 2411.11455 null
2024-11-18 Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications Scarlett Raine et.al. 2411.11287 null
2024-11-19 Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition Tiancheng Lin et.al. 2411.11219 link
2024-11-17 Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry Wenjun Hou et.al. 2411.10937 null
2024-11-16 MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation Ansh Shah et.al. 2411.10886 link
2024-11-16 Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm Sari Masri et.al. 2411.10869 null
2024-11-15 TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding Quang P. M. Pham et.al. 2411.10509 null
2024-11-15 Content-Aware Preserving Image Generation Giang H. Le et.al. 2411.09871 null
2024-11-13 Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification Jose-Luis Matez-Bandera et.al. 2411.08727 link
2024-11-11 $SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation Yinshuang Xu et.al. 2411.07326 null
2024-11-06 Graph-Based Multi-Modal Sensor Fusion for Autonomous Driving Depanshu Sani et.al. 2411.03702 null
2024-11-05 VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation Haochen Zhang et.al. 2411.03540 link
2024-11-05 OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing Pranav Gupta et.al. 2411.02858 null
2024-11-04 Modeling Uncertainty in 3D Gaussian Splatting through Continuous Semantic Splatting Joey Wilson et.al. 2411.02547 null
2024-11-04 Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360° Images Kun Huang et.al. 2411.01749 link
2024-11-03 VQ-Map: Bird’s-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization Yiwei Zhang et.al. 2411.01618 link
2024-11-01 On Deep Learning for Geometric and Semantic Scene Understanding Using On-Vehicle 3D LiDAR Li Li et.al. 2411.00600 link
2024-11-01 Federated Voxel Scene Graph for Intracranial Hemorrhage Antoine P. Sanner et.al. 2411.00578 null
2024-10-30 UniRiT: Towards Few-Shot Non-Rigid Point Cloud Registration Geng Li et.al. 2410.22909 null
2024-10-30 Situational Scene Graph for Structured Human-centric Situation Understanding Chinthani Sugandhika et.al. 2410.22829 null
2024-10-30 Symbolic Graph Inference for Compound Scene Understanding FNU Aryan et.al. 2410.22626 null
2024-10-29 Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving Bo Jiang et.al. 2410.22313 link
2024-10-26 Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation Hao Ding et.al. 2410.20026 null
2024-10-23 Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement Cheng Yuan et.al. 2410.17642 link
2024-10-22 PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding Vinh Nguyen et.al. 2410.16824 null
2024-10-20 Scene Graph Generation with Role-Playing Large Language Models Guikun Chen et.al. 2410.15364 null
2024-10-20 Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment Can Cui et.al. 2410.15281 null
2024-10-19 Semantically Safe Robot Manipulation: From Semantic Scene Understanding to Motion Safeguards Lukas Brunke et.al. 2410.15185 null
2024-10-19 Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding Yi Liu et.al. 2410.14944 link
2024-10-17 ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding Guangda Ji et.al. 2410.13924 link
2024-10-17 VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding Runsen Xu et.al. 2410.13860 link
2024-10-16 3D Gaussian Splatting in Robotics: A Survey Siting Zhu et.al. 2410.12262 null
2024-10-17 SAM-Guided Masked Token Prediction for 3D Scene Understanding Zhimin Chen et.al. 2410.12158 null
2024-10-16 Leveraging Large Vision Language Model For Better Automatic Web GUI Testing Siyi Wang et.al. 2410.12157 null
2024-10-15 MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark Bin Shan et.al. 2410.11538 link
2024-10-14 3DArticCyclists: Generating Simulated Dynamic 3D Cyclists for Human-Object Interaction (HOI) and Autonomous Driving Applications Eduardo R. Corral-Soto et.al. 2410.10782 null
2024-10-17 Stratified Domain Adaptation: A Progressive Self-Training Approach for Scene Text Recognition Kha Nhat Le et.al. 2410.09913 null
2024-10-13 LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond Md Tanvir Islam et.al. 2410.09831 link
2024-10-12 Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors Hritam Basak et.al. 2410.09467 null
2024-10-11 Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking Wei Zhang et.al. 2410.08616 null
2024-10-10 A transition towards virtual representations of visual scenes Américo Pereira et.al. 2410.07987 null
2024-10-10 RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation Songming Liu et.al. 2410.07864 null
2024-10-11 Test-Time Intensity Consistency Adaptation for Shadow Detection Leyi Zhu et.al. 2410.07695 null
2024-10-10 3D Vision-Language Gaussian Splatting Qucheng Peng et.al. 2410.07577 null
2024-10-09 Evaluating the Impact of Point Cloud Colorization on Semantic Segmentation Accuracy Qinfeng Zhu et.al. 2410.06725 null
2024-10-09 Open-RGBT: Open-vocabulary RGB-T Zero-shot Semantic Segmentation in Open-world Environments Meng Yu et.al. 2410.06626 null
2024-10-08 BoxMap: Efficient Structural Mapping and Navigation Zili Wang et.al. 2410.06263 null
2024-10-08 OrionNav: Online Planning for Robot Autonomy with Context-Aware LLM and Open-Vocabulary Semantic Scene Graphs Venkata Naren Devarakonda et.al. 2410.06239 null
2024-10-07 Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders Kosta Dakic et.al. 2410.04817 null
2024-10-07 Diffusion Models in 3D Vision: A Survey Zhen Wang et.al. 2410.04738 null
2024-10-06 In-Place Panoptic Radiance Field Segmentation with Perceptual Prior for 3D Scene Understanding Shenghao Li et.al. 2410.04529 null
2024-10-05 ETHcavation: A Dataset and Pipeline for Panoptic Scene Understanding and Object Tracking in Dynamic Construction Environments Lorenzo Terenzi et.al. 2410.04250 null
2024-10-05 Fast Object Detection with a Machine Learning Edge Device Richard C. Rodriguez et.al. 2410.04173 null
2024-10-04 SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models Yue Zhang et.al. 2410.03878 null
2024-10-03 RESSCAL3D++: Joint Acquisition and Semantic Segmentation of 3D Point Clouds Remco Royen et.al. 2410.02323 link
2024-10-01 A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio Xavier Juanola et.al. 2410.01020 link
2024-09-30 Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation Aleyna Kütük et.al. 2410.00266 null
2024-09-30 Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation Kun Yuan et.al. 2410.00263 link
2024-09-30 You Only Speak Once to See Wenhao Yang et.al. 2409.18372 null
2024-09-26 LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Chenming Zhu et.al. 2409.18125 null
2024-09-26 Text Image Generation for Low-Resource Languages with Dual Translation Learning Chihiro Noguchi et.al. 2409.17747 null
2024-09-26 Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes Seraj Ghasemi et.al. 2409.17720 null
2024-10-02 BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes Kasun Weerakoon et.al. 2409.16484 null
2024-09-24 Open-World Object Detection with Instance Representation Learning Sunoh Lee et.al. 2409.16073 null
2024-09-24 Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving Lingyu Xiao et.al. 2409.15730 link
2024-09-27 Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer Minh Bui et.al. 2409.15117 null
2024-09-23 An Adverse Weather-Immune Scheme with Unfolded Regularization and Foundation Model Knowledge Distillation for Street Scene Understanding Wei-Bin Kou et.al. 2409.14737 null
2024-09-22 One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance Minyi Zhao et.al. 2409.14483 null
2024-09-22 Scene-Text Grounding for Text-Based Video Question Answering Sheng Zhou et.al. 2409.14319 null
2024-09-21 MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors Zhenhua Du et.al. 2409.14019 null
2024-09-21 Relevance-driven Decision Making for Safer and More Efficient Human Robot Collaboration Xiaotong Zhang et.al. 2409.13998 null
2024-09-21 Enhanced Semantic Segmentation for Large-Scale and Imbalanced Point Clouds Haoran Gong et.al. 2409.13983 null
2024-09-19 CLAIR-A: Leveraging Large Language Models to Judge Audio Captions Tsung-Han Wu et.al. 2409.12962 link
2024-09-18 Towards Global Localization using Multi-Modal Object-Instance Re-Identification Aneesh Chavan et.al. 2409.12002 null
2024-09-18 SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection Tim Engelbracht et.al. 2409.11870 null
2024-09-18 VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer Humen Zhong et.al. 2409.11656 null
2024-09-18 DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion Jian Xu et.al. 2409.11642 link
2024-09-16 Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving Yunsheng Ma et.al. 2409.11182 null
2024-09-16 Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation Yifan Xu et.al. 2409.10350 null
2024-09-16 Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation Minghan Chen et.al. 2409.10262 null
2024-09-15 Semantic2D: A Semantic Dataset for 2D Lidar Semantic Segmentation Zhanteng Xie et.al. 2409.09899 null
2024-09-12 LED: Light Enhanced Depth Estimation at Night Simon de Moreau et.al. 2409.08031 link
2024-09-12 Relevance for Human Robot Collaboration Xiaotong Zhang et.al. 2409.07753 null
2024-09-10 Towards Localizing Structural Elements: Merging Geometrical Detection with Semantic Verification in RGB-D Data Ali Tourani et.al. 2409.06625 null
2024-09-10 Loss Distillation via Gradient Matching for Point Cloud Completion with Weighted Chamfer Distance Fangzhou Lin et.al. 2409.06171 link
2024-09-09 Online 3D reconstruction and dense tracking in endoscopic videos Michel Hayoz et.al. 2409.06037 link
2024-09-08 TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs Horatiu Florea et.al. 2409.05142 null
2024-09-06 Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences Rui Yu et.al. 2409.04390 null
2024-09-06 RCNet: Deep Recurrent Collaborative Network for Multi-View Low-Light Image Enhancement Hao Luo et.al. 2409.04363 link
2024-09-05 Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding Yunze Man et.al. 2409.03757 link
2024-09-05 Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction Shen Chen et.al. 2409.03213 null
2024-09-04 Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving Yuhang Lu et.al. 2409.02914 null
2024-09-03 Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning Xiaowei Hu et.al. 2409.02108 link
2024-09-03 EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video Zhen Zhou et.al. 2409.01807 link
2024-09-03 GaussianPU: A Hybrid 2D-3D Upsampling Framework for Enhancing Color Point Clouds via 3D Gaussian Splatting Zixuan Guo et.al. 2409.01581 null
2024-08-31 Leaky Wave Antenna-Equipped RF Chipless Tags for Orientation Estimation Onel L. A. López et.al. 2409.00501 null
2024-08-30 UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios Baichuan Zhou et.al. 2408.17267 link
2024-08-30 AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding Yonghui Wang et.al. 2408.16986 link
2024-08-29 DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving Yongjie Fu et.al. 2408.16647 null
2024-08-28 Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph Zherong Zhang et.al. 2408.15750 null
2024-08-28 RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving Haisheng Su et.al. 2408.15503 link
2024-08-27 Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images Silvia Seidlitz et.al. 2408.15373 link
2024-08-27 MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders Baijiong Lin et.al. 2408.15101 link
2024-08-27 Interactive Occlusion Boundary Estimation through Exploitation of Synthetic Data Lintao Xu et.al. 2408.15038 null
2024-08-27 BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization Mario A. V. Saucedo et.al. 2408.14941 null
2024-08-27 Platypus: A Generalized Specialist Model for Reading Text in Various Forms Peng Wang et.al. 2408.14805 link
2024-08-27 RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models Junyao Ge et.al. 2408.14744 link
2024-08-26 Ensemble Predicate Decoding for Unbiased Scene Graph Generation Jiasong Feng et.al. 2408.14187 null
2024-08-26 FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation Daixun Li et.al. 2408.13980 null
2024-08-25 Making Large Language Models Better Planners with Reasoning-Decision Alignment Zhijian Huang et.al. 2408.13890 null
2024-08-25 3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing Shichao Dong et.al. 2408.13788 null
2024-08-25 Extremely Fine-Grained Visual Classification over Resembling Glyphs in the Wild Fares Bougourzi et.al. 2408.13774 link
2024-08-25 SeeBelow: Sub-dermal 3D Reconstruction of Tumors with Surgical Robotic Palpation and Tactile Exploration Raghava Uppuluri et.al. 2408.13699 null
2024-08-21 Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation Chuandong Liu et.al. 2408.11280 null
2024-08-20 OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding Youjun Zhao et.al. 2408.11030 link
2024-08-19 3D-Aware Instance Segmentation and Tracking in Egocentric Videos Yash Bhalgat et.al. 2408.09860 null
2024-08-16 Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation Tri Ton et.al. 2408.08591 null
2024-08-15 Towards Flexible Visual Relationship Segmentation Fangrui Zhu et.al. 2408.08305 null
2024-08-13 SpectralGaussians: Semantic, spectral 3D Gaussian splatting for multi-spectral scene representation, visualization and analysis Saptarshi Neil Sinha et.al. 2408.06975 null
2024-08-13 SceneGPT: A Language Model for 3D Scene Understanding Shivam Chandhok et.al. 2408.06926 null
2024-08-12 HeLiMOS: A Dataset for Moving Object Segmentation in 3D Point Clouds From Heterogeneous LiDAR Sensors Hyungtae Lim et.al. 2408.06328 null
2024-08-11 Decoder Pre-Training with only Text for Scene Text Recognition Shuai Zhao et.al. 2408.05706 link
2024-08-09 Spherical World-Locking for Audio-Visual Localization in Egocentric Videos Heeseung Yun et.al. 2408.05364 null
2024-08-15 DeepInteraction++: Multi-Modality Interaction for Autonomous Driving Zeyu Yang et.al. 2408.05075 link
2024-08-09 Mesh-based Object Tracking for Dynamic Semantic 3D Scene Graphs via Ray Tracing Lennart Niecksch et.al. 2408.04979 null
2024-08-09 Manipulable Semantic Components: a Computational Representation of Data Visualization Scenes Zhicheng Liu et.al. 2408.04798 null
2024-08-07 Leveraging LLMs for Enhanced Open-Vocabulary 3D Scene Understanding in Autonomous Driving Amirhosein Chahe et.al. 2408.03516 null
2024-08-04 LEGO: Self-Supervised Representation Learning for Scene Text Images Yujin Ren et.al. 2408.02036 null
2024-07-31 RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion Jianxin Huang et.al. 2407.21631 null
2024-07-31 Voxel Scene Graph for Intracranial Hemorrhage Antoine P. Sanner et.al. 2407.21580 null
2024-07-31 A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap Lijun Zhang et.al. 2407.21438 link
2024-07-31 DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations Dongwon Son et.al. 2407.21267 null
2024-07-30 From Feature Importance to Natural Language Explanations Using LLMs with RAG Sule Tekkesinoglu et.al. 2407.20990 null
2024-07-30 Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering Yanpeng Zhao et.al. 2407.20908 link
2024-07-30 NIS-SLAM: Neural Implicit Semantic RGB-D SLAM for 3D Consistent Scene Understanding Hongjia Zhai et.al. 2407.20853 null
2024-07-29 SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction Çağhan Köksal et.al. 2407.20214 null
2024-07-29 Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets Muhammad Abdullah Jamal et.al. 2407.19714 null
2024-07-28 ASI-Seg: Audio-Driven Surgical Instrument Segmentation with Surgeon Intention Understanding Zhen Chen et.al. 2407.19435 link
2024-07-27 GP-VLS: A general-purpose vision language model for surgery Samuel Schmidgall et.al. 2407.19305 null
2024-07-27 Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction Yansheng Li et.al. 2407.19259 null
2024-07-26 BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation Peng Hao et.al. 2407.18715 null
2024-07-26 MOoSE: Multi-Orientation Sharing Experts for Open-set Scene Text Recognition Chang Liu et.al. 2407.18616 link
2024-07-26 Answerability Fields: Answerable Location Estimation via Diffusion Models Daichi Azuma et.al. 2407.18497 null
2024-07-24 3D Question Answering for City Scene Understanding Penglei Sun et.al. 2407.17398 null
2024-07-23 Augmented Efficiency: Reducing Memory Footprint and Accelerating Inference for 3D Semantic Segmentation through Hybrid Vision Aditya Krishnan et.al. 2407.16102 null
2024-07-25 Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation Jaehyeong Jeon et.al. 2407.15396 link
2024-07-21 VideoGameBunny: Towards vision assistants for video games Mohammad Reza Taesiri et.al. 2407.15295 null
2024-07-21 Self-training Room Layout Estimation via Geometry-aware Ray-casting Bolivar Solarte et.al. 2407.15041 null
2024-07-19 A New Lightweight Hybrid Graph Convolutional Neural Network – CNN Scheme for Scene Classification using Object Detection Inference Ayman Beghdadi et.al. 2407.14658 null
2024-07-19 OpenSU3D: Open World 3D Scene Understanding using Foundation Models Rafay Mohiuddin et.al. 2407.14279 null
2024-07-19 MC-PanDA: Mask Confidence for Panoptic Domain Adaptation Ivan Martinović et.al. 2407.14110 link
2024-07-19 GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation Florian Chabot et.al. 2407.14108 null
2024-07-18 Training-Free Model Merging for Multi-target Domain Adaptation Wenyi Li et.al. 2407.13771 null
2024-07-18 General Geometry-aware Weakly Supervised 3D Object Detection Guowen Zhang et.al. 2407.13748 link
2024-07-18 Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation Pengfei Wang et.al. 2407.13362 null
2024-07-17 InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction Xulong Wang et.al. 2407.12661 link
2024-07-17 Out of Length Text Recognition with Sub-String Matching Yongkun Du et.al. 2407.12317 link
2024-07-17 Dual-Hybrid Attention Network for Specular Highlight Removal Xiaojiao Guo et.al. 2407.12255 null
2024-07-16 Disentangled Acoustic Fields For Multimodal Physical Scene Understanding Jie Yin et.al. 2407.11333 null
2024-07-15 OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models Zijian Zhou et.al. 2407.11213 link
2024-07-15 No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations Walter Simoncini et.al. 2407.10964 link
2024-07-18 Benchmarking Vision Language Models for Cultural Understanding Shravan Nayak et.al. 2407.10920 null
2024-07-14 Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data Tuo Feng et.al. 2407.10200 link
2024-07-13 Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding Ruihuang Li et.al. 2407.09781 null
2024-07-12 A Fair Ranking and New Model for Panoptic Scene Graph Generation Julian Lorenz et.al. 2407.09216 link
2024-07-12 From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation Hanrong Shi et.al. 2407.09191 null
2024-07-11 BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight Hang Wu et.al. 2407.08526 null
2024-07-10 Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences Nikolaos Dimitriadis et.al. 2407.08056 null
2024-07-10 Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search Kirill Paramonov et.al. 2407.07541 null
2024-07-09 Joint prototype and coefficient prediction for 3D instance segmentation Remco Royen et.al. 2407.06958 null
2024-07-09 LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition Teng Wang et.al. 2407.06730 null
2024-07-08 Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition Bangbang Zhou et.al. 2407.05562 link
2024-07-07 Self-supervised Learning via Cluster Distance Prediction for Operating Room Context Awareness Idris Hamoud et.al. 2407.05448 null
2024-07-05 Hybrid Primal Sketch: Combining Analogy, Qualitative Representations, and Computer Vision for Scene Understanding Kenneth D. Forbus et.al. 2407.04859 null
2024-07-03 A Unified Framework for 3D Scene Understanding Wei Xu et.al. 2407.03263 null
2024-07-11 Open Panoramic Segmentation Junwei Zheng et.al. 2407.02685 link
2024-07-02 MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders Baijiong Lin et.al. 2407.02228 link
2024-07-02 Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning Chengchao Shen et.al. 2407.02014 link
2024-07-01 PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction Xuan Yu et.al. 2407.01349 null
2024-06-30 ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding Quang P. M. Pham et.al. 2407.00609 null
2024-06-28 EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting Daiwei Zhang et.al. 2406.19811 null
2024-07-01 Mobile Robot Oriented Large-Scale Indoor Dataset for Dynamic Scene Understanding Yifan Tang et.al. 2406.19791 null
2024-06-28 PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation Deyi Ji et.al. 2406.19632 null
2024-06-27 Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation KuanChao Chu et.al. 2406.19316 null
2024-06-26 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation Shengyi Qian et.al. 2406.18158 null
2024-06-24 GPT-4V Explorations: Mining Autonomous Driving Zixuan Li et.al. 2406.16817 null
2024-06-25 AudioBench: A Universal Benchmark for Audio Large Language Models Bin Wang et.al. 2406.16020 link
2024-06-20 EvSegSNN: Neuromorphic Semantic Segmentation for Event Data Dalia Hareb et.al. 2406.14178 null
2024-06-19 StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images Rushikesh Zawar et.al. 2406.13735 null
2024-06-17 DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features Letian Wang et.al. 2406.12095 null
2024-06-17 Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding Yunsong Wang et.al. 2406.11283 null
2024-06-15 PIG: Prompt Images Guidance for Night-Time Scene Parsing Zhifeng Xie et.al. 2406.10531 link
2024-06-14 MapVision: CVPR 2024 Autonomous Grand Challenge Mapless Driving Tech Report Zhongyu Yang et.al. 2406.10125 null
2024-06-14 SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding Junwei Luo et.al. 2406.10100 link
2024-06-14 A Two-Stage Masked Autoencoder Based Network for Indoor Depth Completion Kailai Sun et.al. 2406.09792 link
2024-06-13 MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Fei Wang et.al. 2406.09411 link
2024-06-13 Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach Yansheng Li et.al. 2406.09410 link
2024-06-12 Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment Taekbeom Lee et.al. 2406.08176 link
2024-06-13 A3VLM: Actionable Articulation-Aware Vision Language Model Siyuan Huang et.al. 2406.07549 link
2024-06-10 ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery Xian Sun et.al. 2406.06028 null
2024-06-11 LOP-Field: Brain-inspired Layout-Object-Position Fields for Robotic Scene Understanding Jiawei Hou et.al. 2406.05985 null
2024-06-08 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR’24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation Qingfeng Liu et.al. 2406.05352 null
2024-06-06 Semantic Similarity Score for Measuring Visual Similarity at Semantic Level Senran Fan et.al. 2406.03865 null
2024-06-04 Radar Spectra-Language Model for Automotive Scene Parsing Mariia Pushkareva et.al. 2406.02158 null
2024-06-04 Leveraging Predicate and Triplet Learning for Scene Graph Generation Jiankai Li et.al. 2406.02038 link
2024-06-04 FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping Yuzhou Ji et.al. 2406.01916 null
2024-06-04 PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning Yupeng Zheng et.al. 2406.01587 null
2024-06-03 EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding Thanh-Dat Truong et.al. 2406.01429 null
2024-06-03 Object Aware Egocentric Online Action Detection Joungbin An et.al. 2406.01079 null
2024-06-03 CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos Trong-Thuan Nguyen et.al. 2406.01029 null
2024-06-02 Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering Xingrui Wang et.al. 2406.00622 link
2024-06-02 Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024 Biao Wu et.al. 2406.00587 null
2024-05-30 Learning 3D Robotics Perception using Inductive Priors Muhammad Zubair Irshad et.al. 2405.20364 null
2024-05-30 SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation Junjie Zhang et.al. 2405.19586 null
2024-05-29 Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding Junjie Fei et.al. 2405.18937 null
2024-05-27 GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane Yansong Qu et.al. 2405.17596 null
2024-05-27 OED: Towards One-stage End-to-End Dynamic Scene Graph Generation Guan Wang et.al. 2405.16925 link
2024-05-25 Real-Time Scene Graph Generation Maëlic Neau et.al. 2405.16116 link
2024-05-24 Open-Vocabulary SAM3D: Understand Any 3D Scene Hanchen Tai et.al. 2405.15580 null
2024-05-23 Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis Basile Van Hoorick et.al. 2405.14868 null
2024-05-23 CoPeD-Advancing Multi-Robot Collaborative Perception: A Comprehensive Dataset in Real-World Environments Yang Zhou et.al. 2405.14731 link
2024-05-23 Efficient Robot Learning for Perception and Mapping Niclas Vödisch et.al. 2405.14688 null
2024-05-24 Transformers for Image-Goal Navigation Nikhilanj Pelluri et.al. 2405.14128 null
2024-05-22 TS40K: a 3D Point Cloud Dataset of Rural Terrain and Electrical Transmission System Diogo Lavado et.al. 2405.13989 null
2024-05-22 A General Framework for Jersey Number Recognition in Sports Video Maria Koshkina et.al. 2405.13896 link
2024-05-22 GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games Aoran Mei et.al. 2405.13751 null
2024-05-21 Anticipating Object State Changes Victoria Manousaki et.al. 2405.12789 null
2024-05-21 Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency Hyeongjin Kim et.al. 2405.12648 null
2024-05-20 MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering Jingqun Tang et.al. 2405.11985 link
2024-05-19 The First Swahili Language Scene Text Detection and Recognition Dataset Fadila Wendigoundi Douamba et.al. 2405.11437 link
2024-05-16 Grounded 3D-LLM with Referent Tokens Yilun Chen et.al. 2405.10370 link
2024-05-16 4D Panoptic Scene Graph Generation Jingkang Yang et.al. 2405.10305 link
2024-05-16 When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models Xianzheng Ma et.al. 2405.10255 link
2024-05-16 A Preprocessing and Postprocessing Voxel-based Method for LiDAR Semantic Segmentation Improvement in Long Distance Andrea Matteazzi et.al. 2405.10046 null
2024-05-15 BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Yunhao Ge et.al. 2405.09546 null
2024-05-15 HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition Honghui Chen et.al. 2405.09125 null
2024-05-15 3D Shape Augmentation with Content-Aware Shape Resizing Mingxiang Chen et.al. 2405.09050 null
2024-05-09 Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control Gunshi Gupta et.al. 2405.05852 link
2024-05-11 Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition Zuan Gao et.al. 2405.05841 null
2024-05-09 Benchmarking Neural Radiance Fields for Autonomous Robots: An Overview Yuhang Ming et.al. 2405.05526 null
2024-05-09 DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction Siyu Li et.al. 2405.05518 null
2024-05-08 OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies Lingdong Kong et.al. 2405.05259 link
2024-05-08 Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving Lingdong Kong et.al. 2405.05258 link
2024-05-07 DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving Chen Min et.al. 2405.04390 null
2024-05-07 Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing Boqiang Zhang et.al. 2405.04377 null
2024-05-06 An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas Mira Slavcheva et.al. 2405.03682 null
2024-05-04 Few-Shot Fruit Segmentation via Transfer Learning Jordan A. James et.al. 2405.02556 link
2024-04-29 Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM Navid Rajabi et.al. 2404.19128 null
2024-04-29 Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks Christopher J. Kymn et.al. 2404.19126 null
2024-04-24 Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer Jiaming Lei et.al. 2404.15785 null
2024-04-22 CloudFort: Enhancing Robustness of 3D Point Cloud Classification Against Backdoor Attacks via Spatial Partitioning and Ensemble Prediction Wenhao Lan et.al. 2404.14042 null
2024-04-22 On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments Gang Ma et.al. 2404.13842 null
2024-04-29 Clio: Real-time Task-Driven Open-Set 3D Scene Graphs Dominic Maggio et.al. 2404.13696 link
2024-04-19 BACS: Background Aware Continual Semantic Segmentation Mostafa ElAraby et.al. 2404.13148 link
2024-04-19 Unified Scene Representation and Reconstruction for 3D Large Language Models Tao Chu et.al. 2404.13044 null
2024-04-18 SPIdepth: Strengthened Pose Information for Self-supervised Monocular Depth Estimation Mykola Lavreniuk et.al. 2404.12501 link
2024-04-19 AccidentBlip2: Accident Detection With Multi-View MotionBlip2 Yihua Shao et.al. 2404.12149 link
2024-04-17 Multimodal 3D Object Detection on Unseen Domains Deepti Hegde et.al. 2404.11764 null
2024-04-16 ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation Iaroslav Melekhov et.al. 2404.10699 link
2024-04-16 PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction Sinisa Stekovic et.al. 2404.10620 link
2024-04-16 PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network Yuning Wang et.al. 2404.10263 null
2024-04-15 No More Ambiguity in 360° Room Layout via Bi-Layout Estimation Yu-Ju Tsai et.al. 2404.09993 null
2024-04-15 A Review and Efficient Implementation of Scene Graph Generation Metrics Julian Lorenz et.al. 2404.09616 link
2024-04-14 Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms Diandian Guo et.al. 2404.09231 null
2024-04-11 Gaga: Group Any Gaussians via 3D-aware Memory Bank Weijie Lyu et.al. 2404.07977 null
2024-04-11 AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation Yansheng Li et.al. 2404.07788 null
2024-04-11 Depth Estimation using Weighted-loss and Transfer Learning Muhammad Adeel Hafeez et.al. 2404.07686 null
2024-04-11 Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange Yanhao Wu et.al. 2404.07504 null
2024-04-10 Incorporating Explanations into Human-Machine Interfaces for Trust and Situation Awareness in Autonomous Vehicles Shahin Atakishiyev et.al. 2404.07383 null
2024-04-10 ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling Ege Özsoy et.al. 2404.07031 link
2024-04-10 O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation Muer Tie et.al. 2404.06836 null
2024-04-09 QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding Yash Mehan et.al. 2404.06442 null
2024-04-09 DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird’s Eye View Segmentation with Occlusion Reasoning Senthil Yogamani et.al. 2404.06352 null
2024-04-09 JSTR: Judgment Improves Scene Text Recognition Masato Fujitake et.al. 2404.05967 null
2024-04-06 Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation Danpei Zhao et.al. 2404.04608 null
2024-04-06 SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos Tao Wu et.al. 2404.04565 link
2024-04-05 Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation Zifu Wan et.al. 2404.04256 link
2024-04-06 HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion Jiahang Li et.al. 2404.03527 link
2024-04-04 You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects Lei Zhou et.al. 2404.03462 null
2024-04-03 Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling Xu Wang et.al. 2404.02527 null
2024-04-05 EGTR: Extracting Graph from Transformer for Scene Graph Generation Jinbae Im et.al. 2404.02072 link
2024-04-01 NeRF-MAE : Masked AutoEncoders for Self Supervised 3D representation Learning for Neural Radiance Fields Muhammad Zubair Irshad et.al. 2404.01300 null
2024-04-08 360+x: A Panoptic Multi-modal Scene Understanding Dataset Hao Chen et.al. 2404.00989 null
2024-04-01 Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping Hyeongjun Kwon et.al. 2404.00974 link
2024-04-01 GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields Yunsong Wang et.al. 2404.00931 link
2024-04-01 MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements Lisong C. Sun et.al. 2404.00923 link
2024-04-01 From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models Rongjie Li et.al. 2404.00906 null
2024-03-31 Adapting to Length Shift: FlexiLength Network for Trajectory Prediction Yi Xu et.al. 2404.00742 null
2024-03-31 Neural Radiance Field-based Visual Rendering: A Comprehensive Review Mingyuan Yao et.al. 2404.00714 null
2024-03-29 VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection Zihua Liu et.al. 2404.00149 null
2024-03-29 HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes Ke Wu et.al. 2403.20159 null
2024-04-01 Efficient 3D Instance Mapping and Localization with Neural Fields George Tang et.al. 2403.19797 null
2024-03-27 Object Pose Estimation via the Aggregation of Diffusion Features Tianfu Wang et.al. 2403.18791 link
2024-03-25 Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding Lingdong Kong et.al. 2403.17010 link
2024-03-25 Towards Trustworthy Automated Driving through Qualitative Scene Understanding and Explanations Nassim Belmecheri et.al. 2403.16908 null
2024-03-25 DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding Xiaoxuan Yu et.al. 2403.16431 link
2024-03-24 AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans Cedric Perauer et.al. 2403.16318 null
2024-03-24 Improving Scene Graph Generation with Relation Words’ Debiasing in Vision-Language Models Yuxuan Wang et.al. 2403.16184 null
2024-03-24 Multi-Task Learning with Multi-Task Optimization Lu Bai et.al. 2403.16162 null
2024-03-24 Semantic Is Enough: Only Semantic Information For NeRF Reconstruction Ruibo Wang et.al. 2403.16043 null
2024-03-22 Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting Jun Guo et.al. 2403.15624 null
2024-03-22 DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data Hanrong Ye et.al. 2403.15389 null
2024-03-21 DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation Zeeshan Hayder et.al. 2403.14886 null
2024-03-21 Evaluating Panoramic 3D Estimation in Indoor Lighting Analysis Zining Cheng et.al. 2403.14836 null
2024-03-21 SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field Lizhe Liu et.al. 2403.14366 null
2024-03-21 Exosense: A Vision-Centric Scene Understanding System For Safe Exoskeleton Navigation Jianeng Wang et.al. 2403.14320 null
2024-03-21 Volumetric Environment Representation for Vision-Language Navigation Rui Liu et.al. 2403.14158 null
2024-03-21 3D Object Detection from Point Cloud via Voting Step Diffusion Haoran Hou et.al. 2403.14133 null
2024-03-20 Efficient scene text image super-resolution with semantic guidance LeoWu TomyEnrique et.al. 2403.13330 link
2024-03-19 SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model Armen Avetisyan et.al. 2403.13064 null
2024-03-19 HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting Hongyu Zhou et.al. 2403.12722 null
2024-03-19 M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving Dongyang Xu et.al. 2403.12552 null
2024-03-19 Multi-Object RANSAC: Efficient Plane Clustering Method in a Clutter Seunghyeon Lim et.al. 2403.12449 null
2024-03-19 Geometric Constraints in Deep Learning Frameworks: A Survey Vibhas K Vats et.al. 2403.12431 null
2024-03-18 R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding Qirui Wu et.al. 2403.12301 null
2024-03-18 HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation Ce Zhang et.al. 2403.12033 link
2024-03-18 Agent3D-Zero: An Agent for Zero-shot 3D Understanding Sha Zhang et.al. 2403.11835 null
2024-03-18 OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation Haochen Jiang et.al. 2403.11796 null
2024-03-19 Urban Scene Diffusion through Semantic Occupancy Map Junge Zhang et.al. 2403.11697 null
2024-03-18 Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation Ming Xu et.al. 2403.11541 link
2024-03-18 Beyond Uncertainty: Risk-Aware Active View Acquisition for Safe Robot Navigation and 3D Scene Understanding with FisherRF Guangyi Liu et.al. 2403.11396 null
2024-03-17 Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications Yonggan Fu et.al. 2403.11131 link
2024-03-16 N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields Yash Bhalgat et.al. 2403.10997 null
2024-03-16 Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation Mariia Khan et.al. 2403.10780 null
2024-03-15 Robust Shape Fitting for 3D Scene Abstraction Florian Kluger et.al. 2403.10452 link
2024-03-15 Do Visual-Language Maps Capture Latent Semantics? Matti Pekkanen et.al. 2403.10117 null
2024-03-15 Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning Hang Zhang et.al. 2403.10107 null
2024-03-14 GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding Chengyao Wang et.al. 2403.09639 link
2024-03-12 IndicSTR12: A Dataset for Indic Scene Text Recognition Harsh Lunia et.al. 2403.08007 null
2024-03-12 Efficient Global Navigational Planning in 3D Structures based on Point Cloud Tomography Bowen Yang et.al. 2403.07631 link
2024-03-12 Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss Xuhua Ren et.al. 2403.07518 null
2024-03-12 MoAI: Mixture of All Intelligence for Large Language and Vision Models Byung-Kwan Lee et.al. 2403.07508 link
2024-03-11 Mapping High-level Semantic Regions in Indoor Environments without Object Recognition Roberto Bigazzi et.al. 2403.07076 null
2024-03-11 Optimizing Latent Graph Representations of Surgical Scenes for Zero-Shot Domain Transfer Siddhant Satyanaik et.al. 2403.06953 null
2024-03-08 Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation Yifan Mao et.al. 2403.05056 link
2024-03-07 Towards Scene Graph Anticipation Rohith Peddi et.al. 2403.04899 null
2024-03-07 Embodied Understanding of Driving Scenarios Yunsong Zhou et.al. 2403.04593 link
2024-03-07 Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes Stamatios Georgoulis et.al. 2403.04562 null
2024-03-06 GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding Zi-Ting Chou et.al. 2403.03608 null
2024-03-05 OORD: The Oxford Offroad Radar Dataset Matthew Gadd et.al. 2403.02845 link
2024-03-05 HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes Yichen Yao et.al. 2403.02769 null
2024-02-29 FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything Safouane El Ghazouali et.al. 2403.00175 link
2024-02-29 One model to use them all: Training a segmentation model with complementary datasets Alexander C. Jenke et.al. 2402.19340 link
2024-02-29 Feature boosting with efficient attention for scene parsing Vivek Singh et.al. 2402.19250 null
2024-02-29 PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds Haotian Liu et.al. 2402.18925 null
2024-02-28 Windowed-FourierMixer: Enhancing Clutter-Free Room Modeling with Fourier Transform Bruno Henriques et.al. 2402.18287 null
2024-02-27 LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment Yiming Ren et.al. 2402.17171 null
2024-02-27 Efficiently Leveraging Linguistic Priors for Scene Text Spotting Nguyen Nguyen et.al. 2402.17134 null
2024-02-26 DreamUp3D: Object-Centric Generative Models for Single-View 3D Scene Understanding and Real-to-Sim Transfer Yizhe Wu et.al. 2402.16308 null
2024-02-24 Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition Mingkun Yang et.al. 2402.15806 null
2024-02-23 OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding Francis Engelmann et.al. 2402.15321 null
2024-02-22 S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR Jialun Pei et.al. 2402.14461 null
2024-02-22 Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding Yu-Qi Yang et.al. 2402.14215 link
2024-02-21 Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition Mingkun Yang et.al. 2402.13643 link
2024-02-25 DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models Xiaoyu Tian et.al. 2402.12289 null

(<a href=../README.md>back to main</a>)