Scene Understanding - 2025-04

Publish Date Title Authors PDF Translate Read Code
2025-04-30 V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving Jannik Lübberstedt et.al. 2505.00156 translate read null
2025-04-30 LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics Marc Glocker et.al. 2504.21716 translate read link
2025-04-30 ImaginateAR: AI-Assisted In-Situ Authoring in Augmented Reality Jaewook Lee et.al. 2504.21360 translate read null
2025-04-28 Category-Level and Open-Set Object Pose Estimation for Robotics Peter Hönig et.al. 2504.19572 translate read null
2025-04-28 Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding Yan Wang et.al. 2504.19500 translate read null
2025-04-27 Beyond Physical Reach: Comparing Head- and Cane-Mounted Cameras for Last-Mile Navigation by Blind Users Apurv Varshney et.al. 2504.19345 translate read null
2025-04-27 OpenFusion++: An Open-vocabulary Real-time Scene Understanding System Xiaofeng Jin et.al. 2504.19266 translate read null
2025-04-27 CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis Alexander Baumann et.al. 2504.19223 translate read null
2025-04-27 Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving Mi Zheng et.al. 2504.19183 translate read null
2025-04-23 TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance Meng Chu et.al. 2504.16505 translate read null
2025-04-21 Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends Mohammad Abu Tami et.al. 2504.16134 translate read null
2025-04-22 Vision language models are unreliable at trivial spatial cognition Sangeet Khemlani et.al. 2504.16061 translate read null
2025-04-20 Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension Lin Li et.al. 2504.14642 translate read null
2025-04-20 RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots Zhang Zhang et.al. 2504.14604 translate read null
2025-04-20 Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding Tong Zeng et.al. 2504.14526 translate read link
2025-04-20 Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation Guoyi Zhang et.al. 2504.14481 translate read null
2025-04-18 HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering Alexander Rusnak et.al. 2504.13590 translate read null
2025-04-18 Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding Yuchen Rao et.al. 2504.13580 translate read link
2025-04-18 Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation Cheng Yuan et.al. 2504.13440 translate read null
2025-04-17 Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs Shaohui Dai et.al. 2504.13153 translate read link
2025-04-17 Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks Nassim Belmecheri et.al. 2504.12817 translate read null
2025-04-17 Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation Changsheng Lv et.al. 2504.12606 translate read null
2025-04-16 Generalized Visual Relation Detection with Diffusion Models Kaifeng Gao et.al. 2504.12100 translate read null
2025-04-17 DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency Mengshi Qi et.al. 2504.12080 translate read link
2025-04-16 CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting Wei Sun et.al. 2504.11893 translate read null
2025-04-15 Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning Juan Garcia Giraldo et.al. 2504.11268 translate read null
2025-04-14 Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization Darryl Hannan et.al. 2504.10727 translate read null
2025-04-14 SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding Marc Gutiérrez-Pérez et.al. 2504.10106 translate read link
2025-04-12 Text To 3D Object Generation For Scalable Room Assembly Sonia Laguna et.al. 2504.09328 translate read null
2025-04-11 FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment Sebastián Barbas Laina et.al. 2504.08603 translate read null
2025-04-11 FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents Xin Tan et.al. 2504.08581 translate read null
2025-04-11 DSM: Building A Diverse Semantic Map for 3D Visual Grounding Qinghongbing Xie et.al. 2504.08307 translate read null
2025-04-10 SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos Joshua Li et.al. 2504.07867 translate read null
2025-04-10 DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction Xu Zhao et.al. 2504.07524 translate read null
2025-04-09 RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration Omar Alama et.al. 2504.06994 translate read null
2025-04-09 Audio-visual Event Localization on Portrait Mode Short Videos Wuyang Liu et.al. 2504.06884 translate read null
2025-04-09 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking Chang Nie et.al. 2504.06863 translate read null
2025-04-09 Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding Pedro Hermosilla et.al. 2504.06719 translate read link
2025-04-09 Domain-Conditioned Scene Graphs for State-Grounded Task Planning Jonas Herzog et.al. 2504.06661 translate read null
2025-04-09 Attributes-aware Visual Emotion Representation Learning Rahul Singh Maharjan et.al. 2504.06578 translate read null
2025-04-08 CamContextI2V: Context-aware Controllable Video Generation Luis Denninger et.al. 2504.06022 translate read link
2025-04-08 AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems Zhuoli Zhuang et.al. 2504.05950 translate read null
2025-04-08 PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario Sriram Mandalika et.al. 2504.05908 translate read null
2025-04-08 InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation Jiangsan Zhao et.al. 2504.05751 translate read null
2025-04-07 RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model Congcong Wen et.al. 2504.04988 translate read null
2025-04-07 Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding Zahir Alsulaimawi et.al. 2504.04772 translate read null
2025-04-07 DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation Bo-Wen Yin et.al. 2504.04701 translate read link
2025-04-06 Planning Safety Trajectories with Dual-Phase, Physics-Informed, and Transportation Knowledge-Driven Large Language Models Rui Gan et.al. 2504.04562 translate read null
2025-04-04 3D Scene Understanding Through Local Random Access Sequence Modeling Wanhee Lee et.al. 2504.03875 translate read link
2025-04-07 NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving Kexin Tian et.al. 2504.03164 translate read null
2025-04-03 F-ViTA: Foundation Model Guided Visible to Thermal Translation Jay N. Paranjape et.al. 2504.02801 translate read link
2025-04-03 Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision Xiaofeng Han et.al. 2504.02477 translate read link
2025-04-02 Scene-Centric Unsupervised Panoptic Segmentation Oliver Hahn et.al. 2504.01955 translate read link
2025-04-02 Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness Haochen Wang et.al. 2504.01901 translate read null
2025-04-02 CoMatcher: Multi-View Collaborative Feature Matching Jintao Zhang et.al. 2504.01872 translate read null
2025-04-02 TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication Petr Vanc et.al. 2504.01708 translate read null
2025-04-02 Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation Junjie Chen et.al. 2504.01668 translate read null
2025-04-01 WikiVideo: Article Generation from Multiple Videos Alexander Martin et.al. 2504.00939 translate read link
2025-04-01 Zero-Shot 4D Lidar Panoptic Segmentation Yushan Zhang et.al. 2504.00848 translate read null
2025-04-01 PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks Abdelrahman Elskhawy et.al. 2504.00844 translate read null
2025-04-01 Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights Yuchen Liu et.al. 2504.00839 translate read null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)