Scene Understanding - 2025-04
Scene Understanding - 2025-04
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-04-30 | V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving | Jannik Lübberstedt et.al. | 2505.00156 | translate | read | null |
| 2025-04-30 | LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics | Marc Glocker et.al. | 2504.21716 | translate | read | link |
| 2025-04-30 | ImaginateAR: AI-Assisted In-Situ Authoring in Augmented Reality | Jaewook Lee et.al. | 2504.21360 | translate | read | null |
| 2025-04-28 | Category-Level and Open-Set Object Pose Estimation for Robotics | Peter Hönig et.al. | 2504.19572 | translate | read | null |
| 2025-04-28 | Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding | Yan Wang et.al. | 2504.19500 | translate | read | null |
| 2025-04-27 | Beyond Physical Reach: Comparing Head- and Cane-Mounted Cameras for Last-Mile Navigation by Blind Users | Apurv Varshney et.al. | 2504.19345 | translate | read | null |
| 2025-04-27 | OpenFusion++: An Open-vocabulary Real-time Scene Understanding System | Xiaofeng Jin et.al. | 2504.19266 | translate | read | null |
| 2025-04-27 | CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis | Alexander Baumann et.al. | 2504.19223 | translate | read | null |
| 2025-04-27 | Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving | Mi Zheng et.al. | 2504.19183 | translate | read | null |
| 2025-04-23 | TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Meng Chu et.al. | 2504.16505 | translate | read | null |
| 2025-04-21 | Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends | Mohammad Abu Tami et.al. | 2504.16134 | translate | read | null |
| 2025-04-22 | Vision language models are unreliable at trivial spatial cognition | Sangeet Khemlani et.al. | 2504.16061 | translate | read | null |
| 2025-04-20 | Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension | Lin Li et.al. | 2504.14642 | translate | read | null |
| 2025-04-20 | RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots | Zhang Zhang et.al. | 2504.14604 | translate | read | null |
| 2025-04-20 | Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Tong Zeng et.al. | 2504.14526 | translate | read | link |
| 2025-04-20 | Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation | Guoyi Zhang et.al. | 2504.14481 | translate | read | null |
| 2025-04-18 | HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering | Alexander Rusnak et.al. | 2504.13590 | translate | read | null |
| 2025-04-18 | Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding | Yuchen Rao et.al. | 2504.13580 | translate | read | link |
| 2025-04-18 | Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation | Cheng Yuan et.al. | 2504.13440 | translate | read | null |
| 2025-04-17 | Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs | Shaohui Dai et.al. | 2504.13153 | translate | read | link |
| 2025-04-17 | Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks | Nassim Belmecheri et.al. | 2504.12817 | translate | read | null |
| 2025-04-17 | Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation | Changsheng Lv et.al. | 2504.12606 | translate | read | null |
| 2025-04-16 | Generalized Visual Relation Detection with Diffusion Models | Kaifeng Gao et.al. | 2504.12100 | translate | read | null |
| 2025-04-17 | DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency | Mengshi Qi et.al. | 2504.12080 | translate | read | link |
| 2025-04-16 | CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting | Wei Sun et.al. | 2504.11893 | translate | read | null |
| 2025-04-15 | Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning | Juan Garcia Giraldo et.al. | 2504.11268 | translate | read | null |
| 2025-04-14 | Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization | Darryl Hannan et.al. | 2504.10727 | translate | read | null |
| 2025-04-14 | SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding | Marc Gutiérrez-Pérez et.al. | 2504.10106 | translate | read | link |
| 2025-04-12 | Text To 3D Object Generation For Scalable Room Assembly | Sonia Laguna et.al. | 2504.09328 | translate | read | null |
| 2025-04-11 | FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment | Sebastián Barbas Laina et.al. | 2504.08603 | translate | read | null |
| 2025-04-11 | FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents | Xin Tan et.al. | 2504.08581 | translate | read | null |
| 2025-04-11 | DSM: Building A Diverse Semantic Map for 3D Visual Grounding | Qinghongbing Xie et.al. | 2504.08307 | translate | read | null |
| 2025-04-10 | SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos | Joshua Li et.al. | 2504.07867 | translate | read | null |
| 2025-04-10 | DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction | Xu Zhao et.al. | 2504.07524 | translate | read | null |
| 2025-04-09 | RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration | Omar Alama et.al. | 2504.06994 | translate | read | null |
| 2025-04-09 | Audio-visual Event Localization on Portrait Mode Short Videos | Wuyang Liu et.al. | 2504.06884 | translate | read | null |
| 2025-04-09 | MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking | Chang Nie et.al. | 2504.06863 | translate | read | null |
| 2025-04-09 | Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding | Pedro Hermosilla et.al. | 2504.06719 | translate | read | link |
| 2025-04-09 | Domain-Conditioned Scene Graphs for State-Grounded Task Planning | Jonas Herzog et.al. | 2504.06661 | translate | read | null |
| 2025-04-09 | Attributes-aware Visual Emotion Representation Learning | Rahul Singh Maharjan et.al. | 2504.06578 | translate | read | null |
| 2025-04-08 | CamContextI2V: Context-aware Controllable Video Generation | Luis Denninger et.al. | 2504.06022 | translate | read | link |
| 2025-04-08 | AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems | Zhuoli Zhuang et.al. | 2504.05950 | translate | read | null |
| 2025-04-08 | PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario | Sriram Mandalika et.al. | 2504.05908 | translate | read | null |
| 2025-04-08 | InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation | Jiangsan Zhao et.al. | 2504.05751 | translate | read | null |
| 2025-04-07 | RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model | Congcong Wen et.al. | 2504.04988 | translate | read | null |
| 2025-04-07 | Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding | Zahir Alsulaimawi et.al. | 2504.04772 | translate | read | null |
| 2025-04-07 | DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | Bo-Wen Yin et.al. | 2504.04701 | translate | read | link |
| 2025-04-06 | Planning Safety Trajectories with Dual-Phase, Physics-Informed, and Transportation Knowledge-Driven Large Language Models | Rui Gan et.al. | 2504.04562 | translate | read | null |
| 2025-04-04 | 3D Scene Understanding Through Local Random Access Sequence Modeling | Wanhee Lee et.al. | 2504.03875 | translate | read | link |
| 2025-04-07 | NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving | Kexin Tian et.al. | 2504.03164 | translate | read | null |
| 2025-04-03 | F-ViTA: Foundation Model Guided Visible to Thermal Translation | Jay N. Paranjape et.al. | 2504.02801 | translate | read | link |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Xiaofeng Han et.al. | 2504.02477 | translate | read | link |
| 2025-04-02 | Scene-Centric Unsupervised Panoptic Segmentation | Oliver Hahn et.al. | 2504.01955 | translate | read | link |
| 2025-04-02 | Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness | Haochen Wang et.al. | 2504.01901 | translate | read | null |
| 2025-04-02 | CoMatcher: Multi-View Collaborative Feature Matching | Jintao Zhang et.al. | 2504.01872 | translate | read | null |
| 2025-04-02 | TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication | Petr Vanc et.al. | 2504.01708 | translate | read | null |
| 2025-04-02 | Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation | Junjie Chen et.al. | 2504.01668 | translate | read | null |
| 2025-04-01 | WikiVideo: Article Generation from Multiple Videos | Alexander Martin et.al. | 2504.00939 | translate | read | link |
| 2025-04-01 | Zero-Shot 4D Lidar Panoptic Segmentation | Yushan Zhang et.al. | 2504.00848 | translate | read | null |
| 2025-04-01 | PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks | Abdelrahman Elskhawy et.al. | 2504.00844 | translate | read | null |
| 2025-04-01 | Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights | Yuchen Liu et.al. | 2504.00839 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)