Scene Understanding - 2025-06

Publish Date Title Authors PDF Translate Read Code
2025-06-29 IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering Parker Liu et.al. 2506.23329 translate read link
2025-06-29 Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation Zhenhua Ning et.al. 2506.23120 translate read null
2025-06-28 Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding Xingyilang Yin et.al. 2506.22817 translate read null
2025-06-28 VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding Minchao Jiang et.al. 2506.22799 translate read null
2025-06-26 CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery Felix Holm et.al. 2506.21813 translate read null
2025-06-24 FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models Shiyi Wang et.al. 2506.21627 translate read null
2025-06-26 CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations Julian Lorenz et.al. 2506.21357 translate read null
2025-06-27 ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation Xiwei Xuan et.al. 2506.21233 translate read null
2025-06-25 IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals Markus Gross et.al. 2506.20671 translate read null
2025-06-25 Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios Wenbin Gan et.al. 2506.20531 translate read null
2025-06-25 DreamAnywhere: Object-Centric Panoramic 3D Scene Generation Edoardo Alberto Dominici et.al. 2506.20367 translate read null
2025-06-24 HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions Mrunmai Vivek Phatak et.al. 2506.19639 translate read null
2025-06-24 Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects Federico Tavella et.al. 2506.19579 translate read null
2025-06-24 Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning Pengfei Hao et.al. 2506.19469 translate read null
2025-06-24 Segment Any 3D-Part in a Scene from a Sentence Hongyu Wu et.al. 2506.19331 translate read null
2025-06-24 Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding Runwei Guan et.al. 2506.19288 translate read null
2025-06-24 Object-aware Sound Source Localization via Audio-Visual Scene Understanding Sung Jin Um et.al. 2506.18557 translate read null
2025-06-23 DIP: Unsupervised Dense In-Context Post-training of Visual Representations Sophia Sirko-Galouchenko et.al. 2506.18463 translate read link
2025-06-22 TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving Wenzhuo Liu et.al. 2506.18084 translate read null
2025-06-22 Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis Mohamed Benkedadra et.al. 2506.17910 translate read null
2025-06-21 Optimization-Free Patch Attack on Stereo Depth Estimation Hangcheng Liu et.al. 2506.17632 translate read null
2025-06-21 Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations Zhihao Yuan et.al. 2506.17545 translate read null
2025-06-17 Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment Weiming Zhang et.al. 2506.14271 translate read null
2025-06-17 Unified Representation Space for 3D Visual Grounding Yinuo Zheng et.al. 2506.14238 translate read null
2025-06-17 SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability Juho Bai et.al. 2506.14144 translate read null
2025-06-17 Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems Sanjeda Akter et.al. 2506.14096 translate read null
2025-06-16 FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding Chenlu Zhan et.al. 2506.13629 translate read null
2025-06-16 A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects Guohuan Xie et.al. 2506.13552 translate read null
2025-06-14 A Spatial Relationship Aware Dataset for Robotics Peng Wang et.al. 2506.12525 translate read link
2025-06-14 Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding Youze Wang et.al. 2506.12336 translate read null
2025-06-12 GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset Sahar Nasirihaghighi et.al. 2506.11356 translate read null
2025-06-12 SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis Weiliang Chen et.al. 2506.10981 translate read null
2025-06-13 SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields Qijing Li et.al. 2506.09565 translate read null
2025-06-11 ODG: Occupancy Prediction Using Dual Gaussians Yunxiao Shi et.al. 2506.09417 translate read null
2025-06-10 SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting Mengjiao Ma et.al. 2506.08710 translate read link
2025-06-10 PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly Liang Ma et.al. 2506.08708 translate read null
2025-06-10 From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge Agnese Taluzzi et.al. 2506.08553 translate read null
2025-06-10 Robust Visual Localization via Semantic-Guided Multi-Scale Transformer Zhongtao Tian et.al. 2506.08526 translate read null
2025-06-09 Open World Scene Graph Generation using Vision Language Models Amartya Dutta et.al. 2506.08189 translate read link
2025-06-09 Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods Beining Xu et.al. 2506.07779 translate read null
2025-06-09 OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting Jens Piekenbrinck et.al. 2506.07697 translate read null
2025-06-09 Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent Shoon Kit Lim et.al. 2506.07509 translate read link
2025-06-09 SpatialLM: Training Large Language Models for Structured Indoor Modeling Yongsen Mao et.al. 2506.07491 translate read link
2025-06-08 BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction Yunxiao Shi et.al. 2506.07002 translate read null
2025-06-07 IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion Hongming Chen et.al. 2506.06804 translate read null
2025-06-07 PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments Minghao Zou et.al. 2506.06631 translate read null
2025-06-06 Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments Chad R Samuelson et.al. 2506.06562 translate read null
2025-06-06 Enhancing Situational Awareness in Underwater Robotics with Multi-modal Spatial Perception Pushyami Kaveti et.al. 2506.06476 translate read null
2025-06-06 Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study Leon Mayer et.al. 2506.06232 translate read null
2025-06-06 STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving Christian Fruhwirth-Reisinger et.al. 2506.06218 translate read null
2025-06-06 Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness Steven Landgraf et.al. 2506.05917 translate read null
2025-06-06 HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios Daming Wang et.al. 2506.05883 translate read null
2025-06-06 Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models Hugues Thomas et.al. 2506.05689 translate read null
2025-06-06 Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection Shanmukha Vellamcheti et.al. 2506.05651 translate read null
2025-06-05 SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning Fanqi Kong et.al. 2506.05425 translate read null
2025-06-06 Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs Haoyuan Li et.al. 2506.05318 translate read null
2025-06-06 ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation Daniel Rho et.al. 2506.05317 translate read null
2025-06-04 OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis Junting Chen et.al. 2506.04217 translate read link
2025-06-04 BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation Jialei Chen et.al. 2506.03675 translate read null
2025-06-04 Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI Wing Man Casca Kwok et.al. 2506.03607 translate read null
2025-06-03 Trajectory Prediction Meets Large Language Models: A Survey Yi Xu et.al. 2506.03408 translate read link
2025-06-04 Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments Di Wen et.al. 2506.02845 translate read link
2025-06-03 PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis Mijeong Kim et.al. 2506.02794 translate read null
2025-06-03 Large-scale Self-supervised Video Foundation Model for Intelligent Surgery Shu Yang et.al. 2506.02692 translate read null
2025-06-03 Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024 Patrick Pfreundschuh et.al. 2506.02676 translate read null
2025-06-03 Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models Safaa Abdullahi Moallim Mohamud et.al. 2506.02615 translate read null
2025-06-03 Sign Language: Towards Sign Understanding for Robot Autonomy Ayush Agrawal et.al. 2506.02556 translate read null
2025-06-02 MLLMs Need 3D-Aware Representation Supervision for Scene Understanding Xiaohu Huang et.al. 2506.01946 translate read null
2025-06-02 SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes Yuji Wang et.al. 2506.01558 translate read null
2025-06-02 FDSG: Forecasting Dynamic Scene Graphs Yi Yang et.al. 2506.01487 translate read null
2025-06-02 Learning Sparsity for Effective and Efficient Music Performance Question Answering Xingjian Diao et.al. 2506.01319 translate read null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)