Scene Understanding - 2025-06
Scene Understanding - 2025-06
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-06-29 | IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | Parker Liu et.al. | 2506.23329 | translate | read | link |
| 2025-06-29 | Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation | Zhenhua Ning et.al. | 2506.23120 | translate | read | null |
| 2025-06-28 | Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding | Xingyilang Yin et.al. | 2506.22817 | translate | read | null |
| 2025-06-28 | VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding | Minchao Jiang et.al. | 2506.22799 | translate | read | null |
| 2025-06-26 | CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery | Felix Holm et.al. | 2506.21813 | translate | read | null |
| 2025-06-24 | FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models | Shiyi Wang et.al. | 2506.21627 | translate | read | null |
| 2025-06-26 | CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations | Julian Lorenz et.al. | 2506.21357 | translate | read | null |
| 2025-06-27 | ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation | Xiwei Xuan et.al. | 2506.21233 | translate | read | null |
| 2025-06-25 | IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals | Markus Gross et.al. | 2506.20671 | translate | read | null |
| 2025-06-25 | Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios | Wenbin Gan et.al. | 2506.20531 | translate | read | null |
| 2025-06-25 | DreamAnywhere: Object-Centric Panoramic 3D Scene Generation | Edoardo Alberto Dominici et.al. | 2506.20367 | translate | read | null |
| 2025-06-24 | HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions | Mrunmai Vivek Phatak et.al. | 2506.19639 | translate | read | null |
| 2025-06-24 | Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects | Federico Tavella et.al. | 2506.19579 | translate | read | null |
| 2025-06-24 | Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning | Pengfei Hao et.al. | 2506.19469 | translate | read | null |
| 2025-06-24 | Segment Any 3D-Part in a Scene from a Sentence | Hongyu Wu et.al. | 2506.19331 | translate | read | null |
| 2025-06-24 | Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding | Runwei Guan et.al. | 2506.19288 | translate | read | null |
| 2025-06-24 | Object-aware Sound Source Localization via Audio-Visual Scene Understanding | Sung Jin Um et.al. | 2506.18557 | translate | read | null |
| 2025-06-23 | DIP: Unsupervised Dense In-Context Post-training of Visual Representations | Sophia Sirko-Galouchenko et.al. | 2506.18463 | translate | read | link |
| 2025-06-22 | TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving | Wenzhuo Liu et.al. | 2506.18084 | translate | read | null |
| 2025-06-22 | Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis | Mohamed Benkedadra et.al. | 2506.17910 | translate | read | null |
| 2025-06-21 | Optimization-Free Patch Attack on Stereo Depth Estimation | Hangcheng Liu et.al. | 2506.17632 | translate | read | null |
| 2025-06-21 | Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations | Zhihao Yuan et.al. | 2506.17545 | translate | read | null |
| 2025-06-17 | Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment | Weiming Zhang et.al. | 2506.14271 | translate | read | null |
| 2025-06-17 | Unified Representation Space for 3D Visual Grounding | Yinuo Zheng et.al. | 2506.14238 | translate | read | null |
| 2025-06-17 | SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability | Juho Bai et.al. | 2506.14144 | translate | read | null |
| 2025-06-17 | Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems | Sanjeda Akter et.al. | 2506.14096 | translate | read | null |
| 2025-06-16 | FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding | Chenlu Zhan et.al. | 2506.13629 | translate | read | null |
| 2025-06-16 | A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects | Guohuan Xie et.al. | 2506.13552 | translate | read | null |
| 2025-06-14 | A Spatial Relationship Aware Dataset for Robotics | Peng Wang et.al. | 2506.12525 | translate | read | link |
| 2025-06-14 | Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding | Youze Wang et.al. | 2506.12336 | translate | read | null |
| 2025-06-12 | GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset | Sahar Nasirihaghighi et.al. | 2506.11356 | translate | read | null |
| 2025-06-12 | SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis | Weiliang Chen et.al. | 2506.10981 | translate | read | null |
| 2025-06-13 | SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields | Qijing Li et.al. | 2506.09565 | translate | read | null |
| 2025-06-11 | ODG: Occupancy Prediction Using Dual Gaussians | Yunxiao Shi et.al. | 2506.09417 | translate | read | null |
| 2025-06-10 | SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting | Mengjiao Ma et.al. | 2506.08710 | translate | read | link |
| 2025-06-10 | PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Liang Ma et.al. | 2506.08708 | translate | read | null |
| 2025-06-10 | From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge | Agnese Taluzzi et.al. | 2506.08553 | translate | read | null |
| 2025-06-10 | Robust Visual Localization via Semantic-Guided Multi-Scale Transformer | Zhongtao Tian et.al. | 2506.08526 | translate | read | null |
| 2025-06-09 | Open World Scene Graph Generation using Vision Language Models | Amartya Dutta et.al. | 2506.08189 | translate | read | link |
| 2025-06-09 | Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods | Beining Xu et.al. | 2506.07779 | translate | read | null |
| 2025-06-09 | OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting | Jens Piekenbrinck et.al. | 2506.07697 | translate | read | null |
| 2025-06-09 | Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent | Shoon Kit Lim et.al. | 2506.07509 | translate | read | link |
| 2025-06-09 | SpatialLM: Training Large Language Models for Structured Indoor Modeling | Yongsen Mao et.al. | 2506.07491 | translate | read | link |
| 2025-06-08 | BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction | Yunxiao Shi et.al. | 2506.07002 | translate | read | null |
| 2025-06-07 | IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion | Hongming Chen et.al. | 2506.06804 | translate | read | null |
| 2025-06-07 | PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments | Minghao Zou et.al. | 2506.06631 | translate | read | null |
| 2025-06-06 | Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments | Chad R Samuelson et.al. | 2506.06562 | translate | read | null |
| 2025-06-06 | Enhancing Situational Awareness in Underwater Robotics with Multi-modal Spatial Perception | Pushyami Kaveti et.al. | 2506.06476 | translate | read | null |
| 2025-06-06 | Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study | Leon Mayer et.al. | 2506.06232 | translate | read | null |
| 2025-06-06 | STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Christian Fruhwirth-Reisinger et.al. | 2506.06218 | translate | read | null |
| 2025-06-06 | Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness | Steven Landgraf et.al. | 2506.05917 | translate | read | null |
| 2025-06-06 | HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios | Daming Wang et.al. | 2506.05883 | translate | read | null |
| 2025-06-06 | Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models | Hugues Thomas et.al. | 2506.05689 | translate | read | null |
| 2025-06-06 | Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection | Shanmukha Vellamcheti et.al. | 2506.05651 | translate | read | null |
| 2025-06-05 | SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning | Fanqi Kong et.al. | 2506.05425 | translate | read | null |
| 2025-06-06 | Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Haoyuan Li et.al. | 2506.05318 | translate | read | null |
| 2025-06-06 | ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation | Daniel Rho et.al. | 2506.05317 | translate | read | null |
| 2025-06-04 | OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis | Junting Chen et.al. | 2506.04217 | translate | read | link |
| 2025-06-04 | BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation | Jialei Chen et.al. | 2506.03675 | translate | read | null |
| 2025-06-04 | Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI | Wing Man Casca Kwok et.al. | 2506.03607 | translate | read | null |
| 2025-06-03 | Trajectory Prediction Meets Large Language Models: A Survey | Yi Xu et.al. | 2506.03408 | translate | read | link |
| 2025-06-04 | Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments | Di Wen et.al. | 2506.02845 | translate | read | link |
| 2025-06-03 | PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis | Mijeong Kim et.al. | 2506.02794 | translate | read | null |
| 2025-06-03 | Large-scale Self-supervised Video Foundation Model for Intelligent Surgery | Shu Yang et.al. | 2506.02692 | translate | read | null |
| 2025-06-03 | Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024 | Patrick Pfreundschuh et.al. | 2506.02676 | translate | read | null |
| 2025-06-03 | Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models | Safaa Abdullahi Moallim Mohamud et.al. | 2506.02615 | translate | read | null |
| 2025-06-03 | Sign Language: Towards Sign Understanding for Robot Autonomy | Ayush Agrawal et.al. | 2506.02556 | translate | read | null |
| 2025-06-02 | MLLMs Need 3D-Aware Representation Supervision for Scene Understanding | Xiaohu Huang et.al. | 2506.01946 | translate | read | null |
| 2025-06-02 | SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes | Yuji Wang et.al. | 2506.01558 | translate | read | null |
| 2025-06-02 | FDSG: Forecasting Dynamic Scene Graphs | Yi Yang et.al. | 2506.01487 | translate | read | null |
| 2025-06-02 | Learning Sparsity for Effective and Efficient Music Performance Question Answering | Xingjian Diao et.al. | 2506.01319 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)