Scene Understanding - 2025-06 | Paper Arxiv Daily

Scene Understanding - 2025-06

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-06-29	IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering	Parker Liu et.al.	2506.23329	translate	read	link
2025-06-29	Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation	Zhenhua Ning et.al.	2506.23120	translate	read	null
2025-06-28	Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding	Xingyilang Yin et.al.	2506.22817	translate	read	null
2025-06-28	VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding	Minchao Jiang et.al.	2506.22799	translate	read	null
2025-06-26	CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery	Felix Holm et.al.	2506.21813	translate	read	null
2025-06-24	FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models	Shiyi Wang et.al.	2506.21627	translate	read	null
2025-06-26	CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations	Julian Lorenz et.al.	2506.21357	translate	read	null
2025-06-27	ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation	Xiwei Xuan et.al.	2506.21233	translate	read	null
2025-06-25	IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals	Markus Gross et.al.	2506.20671	translate	read	null
2025-06-25	Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios	Wenbin Gan et.al.	2506.20531	translate	read	null
2025-06-25	DreamAnywhere: Object-Centric Panoramic 3D Scene Generation	Edoardo Alberto Dominici et.al.	2506.20367	translate	read	null
2025-06-24	HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions	Mrunmai Vivek Phatak et.al.	2506.19639	translate	read	null
2025-06-24	Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects	Federico Tavella et.al.	2506.19579	translate	read	null
2025-06-24	Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning	Pengfei Hao et.al.	2506.19469	translate	read	null
2025-06-24	Segment Any 3D-Part in a Scene from a Sentence	Hongyu Wu et.al.	2506.19331	translate	read	null
2025-06-24	Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding	Runwei Guan et.al.	2506.19288	translate	read	null
2025-06-24	Object-aware Sound Source Localization via Audio-Visual Scene Understanding	Sung Jin Um et.al.	2506.18557	translate	read	null
2025-06-23	DIP: Unsupervised Dense In-Context Post-training of Visual Representations	Sophia Sirko-Galouchenko et.al.	2506.18463	translate	read	link
2025-06-22	TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving	Wenzhuo Liu et.al.	2506.18084	translate	read	null
2025-06-22	Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis	Mohamed Benkedadra et.al.	2506.17910	translate	read	null
2025-06-21	Optimization-Free Patch Attack on Stereo Depth Estimation	Hangcheng Liu et.al.	2506.17632	translate	read	null
2025-06-21	Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations	Zhihao Yuan et.al.	2506.17545	translate	read	null
2025-06-17	Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment	Weiming Zhang et.al.	2506.14271	translate	read	null
2025-06-17	Unified Representation Space for 3D Visual Grounding	Yinuo Zheng et.al.	2506.14238	translate	read	null
2025-06-17	SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability	Juho Bai et.al.	2506.14144	translate	read	null
2025-06-17	Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems	Sanjeda Akter et.al.	2506.14096	translate	read	null
2025-06-16	FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding	Chenlu Zhan et.al.	2506.13629	translate	read	null
2025-06-16	A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects	Guohuan Xie et.al.	2506.13552	translate	read	null
2025-06-14	A Spatial Relationship Aware Dataset for Robotics	Peng Wang et.al.	2506.12525	translate	read	link
2025-06-14	Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding	Youze Wang et.al.	2506.12336	translate	read	null
2025-06-12	GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset	Sahar Nasirihaghighi et.al.	2506.11356	translate	read	null
2025-06-12	SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis	Weiliang Chen et.al.	2506.10981	translate	read	null
2025-06-13	SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields	Qijing Li et.al.	2506.09565	translate	read	null
2025-06-11	ODG: Occupancy Prediction Using Dual Gaussians	Yunxiao Shi et.al.	2506.09417	translate	read	null
2025-06-10	SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting	Mengjiao Ma et.al.	2506.08710	translate	read	link
2025-06-10	PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly	Liang Ma et.al.	2506.08708	translate	read	null
2025-06-10	From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge	Agnese Taluzzi et.al.	2506.08553	translate	read	null
2025-06-10	Robust Visual Localization via Semantic-Guided Multi-Scale Transformer	Zhongtao Tian et.al.	2506.08526	translate	read	null
2025-06-09	Open World Scene Graph Generation using Vision Language Models	Amartya Dutta et.al.	2506.08189	translate	read	link
2025-06-09	Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods	Beining Xu et.al.	2506.07779	translate	read	null
2025-06-09	OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting	Jens Piekenbrinck et.al.	2506.07697	translate	read	null
2025-06-09	Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent	Shoon Kit Lim et.al.	2506.07509	translate	read	link
2025-06-09	SpatialLM: Training Large Language Models for Structured Indoor Modeling	Yongsen Mao et.al.	2506.07491	translate	read	link
2025-06-08	BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction	Yunxiao Shi et.al.	2506.07002	translate	read	null
2025-06-07	IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion	Hongming Chen et.al.	2506.06804	translate	read	null
2025-06-07	PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments	Minghao Zou et.al.	2506.06631	translate	read	null
2025-06-06	Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments	Chad R Samuelson et.al.	2506.06562	translate	read	null
2025-06-06	Enhancing Situational Awareness in Underwater Robotics with Multi-modal Spatial Perception	Pushyami Kaveti et.al.	2506.06476	translate	read	null
2025-06-06	Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study	Leon Mayer et.al.	2506.06232	translate	read	null
2025-06-06	STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving	Christian Fruhwirth-Reisinger et.al.	2506.06218	translate	read	null
2025-06-06	Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness	Steven Landgraf et.al.	2506.05917	translate	read	null
2025-06-06	HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios	Daming Wang et.al.	2506.05883	translate	read	null
2025-06-06	Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models	Hugues Thomas et.al.	2506.05689	translate	read	null
2025-06-06	Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection	Shanmukha Vellamcheti et.al.	2506.05651	translate	read	null
2025-06-05	SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning	Fanqi Kong et.al.	2506.05425	translate	read	null
2025-06-06	Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Haoyuan Li et.al.	2506.05318	translate	read	null
2025-06-06	ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation	Daniel Rho et.al.	2506.05317	translate	read	null
2025-06-04	OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis	Junting Chen et.al.	2506.04217	translate	read	link
2025-06-04	BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation	Jialei Chen et.al.	2506.03675	translate	read	null
2025-06-04	Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI	Wing Man Casca Kwok et.al.	2506.03607	translate	read	null
2025-06-03	Trajectory Prediction Meets Large Language Models: A Survey	Yi Xu et.al.	2506.03408	translate	read	link
2025-06-04	Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments	Di Wen et.al.	2506.02845	translate	read	link
2025-06-03	PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis	Mijeong Kim et.al.	2506.02794	translate	read	null
2025-06-03	Large-scale Self-supervised Video Foundation Model for Intelligent Surgery	Shu Yang et.al.	2506.02692	translate	read	null
2025-06-03	Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024	Patrick Pfreundschuh et.al.	2506.02676	translate	read	null
2025-06-03	Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models	Safaa Abdullahi Moallim Mohamud et.al.	2506.02615	translate	read	null
2025-06-03	Sign Language: Towards Sign Understanding for Robot Autonomy	Ayush Agrawal et.al.	2506.02556	translate	read	null
2025-06-02	MLLMs Need 3D-Aware Representation Supervision for Scene Understanding	Xiaohu Huang et.al.	2506.01946	translate	read	null
2025-06-02	SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes	Yuji Wang et.al.	2506.01558	translate	read	null
2025-06-02	FDSG: Forecasting Dynamic Scene Graphs	Yi Yang et.al.	2506.01487	translate	read	null
2025-06-02	Learning Sparsity for Effective and Efficient Music Performance Question Answering	Xingjian Diao et.al.	2506.01319	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)