Scene Understanding - 2025-05 | Paper Arxiv Daily

Scene Understanding - 2025-05

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-05-30	Tackling View-Dependent Semantics in 3D Language Gaussian Splatting	Jiazhong Cen et.al.	2505.24746	translate	read	null
2025-05-30	Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors	Duo Zheng et.al.	2505.24625	translate	read	link
2025-05-30	EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding	Ege Özsoy et.al.	2505.24287	translate	read	null
2025-05-29	ConversAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners	Jad Bendarkawi et.al.	2505.24000	translate	read	null
2025-05-29	A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation	Shuzhou Sun et.al.	2505.23451	translate	read	null
2025-05-29	SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model	Bowen Chen et.al.	2505.23010	translate	read	null
2025-05-28	On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation	Liyao Tang et.al.	2505.22444	translate	read	null
2025-05-28	LiDAR Based Semantic Perception for Forklifts in Outdoor Environments	Benjamin Serfling et.al.	2505.22258	translate	read	null
2025-05-28	3D Question Answering via only 2D Vision-Language Models	Fengyun Wang et.al.	2505.22143	translate	read	null
2025-05-29	DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation	Tianjun Gu et.al.	2505.21969	translate	read	null
2025-05-28	Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs	Insu Lee et.al.	2505.21955	translate	read	null
2025-05-27	A Graph Completion Method that Jointly Predicts Geometry and Topology Enables Effective Molecule Assembly	Rohan V. Koodli et.al.	2505.21833	translate	read	null
2025-05-29	Compositional Scene Understanding through Inverse Generative Modeling	Yanbo Wang et.al.	2505.21780	translate	read	null
2025-05-30	Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks	Keanu Nichols et.al.	2505.21649	translate	read	null
2025-05-27	Assured Autonomy with Neuro-Symbolic Perception	R. Spencer Hallyburton et.al.	2505.21322	translate	read	null
2025-05-27	Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning	Lintao Xu et.al.	2505.21231	translate	read	null
2025-05-27	Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts	Yue Zhang et.al.	2505.21079	translate	read	null
2025-05-27	OccLE: Label-Efficient 3D Semantic Occupancy Prediction	Naiyu Fang et.al.	2505.20617	translate	read	null
2025-05-27	OmniIndoor3D: Comprehensive Indoor 3D Reconstruction	Xiaobao Wei et.al.	2505.20610	translate	read	null
2025-05-26	From Data to Modeling: Fully Open-vocabulary Scene Graph Generation	Zuyao Chen et.al.	2505.20106	translate	read	null
2025-05-26	DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization	Jianxin Huang et.al.	2505.20041	translate	read	null
2025-05-26	Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement	Afrah Shaahid et.al.	2505.19895	translate	read	null
2025-05-26	LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study	Dongil Yang et.al.	2505.19510	translate	read	link
2025-05-25	FHGS: Feature-Homogenized Gaussian Splatting	Q. G. Duan et.al.	2505.19154	translate	read	null
2025-05-25	Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection	Md. Mithun Hossain et.al.	2505.19010	translate	read	null
2025-05-24	Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding	Guofeng Mei et.al.	2505.18819	translate	read	null
2025-05-24	Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps	Sicheng Feng et.al.	2505.18675	translate	read	link
2025-05-23	SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain	Jiawei Zhou et.al.	2505.17727	translate	read	null
2025-05-23	From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation	Mahmoud Chick Zaouali et.al.	2505.17402	translate	read	null
2025-05-22	Assessing the generalization performance of SAM for ureteroscopy scene understanding	Martin Villagrana et.al.	2505.17210	translate	read	null
2025-05-22	CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation	Haihong Hao et.al.	2505.16663	translate	read	link
2025-05-21	SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval	Nikolaos Chaidos et.al.	2505.15867	translate	read	link
2025-05-21	HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning	Xiaodong Mei et.al.	2505.15703	translate	read	null
2025-05-21	Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets	Kaiyuan Chen et.al.	2505.15517	translate	read	link
2025-05-21	RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation	Naman Patel et.al.	2505.15373	translate	read	null
2025-05-21	DC-Scene: Data-Centric Learning for 3D Scene Understanding	Ting Huang et.al.	2505.15232	translate	read	link
2025-05-19	ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling	Ege Özsoy et.al.	2505.12890	translate	read	null
2025-05-19	AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning	Kai Zhang et.al.	2505.12782	translate	read	null
2025-05-19	Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps	Ziqi Wen et.al.	2505.12660	translate	read	null
2025-05-18	LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding	Hanyu Zhou et.al.	2505.12253	translate	read	null
2025-05-18	SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving	Muleilan Pei et.al.	2505.12246	translate	read	null
2025-05-18	Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind	Qingmei Li et.al.	2505.12207	translate	read	link
2025-05-18	Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding	Xuefei Sun et.al.	2505.12194	translate	read	null
2025-05-17	TinyRS-R1: Compact Multimodal Language Model for Remote Sensing	Aybora Koksal et.al.	2505.12099	translate	read	null
2025-05-15	StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation	Daniel A. P. Oliveira et.al.	2505.10292	translate	read	link
2025-05-15	APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds	Yuan Gao et.al.	2505.09971	translate	read	link
2025-05-14	DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection	Jianlin Sun et.al.	2505.09168	translate	read	link
2025-05-14	Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning	Dayong Liang et.al.	2505.09118	translate	read	null
2025-05-13	Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	Zongchuang Zhao et.al.	2505.08725	translate	read	link
2025-05-12	Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions	Yi Zhang et.al.	2505.07611	translate	read	null
2025-05-11	Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding	Chih-Chung Hsu et.al.	2505.06991	translate	read	null
2025-05-11	Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation	Seokjun Kwon et.al.	2505.06951	translate	read	null
2025-05-09	Camera Control at the Edge with Language Models for Scene Understanding	Alexiy Buynitsky et.al.	2505.06402	translate	read	null
2025-05-09	Camera-Only Bird’s Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles	Anupkumar Bochare et.al.	2505.06113	translate	read	null
2025-05-08	Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization	Sooyoung Park et.al.	2505.05343	translate	read	link
2025-05-08	PADriver: Towards Personalized Autonomous Driving	Genghua Kou et.al.	2505.05240	translate	read	null
2025-05-08	Does CLIP perceive art the same way we do?	Andrea Asperti et.al.	2505.05229	translate	read	null
2025-05-07	GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes	Feng Xiao et.al.	2505.04659	translate	read	link
2025-05-07	RAFT: Robust Augmentation of FeaTures for Image Segmentation	Edward Humes et.al.	2505.04529	translate	read	null
2025-05-03	Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models	Gracjan Góral et.al.	2505.03821	translate	read	null
2025-05-06	MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation	Mingcheng Li et.al.	2505.02648	translate	read	null
2025-05-04	Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation	Volodymyr Havrylov et.al.	2505.02075	translate	read	link
2025-05-04	Segment Any RGB-Thermal Model with Language-aided Distillation	Dong Xing et.al.	2505.01950	translate	read	null
2025-05-02	Embracing Diffraction: A Paradigm Shift in Wireless Sensing and Communication	Anurag Pallaprolu et.al.	2505.01625	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)