Scene Understanding - 2026-03 | Paper Arxiv Daily

Scene Understanding - 2026-03

Publish Date	Title	Authors	PDF	Translate	Read	Code
2026-03-31	SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes	Léopold Maillard et.al.	2603.29798	translate	read	null
2026-03-31	Hallucination-aware intermediate representation edit in large vision-language models	Wei Suo et.al.	2603.29405	translate	read	null
2026-03-31	VueBuds: Visual Intelligence with Wireless Earbuds	Maruchi Kim et.al.	2603.29095	translate	read	null
2026-03-30	Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure	Chao Yin et.al.	2603.28660	translate	read	null
2026-03-30	Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation	Weichao Cai et.al.	2603.28414	translate	read	null
2026-03-30	DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning	Weimin Liu et.al.	2603.28251	translate	read	null
2026-03-30	To View Transform or Not to View Transform: NeRF-based Pre-training Perspective	Hyeonjun Jeong et.al.	2603.28090	translate	read	null
2026-03-30	SegRGB-X: General RGB-X Semantic Segmentation Model	Jiong Liu et.al.	2603.28023	translate	read	null
2026-03-30	ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments	Pragat Wagle et.al.	2603.27923	translate	read	null
2026-03-25	LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds	Jaehun Bang et.al.	2603.24146	translate	read	null
2026-03-25	MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation	Gengluo Li et.al.	2603.23896	translate	read	null
2026-03-24	SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes	Zhicheng Qiu et.al.	2603.22893	translate	read	null
2026-03-23	Generalized multi-object classification and tracking with sparse feature resonator networks	Lazar Supic et.al.	2603.22539	translate	read	null
2026-03-23	Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning	Minseok Kang et.al.	2603.21559	translate	read	null
2026-03-22	OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields	Aizierjiang Aiersilan et.al.	2603.20999	translate	read	null
2026-03-20	End-to-End Optimization of Polarimetric Measurement and Material Classifier	Ryota Maeda et.al.	2603.20519	translate	read	null
2026-03-20	IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning	Fan Yang et.al.	2603.20182	translate	read	null
2026-03-20	Structured Latent Dynamics in Wireless CSI via Homomorphic World Models	Salmane Naoumi et.al.	2603.20048	translate	read	null
2026-03-19	Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding	Xianjin Wu et.al.	2603.19235	translate	read	null
2026-03-19	REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation	Shuqi Xiao et.al.	2603.18624	translate	read	null
2026-03-19	OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting	Hongjia Zhai et.al.	2603.18510	translate	read	null
2026-03-18	Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting	Guillem Casadesus Vila et.al.	2603.18218	translate	read	null
2026-03-18	GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes	Huajian Zeng et.al.	2603.17993	translate	read	null
2026-03-18	Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding	Shuyao Shi et.al.	2603.17980	translate	read	null
2026-03-18	SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale	Markus Gross et.al.	2603.17920	translate	read	null
2026-03-18	From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving	A. Humnabadkar et.al.	2603.17714	translate	read	null
2026-03-18	ReLaGS: Relational Language Gaussian Splatting	Yaxu Xie et.al.	2603.17605	translate	read	null
2026-03-18	P $^{3}$ Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation	Tianfu Li et.al.	2603.17459	translate	read	null
2026-03-17	$x^2$ -Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space	Ruishan Guo et.al.	2603.16671	translate	read	null
2026-03-17	BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection	Melissa Schween et.al.	2603.16645	translate	read	null
2026-03-17	OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding	Siting Zhu et.al.	2603.16301	translate	read	null
2026-03-17	Structured prototype regularization for synthetic-to-real driving scene parsing	Jiahe Fan et.al.	2603.16083	translate	read	null
2026-03-16	Safety Case Patterns for VLA-based driving systems: Insights from SimLingo	Gerhard Yu et.al.	2603.16013	translate	read	null
2026-03-16	Panoramic Affordance Prediction	Zixin Zhang et.al.	2603.15558	translate	read	null
2026-03-16	Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation	Yuanfan Zheng et.al.	2603.15475	translate	read	null
2026-03-16	Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context	Mohamed Aziz Younes et.al.	2603.15404	translate	read	null
2026-03-16	RieMind: Geometry-Grounded Spatial Agent for Scene Understanding	Fernando Ropero et.al.	2603.15386	translate	read	null
2026-03-16	NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation	Jiahang Liu et.al.	2603.15186	translate	read	null
2026-03-16	AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving	Wenhui Huang et.al.	2603.14851	translate	read	null
2026-03-16	Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning	Heng Zhou et.al.	2603.14811	translate	read	null
2026-03-16	AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild	Yiting Wang et.al.	2603.14701	translate	read	null
2026-03-15	WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning	Stefan Englmeier et.al.	2603.14497	translate	read	null
2026-03-15	V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning	Lorenzo Mur-Labadia et.al.	2603.14482	translate	read	null
2026-03-15	VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion	Aditya Shirwatkar et.al.	2603.14345	translate	read	null
2026-03-15	4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding	Mohamed Rayan Barhdadi et.al.	2603.14301	translate	read	null
2026-03-15	S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction	Renhe Zhang et.al.	2603.14232	translate	read	null
2026-03-12	Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary	Nazia Tasnim et.al.	2603.11410	translate	read	null
2026-03-11	DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding	Mingzhe Tao et.al.	2603.11380	translate	read	null
2026-03-11	UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark	Yu Zhang et.al.	2603.10722	translate	read	null
2026-03-11	DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime	Julian Lorenz et.al.	2603.10538	translate	read	null
2026-03-10	RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding	Muyi Sun et.al.	2603.09809	translate	read	null
2026-03-10	More than the Sum: Panorama-Language Models for Adverse Omni-Scenes	Weijia Fan et.al.	2603.09573	translate	read	null
2026-03-10	Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning	Chun-Peng Chang et.al.	2603.09512	translate	read	null
2026-03-09	APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model	Yuanjie Lu et.al.	2603.08862	translate	read	null
2026-03-09	Rethinking the semantic classification of indoor places by mobile robots	Oscar Martinez Mozos et.al.	2603.08512	translate	read	null
2026-03-09	UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing	Jiaxi Zhang et.al.	2603.08131	translate	read	null
2026-03-09	SGG-R $^{\rm 3}$ : From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation	Jiaye Feng et.al.	2603.07961	translate	read	null
2026-03-09	Toward Unified Multimodal Representation Learning for Autonomous Driving	Ximeng Tao et.al.	2603.07874	translate	read	null
2026-03-08	Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance	Guodong Sun et.al.	2603.07570	translate	read	null
2026-03-06	AV-Unified: A Unified Framework for Audio-visual Scene Understanding	Guangyao Li et.al.	2603.06530	translate	read	null
2026-03-06	REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation	Maëlic Neau et.al.	2603.06386	translate	read	null
2026-03-06	VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction	Xiaoyang Yan et.al.	2603.06210	translate	read	null
2026-03-06	JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas	Sandeep Inuganti et.al.	2603.06168	translate	read	null
2026-03-06	FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models	Andrew Caunes et.al.	2603.06166	translate	read	null
2026-03-06	Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion	Bohai Gu et.al.	2603.06140	translate	read	null
2026-03-06	DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model	Hao Yang et.al.	2603.06090	translate	read	null
2026-03-06	Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning	Cristiano Battistini et.al.	2603.06084	translate	read	null
2026-03-06	Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image	Zidian Qiu et.al.	2603.05908	translate	read	null
2026-03-05	Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields	Scout Jarman et.al.	2603.05473	translate	read	null
2026-03-05	CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception	Gong Chen et.al.	2603.05255	translate	read	null
2026-03-05	3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding	Xiongkun Linghu et.al.	2603.04976	translate	read	null
2026-03-05	Roomify: Spatially-Grounded Style Transformation for Immersive Virtual Environments	Xueyang Wang et.al.	2603.04917	translate	read	null
2026-03-04	SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D	Zirui Wang et.al.	2603.04614	translate	read	null
2026-03-04	EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding	Seungjun Lee et.al.	2603.04254	translate	read	null
2026-03-04	Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation	Dongnuan Cai et.al.	2603.04128	translate	read	null
2026-03-04	Glass Segmentation with Fusion of Learned and General Visual Features	Risto Ojala et.al.	2603.03718	translate	read	null
2026-03-03	Hazard-Aware Traffic Scene Graph Generation	Yaoqi Huang et.al.	2603.03584	translate	read	null
2026-03-03	An Effective Data Augmentation Method by Asking Questions about Scene Text Images	Xu Yao et.al.	2603.03580	translate	read	null
2026-03-03	Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery	Muhammad Asad et.al.	2603.03571	translate	read	null
2026-03-03	Any Resolution Any Geometry: From Multi-View To Multi-Patch	Wenqing Cui et.al.	2603.03026	translate	read	null
2026-03-03	SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding	Sheng Ye et.al.	2603.02548	translate	read	null
2026-03-02	Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation	Jan Finke et.al.	2603.01999	translate	read	null
2026-03-02	WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration	Gong Chen et.al.	2603.01708	translate	read	null
2026-03-02	CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions	Gong Chen et.al.	2603.01688	translate	read	null
2026-03-02	WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments	Joshua Knights et.al.	2603.01475	translate	read	null
2026-03-01	Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding	Anna Michailidou et.al.	2603.01324	translate	read	null
2026-03-01	Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving	Xubo Zhu et.al.	2603.01007	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)