Scene Understanding - 2026-03

Publish Date Title Authors PDF Translate Read Code
2026-03-31 SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes Léopold Maillard et.al. 2603.29798 translate read null
2026-03-31 Hallucination-aware intermediate representation edit in large vision-language models Wei Suo et.al. 2603.29405 translate read null
2026-03-31 VueBuds: Visual Intelligence with Wireless Earbuds Maruchi Kim et.al. 2603.29095 translate read null
2026-03-30 Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure Chao Yin et.al. 2603.28660 translate read null
2026-03-30 Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation Weichao Cai et.al. 2603.28414 translate read null
2026-03-30 DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning Weimin Liu et.al. 2603.28251 translate read null
2026-03-30 To View Transform or Not to View Transform: NeRF-based Pre-training Perspective Hyeonjun Jeong et.al. 2603.28090 translate read null
2026-03-30 SegRGB-X: General RGB-X Semantic Segmentation Model Jiong Liu et.al. 2603.28023 translate read null
2026-03-30 ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments Pragat Wagle et.al. 2603.27923 translate read null
2026-03-25 LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds Jaehun Bang et.al. 2603.24146 translate read null
2026-03-25 MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation Gengluo Li et.al. 2603.23896 translate read null
2026-03-24 SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes Zhicheng Qiu et.al. 2603.22893 translate read null
2026-03-23 Generalized multi-object classification and tracking with sparse feature resonator networks Lazar Supic et.al. 2603.22539 translate read null
2026-03-23 Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning Minseok Kang et.al. 2603.21559 translate read null
2026-03-22 OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields Aizierjiang Aiersilan et.al. 2603.20999 translate read null
2026-03-20 End-to-End Optimization of Polarimetric Measurement and Material Classifier Ryota Maeda et.al. 2603.20519 translate read null
2026-03-20 IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning Fan Yang et.al. 2603.20182 translate read null
2026-03-20 Structured Latent Dynamics in Wireless CSI via Homomorphic World Models Salmane Naoumi et.al. 2603.20048 translate read null
2026-03-19 Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding Xianjin Wu et.al. 2603.19235 translate read null
2026-03-19 REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation Shuqi Xiao et.al. 2603.18624 translate read null
2026-03-19 OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting Hongjia Zhai et.al. 2603.18510 translate read null
2026-03-18 Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting Guillem Casadesus Vila et.al. 2603.18218 translate read null
2026-03-18 GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes Huajian Zeng et.al. 2603.17993 translate read null
2026-03-18 Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding Shuyao Shi et.al. 2603.17980 translate read null
2026-03-18 SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale Markus Gross et.al. 2603.17920 translate read null
2026-03-18 From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving A. Humnabadkar et.al. 2603.17714 translate read null
2026-03-18 ReLaGS: Relational Language Gaussian Splatting Yaxu Xie et.al. 2603.17605 translate read null
2026-03-18 P $^{3}$ Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation Tianfu Li et.al. 2603.17459 translate read null
2026-03-17 $x^2$ -Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space Ruishan Guo et.al. 2603.16671 translate read null
2026-03-17 BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection Melissa Schween et.al. 2603.16645 translate read null
2026-03-17 OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding Siting Zhu et.al. 2603.16301 translate read null
2026-03-17 Structured prototype regularization for synthetic-to-real driving scene parsing Jiahe Fan et.al. 2603.16083 translate read null
2026-03-16 Safety Case Patterns for VLA-based driving systems: Insights from SimLingo Gerhard Yu et.al. 2603.16013 translate read null
2026-03-16 Panoramic Affordance Prediction Zixin Zhang et.al. 2603.15558 translate read null
2026-03-16 Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation Yuanfan Zheng et.al. 2603.15475 translate read null
2026-03-16 Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context Mohamed Aziz Younes et.al. 2603.15404 translate read null
2026-03-16 RieMind: Geometry-Grounded Spatial Agent for Scene Understanding Fernando Ropero et.al. 2603.15386 translate read null
2026-03-16 NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation Jiahang Liu et.al. 2603.15186 translate read null
2026-03-16 AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving Wenhui Huang et.al. 2603.14851 translate read null
2026-03-16 Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning Heng Zhou et.al. 2603.14811 translate read null
2026-03-16 AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild Yiting Wang et.al. 2603.14701 translate read null
2026-03-15 WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning Stefan Englmeier et.al. 2603.14497 translate read null
2026-03-15 V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning Lorenzo Mur-Labadia et.al. 2603.14482 translate read null
2026-03-15 VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion Aditya Shirwatkar et.al. 2603.14345 translate read null
2026-03-15 4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding Mohamed Rayan Barhdadi et.al. 2603.14301 translate read null
2026-03-15 S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction Renhe Zhang et.al. 2603.14232 translate read null
2026-03-12 Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary Nazia Tasnim et.al. 2603.11410 translate read null
2026-03-11 DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding Mingzhe Tao et.al. 2603.11380 translate read null
2026-03-11 UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark Yu Zhang et.al. 2603.10722 translate read null
2026-03-11 DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime Julian Lorenz et.al. 2603.10538 translate read null
2026-03-10 RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding Muyi Sun et.al. 2603.09809 translate read null
2026-03-10 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes Weijia Fan et.al. 2603.09573 translate read null
2026-03-10 Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning Chun-Peng Chang et.al. 2603.09512 translate read null
2026-03-09 APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model Yuanjie Lu et.al. 2603.08862 translate read null
2026-03-09 Rethinking the semantic classification of indoor places by mobile robots Oscar Martinez Mozos et.al. 2603.08512 translate read null
2026-03-09 UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing Jiaxi Zhang et.al. 2603.08131 translate read null
2026-03-09 SGG-R $^{\rm 3}$ : From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation Jiaye Feng et.al. 2603.07961 translate read null
2026-03-09 Toward Unified Multimodal Representation Learning for Autonomous Driving Ximeng Tao et.al. 2603.07874 translate read null
2026-03-08 Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance Guodong Sun et.al. 2603.07570 translate read null
2026-03-06 AV-Unified: A Unified Framework for Audio-visual Scene Understanding Guangyao Li et.al. 2603.06530 translate read null
2026-03-06 REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation Maëlic Neau et.al. 2603.06386 translate read null
2026-03-06 VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction Xiaoyang Yan et.al. 2603.06210 translate read null
2026-03-06 JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas Sandeep Inuganti et.al. 2603.06168 translate read null
2026-03-06 FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models Andrew Caunes et.al. 2603.06166 translate read null
2026-03-06 Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion Bohai Gu et.al. 2603.06140 translate read null
2026-03-06 DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model Hao Yang et.al. 2603.06090 translate read null
2026-03-06 Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning Cristiano Battistini et.al. 2603.06084 translate read null
2026-03-06 Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image Zidian Qiu et.al. 2603.05908 translate read null
2026-03-05 Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields Scout Jarman et.al. 2603.05473 translate read null
2026-03-05 CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception Gong Chen et.al. 2603.05255 translate read null
2026-03-05 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding Xiongkun Linghu et.al. 2603.04976 translate read null
2026-03-05 Roomify: Spatially-Grounded Style Transformation for Immersive Virtual Environments Xueyang Wang et.al. 2603.04917 translate read null
2026-03-04 SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D Zirui Wang et.al. 2603.04614 translate read null
2026-03-04 EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding Seungjun Lee et.al. 2603.04254 translate read null
2026-03-04 Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Dongnuan Cai et.al. 2603.04128 translate read null
2026-03-04 Glass Segmentation with Fusion of Learned and General Visual Features Risto Ojala et.al. 2603.03718 translate read null
2026-03-03 Hazard-Aware Traffic Scene Graph Generation Yaoqi Huang et.al. 2603.03584 translate read null
2026-03-03 An Effective Data Augmentation Method by Asking Questions about Scene Text Images Xu Yao et.al. 2603.03580 translate read null
2026-03-03 Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery Muhammad Asad et.al. 2603.03571 translate read null
2026-03-03 Any Resolution Any Geometry: From Multi-View To Multi-Patch Wenqing Cui et.al. 2603.03026 translate read null
2026-03-03 SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding Sheng Ye et.al. 2603.02548 translate read null
2026-03-02 Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation Jan Finke et.al. 2603.01999 translate read null
2026-03-02 WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration Gong Chen et.al. 2603.01708 translate read null
2026-03-02 CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions Gong Chen et.al. 2603.01688 translate read null
2026-03-02 WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments Joshua Knights et.al. 2603.01475 translate read null
2026-03-01 Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding Anna Michailidou et.al. 2603.01324 translate read null
2026-03-01 Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving Xubo Zhu et.al. 2603.01007 translate read null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)