Scene Understanding - 2026-03
Scene Understanding - 2026-03
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2026-03-31 | SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes | Léopold Maillard et.al. | 2603.29798 | translate | read | null |
| 2026-03-31 | Hallucination-aware intermediate representation edit in large vision-language models | Wei Suo et.al. | 2603.29405 | translate | read | null |
| 2026-03-31 | VueBuds: Visual Intelligence with Wireless Earbuds | Maruchi Kim et.al. | 2603.29095 | translate | read | null |
| 2026-03-30 | Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial Infrastructure | Chao Yin et.al. | 2603.28660 | translate | read | null |
| 2026-03-30 | Unified Restoration-Perception Learning: Maritime Infrared-Visible Image Fusion and Segmentation | Weichao Cai et.al. | 2603.28414 | translate | read | null |
| 2026-03-30 | DiffAttn: Diffusion-Based Drivers’ Visual Attention Prediction with LLM-Enhanced Semantic Reasoning | Weimin Liu et.al. | 2603.28251 | translate | read | null |
| 2026-03-30 | To View Transform or Not to View Transform: NeRF-based Pre-training Perspective | Hyeonjun Jeong et.al. | 2603.28090 | translate | read | null |
| 2026-03-30 | SegRGB-X: General RGB-X Semantic Segmentation Model | Jiong Liu et.al. | 2603.28023 | translate | read | null |
| 2026-03-30 | ForestSim: A Synthetic Benchmark for Intelligent Vehicle Perception in Unstructured Forest Environments | Pragat Wagle et.al. | 2603.27923 | translate | read | null |
| 2026-03-25 | LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds | Jaehun Bang et.al. | 2603.24146 | translate | read | null |
| 2026-03-25 | MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation | Gengluo Li et.al. | 2603.23896 | translate | read | null |
| 2026-03-24 | SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes | Zhicheng Qiu et.al. | 2603.22893 | translate | read | null |
| 2026-03-23 | Generalized multi-object classification and tracking with sparse feature resonator networks | Lazar Supic et.al. | 2603.22539 | translate | read | null |
| 2026-03-23 | Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning | Minseok Kang et.al. | 2603.21559 | translate | read | null |
| 2026-03-22 | OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields | Aizierjiang Aiersilan et.al. | 2603.20999 | translate | read | null |
| 2026-03-20 | End-to-End Optimization of Polarimetric Measurement and Material Classifier | Ryota Maeda et.al. | 2603.20519 | translate | read | null |
| 2026-03-20 | IndoorR2X: Indoor Robot-to-Everything Coordination with LLM-Driven Planning | Fan Yang et.al. | 2603.20182 | translate | read | null |
| 2026-03-20 | Structured Latent Dynamics in Wireless CSI via Homomorphic World Models | Salmane Naoumi et.al. | 2603.20048 | translate | read | null |
| 2026-03-19 | Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding | Xianjin Wu et.al. | 2603.19235 | translate | read | null |
| 2026-03-19 | REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation | Shuqi Xiao et.al. | 2603.18624 | translate | read | null |
| 2026-03-19 | OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting | Hongjia Zhai et.al. | 2603.18510 | translate | read | null |
| 2026-03-18 | Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting | Guillem Casadesus Vila et.al. | 2603.18218 | translate | read | null |
| 2026-03-18 | GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes | Huajian Zeng et.al. | 2603.17993 | translate | read | null |
| 2026-03-18 | Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding | Shuyao Shi et.al. | 2603.17980 | translate | read | null |
| 2026-03-18 | SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale | Markus Gross et.al. | 2603.17920 | translate | read | null |
| 2026-03-18 | From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving | A. Humnabadkar et.al. | 2603.17714 | translate | read | null |
| 2026-03-18 | ReLaGS: Relational Language Gaussian Splatting | Yaxu Xie et.al. | 2603.17605 | translate | read | null |
| 2026-03-18 | P $^{3}$ Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation | Tianfu Li et.al. | 2603.17459 | translate | read | null |
| 2026-03-17 | $x^2$ -Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space | Ruishan Guo et.al. | 2603.16671 | translate | read | null |
| 2026-03-17 | BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection | Melissa Schween et.al. | 2603.16645 | translate | read | null |
| 2026-03-17 | OGScene3D: Incremental Open-Vocabulary 3D Gaussian Scene Graph Mapping for Scene Understanding | Siting Zhu et.al. | 2603.16301 | translate | read | null |
| 2026-03-17 | Structured prototype regularization for synthetic-to-real driving scene parsing | Jiahe Fan et.al. | 2603.16083 | translate | read | null |
| 2026-03-16 | Safety Case Patterns for VLA-based driving systems: Insights from SimLingo | Gerhard Yu et.al. | 2603.16013 | translate | read | null |
| 2026-03-16 | Panoramic Affordance Prediction | Zixin Zhang et.al. | 2603.15558 | translate | read | null |
| 2026-03-16 | Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation | Yuanfan Zheng et.al. | 2603.15475 | translate | read | null |
| 2026-03-16 | Detection of Autonomous Shuttles in Urban Traffic Images Using Adaptive Residual Context | Mohamed Aziz Younes et.al. | 2603.15404 | translate | read | null |
| 2026-03-16 | RieMind: Geometry-Grounded Spatial Agent for Scene Understanding | Fernando Ropero et.al. | 2603.15386 | translate | read | null |
| 2026-03-16 | NavGSim: High-Fidelity Gaussian Splatting Simulator for Large-Scale Navigation | Jiahang Liu et.al. | 2603.15186 | translate | read | null |
| 2026-03-16 | AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving | Wenhui Huang et.al. | 2603.14851 | translate | read | null |
| 2026-03-16 | Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning | Heng Zhou et.al. | 2603.14811 | translate | read | null |
| 2026-03-16 | AURORA-KITTI: Any-Weather Depth Completion and Denoising in the Wild | Yiting Wang et.al. | 2603.14701 | translate | read | null |
| 2026-03-15 | WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning | Stefan Englmeier et.al. | 2603.14497 | translate | read | null |
| 2026-03-15 | V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning | Lorenzo Mur-Labadia et.al. | 2603.14482 | translate | read | null |
| 2026-03-15 | VIP-Loco: A Visually Guided Infinite Horizon Planning Framework for Legged Locomotion | Aditya Shirwatkar et.al. | 2603.14345 | translate | read | null |
| 2026-03-15 | 4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding | Mohamed Rayan Barhdadi et.al. | 2603.14301 | translate | read | null |
| 2026-03-15 | S2GS: Streaming Semantic Gaussian Splatting for Online Scene Understanding and Reconstruction | Renhe Zhang et.al. | 2603.14232 | translate | read | null |
| 2026-03-12 | Seeing Isn’t Orienting: A Cognitively Grounded Benchmark Reveals Systematic Orientation Failures in MLLMs Supplementary | Nazia Tasnim et.al. | 2603.11410 | translate | read | null |
| 2026-03-11 | DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding | Mingzhe Tao et.al. | 2603.11380 | translate | read | null |
| 2026-03-11 | UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark | Yu Zhang et.al. | 2603.10722 | translate | read | null |
| 2026-03-11 | DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime | Julian Lorenz et.al. | 2603.10538 | translate | read | null |
| 2026-03-10 | RA-SSU: Towards Fine-Grained Audio-Visual Learning with Region-Aware Sound Source Understanding | Muyi Sun et.al. | 2603.09809 | translate | read | null |
| 2026-03-10 | More than the Sum: Panorama-Language Models for Adverse Omni-Scenes | Weijia Fan et.al. | 2603.09573 | translate | read | null |
| 2026-03-10 | Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning | Chun-Peng Chang et.al. | 2603.09512 | translate | read | null |
| 2026-03-09 | APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model | Yuanjie Lu et.al. | 2603.08862 | translate | read | null |
| 2026-03-09 | Rethinking the semantic classification of indoor places by mobile robots | Oscar Martinez Mozos et.al. | 2603.08512 | translate | read | null |
| 2026-03-09 | UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing | Jiaxi Zhang et.al. | 2603.08131 | translate | read | null |
| 2026-03-09 | SGG-R $^{\rm 3}$ : From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation | Jiaye Feng et.al. | 2603.07961 | translate | read | null |
| 2026-03-09 | Toward Unified Multimodal Representation Learning for Autonomous Driving | Ximeng Tao et.al. | 2603.07874 | translate | read | null |
| 2026-03-08 | Efficient RGB-D Scene Understanding via Multi-task Adaptive Learning and Cross-dimensional Feature Guidance | Guodong Sun et.al. | 2603.07570 | translate | read | null |
| 2026-03-06 | AV-Unified: A Unified Framework for Audio-visual Scene Understanding | Guangyao Li et.al. | 2603.06530 | translate | read | null |
| 2026-03-06 | REACT++: Efficient Cross-Attention for Real-Time Scene Graph Generation | Maëlic Neau et.al. | 2603.06386 | translate | read | null |
| 2026-03-06 | VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction | Xiaoyang Yan et.al. | 2603.06210 | translate | read | null |
| 2026-03-06 | JOPP-3D: Joint Open Vocabulary Semantic Segmentation on Point Clouds and Panoramas | Sandeep Inuganti et.al. | 2603.06168 | translate | read | null |
| 2026-03-06 | FreeOcc: Training-free Panoptic Occupancy Prediction via Foundation Models | Andrew Caunes et.al. | 2603.06166 | translate | read | null |
| 2026-03-06 | Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion | Bohai Gu et.al. | 2603.06140 | translate | read | null |
| 2026-03-06 | DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model | Hao Yang et.al. | 2603.06090 | translate | read | null |
| 2026-03-06 | Multimodal Behavior Tree Generation: A Small Vision-Language Model for Robot Task Planning | Cristiano Battistini et.al. | 2603.06084 | translate | read | null |
| 2026-03-06 | Pano3DComposer: Feed-Forward Compositional 3D Scene Generation from Single Panoramic Image | Zidian Qiu et.al. | 2603.05908 | translate | read | null |
| 2026-03-05 | Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields | Scout Jarman et.al. | 2603.05473 | translate | read | null |
| 2026-03-05 | CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception | Gong Chen et.al. | 2603.05255 | translate | read | null |
| 2026-03-05 | 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding | Xiongkun Linghu et.al. | 2603.04976 | translate | read | null |
| 2026-03-05 | Roomify: Spatially-Grounded Style Transformation for Immersive Virtual Environments | Xueyang Wang et.al. | 2603.04917 | translate | read | null |
| 2026-03-04 | SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D | Zirui Wang et.al. | 2603.04614 | translate | read | null |
| 2026-03-04 | EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding | Seungjun Lee et.al. | 2603.04254 | translate | read | null |
| 2026-03-04 | Crab $^{+}$ : A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation | Dongnuan Cai et.al. | 2603.04128 | translate | read | null |
| 2026-03-04 | Glass Segmentation with Fusion of Learned and General Visual Features | Risto Ojala et.al. | 2603.03718 | translate | read | null |
| 2026-03-03 | Hazard-Aware Traffic Scene Graph Generation | Yaoqi Huang et.al. | 2603.03584 | translate | read | null |
| 2026-03-03 | An Effective Data Augmentation Method by Asking Questions about Scene Text Images | Xu Yao et.al. | 2603.03580 | translate | read | null |
| 2026-03-03 | Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery | Muhammad Asad et.al. | 2603.03571 | translate | read | null |
| 2026-03-03 | Any Resolution Any Geometry: From Multi-View To Multi-Patch | Wenqing Cui et.al. | 2603.03026 | translate | read | null |
| 2026-03-03 | SemGS: Feed-Forward Semantic 3D Gaussian Splatting from Sparse Views for Generalizable Scene Understanding | Sheng Ye et.al. | 2603.02548 | translate | read | null |
| 2026-03-02 | Learning Vision-Based Omnidirectional Navigation: A Teacher-Student Approach Using Monocular Depth Estimation | Jan Finke et.al. | 2603.01999 | translate | read | null |
| 2026-03-02 | WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration | Gong Chen et.al. | 2603.01708 | translate | read | null |
| 2026-03-02 | CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions | Gong Chen et.al. | 2603.01688 | translate | read | null |
| 2026-03-02 | WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments | Joshua Knights et.al. | 2603.01475 | translate | read | null |
| 2026-03-01 | Open-Vocabulary vs Supervised Learning Methods for Post-Disaster Visual Scene Understanding | Anna Michailidou et.al. | 2603.01324 | translate | read | null |
| 2026-03-01 | Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving | Xubo Zhu et.al. | 2603.01007 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)