Scene Understanding - 2025-11

Publish Date Title Authors PDF Translate Read Code
2025-11-29 When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI Yanhui Li et.al. 2512.03087 translate read null
2025-11-30 FOM-Nav: Frontier-Object Maps for Object Goal Navigation Thomas Chabal et.al. 2512.01009 translate read null
2025-11-30 Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting Haishan Wang et.al. 2512.00850 translate read null
2025-11-29 Describe Anything Anywhere At Any Moment Nicolas Gorlo et.al. 2512.00565 translate read null
2025-11-29 Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR Lixing Guo et.al. 2512.00294 translate read null
2025-11-28 DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation Zirui Wang et.al. 2512.00226 translate read null
2025-11-28 DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation Hongfei Zhang et.al. 2511.23127 translate read null
2025-11-28 Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding Anik De et.al. 2511.23071 translate read null
2025-11-28 HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model Chen Li et.al. 2511.22961 translate read null
2025-11-28 See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection YuEun Lee et.al. 2511.22906 translate read null
2025-11-27 GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes Di Wang et.al. 2511.22645 translate read null
2025-11-27 CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving Zhaohui Wang et.al. 2511.22532 translate read null
2025-11-27 RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding Xiyan Liu et.al. 2511.22466 translate read null
2025-11-26 SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding Tae-Min Choi et.al. 2511.21339 translate read null
2025-11-26 Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding Yutao Tang et.al. 2511.21191 translate read null
2025-11-26 Scaling Foundation Models for Radar Scene Understanding Pushkal Mishra et.al. 2511.21105 translate read null
2025-11-25 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding Xiaoye Wang et.al. 2511.20646 translate read null
2025-11-25 CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception Miguel Carvalho et.al. 2511.19820 translate read null
2025-11-24 Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models Jonathan Lee et.al. 2511.19526 translate read null
2025-11-24 Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving Jianhua Han et.al. 2511.19221 translate read null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 translate read null
2025-11-24 Autonomous Surface Selection For Manipulator-Based UV Disinfection In Hospitals Using Foundation Models Xueyan Oh et.al. 2511.18709 translate read null
2025-11-23 Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span Heeseung Yun et.al. 2511.18470 translate read null
2025-11-22 Plan-X: Instruct Video Generation via Semantic Planning Lun Huang et.al. 2511.17986 translate read null
2025-11-21 CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation Prantik Howlader et.al. 2511.17755 translate read null
2025-11-18 Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression Siddiqua Namrah et.al. 2511.17612 translate read null
2025-11-21 SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation Seamie Hayes et.al. 2511.17361 translate read null
2025-11-21 Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM Chiori Hori et.al. 2511.17335 translate read null
2025-11-20 POMA-3D: The Point Map Way to 3D Scene Understanding Ye Mao et.al. 2511.16567 translate read null
2025-11-20 LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs Doriand Petit et.al. 2511.16454 translate read null
2025-11-20 Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM Gergely Dinya et.al. 2511.16282 translate read null
2025-11-20 How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots Oliver Bimber et.al. 2511.16262 translate read null
2025-11-20 Real-Time 3D Object Detection with Inference-Aligned Learning Chenyu Zhao et.al. 2511.16140 translate read null
2025-11-20 Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click Raphael Ruschel et.al. 2511.15948 translate read null
2025-11-19 WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion Sajjad Pakdamansavoji et.al. 2511.15874 translate read null
2025-11-19 ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation Simon Boeder et.al. 2511.15396 translate read null
2025-11-19 Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception Jiashu Yang et.al. 2511.15279 translate read null
2025-11-18 RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems Jaro Meyer et.al. 2511.14948 translate read null
2025-11-18 Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models Hao Zhen et.al. 2511.14120 translate read null
2025-11-18 Error-Driven Scene Editing for 3D Grounding in Large Language Models Yue Zhang et.al. 2511.14086 translate read null
2025-11-18 RISE: Single Static Radar-based Indoor Scene Understanding Kaichen Zhou et.al. 2511.14019 translate read null
2025-11-17 VLMs Guided Interpretable Decision Making for Autonomous Driving Xin Hu et.al. 2511.13881 translate read null
2025-11-17 Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation Lingfeng Zhang et.al. 2511.13269 translate read null
2025-11-17 Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving Jiacheng Tang et.al. 2511.13079 translate read null
2025-11-17 Visual Room 2.0: Seeing is Not Understanding for MLLMs Haokun Li et.al. 2511.12928 translate read null
2025-11-16 RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation Xiaoshuai Hao et.al. 2511.12436 translate read null
2025-11-14 Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy Vinit Mehta et.al. 2511.11777 translate read null
2025-11-13 ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts Haowen Jiang et.al. 2511.11740 translate read null
2025-11-14 AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning Jirong Zha et.al. 2511.11025 translate read null
2025-11-13 DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation Xuexun Liu et.al. 2511.10003 translate read null
2025-11-12 Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding Jingtian Ma et.al. 2511.08978 translate read null
2025-11-11 RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation Hae-Won Jo et.al. 2511.08651 translate read null
2025-11-05 Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants I. Bailo et.al. 2511.08609 translate read null
2025-11-11 OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition Lixu Sun et.al. 2511.08133 translate read null
2025-11-11 HD $^2$ -SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving Zhiwen Yang et.al. 2511.07925 translate read null
2025-11-11 Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views Haida Feng et.al. 2511.07813 translate read null
2025-11-10 Inference-Time Scaling of Diffusion Models for Infrared Data Generation Kai A. Horstmann et.al. 2511.07362 translate read null
2025-11-10 PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving Simon Gerstenecker et.al. 2511.07292 translate read null
2025-11-10 Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images JiaKui Hu et.al. 2511.07222 translate read null
2025-11-10 TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding Duc Nguyen et.al. 2511.07007 translate read null
2025-11-10 PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory Qunchao Jin et.al. 2511.06840 translate read null
2025-11-09 Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR) Tobias Rueckert et.al. 2511.06549 translate read null
2025-11-08 Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation Lin Li et.al. 2511.05935 translate read null
2025-11-08 Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning Fei Yu et.al. 2511.05894 translate read null
2025-11-07 Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots Justin Williams et.al. 2511.05642 translate read null
2025-11-06 Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition Nicholas Babey et.al. 2511.05622 translate read null
2025-11-06 GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies Maëlic Neau et.al. 2511.04357 translate read null
2025-11-06 CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation Yuwen Tao et.al. 2511.03992 translate read null
2025-11-06 Simple 3D Pose Features Support Human and Machine Social Scene Understanding Wenshuo Qin et.al. 2511.03988 translate read null
2025-11-06 Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images Sam Bahrami et.al. 2511.03970 translate read null
2025-11-05 SILVI: Simple Interface for Labeling Video Interactions Ozan Kanbertay et.al. 2511.03819 translate read null
2025-11-05 SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding Mauro Orazio Drago et.al. 2511.03325 translate read null
2025-11-04 LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation Gyeom Hwangbo et.al. 2511.03001 translate read null
2025-11-04 DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding Zixuan Liu et.al. 2511.02495 translate read null
2025-11-04 Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization Tao Liu et.al. 2511.02489 translate read link
2025-11-04 From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics Nicolas Schuler et.al. 2511.02427 translate read null
2025-11-03 Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis Soham Joshi et.al. 2511.02046 translate read null
2025-11-03 A Compact Model for Polar Multiple-Channel Field Effect Transistors: A Case Study in III-V Nitride Semiconductors Aias Asteris et.al. 2511.01699 translate read null
2025-11-03 Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models Xiaoyu Zhan et.al. 2511.01618 translate read null
2025-11-03 PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model Wenqi Liang et.al. 2511.01571 translate read null
2025-11-03 Fast and Robust Remote Two-Qubit Gates on Distributed Qubits Yunan Li et.al. 2511.01418 translate read null
2025-11-03 A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model Sampriti Soor et.al. 2511.01317 translate read null
2025-11-03 LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping Lijie Wang et.al. 2511.01186 translate read null
2025-11-02 GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies Ziye Wang et.al. 2511.00998 translate read null
2025-11-01 Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach Oluwatosin Alabi et.al. 2511.00643 translate read null
2025-11-01 CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World Yating Yu et.al. 2511.00613 translate read null
2025-11-01 Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models Panwang Pan et.al. 2511.00503 translate read link

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)