Scene Understanding - 2025-11
Scene Understanding - 2025-11
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-11-29 | When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI | Yanhui Li et.al. | 2512.03087 | translate | read | null |
| 2025-11-30 | FOM-Nav: Frontier-Object Maps for Object Goal Navigation | Thomas Chabal et.al. | 2512.01009 | translate | read | null |
| 2025-11-30 | Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting | Haishan Wang et.al. | 2512.00850 | translate | read | null |
| 2025-11-29 | Describe Anything Anywhere At Any Moment | Nicolas Gorlo et.al. | 2512.00565 | translate | read | null |
| 2025-11-29 | Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR | Lixing Guo et.al. | 2512.00294 | translate | read | null |
| 2025-11-28 | DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation | Zirui Wang et.al. | 2512.00226 | translate | read | null |
| 2025-11-28 | DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation | Hongfei Zhang et.al. | 2511.23127 | translate | read | null |
| 2025-11-28 | Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding | Anik De et.al. | 2511.23071 | translate | read | null |
| 2025-11-28 | HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model | Chen Li et.al. | 2511.22961 | translate | read | null |
| 2025-11-28 | See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection | YuEun Lee et.al. | 2511.22906 | translate | read | null |
| 2025-11-27 | GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes | Di Wang et.al. | 2511.22645 | translate | read | null |
| 2025-11-27 | CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving | Zhaohui Wang et.al. | 2511.22532 | translate | read | null |
| 2025-11-27 | RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding | Xiyan Liu et.al. | 2511.22466 | translate | read | null |
| 2025-11-26 | SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding | Tae-Min Choi et.al. | 2511.21339 | translate | read | null |
| 2025-11-26 | Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding | Yutao Tang et.al. | 2511.21191 | translate | read | null |
| 2025-11-26 | Scaling Foundation Models for Radar Scene Understanding | Pushkal Mishra et.al. | 2511.21105 | translate | read | null |
| 2025-11-25 | 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding | Xiaoye Wang et.al. | 2511.20646 | translate | read | null |
| 2025-11-25 | CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception | Miguel Carvalho et.al. | 2511.19820 | translate | read | null |
| 2025-11-24 | Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models | Jonathan Lee et.al. | 2511.19526 | translate | read | null |
| 2025-11-24 | Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving | Jianhua Han et.al. | 2511.19221 | translate | read | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | translate | read | null |
| 2025-11-24 | Autonomous Surface Selection For Manipulator-Based UV Disinfection In Hospitals Using Foundation Models | Xueyan Oh et.al. | 2511.18709 | translate | read | null |
| 2025-11-23 | Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span | Heeseung Yun et.al. | 2511.18470 | translate | read | null |
| 2025-11-22 | Plan-X: Instruct Video Generation via Semantic Planning | Lun Huang et.al. | 2511.17986 | translate | read | null |
| 2025-11-21 | CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation | Prantik Howlader et.al. | 2511.17755 | translate | read | null |
| 2025-11-18 | Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression | Siddiqua Namrah et.al. | 2511.17612 | translate | read | null |
| 2025-11-21 | SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation | Seamie Hayes et.al. | 2511.17361 | translate | read | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | translate | read | null |
| 2025-11-20 | POMA-3D: The Point Map Way to 3D Scene Understanding | Ye Mao et.al. | 2511.16567 | translate | read | null |
| 2025-11-20 | LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs | Doriand Petit et.al. | 2511.16454 | translate | read | null |
| 2025-11-20 | Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM | Gergely Dinya et.al. | 2511.16282 | translate | read | null |
| 2025-11-20 | How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots | Oliver Bimber et.al. | 2511.16262 | translate | read | null |
| 2025-11-20 | Real-Time 3D Object Detection with Inference-Aligned Learning | Chenyu Zhao et.al. | 2511.16140 | translate | read | null |
| 2025-11-20 | Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click | Raphael Ruschel et.al. | 2511.15948 | translate | read | null |
| 2025-11-19 | WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion | Sajjad Pakdamansavoji et.al. | 2511.15874 | translate | read | null |
| 2025-11-19 | ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation | Simon Boeder et.al. | 2511.15396 | translate | read | null |
| 2025-11-19 | Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception | Jiashu Yang et.al. | 2511.15279 | translate | read | null |
| 2025-11-18 | RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems | Jaro Meyer et.al. | 2511.14948 | translate | read | null |
| 2025-11-18 | Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models | Hao Zhen et.al. | 2511.14120 | translate | read | null |
| 2025-11-18 | Error-Driven Scene Editing for 3D Grounding in Large Language Models | Yue Zhang et.al. | 2511.14086 | translate | read | null |
| 2025-11-18 | RISE: Single Static Radar-based Indoor Scene Understanding | Kaichen Zhou et.al. | 2511.14019 | translate | read | null |
| 2025-11-17 | VLMs Guided Interpretable Decision Making for Autonomous Driving | Xin Hu et.al. | 2511.13881 | translate | read | null |
| 2025-11-17 | Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation | Lingfeng Zhang et.al. | 2511.13269 | translate | read | null |
| 2025-11-17 | Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving | Jiacheng Tang et.al. | 2511.13079 | translate | read | null |
| 2025-11-17 | Visual Room 2.0: Seeing is Not Understanding for MLLMs | Haokun Li et.al. | 2511.12928 | translate | read | null |
| 2025-11-16 | RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation | Xiaoshuai Hao et.al. | 2511.12436 | translate | read | null |
| 2025-11-14 | Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy | Vinit Mehta et.al. | 2511.11777 | translate | read | null |
| 2025-11-13 | ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts | Haowen Jiang et.al. | 2511.11740 | translate | read | null |
| 2025-11-14 | AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning | Jirong Zha et.al. | 2511.11025 | translate | read | null |
| 2025-11-13 | DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation | Xuexun Liu et.al. | 2511.10003 | translate | read | null |
| 2025-11-12 | Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding | Jingtian Ma et.al. | 2511.08978 | translate | read | null |
| 2025-11-11 | RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation | Hae-Won Jo et.al. | 2511.08651 | translate | read | null |
| 2025-11-05 | Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants | I. Bailo et.al. | 2511.08609 | translate | read | null |
| 2025-11-11 | OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition | Lixu Sun et.al. | 2511.08133 | translate | read | null |
| 2025-11-11 | HD $^2$ -SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving | Zhiwen Yang et.al. | 2511.07925 | translate | read | null |
| 2025-11-11 | Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views | Haida Feng et.al. | 2511.07813 | translate | read | null |
| 2025-11-10 | Inference-Time Scaling of Diffusion Models for Infrared Data Generation | Kai A. Horstmann et.al. | 2511.07362 | translate | read | null |
| 2025-11-10 | PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving | Simon Gerstenecker et.al. | 2511.07292 | translate | read | null |
| 2025-11-10 | Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images | JiaKui Hu et.al. | 2511.07222 | translate | read | null |
| 2025-11-10 | TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding | Duc Nguyen et.al. | 2511.07007 | translate | read | null |
| 2025-11-10 | PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory | Qunchao Jin et.al. | 2511.06840 | translate | read | null |
| 2025-11-09 | Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR) | Tobias Rueckert et.al. | 2511.06549 | translate | read | null |
| 2025-11-08 | Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation | Lin Li et.al. | 2511.05935 | translate | read | null |
| 2025-11-08 | Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning | Fei Yu et.al. | 2511.05894 | translate | read | null |
| 2025-11-07 | Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots | Justin Williams et.al. | 2511.05642 | translate | read | null |
| 2025-11-06 | Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition | Nicholas Babey et.al. | 2511.05622 | translate | read | null |
| 2025-11-06 | GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies | Maëlic Neau et.al. | 2511.04357 | translate | read | null |
| 2025-11-06 | CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation | Yuwen Tao et.al. | 2511.03992 | translate | read | null |
| 2025-11-06 | Simple 3D Pose Features Support Human and Machine Social Scene Understanding | Wenshuo Qin et.al. | 2511.03988 | translate | read | null |
| 2025-11-06 | Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images | Sam Bahrami et.al. | 2511.03970 | translate | read | null |
| 2025-11-05 | SILVI: Simple Interface for Labeling Video Interactions | Ozan Kanbertay et.al. | 2511.03819 | translate | read | null |
| 2025-11-05 | SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding | Mauro Orazio Drago et.al. | 2511.03325 | translate | read | null |
| 2025-11-04 | LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation | Gyeom Hwangbo et.al. | 2511.03001 | translate | read | null |
| 2025-11-04 | DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding | Zixuan Liu et.al. | 2511.02495 | translate | read | null |
| 2025-11-04 | Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization | Tao Liu et.al. | 2511.02489 | translate | read | link |
| 2025-11-04 | From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics | Nicolas Schuler et.al. | 2511.02427 | translate | read | null |
| 2025-11-03 | Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis | Soham Joshi et.al. | 2511.02046 | translate | read | null |
| 2025-11-03 | A Compact Model for Polar Multiple-Channel Field Effect Transistors: A Case Study in III-V Nitride Semiconductors | Aias Asteris et.al. | 2511.01699 | translate | read | null |
| 2025-11-03 | Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models | Xiaoyu Zhan et.al. | 2511.01618 | translate | read | null |
| 2025-11-03 | PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model | Wenqi Liang et.al. | 2511.01571 | translate | read | null |
| 2025-11-03 | Fast and Robust Remote Two-Qubit Gates on Distributed Qubits | Yunan Li et.al. | 2511.01418 | translate | read | null |
| 2025-11-03 | A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model | Sampriti Soor et.al. | 2511.01317 | translate | read | null |
| 2025-11-03 | LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping | Lijie Wang et.al. | 2511.01186 | translate | read | null |
| 2025-11-02 | GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies | Ziye Wang et.al. | 2511.00998 | translate | read | null |
| 2025-11-01 | Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach | Oluwatosin Alabi et.al. | 2511.00643 | translate | read | null |
| 2025-11-01 | CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World | Yating Yu et.al. | 2511.00613 | translate | read | null |
| 2025-11-01 | Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models | Panwang Pan et.al. | 2511.00503 | translate | read | link |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)