Scene Understanding - 2025-07
Scene Understanding - 2025-07
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-07-31 | Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs | Bhavya Goyal et.al. | 2508.00169 | translate | read | null |
| 2025-07-31 | 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding | Ting Huang et.al. | 2507.23478 | translate | read | null |
| 2025-07-31 | FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models | Yiming Yang et.al. | 2507.23325 | translate | read | null |
| 2025-07-31 | FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning | Jiajun Cao et.al. | 2507.23318 | translate | read | null |
| 2025-07-30 | DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion | Qingcheng Zhao et.al. | 2507.22825 | translate | read | null |
| 2025-07-30 | UAVScenes: A Multi-Modal Dataset for UAVs | Sijie Wang et.al. | 2507.22412 | translate | read | null |
| 2025-07-29 | EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation | Zhijiang Li et.al. | 2507.21971 | translate | read | null |
| 2025-07-28 | GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction | Tianhao Li et.al. | 2507.20963 | translate | read | null |
| 2025-07-28 | Compositional Video Synthesis by Temporal Object-Centric Learning | Adil Kaan Akan et.al. | 2507.20855 | translate | read | null |
| 2025-07-27 | VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving | Levente Tempfli et.al. | 2507.20397 | translate | read | null |
| 2025-07-27 | Solving Scene Understanding for Autonomous Navigation in Unstructured Environments | Naveen Mathews Renji et.al. | 2507.20389 | translate | read | null |
| 2025-07-26 | FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images | Hao-Yu Hou et.al. | 2507.19993 | translate | read | null |
| 2025-07-26 | UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block | Luoxi Jing et.al. | 2507.19948 | translate | read | null |
| 2025-07-26 | RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection | Xiaokai Bai et.al. | 2507.19856 | translate | read | null |
| 2025-07-26 | Taking Language Embedded 3D Gaussian Splatting into the Wild | Yuze Wang et.al. | 2507.19830 | translate | read | null |
| 2025-07-25 | Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing | Haichuan Li et.al. | 2507.19691 | translate | read | null |
| 2025-07-25 | VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions | Haoang Lu et.al. | 2507.19188 | translate | read | null |
| 2025-07-24 | Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting | Xingyu Miao et.al. | 2507.18678 | translate | read | null |
| 2025-07-23 | From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding | Anna-Maria Halacheva et.al. | 2507.17585 | translate | read | null |
| 2025-07-23 | IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird’s-Eye View Perception | Haichuan Li et.al. | 2507.17445 | translate | read | null |
| 2025-07-22 | ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension | Yizhi Hu et.al. | 2507.16877 | translate | read | null |
| 2025-07-22 | Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge | Tobias Rueckert et.al. | 2507.16559 | translate | read | null |
| 2025-07-22 | Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach | Jon Gutiérrez-Zaballa et.al. | 2507.16556 | translate | read | null |
| 2025-07-22 | DenseSR: Image Shadow Removal as Dense Prediction | Yu-Fan Lin et.al. | 2507.16472 | translate | read | link |
| 2025-07-21 | Label tree semantic losses for rich multi-class medical image segmentation | Junwen Wang et.al. | 2507.15777 | translate | read | null |
| 2025-07-21 | Towards Holistic Surgical Scene Graph | Jongmin Shin et.al. | 2507.15541 | translate | read | null |
| 2025-07-21 | ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting | Ruijie Zhu et.al. | 2507.15454 | translate | read | link |
| 2025-07-21 | VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving | Haichao Liu et.al. | 2507.15266 | translate | read | null |
| 2025-07-19 | DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF | Doriand Petit et.al. | 2507.14596 | translate | read | null |
| 2025-07-19 | Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions | Jintang Xue et.al. | 2507.14555 | translate | read | null |
| 2025-07-19 | Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 | Sujata Gaihre et.al. | 2507.14544 | translate | read | null |
| 2025-07-19 | CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding | Zhou Chen et.al. | 2507.14426 | translate | read | null |
| 2025-07-18 | Semantic Segmentation based Scene Understanding in Autonomous Vehicles | Ehsan Rassekh et.al. | 2507.14303 | translate | read | null |
| 2025-07-18 | Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation | Masahiro Ogawa et.al. | 2507.13628 | translate | read | null |
| 2025-07-17 | Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection | Jingyao Wang et.al. | 2507.13061 | translate | read | null |
| 2025-07-17 | Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models | Yifan Xu et.al. | 2507.12916 | translate | read | null |
| 2025-07-17 | City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning | Penglei Sun et.al. | 2507.12795 | translate | read | null |
| 2025-07-16 | Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection | Sandipan Sarma et.al. | 2507.12628 | translate | read | null |
| 2025-07-15 | Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis | Maciej Szankin et.al. | 2507.11730 | translate | read | null |
| 2025-07-15 | Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander | Li Wang et.al. | 2507.11079 | translate | read | null |
| 2025-07-15 | Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation | Yanbo Wang et.al. | 2507.11001 | translate | read | null |
| 2025-07-14 | Static or Temporal? Semantic Scene Simplification to Aid Wayfinding in Immersive Simulations of Bionic Vision | Justin M. Kasowski et.al. | 2507.10813 | translate | read | null |
| 2025-07-14 | EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Mingxian Lin et.al. | 2507.10548 | translate | read | link |
| 2025-07-13 | VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding | Younggun Kim et.al. | 2507.09815 | translate | read | null |
| 2025-07-13 | Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles | Yangang Ren et.al. | 2507.09537 | translate | read | null |
| 2025-07-12 | Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding | Wencan Huang et.al. | 2507.09334 | translate | read | null |
| 2025-07-12 | THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage | Trong-Thuan Nguyen et.al. | 2507.09200 | translate | read | null |
| 2025-07-12 | Towards Spatial Audio Understanding via Question Answering | Parthasaarathy Sudarsanam et.al. | 2507.09195 | translate | read | null |
| 2025-07-12 | On the Fragility of Multimodal Perception to Temporal Misalignment in Autonomous Driving | Md Hasan Shahriar et.al. | 2507.09095 | translate | read | null |
| 2025-07-10 | OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding | JingLi Lin et.al. | 2507.07984 | translate | read | link |
| 2025-07-10 | MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation | Bangning Wei et.al. | 2507.07519 | translate | read | null |
| 2025-07-09 | SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds | Matthias Zeller et.al. | 2507.06906 | translate | read | null |
| 2025-07-09 | Token Bottleneck: One Token to Remember Dynamics | Taekyung Kim et.al. | 2507.06543 | translate | read | link |
| 2025-07-09 | What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies | Yaoqi Huang et.al. | 2507.06513 | translate | read | null |
| 2025-07-08 | Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion | Aleksandar Jevtić et.al. | 2507.06230 | translate | read | link |
| 2025-07-08 | SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning | Xin Hu et.al. | 2507.05798 | translate | read | null |
| 2025-07-07 | All in One: Visual-Description-Guided Unified Point Cloud Segmentation | Zongyan Han et.al. | 2507.05211 | translate | read | null |
| 2025-07-07 | MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding | Jing Liang et.al. | 2507.04686 | translate | read | null |
| 2025-07-05 | Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation | Ziyu Zhu et.al. | 2507.04047 | translate | read | null |
| 2025-07-05 | Habitat Classification from Ground-Level Imagery Using Deep Neural Networks | Hongrui Shi et.al. | 2507.04017 | translate | read | null |
| 2025-07-04 | Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds | Matthias Zeller et.al. | 2507.03463 | translate | read | null |
| 2025-07-03 | LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans | Zhening Huang et.al. | 2507.02861 | translate | read | link |
| 2025-07-03 | LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion | Fangfu Liu et.al. | 2507.02813 | translate | read | link |
| 2025-07-03 | SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment | Qi Xu et.al. | 2507.02705 | translate | read | link |
| 2025-07-04 | Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach | Elena Ryumina et.al. | 2507.02205 | translate | read | link |
| 2025-07-02 | ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning | Xiao Wang et.al. | 2507.02200 | translate | read | null |
| 2025-07-02 | ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving | Kai Chen et.al. | 2507.01735 | translate | read | null |
| 2025-07-01 | GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond | Anna-Maria Halacheva et.al. | 2507.00886 | translate | read | null |
| 2025-07-01 | BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving | Zeming Chen et.al. | 2507.00707 | translate | read | null |
| 2025-07-01 | SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting | Yiming Huang et.al. | 2506.23309 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)