Scene Understanding - 2025-09
Scene Understanding - 2025-09
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-09-30 | Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification | Artur Barros et.al. | 2509.26457 | translate | read | null |
| 2025-09-30 | Neighbor-aware informal settlement mapping with graph convolutional networks | Thomas Hallopeau et.al. | 2509.26171 | translate | read | null |
| 2025-09-30 | Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models | Yuansen Liu et.al. | 2509.26165 | translate | read | null |
| 2025-09-30 | EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models | Seamie Hayes et.al. | 2509.26087 | translate | read | null |
| 2025-09-30 | VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs | Peng Liu et.al. | 2509.25916 | translate | read | null |
| 2025-09-29 | PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos | Ting-Hsuan Liao et.al. | 2509.25183 | translate | read | null |
| 2025-09-29 | Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs | Yue Zhang et.al. | 2509.25139 | translate | read | null |
| 2025-09-29 | Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots | Ermanno Bartoli et.al. | 2509.24966 | translate | read | null |
| 2025-09-29 | CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D | Mohamad Amin Mirzaei et.al. | 2509.24528 | translate | read | null |
| 2025-09-29 | PhysiAgent: An Embodied Agent Framework in Physical World | Zhihao Wang et.al. | 2509.24524 | translate | read | null |
| 2025-09-29 | Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy | Haijier Chen et.al. | 2509.24385 | translate | read | null |
| 2025-09-29 | Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context | Yongqiang Wang et.al. | 2509.24275 | translate | read | null |
| 2025-09-28 | FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing | Yi Yang et.al. | 2509.23927 | translate | read | null |
| 2025-09-28 | Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation | Hanyu Zhou et.al. | 2509.23828 | translate | read | null |
| 2025-09-28 | From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving | Yixiao Chen et.al. | 2509.23641 | translate | read | null |
| 2025-09-28 | From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations | Javed Ahmad et.al. | 2509.23555 | translate | read | null |
| 2025-09-26 | Good Weights: Proactive, Adaptive Dead Reckoning Fusion for Continuous and Robust Visual SLAM | Yanwei Du et.al. | 2509.22910 | translate | read | null |
| 2025-09-20 | Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment | Abhiroop Chatterjee et.al. | 2509.22697 | translate | read | null |
| 2025-09-26 | UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective | Jun He et.al. | 2509.22228 | translate | read | null |
| 2025-09-26 | Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics | Saurav Jha et.al. | 2509.22014 | translate | read | null |
| 2025-09-26 | Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding | Vahid Mirjalili et.al. | 2509.21922 | translate | read | null |
| 2025-09-25 | Real-Time Indoor Object SLAM with LLM-Enhanced Priors | Yang Jiao et.al. | 2509.21602 | translate | read | null |
| 2025-09-25 | Residual Vector Quantization For Communication-Efficient Multi-Agent Perception | Dereje Shenkut et.al. | 2509.21464 | translate | read | null |
| 2025-09-23 | TUN3D: Towards Real-World Scene Understanding from Unposed Images | Anton Konushin et.al. | 2509.21388 | translate | read | link |
| 2025-09-25 | DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection | Jiayi Zuo et.al. | 2509.20701 | translate | read | null |
| 2025-09-23 | SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment | Binod Singh et.al. | 2509.20401 | translate | read | null |
| 2025-09-24 | Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning | Xun Li et.al. | 2509.20077 | translate | read | null |
| 2025-09-24 | OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving | Pei Liu et.al. | 2509.19973 | translate | read | null |
| 2025-09-23 | Category-Level Object Shape and Pose Estimation in Less Than a Millisecond | Lorenzo Shaikewitz et.al. | 2509.18979 | translate | read | null |
| 2025-09-23 | Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations | Hanqing Liu et.al. | 2509.18953 | translate | read | null |
| 2025-09-23 | Surgical Video Understanding with Label Interpolation | Garam Kim et.al. | 2509.18802 | translate | read | null |
| 2025-09-23 | MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning | Omar Rayyan et.al. | 2509.18757 | translate | read | null |
| 2025-09-23 | PIE: Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving | Chengran Yuan et.al. | 2509.18609 | translate | read | null |
| 2025-09-22 | Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration | Zhitao Zeng et.al. | 2509.17429 | translate | read | null |
| 2025-09-20 | Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding | Haoyuan Li et.al. | 2509.16721 | translate | read | null |
| 2025-09-20 | ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting | Xiaoyang Yan et.al. | 2509.16552 | translate | read | null |
| 2025-09-19 | Towards Sharper Object Boundaries in Self-Supervised Depth Estimation | Aurélien Cecille et.al. | 2509.15987 | translate | read | null |
| 2025-09-19 | RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation | Paul Julius Kühn et.al. | 2509.15886 | translate | read | null |
| 2025-09-19 | SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models | Sen Wang et.al. | 2509.15536 | translate | read | null |
| 2025-09-18 | Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems | Yicheng Zhang et.al. | 2509.15213 | translate | read | null |
| 2025-09-18 | SPATIALGEN: Layout-guided 3D Indoor Scene Generation | Chuan Fang et.al. | 2509.14981 | translate | read | link |
| 2025-09-16 | Semantic 3D Reconstructions with SLAM for Central Airway Obstruction | Ayberk Acar et.al. | 2509.13541 | translate | read | null |
| 2025-09-16 | ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors | Romain Hardy et.al. | 2509.13525 | translate | read | null |
| 2025-09-16 | 3D Aware Region Prompted Vision Language Model | An-Chieh Cheng et.al. | 2509.13317 | translate | read | null |
| 2025-09-16 | Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving | Ruibo Li et.al. | 2509.13116 | translate | read | null |
| 2025-09-16 | Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings | Abdalla Arafa et.al. | 2509.12938 | translate | read | null |
| 2025-09-16 | MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization | Yiyi Zhang et.al. | 2509.12893 | translate | read | null |
| 2025-09-15 | RailSafeNet: Visual Scene Understanding for Tram Safety | Ondřej Valach et.al. | 2509.12125 | translate | read | link |
| 2025-09-15 | Microsurgical Instrument Segmentation for Robot-Assisted Surgery | Tae Kyeong Jeong et.al. | 2509.11727 | translate | read | null |
| 2025-09-15 | See What I Mean? Mobile Eye-Perspective Rendering for Optical See-through Head-mounted Displays | Gerlinde Emsenhuber et.al. | 2509.11653 | translate | read | null |
| 2025-09-14 | Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision | Tianyao Sun et.al. | 2509.11476 | translate | read | null |
| 2025-09-14 | DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation | Yunheng Wang et.al. | 2509.11197 | translate | read | null |
| 2025-09-14 | 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment | Nhut Le et.al. | 2509.11097 | translate | read | null |
| 2025-09-13 | OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds | Chongyu Wang et.al. | 2509.10842 | translate | read | null |
| 2025-09-12 | Multimodal SAM-adapter for Semantic Segmentation | Iacopo Curti et.al. | 2509.10408 | translate | read | null |
| 2025-09-10 | SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation | Michael J. Munje et.al. | 2509.08757 | translate | read | null |
| 2025-09-09 | OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics | Yinan Deng et.al. | 2509.07500 | translate | read | null |
| 2025-09-09 | DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis | Sven Kirchner et.al. | 2509.07463 | translate | read | null |
| 2025-09-08 | Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception | Ensong Liu et.al. | 2509.06506 | translate | read | null |
| 2025-09-07 | UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning | Huy Le et.al. | 2509.06165 | translate | read | null |
| 2025-09-06 | Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation | Tianhao Guo et.al. | 2509.05746 | translate | read | null |
| 2025-09-05 | SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing | Chaolei Wang et.al. | 2509.05144 | translate | read | null |
| 2025-09-03 | Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding | Hongpei Zheng et.al. | 2509.03635 | translate | read | null |
| 2025-09-03 | Rashomon in the Streets: Explanation Ambiguity in Scene Understanding | Helge Spieker et.al. | 2509.03169 | translate | read | null |
| 2025-09-02 | Generalizable Skill Learning for Construction Robots with Crowdsourced Natural Language Instructions, Composable Skills Standardization, and Large Language Model | Hongrui Yu et.al. | 2509.02876 | translate | read | null |
| 2025-09-02 | SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images | Pushpendra Dhakara et.al. | 2509.02287 | translate | read | null |
| 2025-09-02 | Omnidirectional Spatial Modeling from Correlated Panoramas | Xinshen Zhang et.al. | 2509.02164 | translate | read | null |
| 2025-09-02 | AI-Driven Marine Robotics: Emerging Trends in Underwater Perception and Ecosystem Monitoring | Scarlett Raine et.al. | 2509.01878 | translate | read | null |
| 2025-09-01 | Articulated Object Estimation in the Wild | Abdelrhman Werby et.al. | 2509.01708 | translate | read | null |
| 2025-09-01 | Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation | Maëlic Neau et.al. | 2509.01209 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)