Scene Understanding - 2025-12
Scene Understanding - 2025-12
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-12-31 | Spatial4D-Bench: A Versatile 4D Spatial Intelligence Benchmark | Pan Wang et.al. | 2601.00092 | translate | read | null |
| 2025-12-31 | UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning | Ankit Dhiman et.al. | 2512.24763 | translate | read | null |
| 2025-12-31 | 3D Semantic Segmentation for Post-Disaster Assessment | Nhut Le et.al. | 2512.24593 | translate | read | null |
| 2025-12-30 | Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models | Kim Alexander Christensen et.al. | 2512.24470 | translate | read | null |
| 2025-12-30 | Spatial-aware Vision Language Model for Autonomous Driving | Weijie Wei et.al. | 2512.24331 | translate | read | null |
| 2025-12-25 | Break Out the Silverware – Semantic Understanding of Stored Household Items | Michaela Levi-Richter et.al. | 2512.23739 | translate | read | null |
| 2025-12-29 | Multi-label Classification with Panoptic Context Aggregation Networks | Mingyuan Jiu et.al. | 2512.23486 | translate | read | null |
| 2025-12-29 | SpatialMosaic: A Multiview VLM Dataset for Partial Visibility | Kanghee Lee et.al. | 2512.23365 | translate | read | null |
| 2025-12-29 | AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding | Jongoh Jeong et.al. | 2512.23215 | translate | read | null |
| 2025-12-29 | GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation | Tianchen Deng et.al. | 2512.23180 | translate | read | null |
| 2025-12-28 | ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving | Qihang Peng et.al. | 2512.22939 | translate | read | null |
| 2025-12-28 | Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting | Yiqian Li et.al. | 2512.22771 | translate | read | null |
| 2025-12-27 | Instance Communication System for Intelligent Connected Vehicles: Bridging the Gap from Semantic to Instance-Level Transmission | Daiqi Zhang et.al. | 2512.22693 | translate | read | null |
| 2025-12-26 | VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement | Zhengfei Kuang et.al. | 2512.22351 | translate | read | null |
| 2025-12-24 | Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential | Shihao Zou et.al. | 2512.21284 | translate | read | null |
| 2025-12-23 | OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective | Markus Gross et.al. | 2512.20770 | translate | read | null |
| 2025-12-22 | CoDrone: Autonomous Drone Navigation Assisted by Edge and Cloud Foundation Models | Pengyu Chen et.al. | 2512.19083 | translate | read | null |
| 2025-12-22 | VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion | Zaidao Han et.al. | 2512.18954 | translate | read | null |
| 2025-12-21 | Multimodal Classification Network Guided Trajectory Planning for Four-Wheel Independent Steering Autonomous Parking Considering Obstacle Attributes | Jingjia Teng et.al. | 2512.18836 | translate | read | null |
| 2025-12-20 | LLaViDA: A Large Language Vision Driving Assistant for Explicit Reasoning and Enhanced Trajectory Planning | Yudong Liu et.al. | 2512.18211 | translate | read | null |
| 2025-12-19 | InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion | Hoiyeong Jin et.al. | 2512.17504 | translate | read | null |
| 2025-12-18 | MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning | Yuanchen Ju et.al. | 2512.16909 | translate | read | null |
| 2025-12-18 | SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning | Tin Stribor Sohn et.al. | 2512.16461 | translate | read | null |
| 2025-12-18 | Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception | Bangya Liu et.al. | 2512.16265 | translate | read | null |
| 2025-12-16 | Unified Semantic Transformer for 3D Scene Understanding | Sebastian Koch et.al. | 2512.14364 | translate | read | null |
| 2025-12-16 | Consistent Instance Field for Dynamic Scene Understanding | Junyi Wu et.al. | 2512.14126 | translate | read | null |
| 2025-12-16 | Deep Learning Perspective of Scene Understanding in Autonomous Robots | Afia Maham et.al. | 2512.14020 | translate | read | null |
| 2025-12-15 | I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners | Lu Ling et.al. | 2512.13683 | translate | read | null |
| 2025-12-15 | MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion | Minghui Hou et.al. | 2512.13177 | translate | read | null |
| 2025-12-15 | DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass | Vivek Alumootil et.al. | 2512.13122 | translate | read | null |
| 2025-12-15 | SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework | Anja Sheppard et.al. | 2512.12945 | translate | read | null |
| 2025-12-13 | INDOOR-LiDAR: Bridging Simulation and Reality for Robot-Centric 360 degree Indoor LiDAR Perception – A Robot-Centric Hybrid Dataset | Haichuan Li et.al. | 2512.12377 | translate | read | null |
| 2025-12-13 | MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding | Benjamin Beilharz et.al. | 2512.12307 | translate | read | null |
| 2025-12-13 | A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection | Peizheng Li et.al. | 2512.12205 | translate | read | null |
| 2025-12-13 | Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video | Daniel Adebi et.al. | 2512.12165 | translate | read | null |
| 2025-12-12 | Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis | Valentina Lilova et.al. | 2512.11574 | translate | read | null |
| 2025-12-12 | Reconstruction as a Bridge for Event-Based Visual Question Answering | Hanyue Lou et.al. | 2512.11510 | translate | read | null |
| 2025-12-12 | VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing | Emanuel Sánchez Aimar et.al. | 2512.11490 | translate | read | null |
| 2025-12-10 | LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating | Junting Chen et.al. | 2512.09920 | translate | read | null |
| 2025-12-09 | SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding | Seongyong Kim et.al. | 2512.09062 | translate | read | null |
| 2025-12-09 | LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training | Qing Xu et.al. | 2512.08439 | translate | read | null |
| 2025-12-09 | CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning | Zeyuan Chen et.al. | 2512.08135 | translate | read | null |
| 2025-12-08 | SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery | Meng Cao et.al. | 2512.07733 | translate | read | null |
| 2025-12-08 | STRinGS: Selective Text Refinement in Gaussian Splatting | Abhinav Raundhal et.al. | 2512.07230 | translate | read | null |
| 2025-12-08 | A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning | Siyang Jiang et.al. | 2512.07136 | translate | read | null |
| 2025-12-05 | Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction | Shilin Hu et.al. | 2512.06179 | translate | read | null |
| 2025-12-05 | BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving | Karthik Mohan et.al. | 2512.06096 | translate | read | null |
| 2025-12-05 | Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision | Lennart Maack et.al. | 2512.05740 | translate | read | null |
| 2025-12-05 | Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction | Ruihong Yin et.al. | 2512.05597 | translate | read | null |
| 2025-12-05 | VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation | Chinthani Sugandhika et.al. | 2512.05524 | translate | read | null |
| 2025-12-04 | 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer | Xianfeng Wu et.al. | 2512.05060 | translate | read | null |
| 2025-12-03 | C3G: Learning Compact 3D Representations with 2K Gaussians | Honggyu An et.al. | 2512.04021 | translate | read | null |
| 2025-12-03 | Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding | Haoran Zhou et.al. | 2512.03601 | translate | read | null |
| 2025-12-03 | What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models | Tianchen Deng et.al. | 2512.03422 | translate | read | null |
| 2025-12-03 | ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding | Lingjun Zhao et.al. | 2512.03370 | translate | read | null |
| 2025-12-02 | SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding | Hongpei Zheng et.al. | 2512.03284 | translate | read | null |
| 2025-12-02 | Layout Anything: One Transformer for Universal Room Layout Estimation | Md Sohag Mia et.al. | 2512.02952 | translate | read | null |
| 2025-12-02 | Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding | Yerim Jeon et.al. | 2512.02487 | translate | read | null |
| 2025-12-02 | HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild | Valentin Bieri et.al. | 2512.02450 | translate | read | null |
| 2025-12-01 | ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation | Chenyang Gu et.al. | 2512.02013 | translate | read | null |
| 2025-12-01 | OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic | Songyan Zhang et.al. | 2512.01830 | translate | read | null |
| 2025-12-01 | IGen: Scalable Data Generation for Robot Learning from Open-World Images | Chenghao Gu et.al. | 2512.01773 | translate | read | null |
| 2025-12-01 | SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge | Yumeng He et.al. | 2512.01629 | translate | read | null |
| 2025-12-01 | MDiff4STR: Mask Diffusion Model for Scene Text Recognition | Yongkun Du et.al. | 2512.01422 | translate | read | null |
| 2025-12-01 | VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering | Zihua Liu et.al. | 2512.01178 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)