Scene Understanding - 2025-10
Scene Understanding - 2025-10
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-10-28 | A Comprehensive Survey on Surgical Digital Twin | Afsah Sharaf Khan et.al. | 2512.00019 | translate | read | null |
| 2025-10-30 | Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution | Shiyao Sang et.al. | 2511.05540 | translate | read | null |
| 2025-10-31 | The Eigenvalues Entropy as a Classifier Evaluation Measure | Doulaye Dembélé et.al. | 2511.01904 | translate | read | null |
| 2025-10-30 | AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency | Piyushkumar Patel et.al. | 2511.00107 | translate | read | null |
| 2025-10-31 | Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs | Sushil Samuel Dinesh et.al. | 2510.27558 | translate | read | null |
| 2025-10-31 | NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding | Wei Xu et.al. | 2510.27481 | translate | read | null |
| 2025-10-31 | Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing | Yijia Wang et.al. | 2510.27335 | translate | read | null |
| 2025-10-31 | Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis | Weiming Chen et.al. | 2510.27324 | translate | read | null |
| 2025-10-31 | HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition | Jiacheng Hong et.al. | 2510.27148 | translate | read | null |
| 2025-10-30 | A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics | Simindokht Jahangard et.al. | 2510.27033 | translate | read | null |
| 2025-10-30 | The ANUBIS detector and its sensitivity to neutral long-lived particles | ANUBIS Collaboration et.al. | 2510.26932 | translate | read | null |
| 2025-10-30 | HEIR: Learning Graph-Based Motion Hierarchies | Cheng Zheng et.al. | 2510.26786 | translate | read | null |
| 2025-10-30 | Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios | Manjunath Prasad Holenarasipura Rajiv et.al. | 2510.26580 | translate | read | null |
| 2025-10-30 | AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM | Mirko Usuelli et.al. | 2510.26358 | translate | read | null |
| 2025-10-30 | GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? | Mingyu Sung et.al. | 2510.26339 | translate | read | null |
| 2025-10-30 | Letter of Intent: The Forward Physics Facility | Luis A. Anchordoqui et.al. | 2510.26260 | translate | read | null |
| 2025-10-30 | Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM | Ali Caglayan et.al. | 2510.26131 | translate | read | null |
| 2025-10-29 | Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks | Xu Zheng et.al. | 2510.25760 | translate | read | link |
| 2025-10-29 | More than a Moment: Towards Coherent Sequences of Audio Descriptions | Eshika Khandelwal et.al. | 2510.25440 | translate | read | null |
| 2025-10-29 | U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching | Junsheng Zhou et.al. | 2510.25210 | translate | read | null |
| 2025-10-29 | EA3D: Online Open-World 3D Object Extraction from Streaming Videos | Xiaoyu Zhou et.al. | 2510.25146 | translate | read | null |
| 2025-10-29 | Learning Spatial-Aware Manipulation Ordering | Yuxiang Yan et.al. | 2510.25138 | translate | read | null |
| 2025-10-29 | Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments | Manjunath Prasad Holenarasipura Rajiv et.al. | 2510.25070 | translate | read | null |
| 2025-10-28 | VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos | Qiucheng Wu et.al. | 2510.24904 | translate | read | null |
| 2025-10-28 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | translate | read | link |
| 2025-10-28 | Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes | Jonas Hein et.al. | 2510.24332 | translate | read | null |
| 2025-10-28 | Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning | Aodi Wu et.al. | 2510.24152 | translate | read | null |
| 2025-10-27 | Optimized Loudspeaker Panning for Adaptive Sound-Field Correction and Non-stationary Listening Areas | Yuancheng Luo et.al. | 2510.23937 | translate | read | null |
| 2025-10-27 | DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning | Eddison Pham et.al. | 2510.23907 | translate | read | null |
| 2025-10-27 | Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations | Yujia Zhang et.al. | 2510.23607 | translate | read | null |
| 2025-10-27 | PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity | Yuqian Yuan et.al. | 2510.23603 | translate | read | link |
| 2025-10-27 | InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras | Erich Liang et.al. | 2510.23589 | translate | read | null |
| 2025-10-27 | Localising under the drape: proprioception in the era of distributed surgical robotic system | Martin Huber et.al. | 2510.23512 | translate | read | null |
| 2025-10-27 | UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception | Karthikeyan Chandra Sekaran et.al. | 2510.23478 | translate | read | null |
| 2025-10-27 | Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics | Ş. Ekmen et.al. | 2510.23403 | translate | read | null |
| 2025-10-27 | Evaluation of Vision-LLMs in Surveillance Video | Pascal Benschop et.al. | 2510.23190 | translate | read | null |
| 2025-10-27 | Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI | Aryan Mathur et.al. | 2510.23148 | translate | read | null |
| 2025-10-27 | SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency | Quanjian Song et.al. | 2510.22994 | translate | read | null |
| 2025-10-27 | Charting the Design Space of Neural Graph Representations for Subgraph Matching | Vaibhav Raj et.al. | 2510.22897 | translate | read | null |
| 2025-10-26 | IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction | Hao Li et.al. | 2510.22706 | translate | read | link |
| 2025-10-26 | Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views | Anna Deichler et.al. | 2510.22672 | translate | read | null |
| 2025-10-25 | BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles | Seyed Ahmad Hosseini Miangoleh et.al. | 2510.22370 | translate | read | null |
| 2025-10-25 | Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments | Weixian Qian et.al. | 2510.22204 | translate | read | null |
| 2025-10-25 | MOGRAS: Human Motion with Grasping in 3D Scenes | Kunal Bhosikar et.al. | 2510.22199 | translate | read | null |
| 2025-10-25 | LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction | Yuhang Gao et.al. | 2510.22141 | translate | read | null |
| 2025-10-25 | CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding | Lihuang Fang et.al. | 2510.22119 | translate | read | null |
| 2025-10-07 | Avi: Action from Volumetric Inference | Harris Song et.al. | 2510.21746 | translate | read | null |
| 2025-10-24 | OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields | Lisa Weijler et.al. | 2510.21441 | translate | read | null |
| 2025-10-24 | ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models | Pranav Saxena et.al. | 2510.21069 | translate | read | null |
| 2025-10-22 | Uncertainty evaluation of segmentation models for Earth observation | Melanie Rey et.al. | 2510.19586 | translate | read | null |
| 2025-10-22 | Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization | Juncheng Wang et.al. | 2510.19330 | translate | read | null |
| 2025-10-21 | Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations | Phuoc Nguyen et.al. | 2510.18697 | translate | read | null |
| 2025-10-21 | MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning | Wenhui Huang et.al. | 2510.18337 | translate | read | null |
| 2025-10-21 | UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding | Da Zhang et.al. | 2510.18262 | translate | read | null |
| 2025-10-21 | OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion | Tianyu Huang et.al. | 2510.18253 | translate | read | null |
| 2025-10-20 | Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models | Katie Luo et.al. | 2510.17274 | translate | read | null |
| 2025-10-19 | SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes | Xiongkun Linghu et.al. | 2510.16714 | translate | read | null |
| 2025-10-18 | Structured Interfaces for Automated Reasoning with 3D Scene Graphs | Aaron Ray et.al. | 2510.16643 | translate | read | null |
| 2025-10-11 | ESCA: Contextualizing Embodied Agents via Scene-Graph Generation | Jiani Huang et.al. | 2510.15963 | translate | read | null |
| 2025-10-07 | GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments | Leela Krishna et.al. | 2510.14992 | translate | read | null |
| 2025-10-16 | QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps | Matti Pekkanen et.al. | 2510.14546 | translate | read | null |
| 2025-10-15 | Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models | Jia Yun Chua et.al. | 2510.13993 | translate | read | null |
| 2025-10-15 | SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB | Muhammad Ishfaq Hussain et.al. | 2510.13404 | translate | read | null |
| 2025-10-15 | FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding | Francesco Barbato et.al. | 2510.13243 | translate | read | null |
| 2025-10-14 | VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages | Jesse Atuhurra et.al. | 2510.12845 | translate | read | null |
| 2025-10-14 | SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding | Zhiliu Yang et.al. | 2510.12749 | translate | read | null |
| 2025-10-13 | PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation | Hatem Ibrahem et.al. | 2510.11992 | translate | read | null |
| 2025-10-13 | PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image | Pradyumna Yalandur Muralidhar et.al. | 2510.11649 | translate | read | null |
| 2025-10-13 | A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation | Denis Zavadski et.al. | 2510.11567 | translate | read | null |
| 2025-10-13 | mmWalk: Towards Multi-modal Multi-view Walking Assistance | Kedi Ying et.al. | 2510.11520 | translate | read | null |
| 2025-10-13 | REACT3D: Recovering Articulations for Interactive Physical 3D Scenes | Zhao Huang et.al. | 2510.11340 | translate | read | null |
| 2025-10-12 | Real2USD: Scene Representations in Universal Scene Description Language | Christopher D. Hsu et.al. | 2510.10778 | translate | read | null |
| 2025-10-11 | B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding | Feng Xiao et.al. | 2510.10194 | translate | read | null |
| 2025-10-10 | CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation | Kaiwen Wei et.al. | 2510.09266 | translate | read | null |
| 2025-10-08 | Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs | Hanieh Shojaei Miandashti et.al. | 2510.08631 | translate | read | null |
| 2025-10-03 | Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes | Nirmal Elamon et.al. | 2510.08589 | translate | read | null |
| 2025-10-09 | The impact of abstract and object tags on image privacy classification | Darya Baranouskaya et.al. | 2510.07976 | translate | read | null |
| 2025-10-09 | CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving | Tianrui Zhang et.al. | 2510.07944 | translate | read | null |
| 2025-10-09 | An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images | Kanglin Ning et.al. | 2510.07817 | translate | read | null |
| 2025-10-07 | Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model | Danush Kumar Venkatesh et.al. | 2510.07345 | translate | read | null |
| 2025-10-08 | Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion | Jie Luo et.al. | 2510.06687 | translate | read | null |
| 2025-10-07 | When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach | Daniel Gonzálbez-Biosca et.al. | 2510.05661 | translate | read | null |
| 2025-10-07 | HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video | Hongchi Xia et.al. | 2510.05560 | translate | read | null |
| 2025-10-06 | Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction | Chi Yan et.al. | 2510.04759 | translate | read | null |
| 2025-10-02 | LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition | Rixin Zhou et.al. | 2510.01651 | translate | read | null |
| 2025-10-01 | VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs | Mohamad Al Mdfaa et.al. | 2510.01483 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)