Scene Understanding - 2025-10 | Paper Arxiv Daily

Scene Understanding - 2025-10

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-10-28	A Comprehensive Survey on Surgical Digital Twin	Afsah Sharaf Khan et.al.	2512.00019	translate	read	null
2025-10-30	Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution	Shiyao Sang et.al.	2511.05540	translate	read	null
2025-10-31	The Eigenvalues Entropy as a Classifier Evaluation Measure	Doulaye Dembélé et.al.	2511.01904	translate	read	null
2025-10-30	AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency	Piyushkumar Patel et.al.	2511.00107	translate	read	null
2025-10-31	Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs	Sushil Samuel Dinesh et.al.	2510.27558	translate	read	null
2025-10-31	NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding	Wei Xu et.al.	2510.27481	translate	read	null
2025-10-31	Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing	Yijia Wang et.al.	2510.27335	translate	read	null
2025-10-31	Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis	Weiming Chen et.al.	2510.27324	translate	read	null
2025-10-31	HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition	Jiacheng Hong et.al.	2510.27148	translate	read	null
2025-10-30	A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics	Simindokht Jahangard et.al.	2510.27033	translate	read	null
2025-10-30	The ANUBIS detector and its sensitivity to neutral long-lived particles	ANUBIS Collaboration et.al.	2510.26932	translate	read	null
2025-10-30	HEIR: Learning Graph-Based Motion Hierarchies	Cheng Zheng et.al.	2510.26786	translate	read	null
2025-10-30	Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios	Manjunath Prasad Holenarasipura Rajiv et.al.	2510.26580	translate	read	null
2025-10-30	AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM	Mirko Usuelli et.al.	2510.26358	translate	read	null
2025-10-30	GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?	Mingyu Sung et.al.	2510.26339	translate	read	null
2025-10-30	Letter of Intent: The Forward Physics Facility	Luis A. Anchordoqui et.al.	2510.26260	translate	read	null
2025-10-30	Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM	Ali Caglayan et.al.	2510.26131	translate	read	null
2025-10-29	Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks	Xu Zheng et.al.	2510.25760	translate	read	link
2025-10-29	More than a Moment: Towards Coherent Sequences of Audio Descriptions	Eshika Khandelwal et.al.	2510.25440	translate	read	null
2025-10-29	U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching	Junsheng Zhou et.al.	2510.25210	translate	read	null
2025-10-29	EA3D: Online Open-World 3D Object Extraction from Streaming Videos	Xiaoyu Zhou et.al.	2510.25146	translate	read	null
2025-10-29	Learning Spatial-Aware Manipulation Ordering	Yuxiang Yan et.al.	2510.25138	translate	read	null
2025-10-29	Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments	Manjunath Prasad Holenarasipura Rajiv et.al.	2510.25070	translate	read	null
2025-10-28	VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos	Qiucheng Wu et.al.	2510.24904	translate	read	null
2025-10-28	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	Inclusion AI et.al.	2510.24821	translate	read	link
2025-10-28	Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes	Jonas Hein et.al.	2510.24332	translate	read	null
2025-10-28	Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning	Aodi Wu et.al.	2510.24152	translate	read	null
2025-10-27	Optimized Loudspeaker Panning for Adaptive Sound-Field Correction and Non-stationary Listening Areas	Yuancheng Luo et.al.	2510.23937	translate	read	null
2025-10-27	DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning	Eddison Pham et.al.	2510.23907	translate	read	null
2025-10-27	Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations	Yujia Zhang et.al.	2510.23607	translate	read	null
2025-10-27	PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity	Yuqian Yuan et.al.	2510.23603	translate	read	link
2025-10-27	InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras	Erich Liang et.al.	2510.23589	translate	read	null
2025-10-27	Localising under the drape: proprioception in the era of distributed surgical robotic system	Martin Huber et.al.	2510.23512	translate	read	null
2025-10-27	UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception	Karthikeyan Chandra Sekaran et.al.	2510.23478	translate	read	null
2025-10-27	Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics	Ş. Ekmen et.al.	2510.23403	translate	read	null
2025-10-27	Evaluation of Vision-LLMs in Surveillance Video	Pascal Benschop et.al.	2510.23190	translate	read	null
2025-10-27	Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI	Aryan Mathur et.al.	2510.23148	translate	read	null
2025-10-27	SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency	Quanjian Song et.al.	2510.22994	translate	read	null
2025-10-27	Charting the Design Space of Neural Graph Representations for Subgraph Matching	Vaibhav Raj et.al.	2510.22897	translate	read	null
2025-10-26	IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction	Hao Li et.al.	2510.22706	translate	read	link
2025-10-26	Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views	Anna Deichler et.al.	2510.22672	translate	read	null
2025-10-25	BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles	Seyed Ahmad Hosseini Miangoleh et.al.	2510.22370	translate	read	null
2025-10-25	Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments	Weixian Qian et.al.	2510.22204	translate	read	null
2025-10-25	MOGRAS: Human Motion with Grasping in 3D Scenes	Kunal Bhosikar et.al.	2510.22199	translate	read	null
2025-10-25	LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction	Yuhang Gao et.al.	2510.22141	translate	read	null
2025-10-25	CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding	Lihuang Fang et.al.	2510.22119	translate	read	null
2025-10-07	Avi: Action from Volumetric Inference	Harris Song et.al.	2510.21746	translate	read	null
2025-10-24	OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields	Lisa Weijler et.al.	2510.21441	translate	read	null
2025-10-24	ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models	Pranav Saxena et.al.	2510.21069	translate	read	null
2025-10-22	Uncertainty evaluation of segmentation models for Earth observation	Melanie Rey et.al.	2510.19586	translate	read	null
2025-10-22	Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization	Juncheng Wang et.al.	2510.19330	translate	read	null
2025-10-21	Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations	Phuoc Nguyen et.al.	2510.18697	translate	read	null
2025-10-21	MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning	Wenhui Huang et.al.	2510.18337	translate	read	null
2025-10-21	UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding	Da Zhang et.al.	2510.18262	translate	read	null
2025-10-21	OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion	Tianyu Huang et.al.	2510.18253	translate	read	null
2025-10-20	Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models	Katie Luo et.al.	2510.17274	translate	read	null
2025-10-19	SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes	Xiongkun Linghu et.al.	2510.16714	translate	read	null
2025-10-18	Structured Interfaces for Automated Reasoning with 3D Scene Graphs	Aaron Ray et.al.	2510.16643	translate	read	null
2025-10-11	ESCA: Contextualizing Embodied Agents via Scene-Graph Generation	Jiani Huang et.al.	2510.15963	translate	read	null
2025-10-07	GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments	Leela Krishna et.al.	2510.14992	translate	read	null
2025-10-16	QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps	Matti Pekkanen et.al.	2510.14546	translate	read	null
2025-10-15	Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models	Jia Yun Chua et.al.	2510.13993	translate	read	null
2025-10-15	SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB	Muhammad Ishfaq Hussain et.al.	2510.13404	translate	read	null
2025-10-15	FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding	Francesco Barbato et.al.	2510.13243	translate	read	null
2025-10-14	VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages	Jesse Atuhurra et.al.	2510.12845	translate	read	null
2025-10-14	SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding	Zhiliu Yang et.al.	2510.12749	translate	read	null
2025-10-13	PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation	Hatem Ibrahem et.al.	2510.11992	translate	read	null
2025-10-13	PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image	Pradyumna Yalandur Muralidhar et.al.	2510.11649	translate	read	null
2025-10-13	A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation	Denis Zavadski et.al.	2510.11567	translate	read	null
2025-10-13	mmWalk: Towards Multi-modal Multi-view Walking Assistance	Kedi Ying et.al.	2510.11520	translate	read	null
2025-10-13	REACT3D: Recovering Articulations for Interactive Physical 3D Scenes	Zhao Huang et.al.	2510.11340	translate	read	null
2025-10-12	Real2USD: Scene Representations in Universal Scene Description Language	Christopher D. Hsu et.al.	2510.10778	translate	read	null
2025-10-11	B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding	Feng Xiao et.al.	2510.10194	translate	read	null
2025-10-10	CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation	Kaiwen Wei et.al.	2510.09266	translate	read	null
2025-10-08	Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs	Hanieh Shojaei Miandashti et.al.	2510.08631	translate	read	null
2025-10-03	Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes	Nirmal Elamon et.al.	2510.08589	translate	read	null
2025-10-09	The impact of abstract and object tags on image privacy classification	Darya Baranouskaya et.al.	2510.07976	translate	read	null
2025-10-09	CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving	Tianrui Zhang et.al.	2510.07944	translate	read	null
2025-10-09	An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images	Kanglin Ning et.al.	2510.07817	translate	read	null
2025-10-07	Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model	Danush Kumar Venkatesh et.al.	2510.07345	translate	read	null
2025-10-08	Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion	Jie Luo et.al.	2510.06687	translate	read	null
2025-10-07	When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach	Daniel Gonzálbez-Biosca et.al.	2510.05661	translate	read	null
2025-10-07	HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video	Hongchi Xia et.al.	2510.05560	translate	read	null
2025-10-06	Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction	Chi Yan et.al.	2510.04759	translate	read	null
2025-10-02	LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition	Rixin Zhou et.al.	2510.01651	translate	read	null
2025-10-01	VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs	Mohamad Al Mdfaa et.al.	2510.01483	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)