Scene Understanding - 2025-07 | Paper Arxiv Daily

Scene Understanding - 2025-07

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-07-31	Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs	Bhavya Goyal et.al.	2508.00169	translate	read	null
2025-07-31	3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding	Ting Huang et.al.	2507.23478	translate	read	null
2025-07-31	FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models	Yiming Yang et.al.	2507.23325	translate	read	null
2025-07-31	FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning	Jiajun Cao et.al.	2507.23318	translate	read	null
2025-07-30	DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion	Qingcheng Zhao et.al.	2507.22825	translate	read	null
2025-07-30	UAVScenes: A Multi-Modal Dataset for UAVs	Sijie Wang et.al.	2507.22412	translate	read	null
2025-07-29	EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation	Zhijiang Li et.al.	2507.21971	translate	read	null
2025-07-28	GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction	Tianhao Li et.al.	2507.20963	translate	read	null
2025-07-28	Compositional Video Synthesis by Temporal Object-Centric Learning	Adil Kaan Akan et.al.	2507.20855	translate	read	null
2025-07-27	VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving	Levente Tempfli et.al.	2507.20397	translate	read	null
2025-07-27	Solving Scene Understanding for Autonomous Navigation in Unstructured Environments	Naveen Mathews Renji et.al.	2507.20389	translate	read	null
2025-07-26	FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images	Hao-Yu Hou et.al.	2507.19993	translate	read	null
2025-07-26	UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block	Luoxi Jing et.al.	2507.19948	translate	read	null
2025-07-26	RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection	Xiaokai Bai et.al.	2507.19856	translate	read	null
2025-07-26	Taking Language Embedded 3D Gaussian Splatting into the Wild	Yuze Wang et.al.	2507.19830	translate	read	null
2025-07-25	Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing	Haichuan Li et.al.	2507.19691	translate	read	null
2025-07-25	VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions	Haoang Lu et.al.	2507.19188	translate	read	null
2025-07-24	Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting	Xingyu Miao et.al.	2507.18678	translate	read	null
2025-07-23	From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding	Anna-Maria Halacheva et.al.	2507.17585	translate	read	null
2025-07-23	IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird’s-Eye View Perception	Haichuan Li et.al.	2507.17445	translate	read	null
2025-07-22	ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension	Yizhi Hu et.al.	2507.16877	translate	read	null
2025-07-22	Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge	Tobias Rueckert et.al.	2507.16559	translate	read	null
2025-07-22	Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach	Jon Gutiérrez-Zaballa et.al.	2507.16556	translate	read	null
2025-07-22	DenseSR: Image Shadow Removal as Dense Prediction	Yu-Fan Lin et.al.	2507.16472	translate	read	link
2025-07-21	Label tree semantic losses for rich multi-class medical image segmentation	Junwen Wang et.al.	2507.15777	translate	read	null
2025-07-21	Towards Holistic Surgical Scene Graph	Jongmin Shin et.al.	2507.15541	translate	read	null
2025-07-21	ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting	Ruijie Zhu et.al.	2507.15454	translate	read	link
2025-07-21	VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving	Haichao Liu et.al.	2507.15266	translate	read	null
2025-07-19	DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF	Doriand Petit et.al.	2507.14596	translate	read	null
2025-07-19	Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions	Jintang Xue et.al.	2507.14555	translate	read	null
2025-07-19	Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025	Sujata Gaihre et.al.	2507.14544	translate	read	null
2025-07-19	CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding	Zhou Chen et.al.	2507.14426	translate	read	null
2025-07-18	Semantic Segmentation based Scene Understanding in Autonomous Vehicles	Ehsan Rassekh et.al.	2507.14303	translate	read	null
2025-07-18	Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation	Masahiro Ogawa et.al.	2507.13628	translate	read	null
2025-07-17	Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection	Jingyao Wang et.al.	2507.13061	translate	read	null
2025-07-17	Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models	Yifan Xu et.al.	2507.12916	translate	read	null
2025-07-17	City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning	Penglei Sun et.al.	2507.12795	translate	read	null
2025-07-16	Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection	Sandipan Sarma et.al.	2507.12628	translate	read	null
2025-07-15	Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis	Maciej Szankin et.al.	2507.11730	translate	read	null
2025-07-15	Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander	Li Wang et.al.	2507.11079	translate	read	null
2025-07-15	Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation	Yanbo Wang et.al.	2507.11001	translate	read	null
2025-07-14	Static or Temporal? Semantic Scene Simplification to Aid Wayfinding in Immersive Simulations of Bionic Vision	Justin M. Kasowski et.al.	2507.10813	translate	read	null
2025-07-14	EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	Mingxian Lin et.al.	2507.10548	translate	read	link
2025-07-13	VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding	Younggun Kim et.al.	2507.09815	translate	read	null
2025-07-13	Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles	Yangang Ren et.al.	2507.09537	translate	read	null
2025-07-12	Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding	Wencan Huang et.al.	2507.09334	translate	read	null
2025-07-12	THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage	Trong-Thuan Nguyen et.al.	2507.09200	translate	read	null
2025-07-12	Towards Spatial Audio Understanding via Question Answering	Parthasaarathy Sudarsanam et.al.	2507.09195	translate	read	null
2025-07-12	On the Fragility of Multimodal Perception to Temporal Misalignment in Autonomous Driving	Md Hasan Shahriar et.al.	2507.09095	translate	read	null
2025-07-10	OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding	JingLi Lin et.al.	2507.07984	translate	read	link
2025-07-10	MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation	Bangning Wei et.al.	2507.07519	translate	read	null
2025-07-09	SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds	Matthias Zeller et.al.	2507.06906	translate	read	null
2025-07-09	Token Bottleneck: One Token to Remember Dynamics	Taekyung Kim et.al.	2507.06543	translate	read	link
2025-07-09	What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies	Yaoqi Huang et.al.	2507.06513	translate	read	null
2025-07-08	Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion	Aleksandar Jevtić et.al.	2507.06230	translate	read	link
2025-07-08	SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning	Xin Hu et.al.	2507.05798	translate	read	null
2025-07-07	All in One: Visual-Description-Guided Unified Point Cloud Segmentation	Zongyan Han et.al.	2507.05211	translate	read	null
2025-07-07	MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding	Jing Liang et.al.	2507.04686	translate	read	null
2025-07-05	Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation	Ziyu Zhu et.al.	2507.04047	translate	read	null
2025-07-05	Habitat Classification from Ground-Level Imagery Using Deep Neural Networks	Hongrui Shi et.al.	2507.04017	translate	read	null
2025-07-04	Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds	Matthias Zeller et.al.	2507.03463	translate	read	null
2025-07-03	LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans	Zhening Huang et.al.	2507.02861	translate	read	link
2025-07-03	LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion	Fangfu Liu et.al.	2507.02813	translate	read	link
2025-07-03	SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment	Qi Xu et.al.	2507.02705	translate	read	link
2025-07-04	Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach	Elena Ryumina et.al.	2507.02205	translate	read	link
2025-07-02	ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning	Xiao Wang et.al.	2507.02200	translate	read	null
2025-07-02	ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving	Kai Chen et.al.	2507.01735	translate	read	null
2025-07-01	GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond	Anna-Maria Halacheva et.al.	2507.00886	translate	read	null
2025-07-01	BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving	Zeming Chen et.al.	2507.00707	translate	read	null
2025-07-01	SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting	Yiming Huang et.al.	2506.23309	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)