Scene Understanding - 2025-11 | Paper Arxiv Daily

Scene Understanding - 2025-11

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-11-29	When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI	Yanhui Li et.al.	2512.03087	translate	read	null
2025-11-30	FOM-Nav: Frontier-Object Maps for Object Goal Navigation	Thomas Chabal et.al.	2512.01009	translate	read	null
2025-11-30	Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting	Haishan Wang et.al.	2512.00850	translate	read	null
2025-11-29	Describe Anything Anywhere At Any Moment	Nicolas Gorlo et.al.	2512.00565	translate	read	null
2025-11-29	Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR	Lixing Guo et.al.	2512.00294	translate	read	null
2025-11-28	DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation	Zirui Wang et.al.	2512.00226	translate	read	null
2025-11-28	DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation	Hongfei Zhang et.al.	2511.23127	translate	read	null
2025-11-28	Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding	Anik De et.al.	2511.23071	translate	read	null
2025-11-28	HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model	Chen Li et.al.	2511.22961	translate	read	null
2025-11-28	See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection	YuEun Lee et.al.	2511.22906	translate	read	null
2025-11-27	GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes	Di Wang et.al.	2511.22645	translate	read	null
2025-11-27	CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving	Zhaohui Wang et.al.	2511.22532	translate	read	null
2025-11-27	RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding	Xiyan Liu et.al.	2511.22466	translate	read	null
2025-11-26	SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding	Tae-Min Choi et.al.	2511.21339	translate	read	null
2025-11-26	Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding	Yutao Tang et.al.	2511.21191	translate	read	null
2025-11-26	Scaling Foundation Models for Radar Scene Understanding	Pushkal Mishra et.al.	2511.21105	translate	read	null
2025-11-25	3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding	Xiaoye Wang et.al.	2511.20646	translate	read	null
2025-11-25	CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception	Miguel Carvalho et.al.	2511.19820	translate	read	null
2025-11-24	Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models	Jonathan Lee et.al.	2511.19526	translate	read	null
2025-11-24	Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving	Jianhua Han et.al.	2511.19221	translate	read	null
2025-11-24	AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation	Omar Garib et.al.	2511.18718	translate	read	null
2025-11-24	Autonomous Surface Selection For Manipulator-Based UV Disinfection In Hospitals Using Foundation Models	Xueyan Oh et.al.	2511.18709	translate	read	null
2025-11-23	Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span	Heeseung Yun et.al.	2511.18470	translate	read	null
2025-11-22	Plan-X: Instruct Video Generation via Semantic Planning	Lun Huang et.al.	2511.17986	translate	read	null
2025-11-21	CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation	Prantik Howlader et.al.	2511.17755	translate	read	null
2025-11-18	Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression	Siddiqua Namrah et.al.	2511.17612	translate	read	null
2025-11-21	SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation	Seamie Hayes et.al.	2511.17361	translate	read	null
2025-11-21	Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM	Chiori Hori et.al.	2511.17335	translate	read	null
2025-11-20	POMA-3D: The Point Map Way to 3D Scene Understanding	Ye Mao et.al.	2511.16567	translate	read	null
2025-11-20	LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs	Doriand Petit et.al.	2511.16454	translate	read	null
2025-11-20	Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM	Gergely Dinya et.al.	2511.16282	translate	read	null
2025-11-20	How Robot Dogs See the Unseeable: Improving Visual Interpretability via Peering for Exploratory Robots	Oliver Bimber et.al.	2511.16262	translate	read	null
2025-11-20	Real-Time 3D Object Detection with Inference-Aligned Learning	Chenyu Zhao et.al.	2511.16140	translate	read	null
2025-11-20	Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click	Raphael Ruschel et.al.	2511.15948	translate	read	null
2025-11-19	WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion	Sajjad Pakdamansavoji et.al.	2511.15874	translate	read	null
2025-11-19	ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation	Simon Boeder et.al.	2511.15396	translate	read	null
2025-11-19	Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception	Jiashu Yang et.al.	2511.15279	translate	read	null
2025-11-18	RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems	Jaro Meyer et.al.	2511.14948	translate	read	null
2025-11-18	Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models	Hao Zhen et.al.	2511.14120	translate	read	null
2025-11-18	Error-Driven Scene Editing for 3D Grounding in Large Language Models	Yue Zhang et.al.	2511.14086	translate	read	null
2025-11-18	RISE: Single Static Radar-based Indoor Scene Understanding	Kaichen Zhou et.al.	2511.14019	translate	read	null
2025-11-17	VLMs Guided Interpretable Decision Making for Autonomous Driving	Xin Hu et.al.	2511.13881	translate	read	null
2025-11-17	Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation	Lingfeng Zhang et.al.	2511.13269	translate	read	null
2025-11-17	Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving	Jiacheng Tang et.al.	2511.13079	translate	read	null
2025-11-17	Visual Room 2.0: Seeing is Not Understanding for MLLMs	Haokun Li et.al.	2511.12928	translate	read	null
2025-11-16	RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation	Xiaoshuai Hao et.al.	2511.12436	translate	read	null
2025-11-14	Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy	Vinit Mehta et.al.	2511.11777	translate	read	null
2025-11-13	ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts	Haowen Jiang et.al.	2511.11740	translate	read	null
2025-11-14	AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning	Jirong Zha et.al.	2511.11025	translate	read	null
2025-11-13	DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation	Xuexun Liu et.al.	2511.10003	translate	read	null
2025-11-12	Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding	Jingtian Ma et.al.	2511.08978	translate	read	null
2025-11-11	RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation	Hae-Won Jo et.al.	2511.08651	translate	read	null
2025-11-05	Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants	I. Bailo et.al.	2511.08609	translate	read	null
2025-11-11	OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition	Lixu Sun et.al.	2511.08133	translate	read	null
2025-11-11	HD $^2$ -SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving	Zhiwen Yang et.al.	2511.07925	translate	read	null
2025-11-11	Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views	Haida Feng et.al.	2511.07813	translate	read	null
2025-11-10	Inference-Time Scaling of Diffusion Models for Infrared Data Generation	Kai A. Horstmann et.al.	2511.07362	translate	read	null
2025-11-10	PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving	Simon Gerstenecker et.al.	2511.07292	translate	read	null
2025-11-10	Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images	JiaKui Hu et.al.	2511.07222	translate	read	null
2025-11-10	TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding	Duc Nguyen et.al.	2511.07007	translate	read	null
2025-11-10	PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory	Qunchao Jin et.al.	2511.06840	translate	read	null
2025-11-09	Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR)	Tobias Rueckert et.al.	2511.06549	translate	read	null
2025-11-08	Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation	Lin Li et.al.	2511.05935	translate	read	null
2025-11-08	Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning	Fei Yu et.al.	2511.05894	translate	read	null
2025-11-07	Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots	Justin Williams et.al.	2511.05642	translate	read	null
2025-11-06	Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition	Nicholas Babey et.al.	2511.05622	translate	read	null
2025-11-06	GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies	Maëlic Neau et.al.	2511.04357	translate	read	null
2025-11-06	CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation	Yuwen Tao et.al.	2511.03992	translate	read	null
2025-11-06	Simple 3D Pose Features Support Human and Machine Social Scene Understanding	Wenshuo Qin et.al.	2511.03988	translate	read	null
2025-11-06	Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images	Sam Bahrami et.al.	2511.03970	translate	read	null
2025-11-05	SILVI: Simple Interface for Labeling Video Interactions	Ozan Kanbertay et.al.	2511.03819	translate	read	null
2025-11-05	SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding	Mauro Orazio Drago et.al.	2511.03325	translate	read	null
2025-11-04	LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation	Gyeom Hwangbo et.al.	2511.03001	translate	read	null
2025-11-04	DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding	Zixuan Liu et.al.	2511.02495	translate	read	null
2025-11-04	Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization	Tao Liu et.al.	2511.02489	translate	read	link
2025-11-04	From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics	Nicolas Schuler et.al.	2511.02427	translate	read	null
2025-11-03	Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis	Soham Joshi et.al.	2511.02046	translate	read	null
2025-11-03	A Compact Model for Polar Multiple-Channel Field Effect Transistors: A Case Study in III-V Nitride Semiconductors	Aias Asteris et.al.	2511.01699	translate	read	null
2025-11-03	Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models	Xiaoyu Zhan et.al.	2511.01618	translate	read	null
2025-11-03	PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model	Wenqi Liang et.al.	2511.01571	translate	read	null
2025-11-03	Fast and Robust Remote Two-Qubit Gates on Distributed Qubits	Yunan Li et.al.	2511.01418	translate	read	null
2025-11-03	A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model	Sampriti Soor et.al.	2511.01317	translate	read	null
2025-11-03	LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping	Lijie Wang et.al.	2511.01186	translate	read	null
2025-11-02	GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies	Ziye Wang et.al.	2511.00998	translate	read	null
2025-11-01	Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach	Oluwatosin Alabi et.al.	2511.00643	translate	read	null
2025-11-01	CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World	Yating Yu et.al.	2511.00613	translate	read	null
2025-11-01	Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models	Panwang Pan et.al.	2511.00503	translate	read	link

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)