Scene Understanding - 2025-03 | Paper Arxiv Daily

Scene Understanding - 2025-03

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-03-30	PhysPose: Refining 6D Object Poses with Physical Constraints	Martin Malenický et.al.	2503.23587	translate	read	null
2025-03-30	Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model	Jannik Endres et.al.	2503.23502	translate	read	link
2025-03-29	Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery	Boyi Ma et.al.	2503.23130	translate	read	null
2025-03-29	Evaluating Compositional Scene Understanding in Multimodal Generative Models	Shuhao Fu et.al.	2503.23125	translate	read	link
2025-03-29	Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments	Yifan Xu et.al.	2503.23105	translate	read	null
2025-03-29	Empowering Large Language Models with 3D Situation Awareness	Zhihao Yuan et.al.	2503.23024	translate	read	null
2025-03-28	Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users	Antonia Karamolegkou et.al.	2503.22610	translate	read	null
2025-03-28	Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration	Heiko Renz et.al.	2503.22588	translate	read	null
2025-03-28	NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving	Fuhao Li et.al.	2503.22436	translate	read	null
2025-03-28	Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision	Rulin Zhou et.al.	2503.22394	translate	read	null
2025-03-28	A Dataset for Semantic Segmentation in the Presence of Unknowns	Zakaria Laskar et.al.	2503.22309	translate	read	null
2025-03-28	Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction	Seokha Moon et.al.	2503.22087	translate	read	null
2025-03-27	Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting	Anand Bhattad et.al.	2503.21770	translate	read	null
2025-03-27	uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images	Jonathan Lee et.al.	2503.21562	translate	read	link
2025-03-27	Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving	Lucas Nunes et.al.	2503.21449	translate	read	link
2025-03-26	DINeMo: Learning Neural Mesh Models with no 3D Annotations	Weijie Guo et.al.	2503.20220	translate	read	null
2025-03-25	The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs	Jonathan Sauder et.al.	2503.20000	translate	read	null
2025-03-25	SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining	Xiang Xu et.al.	2503.19912	translate	read	link
2025-03-25	OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations	Christina Kassab et.al.	2503.19764	translate	read	null
2025-03-26	COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting	Jiaxin Zhang et.al.	2503.19443	translate	read	link
2025-03-25	Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting	Zhiying Yan et.al.	2503.19332	translate	read	null
2025-03-25	BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation	Hanshuo Qiu et.al.	2503.19303	translate	read	null
2025-03-24	Efficient and Accurate Scene Text Recognition with Cascaded-Transformers	Savas Ozkan et.al.	2503.18883	translate	read	null
2025-03-24	Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition	Yifei Zhang et.al.	2503.18746	translate	read	null
2025-03-24	Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving	Hongkuan Zhou et.al.	2503.18730	translate	read	null
2025-03-23	MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation	Jiaxin Huang et.al.	2503.18135	translate	read	null
2025-03-23	PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding	Hongjia Zhai et.al.	2503.18107	translate	read	null
2025-03-23	PanopticSplatting: End-to-End Panoptic Gaussian Splatting	Yuxuan Xie et.al.	2503.18073	translate	read	null
2025-03-23	PolarFree: Polarization-based Reflection-free Imaging	Mingde Yao et.al.	2503.18055	translate	read	null
2025-03-23	SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining	Yue Li et.al.	2503.18052	translate	read	null
2025-03-23	Geometric Constrained Non-Line-of-Sight Imaging	Xueying Liu et.al.	2503.17992	translate	read	null
2025-03-22	A Causal Adjustment Module for Debiasing Scene Graph Generation	Li Liu et.al.	2503.17862	translate	read	null
2025-03-21	Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation	Giacomo Savazzi et.al.	2503.17224	translate	read	null
2025-03-21	ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail	Chandan Yeshwanth et.al.	2503.17044	translate	read	null
2025-03-21	Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision	Maoji Zheng et.al.	2503.16811	translate	read	null
2025-03-21	OpenCity3D: What do Vision-Language Models know about Urban Environments?	Valentin Bieri et.al.	2503.16776	translate	read	null
2025-03-20	Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding	Jinlong Li et.al.	2503.16707	translate	read	null
2025-03-20	ContactFusion: Stochastic Poisson Surface Maps from Visual and Contact Sensing	Aditya Kamireddypalli et.al.	2503.16592	translate	read	null
2025-03-20	From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction	Ayberk Acar et.al.	2503.16263	translate	read	null
2025-03-20	Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation	Andrea Maracani et.al.	2503.16184	translate	read	null
2025-03-20	What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation?	Xuanming Cui et.al.	2503.15846	translate	read	null
2025-03-19	A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition	Ritabrata Chakraborty et.al.	2503.15639	translate	read	null
2025-03-19	Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene	Shengqiong Wu et.al.	2503.15019	translate	read	null
2025-03-19	Universal Scene Graph Generation	Shengqiong Wu et.al.	2503.15005	translate	read	null
2025-03-19	SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments	Yinqi Chen et.al.	2503.14837	translate	read	null
2025-03-20	These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models	Parker Ewen et.al.	2503.14665	translate	read	null
2025-03-17	Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey	Liewen Liao et.al.	2503.14537	translate	read	null
2025-03-18	DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation	Mu Chen et.al.	2503.13957	translate	read	link
2025-03-18	Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation	Sayak Nag et.al.	2503.13947	translate	read	null
2025-03-18	ChatBEV: A Visual Language Model that Understands BEV Maps	Qingyao Xu et.al.	2503.13938	translate	read	null
2025-03-18	PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds	Barza Nisar et.al.	2503.13914	translate	read	null
2025-03-17	Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training	Corentin Sautier et.al.	2503.13203	translate	read	null
2025-03-17	Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation	Henghui Du et.al.	2503.13068	translate	read	null
2025-03-17	InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving	Ruiqi Song et.al.	2503.13047	translate	read	null
2025-03-17	HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding	Jiahe Zhao et.al.	2503.12955	translate	read	null
2025-03-17	NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models	Sung-Yeon Park et.al.	2503.12772	translate	read	null
2025-03-16	Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding	Imran Kabir et.al.	2503.12663	translate	read	null
2025-03-16	Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset	Yutao Hu et.al.	2503.12385	translate	read	null
2025-03-15	TACO: Taming Diffusion for in-the-wild Video Amodal Completion	Ruijie Lu et.al.	2503.12049	translate	read	null
2025-03-14	Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling	Christopher Xie et.al.	2503.11806	translate	read	null
2025-03-14	EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting	Di Li et.al.	2503.11345	translate	read	null
2025-03-14	Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset	Yibing Weng et.al.	2503.11342	translate	read	null
2025-03-13	Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations	Piyush Gupta et.al.	2503.10941	translate	read	null
2025-03-11	MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation	Anzhe Cheng et.al.	2503.10686	translate	read	null
2025-03-13	TARS: Traffic-Aware Radar Scene Flow Estimation	Jialong Wu et.al.	2503.10210	translate	read	null
2025-03-13	TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness	Mu Chen et.al.	2503.09941	translate	read	null
2025-03-12	Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval	Stefan Sylvius Wagner et.al.	2503.09867	translate	read	null
2025-03-11	Language-Depth Navigated Thermal and Visible Image Fusion	Jinchang Zhang et.al.	2503.08676	translate	read	null
2025-03-11	Generating Robot Constitutions & Benchmarks for Semantic Safety	Pierre Sermanet et.al.	2503.08663	translate	read	null
2025-03-11	Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding	Tim Steinke et.al.	2503.08474	translate	read	null
2025-03-11	TrackOcc: Camera-based 4D Panoptic Occupancy Tracking	Zhuoguang Chen et.al.	2503.08471	translate	read	null
2025-03-11	Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking	Xucheng Guo et.al.	2503.08370	translate	read	null
2025-03-11	DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos	Lorenzo Mur-Labadia et.al.	2503.08344	translate	read	null
2025-03-11	Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving	Runwei Guan et.al.	2503.08336	translate	read	null
2025-03-11	General-Purpose Aerial Intelligent Agents Empowered by Large Language Models	Ji Zhao et.al.	2503.08302	translate	read	null
2025-03-10	FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction	Dennis Rotondi et.al.	2503.07909	translate	read	null
2025-03-10	Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction	Zongzheng Zhang et.al.	2503.07485	translate	read	null
2025-03-10	CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting	Haicheng Liao et.al.	2503.07234	translate	read	null
2025-03-10	A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning	Xin Wen et.al.	2503.06960	translate	read	null
2025-03-10	LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs	Hanyu Zhou et.al.	2503.06934	translate	read	null
2025-03-08	SplatTalk: 3D VQA with Gaussian Splatting	Anh Thai et.al.	2503.06271	translate	read	null
2025-03-08	Segment Anything, Even Occluded	Wei-En Tai et.al.	2503.06261	translate	read	null
2025-03-08	VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion	Meng Wang et.al.	2503.06219	translate	read	null
2025-03-08	Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images	YingLiang Ma et.al.	2503.06190	translate	read	null
2025-03-08	Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction	Kai Li et.al.	2503.06161	translate	read	null
2025-03-08	Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity	Xiaohao Xu et.al.	2503.06014	translate	read	null
2025-03-07	HexPlane Representation for 3D Semantic Scene Understanding	Zeren Chen et.al.	2503.05127	translate	read	null
2025-03-06	Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning	Victor Sebastian Martinez Pozos et.al.	2503.04900	translate	read	null
2025-03-06	EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images	Rohit Menon et.al.	2503.04441	translate	read	null
2025-03-06	An Egocentric Vision-Language Model based Portable Real-time Smart Assistant	Yifei Huang et.al.	2503.04250	translate	read	null
2025-03-06	H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision	Yunxiao Shi et.al.	2503.04059	translate	read	null
2025-03-06	GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding	Xihan Wang et.al.	2503.04034	translate	read	null
2025-03-05	SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection	Devanish N. Kamtam et.al.	2503.03942	translate	read	null
2025-03-05	Vision-Language Models Struggle to Align Entities across Modalities	Iñigo Alonso et.al.	2503.03854	translate	read	null
2025-03-05	Improving 6D Object Pose Estimation of metallic Household and Industry Objects	Thomas Pöllabauer et.al.	2503.03655	translate	read	null
2025-03-04	MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments	Ege Özsoy et.al.	2503.02579	translate	read	link
2025-03-04	Label-Efficient LiDAR Panoptic Segmentation	Ahmet Selim Çanakçı et.al.	2503.02372	translate	read	null
2025-03-04	SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images	Gargi Panda et.al.	2503.02270	translate	read	null
2025-03-03	vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding	Ali Tourani et.al.	2503.01783	translate	read	link
2025-03-03	OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding	Dianyi Yang et.al.	2503.01646	translate	read	null
2025-03-03	Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond	Guanyao Wu et.al.	2503.01210	translate	read	link
2025-03-03	Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations	Junsong Zhang et.al.	2503.01114	translate	read	null
2025-03-01	Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing	Yanjun Li et.al.	2503.00548	translate	read	null
2025-03-01	Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning	Hanxun Yu et.al.	2503.00513	translate	read	link
2025-03-04	Floorplan-SLAM: A Real-Time, High-Accuracy, and Long-Term Multi-Session Point-Plane SLAM for Efficient Floorplan Reconstruction	Haolin Wang et.al.	2503.00397	translate	read	null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)