Scene Understanding - 2025-03

Publish Date Title Authors PDF Translate Read Code
2025-03-30 PhysPose: Refining 6D Object Poses with Physical Constraints Martin Malenický et.al. 2503.23587 translate read null
2025-03-30 Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model Jannik Endres et.al. 2503.23502 translate read link
2025-03-29 Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery Boyi Ma et.al. 2503.23130 translate read null
2025-03-29 Evaluating Compositional Scene Understanding in Multimodal Generative Models Shuhao Fu et.al. 2503.23125 translate read link
2025-03-29 Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments Yifan Xu et.al. 2503.23105 translate read null
2025-03-29 Empowering Large Language Models with 3D Situation Awareness Zhihao Yuan et.al. 2503.23024 translate read null
2025-03-28 Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users Antonia Karamolegkou et.al. 2503.22610 translate read null
2025-03-28 Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration Heiko Renz et.al. 2503.22588 translate read null
2025-03-28 NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving Fuhao Li et.al. 2503.22436 translate read null
2025-03-28 Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision Rulin Zhou et.al. 2503.22394 translate read null
2025-03-28 A Dataset for Semantic Segmentation in the Presence of Unknowns Zakaria Laskar et.al. 2503.22309 translate read null
2025-03-28 Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction Seokha Moon et.al. 2503.22087 translate read null
2025-03-27 Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting Anand Bhattad et.al. 2503.21770 translate read null
2025-03-27 uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images Jonathan Lee et.al. 2503.21562 translate read link
2025-03-27 Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving Lucas Nunes et.al. 2503.21449 translate read link
2025-03-26 DINeMo: Learning Neural Mesh Models with no 3D Annotations Weijie Guo et.al. 2503.20220 translate read null
2025-03-25 The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs Jonathan Sauder et.al. 2503.20000 translate read null
2025-03-25 SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining Xiang Xu et.al. 2503.19912 translate read link
2025-03-25 OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations Christina Kassab et.al. 2503.19764 translate read null
2025-03-26 COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting Jiaxin Zhang et.al. 2503.19443 translate read link
2025-03-25 Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting Zhiying Yan et.al. 2503.19332 translate read null
2025-03-25 BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation Hanshuo Qiu et.al. 2503.19303 translate read null
2025-03-24 Efficient and Accurate Scene Text Recognition with Cascaded-Transformers Savas Ozkan et.al. 2503.18883 translate read null
2025-03-24 Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition Yifei Zhang et.al. 2503.18746 translate read null
2025-03-24 Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving Hongkuan Zhou et.al. 2503.18730 translate read null
2025-03-23 MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation Jiaxin Huang et.al. 2503.18135 translate read null
2025-03-23 PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding Hongjia Zhai et.al. 2503.18107 translate read null
2025-03-23 PanopticSplatting: End-to-End Panoptic Gaussian Splatting Yuxuan Xie et.al. 2503.18073 translate read null
2025-03-23 PolarFree: Polarization-based Reflection-free Imaging Mingde Yao et.al. 2503.18055 translate read null
2025-03-23 SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining Yue Li et.al. 2503.18052 translate read null
2025-03-23 Geometric Constrained Non-Line-of-Sight Imaging Xueying Liu et.al. 2503.17992 translate read null
2025-03-22 A Causal Adjustment Module for Debiasing Scene Graph Generation Li Liu et.al. 2503.17862 translate read null
2025-03-21 Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation Giacomo Savazzi et.al. 2503.17224 translate read null
2025-03-21 ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail Chandan Yeshwanth et.al. 2503.17044 translate read null
2025-03-21 Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision Maoji Zheng et.al. 2503.16811 translate read null
2025-03-21 OpenCity3D: What do Vision-Language Models know about Urban Environments? Valentin Bieri et.al. 2503.16776 translate read null
2025-03-20 Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding Jinlong Li et.al. 2503.16707 translate read null
2025-03-20 ContactFusion: Stochastic Poisson Surface Maps from Visual and Contact Sensing Aditya Kamireddypalli et.al. 2503.16592 translate read null
2025-03-20 From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction Ayberk Acar et.al. 2503.16263 translate read null
2025-03-20 Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation Andrea Maracani et.al. 2503.16184 translate read null
2025-03-20 What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? Xuanming Cui et.al. 2503.15846 translate read null
2025-03-19 A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition Ritabrata Chakraborty et.al. 2503.15639 translate read null
2025-03-19 Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene Shengqiong Wu et.al. 2503.15019 translate read null
2025-03-19 Universal Scene Graph Generation Shengqiong Wu et.al. 2503.15005 translate read null
2025-03-19 SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments Yinqi Chen et.al. 2503.14837 translate read null
2025-03-20 These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models Parker Ewen et.al. 2503.14665 translate read null
2025-03-17 Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey Liewen Liao et.al. 2503.14537 translate read null
2025-03-18 DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation Mu Chen et.al. 2503.13957 translate read link
2025-03-18 Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation Sayak Nag et.al. 2503.13947 translate read null
2025-03-18 ChatBEV: A Visual Language Model that Understands BEV Maps Qingyao Xu et.al. 2503.13938 translate read null
2025-03-18 PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds Barza Nisar et.al. 2503.13914 translate read null
2025-03-17 Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training Corentin Sautier et.al. 2503.13203 translate read null
2025-03-17 Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation Henghui Du et.al. 2503.13068 translate read null
2025-03-17 InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving Ruiqi Song et.al. 2503.13047 translate read null
2025-03-17 HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding Jiahe Zhao et.al. 2503.12955 translate read null
2025-03-17 NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models Sung-Yeon Park et.al. 2503.12772 translate read null
2025-03-16 Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding Imran Kabir et.al. 2503.12663 translate read null
2025-03-16 Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset Yutao Hu et.al. 2503.12385 translate read null
2025-03-15 TACO: Taming Diffusion for in-the-wild Video Amodal Completion Ruijie Lu et.al. 2503.12049 translate read null
2025-03-14 Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling Christopher Xie et.al. 2503.11806 translate read null
2025-03-14 EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting Di Li et.al. 2503.11345 translate read null
2025-03-14 Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset Yibing Weng et.al. 2503.11342 translate read null
2025-03-13 Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations Piyush Gupta et.al. 2503.10941 translate read null
2025-03-11 MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation Anzhe Cheng et.al. 2503.10686 translate read null
2025-03-13 TARS: Traffic-Aware Radar Scene Flow Estimation Jialong Wu et.al. 2503.10210 translate read null
2025-03-13 TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness Mu Chen et.al. 2503.09941 translate read null
2025-03-12 Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval Stefan Sylvius Wagner et.al. 2503.09867 translate read null
2025-03-11 Language-Depth Navigated Thermal and Visible Image Fusion Jinchang Zhang et.al. 2503.08676 translate read null
2025-03-11 Generating Robot Constitutions & Benchmarks for Semantic Safety Pierre Sermanet et.al. 2503.08663 translate read null
2025-03-11 Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding Tim Steinke et.al. 2503.08474 translate read null
2025-03-11 TrackOcc: Camera-based 4D Panoptic Occupancy Tracking Zhuoguang Chen et.al. 2503.08471 translate read null
2025-03-11 Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking Xucheng Guo et.al. 2503.08370 translate read null
2025-03-11 DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos Lorenzo Mur-Labadia et.al. 2503.08344 translate read null
2025-03-11 Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving Runwei Guan et.al. 2503.08336 translate read null
2025-03-11 General-Purpose Aerial Intelligent Agents Empowered by Large Language Models Ji Zhao et.al. 2503.08302 translate read null
2025-03-10 FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction Dennis Rotondi et.al. 2503.07909 translate read null
2025-03-10 Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction Zongzheng Zhang et.al. 2503.07485 translate read null
2025-03-10 CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting Haicheng Liao et.al. 2503.07234 translate read null
2025-03-10 A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning Xin Wen et.al. 2503.06960 translate read null
2025-03-10 LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs Hanyu Zhou et.al. 2503.06934 translate read null
2025-03-08 SplatTalk: 3D VQA with Gaussian Splatting Anh Thai et.al. 2503.06271 translate read null
2025-03-08 Segment Anything, Even Occluded Wei-En Tai et.al. 2503.06261 translate read null
2025-03-08 VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion Meng Wang et.al. 2503.06219 translate read null
2025-03-08 Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images YingLiang Ma et.al. 2503.06190 translate read null
2025-03-08 Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction Kai Li et.al. 2503.06161 translate read null
2025-03-08 Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity Xiaohao Xu et.al. 2503.06014 translate read null
2025-03-07 HexPlane Representation for 3D Semantic Scene Understanding Zeren Chen et.al. 2503.05127 translate read null
2025-03-06 Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning Victor Sebastian Martinez Pozos et.al. 2503.04900 translate read null
2025-03-06 EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images Rohit Menon et.al. 2503.04441 translate read null
2025-03-06 An Egocentric Vision-Language Model based Portable Real-time Smart Assistant Yifei Huang et.al. 2503.04250 translate read null
2025-03-06 H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision Yunxiao Shi et.al. 2503.04059 translate read null
2025-03-06 GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding Xihan Wang et.al. 2503.04034 translate read null
2025-03-05 SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection Devanish N. Kamtam et.al. 2503.03942 translate read null
2025-03-05 Vision-Language Models Struggle to Align Entities across Modalities Iñigo Alonso et.al. 2503.03854 translate read null
2025-03-05 Improving 6D Object Pose Estimation of metallic Household and Industry Objects Thomas Pöllabauer et.al. 2503.03655 translate read null
2025-03-04 MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments Ege Özsoy et.al. 2503.02579 translate read link
2025-03-04 Label-Efficient LiDAR Panoptic Segmentation Ahmet Selim Çanakçı et.al. 2503.02372 translate read null
2025-03-04 SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images Gargi Panda et.al. 2503.02270 translate read null
2025-03-03 vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding Ali Tourani et.al. 2503.01783 translate read link
2025-03-03 OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding Dianyi Yang et.al. 2503.01646 translate read null
2025-03-03 Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond Guanyao Wu et.al. 2503.01210 translate read link
2025-03-03 Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations Junsong Zhang et.al. 2503.01114 translate read null
2025-03-01 Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing Yanjun Li et.al. 2503.00548 translate read null
2025-03-01 Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning Hanxun Yu et.al. 2503.00513 translate read link
2025-03-04 Floorplan-SLAM: A Real-Time, High-Accuracy, and Long-Term Multi-Session Point-Plane SLAM for Efficient Floorplan Reconstruction Haolin Wang et.al. 2503.00397 translate read null

(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)