Scene Understanding - 2025-03
Scene Understanding - 2025-03
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-03-30 | PhysPose: Refining 6D Object Poses with Physical Constraints | Martin Malenický et.al. | 2503.23587 | translate | read | null |
| 2025-03-30 | Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model | Jannik Endres et.al. | 2503.23502 | translate | read | link |
| 2025-03-29 | Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery | Boyi Ma et.al. | 2503.23130 | translate | read | null |
| 2025-03-29 | Evaluating Compositional Scene Understanding in Multimodal Generative Models | Shuhao Fu et.al. | 2503.23125 | translate | read | link |
| 2025-03-29 | Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments | Yifan Xu et.al. | 2503.23105 | translate | read | null |
| 2025-03-29 | Empowering Large Language Models with 3D Situation Awareness | Zhihao Yuan et.al. | 2503.23024 | translate | read | null |
| 2025-03-28 | Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users | Antonia Karamolegkou et.al. | 2503.22610 | translate | read | null |
| 2025-03-28 | Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration | Heiko Renz et.al. | 2503.22588 | translate | read | null |
| 2025-03-28 | NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving | Fuhao Li et.al. | 2503.22436 | translate | read | null |
| 2025-03-28 | Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision | Rulin Zhou et.al. | 2503.22394 | translate | read | null |
| 2025-03-28 | A Dataset for Semantic Segmentation in the Presence of Unknowns | Zakaria Laskar et.al. | 2503.22309 | translate | read | null |
| 2025-03-28 | Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction | Seokha Moon et.al. | 2503.22087 | translate | read | null |
| 2025-03-27 | Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting | Anand Bhattad et.al. | 2503.21770 | translate | read | null |
| 2025-03-27 | uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images | Jonathan Lee et.al. | 2503.21562 | translate | read | link |
| 2025-03-27 | Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving | Lucas Nunes et.al. | 2503.21449 | translate | read | link |
| 2025-03-26 | DINeMo: Learning Neural Mesh Models with no 3D Annotations | Weijie Guo et.al. | 2503.20220 | translate | read | null |
| 2025-03-25 | The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | Jonathan Sauder et.al. | 2503.20000 | translate | read | null |
| 2025-03-25 | SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining | Xiang Xu et.al. | 2503.19912 | translate | read | link |
| 2025-03-25 | OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations | Christina Kassab et.al. | 2503.19764 | translate | read | null |
| 2025-03-26 | COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting | Jiaxin Zhang et.al. | 2503.19443 | translate | read | link |
| 2025-03-25 | Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting | Zhiying Yan et.al. | 2503.19332 | translate | read | null |
| 2025-03-25 | BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation | Hanshuo Qiu et.al. | 2503.19303 | translate | read | null |
| 2025-03-24 | Efficient and Accurate Scene Text Recognition with Cascaded-Transformers | Savas Ozkan et.al. | 2503.18883 | translate | read | null |
| 2025-03-24 | Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition | Yifei Zhang et.al. | 2503.18746 | translate | read | null |
| 2025-03-24 | Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving | Hongkuan Zhou et.al. | 2503.18730 | translate | read | null |
| 2025-03-23 | MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation | Jiaxin Huang et.al. | 2503.18135 | translate | read | null |
| 2025-03-23 | PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding | Hongjia Zhai et.al. | 2503.18107 | translate | read | null |
| 2025-03-23 | PanopticSplatting: End-to-End Panoptic Gaussian Splatting | Yuxuan Xie et.al. | 2503.18073 | translate | read | null |
| 2025-03-23 | PolarFree: Polarization-based Reflection-free Imaging | Mingde Yao et.al. | 2503.18055 | translate | read | null |
| 2025-03-23 | SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining | Yue Li et.al. | 2503.18052 | translate | read | null |
| 2025-03-23 | Geometric Constrained Non-Line-of-Sight Imaging | Xueying Liu et.al. | 2503.17992 | translate | read | null |
| 2025-03-22 | A Causal Adjustment Module for Debiasing Scene Graph Generation | Li Liu et.al. | 2503.17862 | translate | read | null |
| 2025-03-21 | Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation | Giacomo Savazzi et.al. | 2503.17224 | translate | read | null |
| 2025-03-21 | ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail | Chandan Yeshwanth et.al. | 2503.17044 | translate | read | null |
| 2025-03-21 | Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision | Maoji Zheng et.al. | 2503.16811 | translate | read | null |
| 2025-03-21 | OpenCity3D: What do Vision-Language Models know about Urban Environments? | Valentin Bieri et.al. | 2503.16776 | translate | read | null |
| 2025-03-20 | Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding | Jinlong Li et.al. | 2503.16707 | translate | read | null |
| 2025-03-20 | ContactFusion: Stochastic Poisson Surface Maps from Visual and Contact Sensing | Aditya Kamireddypalli et.al. | 2503.16592 | translate | read | null |
| 2025-03-20 | From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction | Ayberk Acar et.al. | 2503.16263 | translate | read | null |
| 2025-03-20 | Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation | Andrea Maracani et.al. | 2503.16184 | translate | read | null |
| 2025-03-20 | What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? | Xuanming Cui et.al. | 2503.15846 | translate | read | null |
| 2025-03-19 | A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition | Ritabrata Chakraborty et.al. | 2503.15639 | translate | read | null |
| 2025-03-19 | Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene | Shengqiong Wu et.al. | 2503.15019 | translate | read | null |
| 2025-03-19 | Universal Scene Graph Generation | Shengqiong Wu et.al. | 2503.15005 | translate | read | null |
| 2025-03-19 | SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments | Yinqi Chen et.al. | 2503.14837 | translate | read | null |
| 2025-03-20 | These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models | Parker Ewen et.al. | 2503.14665 | translate | read | null |
| 2025-03-17 | Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey | Liewen Liao et.al. | 2503.14537 | translate | read | null |
| 2025-03-18 | DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation | Mu Chen et.al. | 2503.13957 | translate | read | link |
| 2025-03-18 | Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation | Sayak Nag et.al. | 2503.13947 | translate | read | null |
| 2025-03-18 | ChatBEV: A Visual Language Model that Understands BEV Maps | Qingyao Xu et.al. | 2503.13938 | translate | read | null |
| 2025-03-18 | PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds | Barza Nisar et.al. | 2503.13914 | translate | read | null |
| 2025-03-17 | Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training | Corentin Sautier et.al. | 2503.13203 | translate | read | null |
| 2025-03-17 | Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation | Henghui Du et.al. | 2503.13068 | translate | read | null |
| 2025-03-17 | InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving | Ruiqi Song et.al. | 2503.13047 | translate | read | null |
| 2025-03-17 | HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding | Jiahe Zhao et.al. | 2503.12955 | translate | read | null |
| 2025-03-17 | NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models | Sung-Yeon Park et.al. | 2503.12772 | translate | read | null |
| 2025-03-16 | Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding | Imran Kabir et.al. | 2503.12663 | translate | read | null |
| 2025-03-16 | Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset | Yutao Hu et.al. | 2503.12385 | translate | read | null |
| 2025-03-15 | TACO: Taming Diffusion for in-the-wild Video Amodal Completion | Ruijie Lu et.al. | 2503.12049 | translate | read | null |
| 2025-03-14 | Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling | Christopher Xie et.al. | 2503.11806 | translate | read | null |
| 2025-03-14 | EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting | Di Li et.al. | 2503.11345 | translate | read | null |
| 2025-03-14 | Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset | Yibing Weng et.al. | 2503.11342 | translate | read | null |
| 2025-03-13 | Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations | Piyush Gupta et.al. | 2503.10941 | translate | read | null |
| 2025-03-11 | MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation | Anzhe Cheng et.al. | 2503.10686 | translate | read | null |
| 2025-03-13 | TARS: Traffic-Aware Radar Scene Flow Estimation | Jialong Wu et.al. | 2503.10210 | translate | read | null |
| 2025-03-13 | TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness | Mu Chen et.al. | 2503.09941 | translate | read | null |
| 2025-03-12 | Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval | Stefan Sylvius Wagner et.al. | 2503.09867 | translate | read | null |
| 2025-03-11 | Language-Depth Navigated Thermal and Visible Image Fusion | Jinchang Zhang et.al. | 2503.08676 | translate | read | null |
| 2025-03-11 | Generating Robot Constitutions & Benchmarks for Semantic Safety | Pierre Sermanet et.al. | 2503.08663 | translate | read | null |
| 2025-03-11 | Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding | Tim Steinke et.al. | 2503.08474 | translate | read | null |
| 2025-03-11 | TrackOcc: Camera-based 4D Panoptic Occupancy Tracking | Zhuoguang Chen et.al. | 2503.08471 | translate | read | null |
| 2025-03-11 | Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking | Xucheng Guo et.al. | 2503.08370 | translate | read | null |
| 2025-03-11 | DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos | Lorenzo Mur-Labadia et.al. | 2503.08344 | translate | read | null |
| 2025-03-11 | Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving | Runwei Guan et.al. | 2503.08336 | translate | read | null |
| 2025-03-11 | General-Purpose Aerial Intelligent Agents Empowered by Large Language Models | Ji Zhao et.al. | 2503.08302 | translate | read | null |
| 2025-03-10 | FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction | Dennis Rotondi et.al. | 2503.07909 | translate | read | null |
| 2025-03-10 | Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction | Zongzheng Zhang et.al. | 2503.07485 | translate | read | null |
| 2025-03-10 | CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting | Haicheng Liao et.al. | 2503.07234 | translate | read | null |
| 2025-03-10 | A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning | Xin Wen et.al. | 2503.06960 | translate | read | null |
| 2025-03-10 | LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs | Hanyu Zhou et.al. | 2503.06934 | translate | read | null |
| 2025-03-08 | SplatTalk: 3D VQA with Gaussian Splatting | Anh Thai et.al. | 2503.06271 | translate | read | null |
| 2025-03-08 | Segment Anything, Even Occluded | Wei-En Tai et.al. | 2503.06261 | translate | read | null |
| 2025-03-08 | VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion | Meng Wang et.al. | 2503.06219 | translate | read | null |
| 2025-03-08 | Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images | YingLiang Ma et.al. | 2503.06190 | translate | read | null |
| 2025-03-08 | Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction | Kai Li et.al. | 2503.06161 | translate | read | null |
| 2025-03-08 | Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity | Xiaohao Xu et.al. | 2503.06014 | translate | read | null |
| 2025-03-07 | HexPlane Representation for 3D Semantic Scene Understanding | Zeren Chen et.al. | 2503.05127 | translate | read | null |
| 2025-03-06 | Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning | Victor Sebastian Martinez Pozos et.al. | 2503.04900 | translate | read | null |
| 2025-03-06 | EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images | Rohit Menon et.al. | 2503.04441 | translate | read | null |
| 2025-03-06 | An Egocentric Vision-Language Model based Portable Real-time Smart Assistant | Yifei Huang et.al. | 2503.04250 | translate | read | null |
| 2025-03-06 | H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision | Yunxiao Shi et.al. | 2503.04059 | translate | read | null |
| 2025-03-06 | GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding | Xihan Wang et.al. | 2503.04034 | translate | read | null |
| 2025-03-05 | SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection | Devanish N. Kamtam et.al. | 2503.03942 | translate | read | null |
| 2025-03-05 | Vision-Language Models Struggle to Align Entities across Modalities | Iñigo Alonso et.al. | 2503.03854 | translate | read | null |
| 2025-03-05 | Improving 6D Object Pose Estimation of metallic Household and Industry Objects | Thomas Pöllabauer et.al. | 2503.03655 | translate | read | null |
| 2025-03-04 | MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments | Ege Özsoy et.al. | 2503.02579 | translate | read | link |
| 2025-03-04 | Label-Efficient LiDAR Panoptic Segmentation | Ahmet Selim Çanakçı et.al. | 2503.02372 | translate | read | null |
| 2025-03-04 | SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images | Gargi Panda et.al. | 2503.02270 | translate | read | null |
| 2025-03-03 | vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding | Ali Tourani et.al. | 2503.01783 | translate | read | link |
| 2025-03-03 | OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding | Dianyi Yang et.al. | 2503.01646 | translate | read | null |
| 2025-03-03 | Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond | Guanyao Wu et.al. | 2503.01210 | translate | read | link |
| 2025-03-03 | Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations | Junsong Zhang et.al. | 2503.01114 | translate | read | null |
| 2025-03-01 | Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing | Yanjun Li et.al. | 2503.00548 | translate | read | null |
| 2025-03-01 | Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning | Hanxun Yu et.al. | 2503.00513 | translate | read | link |
| 2025-03-04 | Floorplan-SLAM: A Real-Time, High-Accuracy, and Long-Term Multi-Session Point-Plane SLAM for Efficient Floorplan Reconstruction | Haolin Wang et.al. | 2503.00397 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)