Scene Understanding - 2025-05
Scene Understanding - 2025-05
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-05-30 | Tackling View-Dependent Semantics in 3D Language Gaussian Splatting | Jiazhong Cen et.al. | 2505.24746 | translate | read | null |
| 2025-05-30 | Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors | Duo Zheng et.al. | 2505.24625 | translate | read | link |
| 2025-05-30 | EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding | Ege Özsoy et.al. | 2505.24287 | translate | read | null |
| 2025-05-29 | ConversAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners | Jad Bendarkawi et.al. | 2505.24000 | translate | read | null |
| 2025-05-29 | A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation | Shuzhou Sun et.al. | 2505.23451 | translate | read | null |
| 2025-05-29 | SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model | Bowen Chen et.al. | 2505.23010 | translate | read | null |
| 2025-05-28 | On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation | Liyao Tang et.al. | 2505.22444 | translate | read | null |
| 2025-05-28 | LiDAR Based Semantic Perception for Forklifts in Outdoor Environments | Benjamin Serfling et.al. | 2505.22258 | translate | read | null |
| 2025-05-28 | 3D Question Answering via only 2D Vision-Language Models | Fengyun Wang et.al. | 2505.22143 | translate | read | null |
| 2025-05-29 | DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation | Tianjun Gu et.al. | 2505.21969 | translate | read | null |
| 2025-05-28 | Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs | Insu Lee et.al. | 2505.21955 | translate | read | null |
| 2025-05-27 | A Graph Completion Method that Jointly Predicts Geometry and Topology Enables Effective Molecule Assembly | Rohan V. Koodli et.al. | 2505.21833 | translate | read | null |
| 2025-05-29 | Compositional Scene Understanding through Inverse Generative Modeling | Yanbo Wang et.al. | 2505.21780 | translate | read | null |
| 2025-05-30 | Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks | Keanu Nichols et.al. | 2505.21649 | translate | read | null |
| 2025-05-27 | Assured Autonomy with Neuro-Symbolic Perception | R. Spencer Hallyburton et.al. | 2505.21322 | translate | read | null |
| 2025-05-27 | Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning | Lintao Xu et.al. | 2505.21231 | translate | read | null |
| 2025-05-27 | Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts | Yue Zhang et.al. | 2505.21079 | translate | read | null |
| 2025-05-27 | OccLE: Label-Efficient 3D Semantic Occupancy Prediction | Naiyu Fang et.al. | 2505.20617 | translate | read | null |
| 2025-05-27 | OmniIndoor3D: Comprehensive Indoor 3D Reconstruction | Xiaobao Wei et.al. | 2505.20610 | translate | read | null |
| 2025-05-26 | From Data to Modeling: Fully Open-vocabulary Scene Graph Generation | Zuyao Chen et.al. | 2505.20106 | translate | read | null |
| 2025-05-26 | DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization | Jianxin Huang et.al. | 2505.20041 | translate | read | null |
| 2025-05-26 | Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement | Afrah Shaahid et.al. | 2505.19895 | translate | read | null |
| 2025-05-26 | LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study | Dongil Yang et.al. | 2505.19510 | translate | read | link |
| 2025-05-25 | FHGS: Feature-Homogenized Gaussian Splatting | Q. G. Duan et.al. | 2505.19154 | translate | read | null |
| 2025-05-25 | Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection | Md. Mithun Hossain et.al. | 2505.19010 | translate | read | null |
| 2025-05-24 | Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding | Guofeng Mei et.al. | 2505.18819 | translate | read | null |
| 2025-05-24 | Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps | Sicheng Feng et.al. | 2505.18675 | translate | read | link |
| 2025-05-23 | SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain | Jiawei Zhou et.al. | 2505.17727 | translate | read | null |
| 2025-05-23 | From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation | Mahmoud Chick Zaouali et.al. | 2505.17402 | translate | read | null |
| 2025-05-22 | Assessing the generalization performance of SAM for ureteroscopy scene understanding | Martin Villagrana et.al. | 2505.17210 | translate | read | null |
| 2025-05-22 | CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation | Haihong Hao et.al. | 2505.16663 | translate | read | link |
| 2025-05-21 | SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval | Nikolaos Chaidos et.al. | 2505.15867 | translate | read | link |
| 2025-05-21 | HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning | Xiaodong Mei et.al. | 2505.15703 | translate | read | null |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Kaiyuan Chen et.al. | 2505.15517 | translate | read | link |
| 2025-05-21 | RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation | Naman Patel et.al. | 2505.15373 | translate | read | null |
| 2025-05-21 | DC-Scene: Data-Centric Learning for 3D Scene Understanding | Ting Huang et.al. | 2505.15232 | translate | read | link |
| 2025-05-19 | ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling | Ege Özsoy et.al. | 2505.12890 | translate | read | null |
| 2025-05-19 | AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning | Kai Zhang et.al. | 2505.12782 | translate | read | null |
| 2025-05-19 | Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps | Ziqi Wen et.al. | 2505.12660 | translate | read | null |
| 2025-05-18 | LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding | Hanyu Zhou et.al. | 2505.12253 | translate | read | null |
| 2025-05-18 | SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving | Muleilan Pei et.al. | 2505.12246 | translate | read | null |
| 2025-05-18 | Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | Qingmei Li et.al. | 2505.12207 | translate | read | link |
| 2025-05-18 | Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding | Xuefei Sun et.al. | 2505.12194 | translate | read | null |
| 2025-05-17 | TinyRS-R1: Compact Multimodal Language Model for Remote Sensing | Aybora Koksal et.al. | 2505.12099 | translate | read | null |
| 2025-05-15 | StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation | Daniel A. P. Oliveira et.al. | 2505.10292 | translate | read | link |
| 2025-05-15 | APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds | Yuan Gao et.al. | 2505.09971 | translate | read | link |
| 2025-05-14 | DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection | Jianlin Sun et.al. | 2505.09168 | translate | read | link |
| 2025-05-14 | Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning | Dayong Liang et.al. | 2505.09118 | translate | read | null |
| 2025-05-13 | Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving | Zongchuang Zhao et.al. | 2505.08725 | translate | read | link |
| 2025-05-12 | Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions | Yi Zhang et.al. | 2505.07611 | translate | read | null |
| 2025-05-11 | Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding | Chih-Chung Hsu et.al. | 2505.06991 | translate | read | null |
| 2025-05-11 | Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation | Seokjun Kwon et.al. | 2505.06951 | translate | read | null |
| 2025-05-09 | Camera Control at the Edge with Language Models for Scene Understanding | Alexiy Buynitsky et.al. | 2505.06402 | translate | read | null |
| 2025-05-09 | Camera-Only Bird’s Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles | Anupkumar Bochare et.al. | 2505.06113 | translate | read | null |
| 2025-05-08 | Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization | Sooyoung Park et.al. | 2505.05343 | translate | read | link |
| 2025-05-08 | PADriver: Towards Personalized Autonomous Driving | Genghua Kou et.al. | 2505.05240 | translate | read | null |
| 2025-05-08 | Does CLIP perceive art the same way we do? | Andrea Asperti et.al. | 2505.05229 | translate | read | null |
| 2025-05-07 | GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes | Feng Xiao et.al. | 2505.04659 | translate | read | link |
| 2025-05-07 | RAFT: Robust Augmentation of FeaTures for Image Segmentation | Edward Humes et.al. | 2505.04529 | translate | read | null |
| 2025-05-03 | Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models | Gracjan Góral et.al. | 2505.03821 | translate | read | null |
| 2025-05-06 | MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation | Mingcheng Li et.al. | 2505.02648 | translate | read | null |
| 2025-05-04 | Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation | Volodymyr Havrylov et.al. | 2505.02075 | translate | read | link |
| 2025-05-04 | Segment Any RGB-Thermal Model with Language-aided Distillation | Dong Xing et.al. | 2505.01950 | translate | read | null |
| 2025-05-02 | Embracing Diffraction: A Paradigm Shift in Wireless Sensing and Communication | Anurag Pallaprolu et.al. | 2505.01625 | translate | read | null |
(<a href=../Scene_Understanding.md>back to Scene Understanding</a>)