Scene Understanding
Scene Understanding
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-18 | MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning | Yuanchen Ju et.al. | 2512.16909 | null |
| 2025-12-18 | SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning | Tin Stribor Sohn et.al. | 2512.16461 | null |
| 2025-12-18 | Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception | Bangya Liu et.al. | 2512.16265 | null |
| 2025-12-16 | Unified Semantic Transformer for 3D Scene Understanding | Sebastian Koch et.al. | 2512.14364 | null |
| 2025-12-16 | Consistent Instance Field for Dynamic Scene Understanding | Junyi Wu et.al. | 2512.14126 | null |
| 2025-12-16 | Deep Learning Perspective of Scene Understanding in Autonomous Robots | Afia Maham et.al. | 2512.14020 | null |
| 2025-12-15 | I-Scene: 3D Instance Models are Implicit Generalizable Spatial Learners | Lu Ling et.al. | 2512.13683 | null |
| 2025-12-15 | MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion | Minghui Hou et.al. | 2512.13177 | null |
| 2025-12-15 | DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass | Vivek Alumootil et.al. | 2512.13122 | null |
| 2025-12-15 | SLIM-VDB: A Real-Time 3D Probabilistic Semantic Mapping Framework | Anja Sheppard et.al. | 2512.12945 | null |
| 2025-12-13 | INDOOR-LiDAR: Bridging Simulation and Reality for Robot-Centric 360 degree Indoor LiDAR Perception – A Robot-Centric Hybrid Dataset | Haichuan Li et.al. | 2512.12377 | null |
| 2025-12-13 | MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding | Benjamin Beilharz et.al. | 2512.12307 | null |
| 2025-12-13 | A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection | Peizheng Li et.al. | 2512.12205 | null |
| 2025-12-13 | Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video | Daniel Adebi et.al. | 2512.12165 | null |
| 2025-12-12 | Evaluating Foundation Models’ 3D Understanding Through Multi-View Correspondence Analysis | Valentina Lilova et.al. | 2512.11574 | null |
| 2025-12-12 | Reconstruction as a Bridge for Event-Based Visual Question Answering | Hanyue Lou et.al. | 2512.11510 | null |
| 2025-12-12 | VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing | Emanuel Sánchez Aimar et.al. | 2512.11490 | null |
| 2025-12-10 | LISN: Language-Instructed Social Navigation with VLM-based Controller Modulating | Junting Chen et.al. | 2512.09920 | null |
| 2025-12-09 | SIP: Site in Pieces- A Dataset of Disaggregated Construction-Phase 3D Scans for Semantic Segmentation and Scene Understanding | Seongyong Kim et.al. | 2512.09062 | null |
| 2025-12-09 | LapFM: A Laparoscopic Segmentation Foundation Model via Hierarchical Concept Evolving Pre-training | Qing Xu et.al. | 2512.08439 | null |
| 2025-12-09 | CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning | Zeyuan Chen et.al. | 2512.08135 | null |
| 2025-12-08 | SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery | Meng Cao et.al. | 2512.07733 | null |
| 2025-12-08 | STRinGS: Selective Text Refinement in Gaussian Splatting | Abhinav Raundhal et.al. | 2512.07230 | null |
| 2025-12-08 | A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning | Siyang Jiang et.al. | 2512.07136 | null |
| 2025-12-05 | Physics-Grounded Attached Shadow Detection Using Approximate 3D Geometry and Light Direction | Shilin Hu et.al. | 2512.06179 | null |
| 2025-12-05 | BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving | Karthik Mohan et.al. | 2512.06096 | null |
| 2025-12-05 | Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision | Lennart Maack et.al. | 2512.05740 | null |
| 2025-12-05 | Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction | Ruihong Yin et.al. | 2512.05597 | null |
| 2025-12-05 | VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation | Chinthani Sugandhika et.al. | 2512.05524 | null |
| 2025-12-04 | 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer | Xianfeng Wu et.al. | 2512.05060 | null |
| 2025-12-03 | C3G: Learning Compact 3D Representations with 2K Gaussians | Honggyu An et.al. | 2512.04021 | null |
| 2025-12-03 | Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding | Haoran Zhou et.al. | 2512.03601 | null |
| 2025-12-03 | What Is The Best 3D Scene Representation for Robotics? From Geometric to Foundation Models | Tianchen Deng et.al. | 2512.03422 | null |
| 2025-12-03 | ShelfGaussian: Shelf-Supervised Open-Vocabulary Gaussian-based 3D Scene Understanding | Lingjun Zhao et.al. | 2512.03370 | null |
| 2025-12-02 | SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding | Hongpei Zheng et.al. | 2512.03284 | null |
| 2025-11-29 | When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI | Yanhui Li et.al. | 2512.03087 | null |
| 2025-12-02 | Layout Anything: One Transformer for Universal Room Layout Estimation | Md Sohag Mia et.al. | 2512.02952 | null |
| 2025-12-02 | Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding | Yerim Jeon et.al. | 2512.02487 | null |
| 2025-12-02 | HouseLayout3D: A Benchmark and Training-Free Baseline for 3D Layout Estimation in the Wild | Valentin Bieri et.al. | 2512.02450 | null |
| 2025-12-01 | ManualVLA: A Unified VLA Model for Chain-of-Thought Manual Generation and Robotic Manipulation | Chenyang Gu et.al. | 2512.02013 | null |
| 2025-12-01 | OpenREAD: Reinforced Open-Ended Reasoning for End-to-End Autonomous Driving with LLM-as-Critic | Songyan Zhang et.al. | 2512.01830 | null |
| 2025-12-01 | IGen: Scalable Data Generation for Robot Learning from Open-World Images | Chenghao Gu et.al. | 2512.01773 | null |
| 2025-12-01 | SPARK: Sim-ready Part-level Articulated Reconstruction with VLM Knowledge | Yumeng He et.al. | 2512.01629 | null |
| 2025-12-01 | MDiff4STR: Mask Diffusion Model for Scene Text Recognition | Yongkun Du et.al. | 2512.01422 | null |
| 2025-12-01 | VSRD++: Autolabeling for 3D Object Detection via Instance-Aware Volumetric Silhouette Rendering | Zihua Liu et.al. | 2512.01178 | null |
| 2025-11-30 | FOM-Nav: Frontier-Object Maps for Object Goal Navigation | Thomas Chabal et.al. | 2512.01009 | null |
| 2025-11-30 | Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting | Haishan Wang et.al. | 2512.00850 | null |
| 2025-11-29 | Describe Anything Anywhere At Any Moment | Nicolas Gorlo et.al. | 2512.00565 | null |
| 2025-11-29 | Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR | Lixing Guo et.al. | 2512.00294 | null |
| 2025-11-28 | DenseScan: Advancing 3D Scene Understanding with 2D Dense Annotation | Zirui Wang et.al. | 2512.00226 | null |
| 2025-10-28 | A Comprehensive Survey on Surgical Digital Twin | Afsah Sharaf Khan et.al. | 2512.00019 | null |
| 2025-11-28 | DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation | Hongfei Zhang et.al. | 2511.23127 | null |
| 2025-11-28 | Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding | Anik De et.al. | 2511.23071 | null |
| 2025-11-28 | HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model | Chen Li et.al. | 2511.22961 | null |
| 2025-11-28 | See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection | YuEun Lee et.al. | 2511.22906 | null |
| 2025-11-27 | GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes | Di Wang et.al. | 2511.22645 | null |
| 2025-11-27 | CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving | Zhaohui Wang et.al. | 2511.22532 | null |
| 2025-11-27 | RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding | Xiyan Liu et.al. | 2511.22466 | null |
| 2025-11-26 | SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding | Tae-Min Choi et.al. | 2511.21339 | null |
| 2025-11-26 | Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding | Yutao Tang et.al. | 2511.21191 | null |
| 2025-11-26 | Scaling Foundation Models for Radar Scene Understanding | Pushkal Mishra et.al. | 2511.21105 | null |
| 2025-11-25 | 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding | Xiaoye Wang et.al. | 2511.20646 | null |
| 2025-11-25 | CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception | Miguel Carvalho et.al. | 2511.19820 | null |
| 2025-11-24 | Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models | Jonathan Lee et.al. | 2511.19526 | null |
| 2025-11-24 | Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving | Jianhua Han et.al. | 2511.19221 | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | null |
| 2025-11-24 | Autonomous Surface Selection For Manipulator-Based UV Disinfection In Hospitals Using Foundation Models | Xueyan Oh et.al. | 2511.18709 | null |
| 2025-11-23 | Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span | Heeseung Yun et.al. | 2511.18470 | null |
| 2025-11-22 | Plan-X: Instruct Video Generation via Semantic Planning | Lun Huang et.al. | 2511.17986 | null |
| 2025-11-21 | CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation | Prantik Howlader et.al. | 2511.17755 | null |
| 2025-11-18 | Unified Low-Light Traffic Image Enhancement via Multi-Stage Illumination Recovery and Adaptive Noise Suppression | Siddiqua Namrah et.al. | 2511.17612 | null |
| 2025-11-21 | SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation | Seamie Hayes et.al. | 2511.17361 | null |
| 2025-11-21 | Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM | Chiori Hori et.al. | 2511.17335 | null |
| 2025-11-20 | POMA-3D: The Point Map Way to 3D Scene Understanding | Ye Mao et.al. | 2511.16567 | null |
| 2025-11-20 | LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs | Doriand Petit et.al. | 2511.16454 | null |
| 2025-11-20 | Building temporally coherent 3D maps with VGGT for memory-efficient Semantic SLAM | Gergely Dinya et.al. | 2511.16282 | null |
| 2025-11-20 | How Robot Dogs See the Unseeable | Oliver Bimber et.al. | 2511.16262 | null |
| 2025-11-20 | Real-Time 3D Object Detection with Inference-Aligned Learning | Chenyu Zhao et.al. | 2511.16140 | null |
| 2025-11-20 | Click2Graph: Interactive Panoptic Video Scene Graphs from a Single Click | Raphael Ruschel et.al. | 2511.15948 | null |
| 2025-11-19 | WALDO: Where Unseen Model-based 6D Pose Estimation Meets Occlusion | Sajjad Pakdamansavoji et.al. | 2511.15874 | null |
| 2025-11-19 | ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation | Simon Boeder et.al. | 2511.15396 | null |
| 2025-11-19 | Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception | Jiashu Yang et.al. | 2511.15279 | null |
| 2025-11-18 | RocSync: Millisecond-Accurate Temporal Synchronization for Heterogeneous Camera Systems | Jaro Meyer et.al. | 2511.14948 | null |
| 2025-11-18 | Multi-view Phase-aware Pedestrian-Vehicle Incident Reasoning Framework with Vision-Language Models | Hao Zhen et.al. | 2511.14120 | null |
| 2025-11-18 | Error-Driven Scene Editing for 3D Grounding in Large Language Models | Yue Zhang et.al. | 2511.14086 | null |
| 2025-11-18 | RISE: Single Static Radar-based Indoor Scene Understanding | Kaichen Zhou et.al. | 2511.14019 | null |
| 2025-11-17 | VLMs Guided Interpretable Decision Making for Autonomous Driving | Xin Hu et.al. | 2511.13881 | null |
| 2025-11-17 | Is your VLM Sky-Ready? A Comprehensive Spatial Intelligence Benchmark for UAV Navigation | Lingfeng Zhang et.al. | 2511.13269 | null |
| 2025-11-17 | Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving | Jiacheng Tang et.al. | 2511.13079 | null |
| 2025-11-17 | Visual Room 2.0: Seeing is Not Understanding for MLLMs | Haokun Li et.al. | 2511.12928 | null |
| 2025-11-16 | RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation | Xiaoshuai Hao et.al. | 2511.12436 | null |
| 2025-11-14 | Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy | Vinit Mehta et.al. | 2511.11777 | null |
| 2025-11-13 | ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts | Haowen Jiang et.al. | 2511.11740 | null |
| 2025-11-14 | AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning | Jirong Zha et.al. | 2511.11025 | null |
| 2025-11-13 | DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation | Xuexun Liu et.al. | 2511.10003 | null |
| 2025-11-12 | Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding | Jingtian Ma et.al. | 2511.08978 | null |
| 2025-11-11 | RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation | Hae-Won Jo et.al. | 2511.08651 | null |
| 2025-11-05 | Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants | I. Bailo et.al. | 2511.08609 | null |
| 2025-11-11 | OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition | Lixu Sun et.al. | 2511.08133 | null |
| 2025-11-11 | HD $^2$ -SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving | Zhiwen Yang et.al. | 2511.07925 | null |
| 2025-11-11 | Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views | Haida Feng et.al. | 2511.07813 | null |
| 2025-11-10 | Inference-Time Scaling of Diffusion Models for Infrared Data Generation | Kai A. Horstmann et.al. | 2511.07362 | null |
| 2025-11-10 | PlanT 2.0: Exposing Biases and Structural Flaws in Closed-Loop Driving | Simon Gerstenecker et.al. | 2511.07292 | null |
| 2025-11-10 | Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images | JiaKui Hu et.al. | 2511.07222 | null |
| 2025-11-10 | TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding | Duc Nguyen et.al. | 2511.07007 | null |
| 2025-11-10 | PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory | Qunchao Jin et.al. | 2511.06840 | null |
| 2025-11-09 | Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR) | Tobias Rueckert et.al. | 2511.06549 | null |
| 2025-11-08 | Interaction-Centric Knowledge Infusion and Transfer for Open-Vocabulary Scene Graph Generation | Lin Li et.al. | 2511.05935 | null |
| 2025-11-08 | Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning | Fei Yu et.al. | 2511.05894 | null |
| 2025-11-07 | Lite VLA: Efficient Vision-Language-Action Control on CPU-Bound Edge Robots | Justin Williams et.al. | 2511.05642 | null |
| 2025-11-06 | Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition | Nicholas Babey et.al. | 2511.05622 | null |
| 2025-10-30 | Token Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution | Shiyao Sang et.al. | 2511.05540 | null |
| 2025-11-06 | GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies | Maëlic Neau et.al. | 2511.04357 | null |
| 2025-11-06 | CaRF: Enhancing Multi-View Consistency in Referring 3D Gaussian Splatting Segmentation | Yuwen Tao et.al. | 2511.03992 | null |
| 2025-11-06 | Simple 3D Pose Features Support Human and Machine Social Scene Understanding | Wenshuo Qin et.al. | 2511.03988 | null |
| 2025-11-06 | Room Envelopes: A Synthetic Dataset for Indoor Layout Reconstruction from Images | Sam Bahrami et.al. | 2511.03970 | null |
| 2025-11-05 | SILVI: Simple Interface for Labeling Video Interactions | Ozan Kanbertay et.al. | 2511.03819 | null |
| 2025-11-05 | SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding | Mauro Orazio Drago et.al. | 2511.03325 | null |
| 2025-11-04 | LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation | Gyeom Hwangbo et.al. | 2511.03001 | null |
| 2025-11-04 | DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding | Zixuan Liu et.al. | 2511.02495 | null |
| 2025-11-04 | Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization | Tao Liu et.al. | 2511.02489 | link |
| 2025-11-04 | From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics | Nicolas Schuler et.al. | 2511.02427 | null |
| 2025-11-03 | Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis | Soham Joshi et.al. | 2511.02046 | null |
| 2025-10-31 | The Eigenvalues Entropy as a Classifier Evaluation Measure | Doulaye Dembélé et.al. | 2511.01904 | null |
| 2025-11-03 | A Compact Model for Polar Multiple-Channel Field Effect Transistors: A Case Study in III-V Nitride Semiconductors | Aias Asteris et.al. | 2511.01699 | null |
| 2025-11-03 | Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models | Xiaoyu Zhan et.al. | 2511.01618 | null |
| 2025-11-03 | PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model | Wenqi Liang et.al. | 2511.01571 | null |
| 2025-11-03 | Fast and Robust Remote Two-Qubit Gates on Distributed Qubits | Yunan Li et.al. | 2511.01418 | null |
| 2025-11-03 | A Generative Adversarial Approach to Adversarial Attacks Guided by Contrastive Language-Image Pre-trained Model | Sampriti Soor et.al. | 2511.01317 | null |
| 2025-11-03 | LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping | Lijie Wang et.al. | 2511.01186 | null |
| 2025-11-02 | GauDP: Reinventing Multi-Agent Collaboration through Gaussian-Image Synergy in Diffusion Policies | Ziye Wang et.al. | 2511.00998 | null |
| 2025-11-01 | Grounding Surgical Action Triplets with Instrument Instance Segmentation: A Dataset and Target-Aware Fusion Approach | Oluwatosin Alabi et.al. | 2511.00643 | null |
| 2025-11-01 | CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World | Yating Yu et.al. | 2511.00613 | null |
| 2025-11-01 | Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models | Panwang Pan et.al. | 2511.00503 | link |
| 2025-10-30 | AI Powered High Quality Text to Video Generation with Enhanced Temporal Consistency | Piyushkumar Patel et.al. | 2511.00107 | null |
| 2025-10-31 | Toward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs | Sushil Samuel Dinesh et.al. | 2510.27558 | null |
| 2025-10-31 | NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding | Wei Xu et.al. | 2510.27481 | null |
| 2025-10-31 | Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing | Yijia Wang et.al. | 2510.27335 | null |
| 2025-10-31 | Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis | Weiming Chen et.al. | 2510.27324 | null |
| 2025-10-31 | HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition | Jiacheng Hong et.al. | 2510.27148 | null |
| 2025-10-30 | A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics | Simindokht Jahangard et.al. | 2510.27033 | null |
| 2025-10-30 | The ANUBIS detector and its sensitivity to neutral long-lived particles | ANUBIS Collaboration et.al. | 2510.26932 | null |
| 2025-10-30 | HEIR: Learning Graph-Based Motion Hierarchies | Cheng Zheng et.al. | 2510.26786 | null |
| 2025-10-30 | Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios | Manjunath Prasad Holenarasipura Rajiv et.al. | 2510.26580 | null |
| 2025-10-30 | AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM | Mirko Usuelli et.al. | 2510.26358 | null |
| 2025-10-30 | GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model? | Mingyu Sung et.al. | 2510.26339 | null |
| 2025-10-30 | Letter of Intent: The Forward Physics Facility | Luis A. Anchordoqui et.al. | 2510.26260 | null |
| 2025-10-30 | Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM | Ali Caglayan et.al. | 2510.26131 | null |
| 2025-10-29 | Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks | Xu Zheng et.al. | 2510.25760 | link |
| 2025-10-29 | More than a Moment: Towards Coherent Sequences of Audio Descriptions | Eshika Khandelwal et.al. | 2510.25440 | null |
| 2025-10-29 | U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching | Junsheng Zhou et.al. | 2510.25210 | null |
| 2025-10-29 | EA3D: Online Open-World 3D Object Extraction from Streaming Videos | Xiaoyu Zhou et.al. | 2510.25146 | null |
| 2025-10-29 | Learning Spatial-Aware Manipulation Ordering | Yuxiang Yan et.al. | 2510.25138 | null |
| 2025-10-29 | Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments | Manjunath Prasad Holenarasipura Rajiv et.al. | 2510.25070 | null |
| 2025-10-28 | VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos | Qiucheng Wu et.al. | 2510.24904 | null |
| 2025-10-28 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | link |
| 2025-10-28 | Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes | Jonas Hein et.al. | 2510.24332 | null |
| 2025-10-28 | Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning | Aodi Wu et.al. | 2510.24152 | null |
| 2025-10-27 | Optimized Loudspeaker Panning for Adaptive Sound-Field Correction and Non-stationary Listening Areas | Yuancheng Luo et.al. | 2510.23937 | null |
| 2025-10-27 | DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning | Eddison Pham et.al. | 2510.23907 | null |
| 2025-10-27 | Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations | Yujia Zhang et.al. | 2510.23607 | link |
| 2025-10-27 | PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity | Yuqian Yuan et.al. | 2510.23603 | link |
| 2025-10-27 | InFlux: A Benchmark for Self-Calibration of Dynamic Intrinsics of Video Cameras | Erich Liang et.al. | 2510.23589 | null |
| 2025-10-27 | Localising under the drape: proprioception in the era of distributed surgical robotic system | Martin Huber et.al. | 2510.23512 | null |
| 2025-10-27 | UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception | Karthikeyan Chandra Sekaran et.al. | 2510.23478 | null |
| 2025-10-27 | Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics | Ş. Ekmen et.al. | 2510.23403 | null |
| 2025-10-27 | Evaluation of Vision-LLMs in Surveillance Video | Pascal Benschop et.al. | 2510.23190 | null |
| 2025-10-27 | Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI | Aryan Mathur et.al. | 2510.23148 | null |
| 2025-10-27 | SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency | Quanjian Song et.al. | 2510.22994 | null |
| 2025-10-27 | Charting the Design Space of Neural Graph Representations for Subgraph Matching | Vaibhav Raj et.al. | 2510.22897 | null |
| 2025-10-26 | IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction | Hao Li et.al. | 2510.22706 | link |
| 2025-10-26 | Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views | Anna Deichler et.al. | 2510.22672 | null |
| 2025-10-25 | BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles | Seyed Ahmad Hosseini Miangoleh et.al. | 2510.22370 | null |
| 2025-10-25 | Bridging Perception and Reasoning: Dual-Pipeline Neuro-Symbolic Landing for UAVs in Cluttered Environments | Weixian Qian et.al. | 2510.22204 | null |
| 2025-10-25 | MOGRAS: Human Motion with Grasping in 3D Scenes | Kunal Bhosikar et.al. | 2510.22199 | null |
| 2025-10-25 | LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction | Yuhang Gao et.al. | 2510.22141 | null |
| 2025-10-25 | CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding | Lihuang Fang et.al. | 2510.22119 | null |
| 2025-10-07 | Avi: Action from Volumetric Inference | Harris Song et.al. | 2510.21746 | null |
| 2025-10-24 | OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields | Lisa Weijler et.al. | 2510.21441 | null |
| 2025-10-24 | ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models | Pranav Saxena et.al. | 2510.21069 | null |
| 2025-10-22 | Uncertainty evaluation of segmentation models for Earth observation | Melanie Rey et.al. | 2510.19586 | null |
| 2025-10-22 | Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization | Juncheng Wang et.al. | 2510.19330 | null |
| 2025-10-21 | Event-Grounding Graph: Unified Spatio-Temporal Scene Graph from Robotic Observations | Phuoc Nguyen et.al. | 2510.18697 | null |
| 2025-10-21 | MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning | Wenhui Huang et.al. | 2510.18337 | null |
| 2025-10-21 | UWBench: A Comprehensive Vision-Language Benchmark for Underwater Understanding | Da Zhang et.al. | 2510.18262 | null |
| 2025-10-21 | OpenInsGaussian: Open-vocabulary Instance Gaussian Segmentation with Context-aware Cross-view Fusion | Tianyu Huang et.al. | 2510.18253 | null |
| 2025-10-20 | Enhanced Motion Forecasting with Plug-and-Play Multimodal Large Language Models | Katie Luo et.al. | 2510.17274 | null |
| 2025-10-19 | SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes | Xiongkun Linghu et.al. | 2510.16714 | null |
| 2025-10-18 | Structured Interfaces for Automated Reasoning with 3D Scene Graphs | Aaron Ray et.al. | 2510.16643 | null |
| 2025-10-11 | ESCA: Contextualizing Embodied Agents via Scene-Graph Generation | Jiani Huang et.al. | 2510.15963 | null |
| 2025-10-07 | GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments | Leela Krishna et.al. | 2510.14992 | null |
| 2025-10-16 | QuASH: Using Natural-Language Heuristics to Query Visual-Language Robotic Maps | Matti Pekkanen et.al. | 2510.14546 | null |
| 2025-10-15 | Efficient Few-Shot Learning in Remote Sensing: Fusing Vision and Vision-Language Models | Jia Yun Chua et.al. | 2510.13993 | null |
| 2025-10-15 | SWIR-LightFusion: Multi-spectral Semantic Fusion of Synthetic SWIR with Thermal IR (LWIR/MWIR) and RGB | Muhammad Ishfaq Hussain et.al. | 2510.13404 | null |
| 2025-10-15 | FlyAwareV2: A Multimodal Cross-Domain UAV Dataset for Urban Scene Understanding | Francesco Barbato et.al. | 2510.13243 | null |
| 2025-10-14 | VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages | Jesse Atuhurra et.al. | 2510.12845 | null |
| 2025-10-14 | SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding | Zhiliu Yang et.al. | 2510.12749 | null |
| 2025-10-13 | PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation | Hatem Ibrahem et.al. | 2510.11992 | null |
| 2025-10-13 | PhySIC: Physically Plausible 3D Human-Scene Interaction and Contact from a Single Image | Pradyumna Yalandur Muralidhar et.al. | 2510.11649 | null |
| 2025-10-13 | A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation | Denis Zavadski et.al. | 2510.11567 | null |
| 2025-10-13 | mmWalk: Towards Multi-modal Multi-view Walking Assistance | Kedi Ying et.al. | 2510.11520 | null |
| 2025-10-13 | REACT3D: Recovering Articulations for Interactive Physical 3D Scenes | Zhao Huang et.al. | 2510.11340 | null |
| 2025-10-12 | Real2USD: Scene Representations in Universal Scene Description Language | Christopher D. Hsu et.al. | 2510.10778 | null |
| 2025-10-11 | B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding | Feng Xiao et.al. | 2510.10194 | null |
| 2025-10-10 | CFVBench: A Comprehensive Video Benchmark for Fine-grained Multimodal Retrieval-Augmented Generation | Kaiwen Wei et.al. | 2510.09266 | null |
| 2025-10-08 | Out-of-Distribution Detection in LiDAR Semantic Segmentation Using Epistemic Uncertainty from Hierarchical GMMs | Hanieh Shojaei Miandashti et.al. | 2510.08631 | null |
| 2025-10-03 | Beyond CNNs: Efficient Fine-Tuning of Multi-Modal LLMs for Object Detection on Low-Data Regimes | Nirmal Elamon et.al. | 2510.08589 | null |
| 2025-10-09 | The impact of abstract and object tags on image privacy classification | Darya Baranouskaya et.al. | 2510.07976 | null |
| 2025-10-09 | CVD-STORM: Cross-View Video Diffusion with Spatial-Temporal Reconstruction Model for Autonomous Driving | Tianrui Zhang et.al. | 2510.07944 | link |
| 2025-10-09 | An End-to-End Room Geometry Constrained Depth Estimation Framework for Indoor Panorama Images | Kanglin Ning et.al. | 2510.07817 | null |
| 2025-10-07 | Mitigating Surgical Data Imbalance with Dual-Prediction Video Diffusion Model | Danush Kumar Venkatesh et.al. | 2510.07345 | null |
| 2025-10-08 | Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion | Jie Luo et.al. | 2510.06687 | null |
| 2025-10-07 | When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach | Daniel Gonzálbez-Biosca et.al. | 2510.05661 | null |
| 2025-10-07 | HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video | Hongchi Xia et.al. | 2510.05560 | link |
| 2025-10-06 | Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction | Chi Yan et.al. | 2510.04759 | link |
| 2025-10-02 | LadderMoE: Ladder-Side Mixture of Experts Adapters for Bronze Inscription Recognition | Rixin Zhou et.al. | 2510.01651 | null |
| 2025-10-01 | VL-KnG: Visual Scene Understanding for Navigation Goal Identification using Spatiotemporal Knowledge Graphs | Mohamad Al Mdfaa et.al. | 2510.01483 | null |
| 2025-09-30 | Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification | Artur Barros et.al. | 2509.26457 | null |
| 2025-09-30 | Neighbor-aware informal settlement mapping with graph convolutional networks | Thomas Hallopeau et.al. | 2509.26171 | null |
| 2025-09-30 | Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models | Yuansen Liu et.al. | 2509.26165 | link |
| 2025-09-30 | EasyOcc: 3D Pseudo-Label Supervision for Fully Self-Supervised Semantic Occupancy Prediction Models | Seamie Hayes et.al. | 2509.26087 | null |
| 2025-09-30 | VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs | Peng Liu et.al. | 2509.25916 | null |
| 2025-09-29 | PAD3R: Pose-Aware Dynamic 3D Reconstruction from Casual Videos | Ting-Hsuan Liao et.al. | 2509.25183 | null |
| 2025-09-29 | Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs | Yue Zhang et.al. | 2509.25139 | null |
| 2025-09-29 | Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots | Ermanno Bartoli et.al. | 2509.24966 | null |
| 2025-09-29 | CORE-3D: Context-aware Open-vocabulary Retrieval by Embeddings in 3D | Mohamad Amin Mirzaei et.al. | 2509.24528 | null |
| 2025-09-29 | PhysiAgent: An Embodied Agent Framework in Physical World | Zhihao Wang et.al. | 2509.24524 | null |
| 2025-09-29 | Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy | Haijier Chen et.al. | 2509.24385 | null |
| 2025-09-29 | Robust Partial 3D Point Cloud Registration via Confidence Estimation under Global Context | Yongqiang Wang et.al. | 2509.24275 | null |
| 2025-09-28 | FUSAR-KLIP: Towards Multimodal Foundation Models for Remote Sensing | Yi Yang et.al. | 2509.23927 | null |
| 2025-09-28 | Uni4D-LLM: A Unified SpatioTemporal-Aware VLM for 4D Understanding and Generation | Hanyu Zhou et.al. | 2509.23828 | null |
| 2025-09-28 | From Static to Dynamic: a Survey of Topology-Aware Perception in Autonomous Driving | Yixiao Chen et.al. | 2509.23641 | null |
| 2025-09-28 | From Fields to Splats: A Cross-Domain Survey of Real-Time Neural Scene Representations | Javed Ahmad et.al. | 2509.23555 | null |
| 2025-09-26 | Good Weights: Proactive, Adaptive Dead Reckoning Fusion for Continuous and Robust Visual SLAM | Yanwei Du et.al. | 2509.22910 | null |
| 2025-09-20 | Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment | Abhiroop Chatterjee et.al. | 2509.22697 | null |
| 2025-09-26 | UrbanFeel: A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective | Jun He et.al. | 2509.22228 | null |
| 2025-09-26 | Lightweight Structured Multimodal Reasoning for Clinical Scene Understanding in Robotics | Saurav Jha et.al. | 2509.22014 | null |
| 2025-09-26 | Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding | Vahid Mirjalili et.al. | 2509.21922 | null |
| 2025-09-25 | Real-Time Indoor Object SLAM with LLM-Enhanced Priors | Yang Jiao et.al. | 2509.21602 | null |
| 2025-09-25 | Residual Vector Quantization For Communication-Efficient Multi-Agent Perception | Dereje Shenkut et.al. | 2509.21464 | null |
| 2025-09-23 | TUN3D: Towards Real-World Scene Understanding from Unposed Images | Anton Konushin et.al. | 2509.21388 | link |
| 2025-09-25 | DENet: Dual-Path Edge Network with Global-Local Attention for Infrared Small Target Detection | Jiayi Zuo et.al. | 2509.20701 | null |
| 2025-09-23 | SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment | Binod Singh et.al. | 2509.20401 | null |
| 2025-09-24 | Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning | Xun Li et.al. | 2509.20077 | null |
| 2025-09-24 | OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving | Pei Liu et.al. | 2509.19973 | null |
| 2025-09-23 | Category-Level Object Shape and Pose Estimation in Less Than a Millisecond | Lorenzo Shaikewitz et.al. | 2509.18979 | null |
| 2025-09-23 | Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variations | Hanqing Liu et.al. | 2509.18953 | null |
| 2025-09-23 | Surgical Video Understanding with Label Interpolation | Garam Kim et.al. | 2509.18802 | null |
| 2025-09-23 | MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning | Omar Rayyan et.al. | 2509.18757 | null |
| 2025-09-23 | PIE: Perception and Interaction Enhanced End-to-End Motion Planning for Autonomous Driving | Chengran Yuan et.al. | 2509.18609 | null |
| 2025-09-22 | Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration | Zhitao Zeng et.al. | 2509.17429 | null |
| 2025-09-20 | Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding | Haoyuan Li et.al. | 2509.16721 | null |
| 2025-09-20 | ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting | Xiaoyang Yan et.al. | 2509.16552 | null |
| 2025-09-19 | Towards Sharper Object Boundaries in Self-Supervised Depth Estimation | Aurélien Cecille et.al. | 2509.15987 | null |
| 2025-09-19 | RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation | Paul Julius Kühn et.al. | 2509.15886 | null |
| 2025-09-19 | SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models | Sen Wang et.al. | 2509.15536 | null |
| 2025-09-18 | Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems | Yicheng Zhang et.al. | 2509.15213 | null |
| 2025-09-18 | SPATIALGEN: Layout-guided 3D Indoor Scene Generation | Chuan Fang et.al. | 2509.14981 | link |
| 2025-09-16 | Semantic 3D Reconstructions with SLAM for Central Airway Obstruction | Ayberk Acar et.al. | 2509.13541 | null |
| 2025-09-16 | ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors | Romain Hardy et.al. | 2509.13525 | null |
| 2025-09-16 | 3D Aware Region Prompted Vision Language Model | An-Chieh Cheng et.al. | 2509.13317 | null |
| 2025-09-16 | Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving | Ruibo Li et.al. | 2509.13116 | null |
| 2025-09-16 | Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings | Abdalla Arafa et.al. | 2509.12938 | null |
| 2025-09-16 | MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization | Yiyi Zhang et.al. | 2509.12893 | null |
| 2025-09-15 | RailSafeNet: Visual Scene Understanding for Tram Safety | Ondřej Valach et.al. | 2509.12125 | link |
| 2025-09-15 | Microsurgical Instrument Segmentation for Robot-Assisted Surgery | Tae Kyeong Jeong et.al. | 2509.11727 | null |
| 2025-09-15 | See What I Mean? Mobile Eye-Perspective Rendering for Optical See-through Head-mounted Displays | Gerlinde Emsenhuber et.al. | 2509.11653 | null |
| 2025-09-14 | Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision | Tianyao Sun et.al. | 2509.11476 | null |
| 2025-09-14 | DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation | Yunheng Wang et.al. | 2509.11197 | null |
| 2025-09-14 | 3DAeroRelief: The first 3D Benchmark UAV Dataset for Post-Disaster Assessment | Nhut Le et.al. | 2509.11097 | null |
| 2025-09-13 | OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds | Chongyu Wang et.al. | 2509.10842 | null |
| 2025-09-12 | Multimodal SAM-adapter for Semantic Segmentation | Iacopo Curti et.al. | 2509.10408 | null |
| 2025-09-10 | SocialNav-SUB: Benchmarking VLMs for Scene Understanding in Social Robot Navigation | Michael J. Munje et.al. | 2509.08757 | null |
| 2025-09-09 | OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics | Yinan Deng et.al. | 2509.07500 | null |
| 2025-09-09 | DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis | Sven Kirchner et.al. | 2509.07463 | null |
| 2025-09-08 | Synesthesia of Machines (SoM)-Aided LiDAR Point Cloud Transmission for Collaborative Perception | Ensong Liu et.al. | 2509.06506 | null |
| 2025-09-07 | UNO: Unifying One-stage Video Scene Graph Generation via Object-Centric Visual Representation Learning | Huy Le et.al. | 2509.06165 | null |
| 2025-09-06 | Depth-Aware Super-Resolution via Distance-Adaptive Variational Formulation | Tianhao Guo et.al. | 2509.05746 | null |
| 2025-09-05 | SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing | Chaolei Wang et.al. | 2509.05144 | null |
| 2025-09-03 | Reg3D: Reconstructive Geometry Instruction Tuning for 3D Scene Understanding | Hongpei Zheng et.al. | 2509.03635 | null |
| 2025-09-03 | Rashomon in the Streets: Explanation Ambiguity in Scene Understanding | Helge Spieker et.al. | 2509.03169 | null |
| 2025-09-02 | Generalizable Skill Learning for Construction Robots with Crowdsourced Natural Language Instructions, Composable Skills Standardization, and Large Language Model | Hongrui Yu et.al. | 2509.02876 | null |
| 2025-09-02 | SynthGenNet: a self-supervised approach for test-time generalization using synthetic multi-source domain mixing of street view images | Pushpendra Dhakara et.al. | 2509.02287 | null |
| 2025-09-02 | Omnidirectional Spatial Modeling from Correlated Panoramas | Xinshen Zhang et.al. | 2509.02164 | null |
| 2025-09-02 | AI-Driven Marine Robotics: Emerging Trends in Underwater Perception and Ecosystem Monitoring | Scarlett Raine et.al. | 2509.01878 | null |
| 2025-09-01 | Articulated Object Estimation in the Wild | Abdelrhman Werby et.al. | 2509.01708 | null |
| 2025-09-01 | Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation | Maëlic Neau et.al. | 2509.01209 | null |
| 2025-08-31 | SWAGSplatting: Semantic-guided Water-scene Augmented Gaussian Splatting | Zhuodong Jiang et.al. | 2509.00800 | null |
| 2025-08-31 | OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving | Pei Liu et.al. | 2509.00789 | null |
| 2025-08-30 | ConceptBot: Enhancing Robot’s Autonomy through Task Decomposition with Large Language Models and Knowledge Graph | Alessandro Leanza et.al. | 2509.00570 | null |
| 2025-08-29 | Beyond Pixels: Introducing Geometric-Semantic World Priors for Video-based Embodied Models via Spatio-temporal Alignment | Jinzhou Tang et.al. | 2509.00210 | null |
| 2025-08-18 | 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving | Ali K. AlShami et.al. | 2508.21080 | null |
| 2025-08-27 | Hyperspectral Sensors and Autonomous Driving: Technologies, Limitations, and Opportunities | Imad Ali Shah et.al. | 2508.19905 | null |
| 2025-08-27 | Context-Aware Risk Estimation in Home Environments: A Probabilistic Framework for Service Robots | Sena Ishii et.al. | 2508.19788 | null |
| 2025-08-27 | LabelGS: Label-Aware 3D Gaussian Splatting for 3D Scene Segmentation | Yupeng Zhang et.al. | 2508.19699 | link |
| 2025-08-27 | Scalable Object Detection in the Car Interior With Vision Foundation Models | Bálint Mészáros et.al. | 2508.19651 | null |
| 2025-08-25 | ArgusCogito: Chain-of-Thought for Cross-Modal Synergy and Omnidirectional Reasoning in Camouflaged Object Segmentation | Jianwen Tan et.al. | 2508.18050 | null |
| 2025-08-25 | HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation | Xiping Wang et.al. | 2508.17832 | null |
| 2025-08-24 | Investigating Domain Gaps for Indoor 3D Object Detection | Zijing Zhao et.al. | 2508.17439 | null |
| 2025-08-24 | An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing | Zihan Liang et.al. | 2508.17435 | null |
| 2025-08-24 | SEER-VAR: Semantic Egocentric Environment Reasoner for Vehicle Augmented Reality | Yuzhi Lai et.al. | 2508.17255 | null |
| 2025-08-24 | Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding | Yunxiang Yang et.al. | 2508.17205 | null |
| 2025-08-23 | PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models | Xianjing Cheng et.al. | 2508.17050 | null |
| 2025-08-22 | HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction | Sara Rojas et.al. | 2508.16433 | null |
| 2025-08-21 | ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification | Bochao Sun et.al. | 2508.15632 | null |
| 2025-08-19 | Hybrelighter: Combining Deep Anisotropic Diffusion and Scene Reconstruction for On-device Real-time Relighting in Mixed Reality | Hanwen Zhao et.al. | 2508.14930 | null |
| 2025-08-20 | MoVieDrive: Multi-Modal Multi-View Urban Scene Video Generation | Guile Wu et.al. | 2508.14327 | null |
| 2025-08-19 | GALA: Guided Attention with Language Alignment for Open Vocabulary Gaussian Splatting | Elena Alegret et.al. | 2508.14278 | null |
| 2025-08-19 | ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving | Xianda Guo et.al. | 2508.13977 | null |
| 2025-08-19 | Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference | Yunxiang Yang et.al. | 2508.13439 | null |
| 2025-08-17 | PreSem-Surf: RGB-D Surface Reconstruction with Progressive Semantic Modeling and SG-MLP Pre-Rendering Mechanism | Yuyan Ye et.al. | 2508.13228 | null |
| 2025-08-17 | LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving | Nan Song et.al. | 2508.12404 | null |
| 2025-08-17 | Splat Feature Solver | Butian Xiong et.al. | 2508.12216 | null |
| 2025-08-16 | InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes | Hongyuan Liu et.al. | 2508.12015 | null |
| 2025-08-14 | Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset | Wentao Mo et.al. | 2508.11058 | null |
| 2025-08-13 | Semantic-aware DropSplat: Adaptive Pruning of Redundant Gaussians for 3D Aerial-View Segmentation | Xu Tang et.al. | 2508.09626 | null |
| 2025-08-12 | Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment | Shi-Chen Zhang et.al. | 2508.08811 | null |
| 2025-08-11 | SAGOnline: Segment Any Gaussians Online | Wentao Sun et.al. | 2508.08219 | null |
| 2025-08-11 | TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking | Tony Danjun Wang et.al. | 2508.07968 | null |
| 2025-08-11 | DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models | Licheng Zhang et.al. | 2508.07714 | null |
| 2025-08-10 | Understanding Dynamic Scenes in Ego Centric 4D Point Clouds | Junsheng Huang et.al. | 2508.07251 | null |
| 2025-08-05 | Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images | Qi Xun Yeo et.al. | 2508.06546 | null |
| 2025-08-07 | VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments | Kaiser Hamid et.al. | 2508.05852 | null |
| 2025-08-07 | Point cloud segmentation for 3D Clothed Human Layering | Davide Garavaso et.al. | 2508.05531 | null |
| 2025-08-07 | EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery | Bingyu Yang et.al. | 2508.05205 | null |
| 2025-08-07 | A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding | Mahmoud Chick Zaouali et.al. | 2508.05064 | null |
| 2025-08-07 | TRKT: Weakly Supervised Dynamic Scene Graph Generation with Temporal-enhanced Relation-aware Knowledge Transferring | Zhu Xu et.al. | 2508.04943 | null |
| 2025-08-06 | PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment | Gustav Hanning et.al. | 2508.04659 | null |
| 2025-08-05 | SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision | Zhaoxu Li et.al. | 2508.03177 | null |
| 2025-08-05 | CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation | Lekang Wen et.al. | 2508.03060 | null |
| 2025-08-04 | FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation | Cui Miao et.al. | 2508.02190 | null |
| 2025-08-04 | GaussianCross: Cross-modal Self-supervised 3D Representation Learning via Gaussian Splatting | Lei Yao et.al. | 2508.02172 | null |
| 2025-08-03 | DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion | Zhigang Sun et.al. | 2508.01778 | null |
| 2025-08-03 | AG $^2$ aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing | Zhaonan Wang et.al. | 2508.01740 | null |
| 2025-08-03 | Dynamic Robot-Assisted Surgery with Hierarchical Class-Incremental Semantic Segmentation | Julia Hindel et.al. | 2508.01713 | null |
| 2025-08-02 | TEACH: Text Encoding as Curriculum Hints for Scene Text Recognition | Xiahan Yang et.al. | 2508.01153 | null |
| 2025-08-02 | OpenGS-Fusion: Open-Vocabulary Dense Mapping with Hybrid 3D Gaussian Splatting for Refined Object-Level Understanding | Dianyi Yang et.al. | 2508.01150 | null |
| 2025-08-01 | 3D Reconstruction via Incremental Structure From Motion | Muhammad Zeeshan et.al. | 2508.01019 | null |
| 2025-08-01 | Cooperative Perception: A Resource-Efficient Framework for Multi-Drone 3D Scene Reconstruction Using Federated Diffusion and NeRF | Massoud Pourmandi et.al. | 2508.00967 | null |
| 2025-07-31 | Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs | Bhavya Goyal et.al. | 2508.00169 | null |
| 2025-07-31 | 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding | Ting Huang et.al. | 2507.23478 | null |
| 2025-07-31 | FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models | Yiming Yang et.al. | 2507.23325 | null |
| 2025-07-31 | FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning | Jiajun Cao et.al. | 2507.23318 | null |
| 2025-07-30 | DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion | Qingcheng Zhao et.al. | 2507.22825 | null |
| 2025-07-30 | UAVScenes: A Multi-Modal Dataset for UAVs | Sijie Wang et.al. | 2507.22412 | null |
| 2025-07-29 | EIFNet: Leveraging Event-Image Fusion for Robust Semantic Segmentation | Zhijiang Li et.al. | 2507.21971 | null |
| 2025-07-28 | GTAD: Global Temporal Aggregation Denoising Learning for 3D Semantic Occupancy Prediction | Tianhao Li et.al. | 2507.20963 | null |
| 2025-07-28 | Compositional Video Synthesis by Temporal Object-Centric Learning | Adil Kaan Akan et.al. | 2507.20855 | null |
| 2025-07-27 | VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving | Levente Tempfli et.al. | 2507.20397 | null |
| 2025-07-27 | Solving Scene Understanding for Autonomous Navigation in Unstructured Environments | Naveen Mathews Renji et.al. | 2507.20389 | null |
| 2025-07-26 | FROSS: Faster-than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images | Hao-Yu Hou et.al. | 2507.19993 | null |
| 2025-07-26 | UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block | Luoxi Jing et.al. | 2507.19948 | null |
| 2025-07-26 | RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection | Xiaokai Bai et.al. | 2507.19856 | null |
| 2025-07-26 | Taking Language Embedded 3D Gaussian Splatting into the Wild | Yuze Wang et.al. | 2507.19830 | null |
| 2025-07-25 | Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing | Haichuan Li et.al. | 2507.19691 | null |
| 2025-07-25 | VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions | Haoang Lu et.al. | 2507.19188 | null |
| 2025-07-24 | Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting | Xingyu Miao et.al. | 2507.18678 | null |
| 2025-07-23 | From Scan to Action: Leveraging Realistic Scans for Embodied Scene Understanding | Anna-Maria Halacheva et.al. | 2507.17585 | null |
| 2025-07-23 | IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Bird’s-Eye View Perception | Haichuan Li et.al. | 2507.17445 | null |
| 2025-07-22 | ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension | Yizhi Hu et.al. | 2507.16877 | null |
| 2025-07-22 | Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge | Tobias Rueckert et.al. | 2507.16559 | null |
| 2025-07-22 | Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach | Jon Gutiérrez-Zaballa et.al. | 2507.16556 | null |
| 2025-07-22 | DenseSR: Image Shadow Removal as Dense Prediction | Yu-Fan Lin et.al. | 2507.16472 | link |
| 2025-07-21 | Label tree semantic losses for rich multi-class medical image segmentation | Junwen Wang et.al. | 2507.15777 | null |
| 2025-07-21 | Towards Holistic Surgical Scene Graph | Jongmin Shin et.al. | 2507.15541 | null |
| 2025-07-21 | ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting | Ruijie Zhu et.al. | 2507.15454 | link |
| 2025-07-21 | VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving | Haichao Liu et.al. | 2507.15266 | null |
| 2025-07-19 | DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF | Doriand Petit et.al. | 2507.14596 | null |
| 2025-07-19 | Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions | Jintang Xue et.al. | 2507.14555 | null |
| 2025-07-19 | Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025 | Sujata Gaihre et.al. | 2507.14544 | null |
| 2025-07-19 | CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding | Zhou Chen et.al. | 2507.14426 | null |
| 2025-07-18 | Semantic Segmentation based Scene Understanding in Autonomous Vehicles | Ehsan Rassekh et.al. | 2507.14303 | null |
| 2025-07-18 | Moving Object Detection from Moving Camera Using Focus of Expansion Likelihood and Segmentation | Masahiro Ogawa et.al. | 2507.13628 | null |
| 2025-07-17 | Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection | Jingyao Wang et.al. | 2507.13061 | null |
| 2025-07-17 | Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models | Yifan Xu et.al. | 2507.12916 | null |
| 2025-07-17 | City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning | Penglei Sun et.al. | 2507.12795 | null |
| 2025-07-16 | Funnel-HOI: Top-Down Perception for Zero-Shot HOI Detection | Sandipan Sarma et.al. | 2507.12628 | null |
| 2025-07-15 | Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis | Maciej Szankin et.al. | 2507.11730 | null |
| 2025-07-15 | Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander | Li Wang et.al. | 2507.11079 | null |
| 2025-07-15 | Learning to Tune Like an Expert: Interpretable and Scene-Aware Navigation via MLLM Reasoning and CVAE-Based Adaptation | Yanbo Wang et.al. | 2507.11001 | null |
| 2025-07-14 | Static or Temporal? Semantic Scene Simplification to Aid Wayfinding in Immersive Simulations of Bionic Vision | Justin M. Kasowski et.al. | 2507.10813 | null |
| 2025-07-14 | EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Mingxian Lin et.al. | 2507.10548 | link |
| 2025-07-13 | VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding | Younggun Kim et.al. | 2507.09815 | null |
| 2025-07-13 | Self-supervised Pretraining for Integrated Prediction and Planning of Automated Vehicles | Yangang Ren et.al. | 2507.09537 | null |
| 2025-07-12 | Fast3D: Accelerating 3D Multi-modal Large Language Models for Efficient 3D Scene Understanding | Wencan Huang et.al. | 2507.09334 | null |
| 2025-07-12 | THYME: Temporal Hierarchical-Cyclic Interactivity Modeling for Video Scene Graphs in Aerial Footage | Trong-Thuan Nguyen et.al. | 2507.09200 | null |
| 2025-07-12 | Towards Spatial Audio Understanding via Question Answering | Parthasaarathy Sudarsanam et.al. | 2507.09195 | null |
| 2025-07-12 | On the Fragility of Multimodal Perception to Temporal Misalignment in Autonomous Driving | Md Hasan Shahriar et.al. | 2507.09095 | null |
| 2025-07-10 | OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding | JingLi Lin et.al. | 2507.07984 | link |
| 2025-07-10 | MUVOD: A Novel Multi-view Video Object Segmentation Dataset and A Benchmark for 3D Segmentation | Bangning Wei et.al. | 2507.07519 | null |
| 2025-07-09 | SemRaFiner: Panoptic Segmentation in Sparse and Noisy Radar Point Clouds | Matthias Zeller et.al. | 2507.06906 | null |
| 2025-07-09 | Token Bottleneck: One Token to Remember Dynamics | Taekyung Kim et.al. | 2507.06543 | link |
| 2025-07-09 | What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies | Yaoqi Huang et.al. | 2507.06513 | null |
| 2025-07-08 | Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion | Aleksandar Jevtić et.al. | 2507.06230 | link |
| 2025-07-08 | SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning | Xin Hu et.al. | 2507.05798 | null |
| 2025-07-07 | All in One: Visual-Description-Guided Unified Point Cloud Segmentation | Zongyan Han et.al. | 2507.05211 | null |
| 2025-07-07 | MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding | Jing Liang et.al. | 2507.04686 | null |
| 2025-07-05 | Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation | Ziyu Zhu et.al. | 2507.04047 | null |
| 2025-07-05 | Habitat Classification from Ground-Level Imagery Using Deep Neural Networks | Hongrui Shi et.al. | 2507.04017 | null |
| 2025-07-04 | Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds | Matthias Zeller et.al. | 2507.03463 | null |
| 2025-07-03 | LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans | Zhening Huang et.al. | 2507.02861 | link |
| 2025-07-03 | LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion | Fangfu Liu et.al. | 2507.02813 | link |
| 2025-07-03 | SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment | Qi Xu et.al. | 2507.02705 | link |
| 2025-07-04 | Team RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach | Elena Ryumina et.al. | 2507.02205 | link |
| 2025-07-02 | ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning | Xiao Wang et.al. | 2507.02200 | null |
| 2025-07-02 | ECCV 2024 W-CODA: 1st Workshop on Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving | Kai Chen et.al. | 2507.01735 | null |
| 2025-07-01 | GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond | Anna-Maria Halacheva et.al. | 2507.00886 | null |
| 2025-07-01 | BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving | Zeming Chen et.al. | 2507.00707 | null |
| 2025-06-29 | IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | Parker Liu et.al. | 2506.23329 | link |
| 2025-07-01 | SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting | Yiming Huang et.al. | 2506.23309 | null |
| 2025-06-29 | Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation | Zhenhua Ning et.al. | 2506.23120 | null |
| 2025-06-28 | Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding | Xingyilang Yin et.al. | 2506.22817 | null |
| 2025-06-28 | VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding | Minchao Jiang et.al. | 2506.22799 | null |
| 2025-06-26 | CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery | Felix Holm et.al. | 2506.21813 | null |
| 2025-06-24 | FrankenBot: Brain-Morphic Modular Orchestration for Robotic Manipulation with Vision-Language Models | Shiyi Wang et.al. | 2506.21627 | null |
| 2025-06-26 | CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations | Julian Lorenz et.al. | 2506.21357 | null |
| 2025-06-27 | ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation | Xiwei Xuan et.al. | 2506.21233 | null |
| 2025-06-25 | IPFormer: Visual 3D Panoptic Scene Completion with Context-Adaptive Instance Proposals | Markus Gross et.al. | 2506.20671 | null |
| 2025-06-25 | Case-based Reasoning Augmented Large Language Model Framework for Decision Making in Realistic Safety-Critical Driving Scenarios | Wenbin Gan et.al. | 2506.20531 | null |
| 2025-06-25 | DreamAnywhere: Object-Centric Panoramic 3D Scene Generation | Edoardo Alberto Dominici et.al. | 2506.20367 | null |
| 2025-06-24 | HOIverse: A Synthetic Scene Graph Dataset With Human Object Interactions | Mrunmai Vivek Phatak et.al. | 2506.19639 | null |
| 2025-06-24 | Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects | Federico Tavella et.al. | 2506.19579 | null |
| 2025-06-24 | Surgery-R1: Advancing Surgical-VQLA with Reasoning Multimodal Large Language Model via Reinforcement Learning | Pengfei Hao et.al. | 2506.19469 | null |
| 2025-06-24 | Segment Any 3D-Part in a Scene from a Sentence | Hongyu Wu et.al. | 2506.19331 | null |
| 2025-06-24 | Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding | Runwei Guan et.al. | 2506.19288 | null |
| 2025-06-24 | Object-aware Sound Source Localization via Audio-Visual Scene Understanding | Sung Jin Um et.al. | 2506.18557 | null |
| 2025-06-23 | DIP: Unsupervised Dense In-Context Post-training of Visual Representations | Sophia Sirko-Galouchenko et.al. | 2506.18463 | link |
| 2025-06-22 | TEM^3-Learning: Time-Efficient Multimodal Multi-Task Learning for Advanced Assistive Driving | Wenzhuo Liu et.al. | 2506.18084 | null |
| 2025-06-22 | Feedback Driven Multi Stereo Vision System for Real-Time Event Analysis | Mohamed Benkedadra et.al. | 2506.17910 | null |
| 2025-06-21 | Optimization-Free Patch Attack on Stereo Depth Estimation | Hangcheng Liu et.al. | 2506.17632 | null |
| 2025-06-21 | Scene-R1: Video-Grounded Large Language Models for 3D Scene Reasoning without 3D Annotations | Zhihao Yuan et.al. | 2506.17545 | null |
| 2025-06-17 | Leader360V: The Large-scale, Real-world 360 Video Dataset for Multi-task Learning in Diverse Environment | Weiming Zhang et.al. | 2506.14271 | null |
| 2025-06-17 | Unified Representation Space for 3D Visual Grounding | Yinuo Zheng et.al. | 2506.14238 | null |
| 2025-06-17 | SceneAware: Scene-Constrained Pedestrian Trajectory Prediction with LLM-Guided Walkability | Juho Bai et.al. | 2506.14144 | null |
| 2025-06-17 | Image Segmentation with Large Language Models: A Survey with Perspectives for Intelligent Transportation Systems | Sanjeda Akter et.al. | 2506.14096 | null |
| 2025-06-16 | FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding | Chenlu Zhan et.al. | 2506.13629 | null |
| 2025-06-16 | A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects | Guohuan Xie et.al. | 2506.13552 | null |
| 2025-06-14 | A Spatial Relationship Aware Dataset for Robotics | Peng Wang et.al. | 2506.12525 | link |
| 2025-06-14 | Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding | Youze Wang et.al. | 2506.12336 | null |
| 2025-06-12 | GynSurg: A Comprehensive Gynecology Laparoscopic Surgery Dataset | Sahar Nasirihaghighi et.al. | 2506.11356 | null |
| 2025-06-12 | SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis | Weiliang Chen et.al. | 2506.10981 | null |
| 2025-06-13 | SemanticSplat: Feed-Forward 3D Scene Understanding with Language-Aware Gaussian Fields | Qijing Li et.al. | 2506.09565 | null |
| 2025-06-11 | ODG: Occupancy Prediction Using Dual Gaussians | Yunxiao Shi et.al. | 2506.09417 | null |
| 2025-06-10 | SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting | Mengjiao Ma et.al. | 2506.08710 | link |
| 2025-06-10 | PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly | Liang Ma et.al. | 2506.08708 | null |
| 2025-06-10 | From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge | Agnese Taluzzi et.al. | 2506.08553 | null |
| 2025-06-10 | Robust Visual Localization via Semantic-Guided Multi-Scale Transformer | Zhongtao Tian et.al. | 2506.08526 | null |
| 2025-06-09 | Open World Scene Graph Generation using Vision Language Models | Amartya Dutta et.al. | 2506.08189 | link |
| 2025-06-09 | Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods | Beining Xu et.al. | 2506.07779 | null |
| 2025-06-09 | OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting | Jens Piekenbrinck et.al. | 2506.07697 | null |
| 2025-06-09 | Taking Flight with Dialogue: Enabling Natural Language Control for PX4-based Drone Agent | Shoon Kit Lim et.al. | 2506.07509 | link |
| 2025-06-09 | SpatialLM: Training Large Language Models for Structured Indoor Modeling | Yongsen Mao et.al. | 2506.07491 | link |
| 2025-06-08 | BePo: Leveraging Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction | Yunxiao Shi et.al. | 2506.07002 | null |
| 2025-06-07 | IRS: Instance-Level 3D Scene Graphs via Room Prior Guided LiDAR-Camera Fusion | Hongming Chen et.al. | 2506.06804 | null |
| 2025-06-07 | PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments | Minghao Zou et.al. | 2506.06631 | null |
| 2025-06-06 | Towards Terrain-Aware Task-Driven 3D Scene Graph Generation in Outdoor Environments | Chad R Samuelson et.al. | 2506.06562 | null |
| 2025-06-06 | Enhancing Situational Awareness in Underwater Robotics with Multi-modal Spatial Perception | Pushyami Kaveti et.al. | 2506.06476 | null |
| 2025-06-06 | Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study | Leon Mayer et.al. | 2506.06232 | null |
| 2025-06-06 | STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving | Christian Fruhwirth-Reisinger et.al. | 2506.06218 | null |
| 2025-06-06 | Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness | Steven Landgraf et.al. | 2506.05917 | null |
| 2025-06-06 | HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios | Daming Wang et.al. | 2506.05883 | null |
| 2025-06-06 | Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models | Hugues Thomas et.al. | 2506.05689 | null |
| 2025-06-06 | Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection | Shanmukha Vellamcheti et.al. | 2506.05651 | null |
| 2025-06-05 | SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning | Fanqi Kong et.al. | 2506.05425 | null |
| 2025-06-06 | Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs | Haoyuan Li et.al. | 2506.05318 | null |
| 2025-06-06 | ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation | Daniel Rho et.al. | 2506.05317 | null |
| 2025-06-04 | OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis | Junting Chen et.al. | 2506.04217 | link |
| 2025-06-04 | BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic Segmentation | Jialei Chen et.al. | 2506.03675 | null |
| 2025-06-04 | Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI | Wing Man Casca Kwok et.al. | 2506.03607 | null |
| 2025-06-03 | Trajectory Prediction Meets Large Language Models: A Survey | Yi Xu et.al. | 2506.03408 | link |
| 2025-06-04 | Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments | Di Wen et.al. | 2506.02845 | link |
| 2025-06-03 | PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis | Mijeong Kim et.al. | 2506.02794 | null |
| 2025-06-03 | Large-scale Self-supervised Video Foundation Model for Intelligent Surgery | Shu Yang et.al. | 2506.02692 | null |
| 2025-06-03 | Sight Guide: A Wearable Assistive Perception and Navigation System for the Vision Assistance Race in the Cybathlon 2024 | Patrick Pfreundschuh et.al. | 2506.02676 | null |
| 2025-06-03 | Hierarchical Question-Answering for Driving Scene Understanding Using Vision-Language Models | Safaa Abdullahi Moallim Mohamud et.al. | 2506.02615 | null |
| 2025-06-03 | Sign Language: Towards Sign Understanding for Robot Autonomy | Ayush Agrawal et.al. | 2506.02556 | null |
| 2025-06-02 | MLLMs Need 3D-Aware Representation Supervision for Scene Understanding | Xiaohu Huang et.al. | 2506.01946 | null |
| 2025-06-02 | SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes | Yuji Wang et.al. | 2506.01558 | null |
| 2025-06-02 | FDSG: Forecasting Dynamic Scene Graphs | Yi Yang et.al. | 2506.01487 | null |
| 2025-06-02 | Learning Sparsity for Effective and Efficient Music Performance Question Answering | Xingjian Diao et.al. | 2506.01319 | null |
| 2025-05-30 | Tackling View-Dependent Semantics in 3D Language Gaussian Splatting | Jiazhong Cen et.al. | 2505.24746 | null |
| 2025-05-30 | Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors | Duo Zheng et.al. | 2505.24625 | link |
| 2025-05-30 | EgoExOR: An Ego-Exo-Centric Operating Room Dataset for Surgical Activity Understanding | Ege Özsoy et.al. | 2505.24287 | null |
| 2025-05-29 | ConversAR: Exploring Embodied LLM-Powered Group Conversations in Augmented Reality for Second Language Learners | Jad Bendarkawi et.al. | 2505.24000 | null |
| 2025-05-29 | A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph Generation | Shuzhou Sun et.al. | 2505.23451 | null |
| 2025-05-29 | SeG-SR: Integrating Semantic Knowledge into Remote Sensing Image Super-Resolution via Vision-Language Model | Bowen Chen et.al. | 2505.23010 | null |
| 2025-05-28 | On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation | Liyao Tang et.al. | 2505.22444 | null |
| 2025-05-28 | LiDAR Based Semantic Perception for Forklifts in Outdoor Environments | Benjamin Serfling et.al. | 2505.22258 | null |
| 2025-05-28 | 3D Question Answering via only 2D Vision-Language Models | Fengyun Wang et.al. | 2505.22143 | null |
| 2025-05-29 | DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation | Tianjun Gu et.al. | 2505.21969 | null |
| 2025-05-28 | Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs | Insu Lee et.al. | 2505.21955 | null |
| 2025-05-27 | A Graph Completion Method that Jointly Predicts Geometry and Topology Enables Effective Molecule Assembly | Rohan V. Koodli et.al. | 2505.21833 | null |
| 2025-05-29 | Compositional Scene Understanding through Inverse Generative Modeling | Yanbo Wang et.al. | 2505.21780 | null |
| 2025-05-30 | Right Side Up? Disentangling Orientation Understanding in MLLMs with Fine-grained Multi-axis Perception Tasks | Keanu Nichols et.al. | 2505.21649 | null |
| 2025-05-27 | Assured Autonomy with Neuro-Symbolic Perception | R. Spencer Hallyburton et.al. | 2505.21322 | null |
| 2025-05-27 | Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning | Lintao Xu et.al. | 2505.21231 | null |
| 2025-05-27 | Uni3D-MoE: Scalable Multimodal 3D Scene Understanding via Mixture of Experts | Yue Zhang et.al. | 2505.21079 | null |
| 2025-05-27 | OccLE: Label-Efficient 3D Semantic Occupancy Prediction | Naiyu Fang et.al. | 2505.20617 | null |
| 2025-05-27 | OmniIndoor3D: Comprehensive Indoor 3D Reconstruction | Xiaobao Wei et.al. | 2505.20610 | null |
| 2025-05-26 | From Data to Modeling: Fully Open-vocabulary Scene Graph Generation | Zuyao Chen et.al. | 2505.20106 | null |
| 2025-05-26 | DepthMatch: Semi-Supervised RGB-D Scene Parsing through Depth-Guided Regularization | Jianxin Huang et.al. | 2505.20041 | null |
| 2025-05-26 | Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement | Afrah Shaahid et.al. | 2505.19895 | null |
| 2025-05-26 | LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study | Dongil Yang et.al. | 2505.19510 | link |
| 2025-05-25 | FHGS: Feature-Homogenized Gaussian Splatting | Q. G. Duan et.al. | 2505.19154 | null |
| 2025-05-25 | Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection | Md. Mithun Hossain et.al. | 2505.19010 | null |
| 2025-05-24 | Self-Supervised and Generalizable Tokenization for CLIP-Based 3D Understanding | Guofeng Mei et.al. | 2505.18819 | null |
| 2025-05-24 | Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps | Sicheng Feng et.al. | 2505.18675 | link |
| 2025-05-23 | SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain | Jiawei Zhou et.al. | 2505.17727 | null |
| 2025-05-23 | From Flight to Insight: Semantic 3D Reconstruction for Aerial Inspection via Gaussian Splatting and Language-Guided Segmentation | Mahmoud Chick Zaouali et.al. | 2505.17402 | null |
| 2025-05-22 | Assessing the generalization performance of SAM for ureteroscopy scene understanding | Martin Villagrana et.al. | 2505.17210 | null |
| 2025-05-22 | CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation | Haihong Hao et.al. | 2505.16663 | link |
| 2025-05-21 | SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval | Nikolaos Chaidos et.al. | 2505.15867 | link |
| 2025-05-21 | HAMF: A Hybrid Attention-Mamba Framework for Joint Scene Context Understanding and Future Motion Representation Learning | Xiaodong Mei et.al. | 2505.15703 | null |
| 2025-05-21 | Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets | Kaiyuan Chen et.al. | 2505.15517 | link |
| 2025-05-21 | RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation | Naman Patel et.al. | 2505.15373 | null |
| 2025-05-21 | DC-Scene: Data-Centric Learning for 3D Scene Understanding | Ting Huang et.al. | 2505.15232 | link |
| 2025-05-19 | ORQA: A Benchmark and Foundation Model for Holistic Operating Room Modeling | Ege Özsoy et.al. | 2505.12890 | null |
| 2025-05-19 | AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning | Kai Zhang et.al. | 2505.12782 | null |
| 2025-05-19 | Predicting Reaction Time to Comprehend Scenes with Foveated Scene Understanding Maps | Ziqi Wen et.al. | 2505.12660 | null |
| 2025-05-18 | LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding | Hanyu Zhou et.al. | 2505.12253 | null |
| 2025-05-18 | SEPT: Standard-Definition Map Enhanced Scene Perception and Topology Reasoning for Autonomous Driving | Muleilan Pei et.al. | 2505.12246 | null |
| 2025-05-18 | Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind | Qingmei Li et.al. | 2505.12207 | link |
| 2025-05-18 | Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding | Xuefei Sun et.al. | 2505.12194 | null |
| 2025-05-17 | TinyRS-R1: Compact Multimodal Language Model for Remote Sensing | Aybora Koksal et.al. | 2505.12099 | null |
| 2025-05-15 | StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation | Daniel A. P. Oliveira et.al. | 2505.10292 | link |
| 2025-05-15 | APCoTTA: Continual Test-Time Adaptation for Semantic Segmentation of Airborne LiDAR Point Clouds | Yuan Gao et.al. | 2505.09971 | link |
| 2025-05-14 | DRRNet: Macro-Micro Feature Fusion and Dual Reverse Refinement for Camouflaged Object Detection | Jianlin Sun et.al. | 2505.09168 | link |
| 2025-05-14 | Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning | Dayong Liang et.al. | 2505.09118 | null |
| 2025-05-13 | Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving | Zongchuang Zhao et.al. | 2505.08725 | link |
| 2025-05-12 | Deep Learning Advances in Vision-Based Traffic Accident Anticipation: A Comprehensive Review of Methods,Datasets,and Future Directions | Yi Zhang et.al. | 2505.07611 | null |
| 2025-05-11 | Technical Report for ICRA 2025 GOOSE 2D Semantic Segmentation Challenge: Leveraging Color Shift Correction, RoPE-Swin Backbone, and Quantile-based Label Denoising Strategy for Robust Outdoor Scene Understanding | Chih-Chung Hsu et.al. | 2505.06991 | null |
| 2025-05-11 | Boosting Cross-spectral Unsupervised Domain Adaptation for Thermal Semantic Segmentation | Seokjun Kwon et.al. | 2505.06951 | null |
| 2025-05-09 | Camera Control at the Edge with Language Models for Scene Understanding | Alexiy Buynitsky et.al. | 2505.06402 | null |
| 2025-05-09 | Camera-Only Bird’s Eye View Perception: A Neural Approach to LiDAR-Free Environmental Mapping for Autonomous Vehicles | Anupkumar Bochare et.al. | 2505.06113 | null |
| 2025-05-08 | Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization | Sooyoung Park et.al. | 2505.05343 | link |
| 2025-05-08 | PADriver: Towards Personalized Autonomous Driving | Genghua Kou et.al. | 2505.05240 | null |
| 2025-05-08 | Does CLIP perceive art the same way we do? | Andrea Asperti et.al. | 2505.05229 | null |
| 2025-05-07 | GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes | Feng Xiao et.al. | 2505.04659 | link |
| 2025-05-07 | RAFT: Robust Augmentation of FeaTures for Image Segmentation | Edward Humes et.al. | 2505.04529 | null |
| 2025-05-03 | Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models | Gracjan Góral et.al. | 2505.03821 | null |
| 2025-05-06 | MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation | Mingcheng Li et.al. | 2505.02648 | null |
| 2025-05-04 | Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation | Volodymyr Havrylov et.al. | 2505.02075 | link |
| 2025-05-04 | Segment Any RGB-Thermal Model with Language-aided Distillation | Dong Xing et.al. | 2505.01950 | null |
| 2025-05-02 | Embracing Diffraction: A Paradigm Shift in Wireless Sensing and Communication | Anurag Pallaprolu et.al. | 2505.01625 | null |
| 2025-04-30 | V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving | Jannik Lübberstedt et.al. | 2505.00156 | null |
| 2025-04-30 | LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics | Marc Glocker et.al. | 2504.21716 | link |
| 2025-04-30 | ImaginateAR: AI-Assisted In-Situ Authoring in Augmented Reality | Jaewook Lee et.al. | 2504.21360 | null |
| 2025-04-28 | Category-Level and Open-Set Object Pose Estimation for Robotics | Peter Hönig et.al. | 2504.19572 | null |
| 2025-04-28 | Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding | Yan Wang et.al. | 2504.19500 | null |
| 2025-04-27 | Beyond Physical Reach: Comparing Head- and Cane-Mounted Cameras for Last-Mile Navigation by Blind Users | Apurv Varshney et.al. | 2504.19345 | null |
| 2025-04-27 | OpenFusion++: An Open-vocabulary Real-time Scene Understanding System | Xiaofeng Jin et.al. | 2504.19266 | null |
| 2025-04-27 | CARL: Camera-Agnostic Representation Learning for Spectral Image Analysis | Alexander Baumann et.al. | 2504.19223 | null |
| 2025-04-27 | Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving | Mi Zheng et.al. | 2504.19183 | null |
| 2025-04-23 | TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance | Meng Chu et.al. | 2504.16505 | null |
| 2025-04-21 | Multimodal Large Language Models for Enhanced Traffic Safety: A Comprehensive Review and Future Trends | Mohammad Abu Tami et.al. | 2504.16134 | null |
| 2025-04-22 | Vision language models are unreliable at trivial spatial cognition | Sangeet Khemlani et.al. | 2504.16061 | null |
| 2025-04-20 | Relation-R1: Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relational Comprehension | Lin Li et.al. | 2504.14642 | null |
| 2025-04-20 | RoboOcc: Enhancing the Geometric and Semantic Scene Understanding for Robots | Zhang Zhang et.al. | 2504.14604 | null |
| 2025-04-20 | Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | Tong Zeng et.al. | 2504.14526 | link |
| 2025-04-20 | Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation | Guoyi Zhang et.al. | 2504.14481 | null |
| 2025-04-18 | HAECcity: Open-Vocabulary Scene Understanding of City-Scale Point Clouds with Superpoint Graph Clustering | Alexander Rusnak et.al. | 2504.13590 | null |
| 2025-04-18 | Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding | Yuchen Rao et.al. | 2504.13580 | link |
| 2025-04-18 | Temporal Propagation of Asymmetric Feature Pyramid for Surgical Scene Segmentation | Cheng Yuan et.al. | 2504.13440 | null |
| 2025-04-17 | Training-Free Hierarchical Scene Understanding for Gaussian Splatting with Superpoint Graphs | Shaohui Dai et.al. | 2504.13153 | link |
| 2025-04-17 | Explainable Scene Understanding with Qualitative Representations and Graph Neural Networks | Nassim Belmecheri et.al. | 2504.12817 | null |
| 2025-04-17 | Robo-SGG: Exploiting Layout-Oriented Normalization and Restitution for Robust Scene Graph Generation | Changsheng Lv et.al. | 2504.12606 | null |
| 2025-04-16 | Generalized Visual Relation Detection with Diffusion Models | Kaifeng Gao et.al. | 2504.12100 | null |
| 2025-04-17 | DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency | Mengshi Qi et.al. | 2504.12080 | link |
| 2025-04-16 | CAGS: Open-Vocabulary 3D Scene Understanding with Context-Aware Gaussian Splatting | Wei Sun et.al. | 2504.11893 | null |
| 2025-04-15 | Single-Input Multi-Output Model Merging: Leveraging Foundation Models for Dense Multi-Task Learning | Juan Garcia Giraldo et.al. | 2504.11268 | null |
| 2025-04-14 | Foundation Models for Remote Sensing: An Analysis of MLLMs for Object Localization | Darryl Hannan et.al. | 2504.10727 | null |
| 2025-04-14 | SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding | Marc Gutiérrez-Pérez et.al. | 2504.10106 | link |
| 2025-04-12 | Text To 3D Object Generation For Scalable Room Assembly | Sonia Laguna et.al. | 2504.09328 | null |
| 2025-04-11 | FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment | Sebastián Barbas Laina et.al. | 2504.08603 | null |
| 2025-04-11 | FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents | Xin Tan et.al. | 2504.08581 | null |
| 2025-04-11 | DSM: Building A Diverse Semantic Map for 3D Visual Grounding | Qinghongbing Xie et.al. | 2504.08307 | null |
| 2025-04-10 | SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos | Joshua Li et.al. | 2504.07867 | null |
| 2025-04-10 | DGOcc: Depth-aware Global Query-based Network for Monocular 3D Occupancy Prediction | Xu Zhao et.al. | 2504.07524 | null |
| 2025-04-09 | RayFronts: Open-Set Semantic Ray Frontiers for Online Scene Understanding and Exploration | Omar Alama et.al. | 2504.06994 | null |
| 2025-04-09 | Audio-visual Event Localization on Portrait Mode Short Videos | Wuyang Liu et.al. | 2504.06884 | null |
| 2025-04-09 | MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking | Chang Nie et.al. | 2504.06863 | null |
| 2025-04-09 | Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding | Pedro Hermosilla et.al. | 2504.06719 | link |
| 2025-04-09 | Domain-Conditioned Scene Graphs for State-Grounded Task Planning | Jonas Herzog et.al. | 2504.06661 | null |
| 2025-04-09 | Attributes-aware Visual Emotion Representation Learning | Rahul Singh Maharjan et.al. | 2504.06578 | null |
| 2025-04-08 | CamContextI2V: Context-aware Controllable Video Generation | Luis Denninger et.al. | 2504.06022 | link |
| 2025-04-08 | AEGIS: Human Attention-based Explainable Guidance for Intelligent Vehicle Systems | Zhuoli Zhuang et.al. | 2504.05950 | null |
| 2025-04-08 | PRIMEDrive-CoT: A Precognitive Chain-of-Thought Framework for Uncertainty-Aware Object Interaction in Driving Scene Scenario | Sriram Mandalika et.al. | 2504.05908 | null |
| 2025-04-08 | InvNeRF-Seg: Fine-Tuning a Pre-Trained NeRF for 3D Object Segmentation | Jiangsan Zhao et.al. | 2504.05751 | null |
| 2025-04-07 | RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model | Congcong Wen et.al. | 2504.04988 | null |
| 2025-04-07 | Feedback-Enhanced Hallucination-Resistant Vision-Language Model for Real-Time Scene Understanding | Zahir Alsulaimawi et.al. | 2504.04772 | null |
| 2025-04-07 | DFormerv2: Geometry Self-Attention for RGBD Semantic Segmentation | Bo-Wen Yin et.al. | 2504.04701 | link |
| 2025-04-06 | Planning Safety Trajectories with Dual-Phase, Physics-Informed, and Transportation Knowledge-Driven Large Language Models | Rui Gan et.al. | 2504.04562 | null |
| 2025-04-04 | 3D Scene Understanding Through Local Random Access Sequence Modeling | Wanhee Lee et.al. | 2504.03875 | link |
| 2025-04-07 | NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving | Kexin Tian et.al. | 2504.03164 | null |
| 2025-04-03 | F-ViTA: Foundation Model Guided Visible to Thermal Translation | Jay N. Paranjape et.al. | 2504.02801 | link |
| 2025-04-03 | Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision | Xiaofeng Han et.al. | 2504.02477 | link |
| 2025-04-02 | Scene-Centric Unsupervised Panoptic Segmentation | Oliver Hahn et.al. | 2504.01955 | link |
| 2025-04-02 | Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness | Haochen Wang et.al. | 2504.01901 | null |
| 2025-04-02 | CoMatcher: Multi-View Collaborative Feature Matching | Jintao Zhang et.al. | 2504.01872 | null |
| 2025-04-02 | TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication | Petr Vanc et.al. | 2504.01708 | null |
| 2025-04-02 | Overlap-Aware Feature Learning for Robust Unsupervised Domain Adaptation for 3D Semantic Segmentation | Junjie Chen et.al. | 2504.01668 | null |
| 2025-04-01 | WikiVideo: Article Generation from Multiple Videos | Alexander Martin et.al. | 2504.00939 | link |
| 2025-04-01 | Zero-Shot 4D Lidar Panoptic Segmentation | Yushan Zhang et.al. | 2504.00848 | null |
| 2025-04-01 | PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks | Abdelrahman Elskhawy et.al. | 2504.00844 | null |
| 2025-04-01 | Context-Aware Human Behavior Prediction Using Multimodal Large Language Models: Challenges and Insights | Yuchen Liu et.al. | 2504.00839 | null |
| 2025-03-30 | PhysPose: Refining 6D Object Poses with Physical Constraints | Martin Malenický et.al. | 2503.23587 | null |
| 2025-03-30 | Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model | Jannik Endres et.al. | 2503.23502 | link |
| 2025-03-29 | Can DeepSeek-V3 Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery | Boyi Ma et.al. | 2503.23130 | null |
| 2025-03-29 | Evaluating Compositional Scene Understanding in Multimodal Generative Models | Shuhao Fu et.al. | 2503.23125 | link |
| 2025-03-29 | Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments | Yifan Xu et.al. | 2503.23105 | null |
| 2025-03-29 | Empowering Large Language Models with 3D Situation Awareness | Zhihao Yuan et.al. | 2503.23024 | null |
| 2025-03-28 | Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users | Antonia Karamolegkou et.al. | 2503.22610 | null |
| 2025-03-28 | Next-Best-Trajectory Planning of Robot Manipulators for Effective Observation and Exploration | Heiko Renz et.al. | 2503.22588 | null |
| 2025-03-28 | NuGrounding: A Multi-View 3D Visual Grounding Framework in Autonomous Driving | Fuhao Li et.al. | 2503.22436 | null |
| 2025-03-28 | Endo-TTAP: Robust Endoscopic Tissue Tracking via Multi-Facet Guided Attention and Hybrid Flow-point Supervision | Rulin Zhou et.al. | 2503.22394 | null |
| 2025-03-28 | A Dataset for Semantic Segmentation in the Presence of Unknowns | Zakaria Laskar et.al. | 2503.22309 | null |
| 2025-03-28 | Mitigating Trade-off: Stream and Query-guided Aggregation for Efficient and Effective 3D Occupancy Prediction | Seokha Moon et.al. | 2503.22087 | null |
| 2025-03-27 | Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting | Anand Bhattad et.al. | 2503.21770 | null |
| 2025-03-27 | uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images | Jonathan Lee et.al. | 2503.21562 | link |
| 2025-03-27 | Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving | Lucas Nunes et.al. | 2503.21449 | link |
| 2025-03-26 | DINeMo: Learning Neural Mesh Models with no 3D Annotations | Weijie Guo et.al. | 2503.20220 | null |
| 2025-03-25 | The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | Jonathan Sauder et.al. | 2503.20000 | null |
| 2025-03-25 | SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining | Xiang Xu et.al. | 2503.19912 | link |
| 2025-03-25 | OpenLex3D: A New Evaluation Benchmark for Open-Vocabulary 3D Scene Representations | Christina Kassab et.al. | 2503.19764 | null |
| 2025-03-26 | COB-GS: Clear Object Boundaries in 3DGS Segmentation Based on Boundary-Adaptive Gaussian Splitting | Jiaxin Zhang et.al. | 2503.19443 | link |
| 2025-03-25 | Divide-and-Conquer: Dual-Hierarchical Optimization for Semantic 4D Gaussian Spatting | Zhiying Yan et.al. | 2503.19332 | null |
| 2025-03-25 | BIMII-Net: Brain-Inspired Multi-Iterative Interactive Network for RGB-T Road Scene Semantic Segmentation | Hanshuo Qiu et.al. | 2503.19303 | null |
| 2025-03-24 | Efficient and Accurate Scene Text Recognition with Cascaded-Transformers | Savas Ozkan et.al. | 2503.18883 | null |
| 2025-03-24 | Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition | Yifei Zhang et.al. | 2503.18746 | null |
| 2025-03-24 | Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving | Hongkuan Zhou et.al. | 2503.18730 | null |
| 2025-03-23 | MLLM-For3D: Adapting Multimodal Large Language Model for 3D Reasoning Segmentation | Jiaxin Huang et.al. | 2503.18135 | null |
| 2025-03-23 | PanoGS: Gaussian-based Panoptic Segmentation for 3D Open Vocabulary Scene Understanding | Hongjia Zhai et.al. | 2503.18107 | null |
| 2025-03-23 | PanopticSplatting: End-to-End Panoptic Gaussian Splatting | Yuxuan Xie et.al. | 2503.18073 | null |
| 2025-03-23 | PolarFree: Polarization-based Reflection-free Imaging | Mingde Yao et.al. | 2503.18055 | null |
| 2025-03-23 | SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining | Yue Li et.al. | 2503.18052 | null |
| 2025-03-23 | Geometric Constrained Non-Line-of-Sight Imaging | Xueying Liu et.al. | 2503.17992 | null |
| 2025-03-22 | A Causal Adjustment Module for Debiasing Scene Graph Generation | Li Liu et.al. | 2503.17862 | null |
| 2025-03-21 | Neuro-Symbolic Scene Graph Conditioning for Synthetic Image Dataset Generation | Giacomo Savazzi et.al. | 2503.17224 | null |
| 2025-03-21 | ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail | Chandan Yeshwanth et.al. | 2503.17044 | null |
| 2025-03-21 | Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision | Maoji Zheng et.al. | 2503.16811 | null |
| 2025-03-21 | OpenCity3D: What do Vision-Language Models know about Urban Environments? | Valentin Bieri et.al. | 2503.16776 | null |
| 2025-03-20 | Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding | Jinlong Li et.al. | 2503.16707 | null |
| 2025-03-20 | ContactFusion: Stochastic Poisson Surface Maps from Visual and Contact Sensing | Aditya Kamireddypalli et.al. | 2503.16592 | null |
| 2025-03-20 | From Monocular Vision to Autonomous Action: Guiding Tumor Resection via 3D Reconstruction | Ayberk Acar et.al. | 2503.16263 | null |
| 2025-03-20 | Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation | Andrea Maracani et.al. | 2503.16184 | null |
| 2025-03-20 | What can Off-the-Shelves Large Multi-Modal Models do for Dynamic Scene Graph Generation? | Xuanming Cui et.al. | 2503.15846 | null |
| 2025-03-19 | A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition | Ritabrata Chakraborty et.al. | 2503.15639 | null |
| 2025-03-19 | Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene | Shengqiong Wu et.al. | 2503.15019 | null |
| 2025-03-19 | Universal Scene Graph Generation | Shengqiong Wu et.al. | 2503.15005 | null |
| 2025-03-19 | SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments | Yinqi Chen et.al. | 2503.14837 | null |
| 2025-03-20 | These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models | Parker Ewen et.al. | 2503.14665 | null |
| 2025-03-17 | Learning-based 3D Reconstruction in Autonomous Driving: A Comprehensive Survey | Liewen Liao et.al. | 2503.14537 | null |
| 2025-03-18 | DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation | Mu Chen et.al. | 2503.13957 | link |
| 2025-03-18 | Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation | Sayak Nag et.al. | 2503.13947 | null |
| 2025-03-18 | ChatBEV: A Visual Language Model that Understands BEV Maps | Qingyao Xu et.al. | 2503.13938 | null |
| 2025-03-18 | PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds | Barza Nisar et.al. | 2503.13914 | null |
| 2025-03-17 | Clustering is back: Reaching state-of-the-art LiDAR instance segmentation without training | Corentin Sautier et.al. | 2503.13203 | null |
| 2025-03-17 | Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation | Henghui Du et.al. | 2503.13068 | null |
| 2025-03-17 | InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving | Ruiqi Song et.al. | 2503.13047 | null |
| 2025-03-17 | HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding | Jiahe Zhao et.al. | 2503.12955 | null |
| 2025-03-17 | NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models | Sung-Yeon Park et.al. | 2503.12772 | null |
| 2025-03-16 | Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding | Imran Kabir et.al. | 2503.12663 | null |
| 2025-03-16 | Car-1000: A New Large Scale Fine-Grained Visual Categorization Dataset | Yutao Hu et.al. | 2503.12385 | null |
| 2025-03-15 | TACO: Taming Diffusion for in-the-wild Video Amodal Completion | Ruijie Lu et.al. | 2503.12049 | null |
| 2025-03-14 | Human-in-the-Loop Local Corrections of 3D Scene Layouts via Infilling | Christopher Xie et.al. | 2503.11806 | null |
| 2025-03-14 | EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting | Di Li et.al. | 2503.11345 | null |
| 2025-03-14 | Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset | Yibing Weng et.al. | 2503.11342 | null |
| 2025-03-13 | Graph-Grounded LLMs: Leveraging Graphical Function Calling to Minimize LLM Hallucinations | Piyush Gupta et.al. | 2503.10941 | null |
| 2025-03-11 | MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation | Anzhe Cheng et.al. | 2503.10686 | null |
| 2025-03-13 | TARS: Traffic-Aware Radar Scene Flow Estimation | Jialong Wu et.al. | 2503.10210 | null |
| 2025-03-13 | TGP: Two-modal occupancy prediction with 3D Gaussian and sparse points for 3D Environment Awareness | Mu Chen et.al. | 2503.09941 | null |
| 2025-03-12 | Object-Aware DINO (Oh-A-Dino): Enhancing Self-Supervised Representations for Multi-Object Instance Retrieval | Stefan Sylvius Wagner et.al. | 2503.09867 | null |
| 2025-03-11 | Language-Depth Navigated Thermal and Visible Image Fusion | Jinchang Zhang et.al. | 2503.08676 | null |
| 2025-03-11 | Generating Robot Constitutions & Benchmarks for Semantic Safety | Pierre Sermanet et.al. | 2503.08663 | null |
| 2025-03-11 | Collaborative Dynamic 3D Scene Graphs for Open-Vocabulary Urban Scene Understanding | Tim Steinke et.al. | 2503.08474 | null |
| 2025-03-11 | TrackOcc: Camera-based 4D Panoptic Occupancy Tracking | Zhuoguang Chen et.al. | 2503.08471 | null |
| 2025-03-11 | Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking | Xucheng Guo et.al. | 2503.08370 | null |
| 2025-03-11 | DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos | Lorenzo Mur-Labadia et.al. | 2503.08344 | null |
| 2025-03-11 | Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving | Runwei Guan et.al. | 2503.08336 | null |
| 2025-03-11 | General-Purpose Aerial Intelligent Agents Empowered by Large Language Models | Ji Zhao et.al. | 2503.08302 | null |
| 2025-03-10 | FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction | Dennis Rotondi et.al. | 2503.07909 | null |
| 2025-03-10 | Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction | Zongzheng Zhang et.al. | 2503.07485 | null |
| 2025-03-10 | CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting | Haicheng Liao et.al. | 2503.07234 | null |
| 2025-03-10 | A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning | Xin Wen et.al. | 2503.06960 | null |
| 2025-03-10 | LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs | Hanyu Zhou et.al. | 2503.06934 | null |
| 2025-03-08 | SplatTalk: 3D VQA with Gaussian Splatting | Anh Thai et.al. | 2503.06271 | null |
| 2025-03-08 | Segment Anything, Even Occluded | Wei-En Tai et.al. | 2503.06261 | null |
| 2025-03-08 | VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion | Meng Wang et.al. | 2503.06219 | null |
| 2025-03-08 | Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images | YingLiang Ma et.al. | 2503.06190 | null |
| 2025-03-08 | Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction | Kai Li et.al. | 2503.06161 | null |
| 2025-03-08 | Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity | Xiaohao Xu et.al. | 2503.06014 | null |
| 2025-03-07 | HexPlane Representation for 3D Semantic Scene Understanding | Zeren Chen et.al. | 2503.05127 | null |
| 2025-03-06 | Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning | Victor Sebastian Martinez Pozos et.al. | 2503.04900 | null |
| 2025-03-06 | EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images | Rohit Menon et.al. | 2503.04441 | null |
| 2025-03-06 | An Egocentric Vision-Language Model based Portable Real-time Smart Assistant | Yifei Huang et.al. | 2503.04250 | null |
| 2025-03-06 | H3O: Hyper-Efficient 3D Occupancy Prediction with Heterogeneous Supervision | Yunxiao Shi et.al. | 2503.04059 | null |
| 2025-03-06 | GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding | Xihan Wang et.al. | 2503.04034 | null |
| 2025-03-05 | SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection | Devanish N. Kamtam et.al. | 2503.03942 | null |
| 2025-03-05 | Vision-Language Models Struggle to Align Entities across Modalities | Iñigo Alonso et.al. | 2503.03854 | null |
| 2025-03-05 | Improving 6D Object Pose Estimation of metallic Household and Industry Objects | Thomas Pöllabauer et.al. | 2503.03655 | null |
| 2025-03-04 | MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments | Ege Özsoy et.al. | 2503.02579 | link |
| 2025-03-04 | Label-Efficient LiDAR Panoptic Segmentation | Ahmet Selim Çanakçı et.al. | 2503.02372 | null |
| 2025-03-04 | SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images | Gargi Panda et.al. | 2503.02270 | null |
| 2025-03-03 | vS-Graphs: Integrating Visual SLAM and Situational Graphs through Multi-level Scene Understanding | Ali Tourani et.al. | 2503.01783 | link |
| 2025-03-03 | OpenGS-SLAM: Open-Set Dense Semantic SLAM with 3D Gaussian Splatting for Object-Level Scene Understanding | Dianyi Yang et.al. | 2503.01646 | null |
| 2025-03-03 | Every SAM Drop Counts: Embracing Semantic Priors for Multi-Modality Image Fusion and Beyond | Guanyao Wu et.al. | 2503.01210 | link |
| 2025-03-03 | Semi-Supervised 360 Layout Estimation with Panoramic Collaborative Perturbations | Junsong Zhang et.al. | 2503.01114 | null |
| 2025-03-01 | Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing | Yanjun Li et.al. | 2503.00548 | null |
| 2025-03-01 | Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning | Hanxun Yu et.al. | 2503.00513 | link |
| 2025-03-04 | Floorplan-SLAM: A Real-Time, High-Accuracy, and Long-Term Multi-Session Point-Plane SLAM for Efficient Floorplan Reconstruction | Haolin Wang et.al. | 2503.00397 | null |
| 2025-02-28 | Vibrotactile information coding strategies for a body-worn vest to aid robot-human collaboration | Adrian Vecina Tercero et.al. | 2502.21056 | null |
| 2025-02-27 | Towards Statistical Factuality Guarantee for Large Vision-Language Models | Zhuohang Li et.al. | 2502.20560 | null |
| 2025-02-26 | Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator | Xiankang He et.al. | 2502.19204 | link |
| 2025-02-25 | VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion | Pei Liu et.al. | 2502.18042 | null |
| 2025-02-24 | AAD-LLM: Neural Attention-Driven Auditory Scene Understanding | Xilin Jiang et.al. | 2502.16794 | link |
| 2025-02-28 | Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model | Yaxuan Huang et.al. | 2502.16779 | link |
| 2025-02-23 | Dr. Splat: Directly Referring 3D Gaussian Splatting via Direct Language Embedding Registration | Kim Jun-Seong et.al. | 2502.16652 | null |
| 2025-02-21 | Weakly Supervised Video Scene Graph Generation via Natural Language Supervision | Kibum Kim et.al. | 2502.15370 | link |
| 2025-02-21 | DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation | Luzhou Ge et.al. | 2502.15309 | link |
| 2025-02-21 | Hierarchical Context Transformer for Multi-level Semantic Scene Understanding | Luoying Hao et.al. | 2502.15184 | link |
| 2025-02-20 | CrossOver: 3D Scene Cross-Modal Alignment | Sayan Deb Sarkar et.al. | 2502.15011 | link |
| 2025-02-20 | Hier-SLAM++: Neuro-Symbolic Semantic SLAM with a Hierarchically Categorical Gaussian Splatting | Boying Li et.al. | 2502.14931 | null |
| 2025-02-19 | Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning | Rui Zhao et.al. | 2502.14917 | null |
| 2025-02-16 | Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review | Ufaq Khan et.al. | 2502.14886 | null |
| 2025-02-21 | AVD2: Accident Video Diffusion for Accident Video Description | Cheng Li et.al. | 2502.14801 | null |
| 2025-02-18 | Spiking Vision Transformer with Saccadic Attention | Shuai Wang et.al. | 2502.12677 | null |
| 2025-02-16 | NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM | Zihan Wang et.al. | 2502.11142 | link |
| 2025-02-15 | Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy | Mingyang Zhao et.al. | 2502.10704 | link |
| 2025-02-14 | Leveraging V2X for Collaborative HD Maps Construction Using Scene Graph Generation | Gamal Elghazaly et.al. | 2502.10127 | null |
| 2025-02-13 | FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation | Bin Yang et.al. | 2502.09274 | null |
| 2025-02-13 | Billet Number Recognition Based on Test-Time Adaptation | Yuan Wei et.al. | 2502.09026 | null |
| 2025-02-13 | EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition | Xiao Wang et.al. | 2502.09020 | link |
| 2025-02-13 | 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning | Guoqin Tang et.al. | 2502.08903 | null |
| 2025-02-10 | Fully Exploiting Vision Foundation Model’s Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing | Sicen Guo et.al. | 2502.06219 | null |
| 2025-02-08 | Content-based Video Retrieval in Traffic Videos using Latent Dirichlet Allocation Topic Model | Mohammad Kianpisheh et.al. | 2502.05457 | null |
| 2025-02-06 | sshELF: Single-Shot Hierarchical Extrapolation of Latent Features for 3D Reconstruction from Sparse-Views | Eyvaz Najafli et.al. | 2502.04318 | null |
| 2025-02-06 | Taking A Closer Look at Interacting Objects: Interaction-Aware Open Vocabulary Scene Graph Generation | Lin Li et.al. | 2502.03856 | null |
| 2025-02-05 | EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality | Junlong Chen et.al. | 2502.03564 | null |
| 2025-02-04 | Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation | Junha Lee et.al. | 2502.02548 | null |
| 2025-02-04 | Event-aided Semantic Scene Completion | Shangwei Guo et.al. | 2502.02334 | link |
| 2025-02-03 | AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis | Basit Alawode et.al. | 2502.01785 | null |
| 2025-01-30 | Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation | Yuelei Li et.al. | 2501.18733 | null |
| 2025-01-30 | Efficient Interactive 3D Multi-Object Removal | Jingcheng Ni et.al. | 2501.17636 | null |
| 2025-02-04 | Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding | Akash Kumar et.al. | 2501.17053 | null |
| 2025-01-29 | PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding | Wei Chow et.al. | 2501.16411 | link |
| 2025-01-26 | Ocean-OCR: Towards General OCR Application via a Vision-Language Model | Song Chen et.al. | 2501.15558 | link |
| 2025-01-26 | Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics | Ali Tourani et.al. | 2501.15505 | link |
| 2025-01-24 | HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation | Xin Zhou et.al. | 2501.14729 | link |
| 2025-01-24 | Scene Understanding Enabled Semantic Communication with Open Channel Coding | Zhe Xiang et.al. | 2501.14520 | null |
| 2025-01-23 | GeomGS: LiDAR-Guided Geometry-Aware Gaussian Splatting for Robot Localization | Jaewon Lee et.al. | 2501.13417 | null |
| 2025-01-22 | Neural Radiance Fields for the Real World: A Survey | Wenhui Xiao et.al. | 2501.13104 | null |
| 2025-01-22 | PSGSL: A Probabilistic Framework Integrating Semantic Scene Understanding and Gas Sensing for Gas Source Localization | Pepe Ojeda et.al. | 2501.12812 | null |
| 2025-01-20 | Dynamic Scene Understanding from Vision-Language Representations | Shahaf Pruss et.al. | 2501.11653 | null |
| 2025-01-20 | EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery | Guankun Wang et.al. | 2501.11347 | link |
| 2025-01-20 | A Survey of World Models for Autonomous Driving | Tuo Feng et.al. | 2501.11260 | null |
| 2025-01-17 | A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features | Enes Karanfil et.al. | 2501.10144 | null |
| 2025-01-16 | CrossModalityDiffusion: Multi-Modal Novel View Synthesis with Unified Intermediate Representation | Alex Berian et.al. | 2501.09838 | link |
| 2025-01-16 | YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks | Saptarashmi Bandyopadhyay et.al. | 2501.09355 | null |
| 2025-01-15 | Embodied Scene Understanding for Vision Language Models via MetaVQA | Weizhen Wang et.al. | 2501.09167 | null |
| 2025-01-15 | GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap | Donghwi Jung et.al. | 2501.08575 | link |
| 2025-01-14 | 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding | Haomiao Xiong et.al. | 2501.07819 | link |
| 2025-01-13 | Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models | Yasiru Ranasinghe et.al. | 2501.07396 | null |
| 2025-01-13 | Hierarchical Superpixel Segmentation via Structural Information Theory | Minhui Xie et.al. | 2501.07069 | link |
| 2025-01-12 | Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving | Haoxiang Gao et.al. | 2501.06680 | null |
| 2025-01-08 | NextStop: An Improved Tracker For Panoptic LIDAR Segmentation Data | Nirit Alkalay et.al. | 2501.06235 | null |
| 2025-01-10 | Self-Supervised Partial Cycle-Consistency for Multi-View Matching | Fedor Taggenbrock et.al. | 2501.06000 | link |
| 2025-01-10 | UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph Generation | Xinyao Liao et.al. | 2501.05687 | null |
| 2025-01-09 | Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding | Mohammed Elhenawy et.al. | 2501.05566 | null |
| 2025-01-09 | A Systematic Literature Review on Deep Learning-based Depth Estimation in Computer Vision | Ali Rohan et.al. | 2501.05147 | null |
| 2025-01-08 | TADFormer : Task-Adaptive Dynamic Transformer for Efficient Multi-Task Learning | Seungmin Baek et.al. | 2501.04293 | null |
| 2025-01-07 | A Bayesian Modeling Framework for Estimation and Ground Segmentation of Cluttered Staircases | Prasanna Sriganesh et.al. | 2501.04170 | null |
| 2025-01-07 | LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving | Lingdong Kong et.al. | 2501.04005 | null |
| 2025-01-07 | CL3DOR: Contrastive Learning for 3D Large Multimodal Models via Odds Ratio on High-Resolution Point Clouds | Keonwoo Kim et.al. | 2501.03879 | null |
| 2025-01-07 | Advancing the Understanding of Fine-Grained 3D Forest Structures using Digital Cousins and Simulation-to-Reality: Methods and Datasets | Jing Liu et.al. | 2501.03637 | null |
| 2025-01-03 | VideoLifter: Lifting Videos to 3D with Fast Hierarchical Stereo Alignment | Wenyan Cong et.al. | 2501.01949 | null |
| 2025-01-03 | IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks | Aecheon Jung et.al. | 2501.01685 | link |
| 2025-01-09 | GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models | Zhangyang Qi et.al. | 2501.01428 | null |
| 2025-01-02 | 3D-LLaVA: Towards Generalist 3D LMMs with Omni Superpoint Transformer | Jiajun Deng et.al. | 2501.01163 | null |
| 2025-01-02 | Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction | Xuan Yu et.al. | 2501.01119 | null |
| 2024-12-31 | STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes | Jiawei Yang et.al. | 2501.00602 | null |
| 2024-12-31 | Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding | Yue Fan et.al. | 2501.00358 | null |
| 2024-12-31 | OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies | Runnan Chen et.al. | 2501.00326 | link |
| 2024-12-30 | Text-to-Image GAN with Pretrained Representations | Xiaozhou You et.al. | 2501.00116 | null |
| 2024-12-30 | 4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives | Zeyu Yang et.al. | 2412.20720 | null |
| 2024-12-27 | An Actionable Hierarchical Scene Representation Enhancing Autonomous Inspection Missions in Unknown Environments | Vignesh Kottayam Viswanathan et.al. | 2412.19582 | null |
| 2024-12-27 | xFLIE: Leveraging Actionable Hierarchical Scene Representations for Autonomous Semantic-Aware Inspection Missions | Vignesh Kottayam Viswanathan et.al. | 2412.19571 | link |
| 2024-12-27 | MLLM-SUL: Multimodal Large Language Model for Semantic Scene Understanding and Localization in Traffic Scenarios | Jiaqi Fan et.al. | 2412.19406 | null |
| 2024-12-26 | Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation | Tao Liu et.al. | 2412.19021 | null |
| 2024-12-25 | 3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding | Tatiana Zemskova et.al. | 2412.18450 | link |
| 2024-12-24 | MR-COGraphs: Communication-efficient Multi-Robot Open-vocabulary Mapping System via 3D Scene Graphs | Qiuyi Gu et.al. | 2412.18381 | null |
| 2024-12-24 | Parallel Neural Computing for Scene Understanding from LiDAR Perception in Autonomous Racing | Suwesh Prasad Sah et.al. | 2412.18165 | link |
| 2024-12-24 | UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision | Yuru Wang et.al. | 2412.18131 | null |
| 2024-12-24 | LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding | Hao Li et.al. | 2412.17635 | null |
| 2024-12-21 | Application of Multimodal Large Language Models in Autonomous Driving | Md Robiul Islam et.al. | 2412.16410 | null |
| 2024-12-20 | Improving Object Detection for Time-Lapse Imagery Using Temporal Features in Wildlife Monitoring | Marcus Jenkins et.al. | 2412.16329 | link |
| 2024-12-19 | AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving | Shuo Xing et.al. | 2412.15206 | link |
| 2024-12-19 | ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects | Qihang Cao et.al. | 2412.14837 | null |
| 2024-12-19 | PC-BEV: An Efficient Polar-Cartesian BEV Fusion Framework for LiDAR Semantic Segmentation | Shoumeng Qiu et.al. | 2412.14821 | link |
| 2024-12-18 | GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting | Yuning Peng et.al. | 2412.13654 | link |
| 2024-12-18 | RelationField: Relate Anything in Radiance Fields | Sebastian Koch et.al. | 2412.13652 | null |
| 2024-12-18 | Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset | Sithu Aung et.al. | 2412.13569 | null |
| 2024-12-17 | RA-SGG: Retrieval-Augmented Scene Graph Generation Framework via Multi-Prototype Learning | Kanghoon Yoon et.al. | 2412.12788 | link |
| 2024-12-18 | Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration | Ziheng Zhou et.al. | 2412.12628 | null |
| 2024-12-17 | Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning | Qi Sun et.al. | 2412.11974 | link |
| 2024-12-16 | DINO-Foresight: Looking into the Future with DINO | Efstathios Karypidis et.al. | 2412.11673 | link |
| 2024-12-16 | An Enhanced Classification Method Based on Adaptive Multi-Scale Fusion for Long-tailed Multispectral Point Clouds | TianZhu Liu et.al. | 2412.11407 | null |
| 2024-12-15 | SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation | Hang Zhang et.al. | 2412.11026 | null |
| 2024-12-13 | SuperGSeg: Open-Vocabulary 3D Segmentation with Structured Super-Gaussians | Siyun Liang et.al. | 2412.10231 | null |
| 2024-12-13 | Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance | Jiahao Lyu et.al. | 2412.10159 | null |
| 2024-12-17 | WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model | Songyan Zhang et.al. | 2412.09951 | link |
| 2024-12-12 | LIVE-GS: LLM Powers Interactive VR by Enhancing Gaussian Splatting | Haotian Mao et.al. | 2412.09176 | null |
| 2024-12-11 | SLGaussian: Fast Language Gaussian Splatting in Sparse Views | Kangjie Chen et.al. | 2412.08331 | null |
| 2024-12-11 | TGOSPA Metric Parameters Selection and Evaluation for Visual Multi-object Tracking | Jan Krejčí et.al. | 2412.08321 | null |
| 2024-12-11 | THUD++: Large-Scale Dynamic Indoor Scene Dataset and Benchmark for Mobile Robots | Zeshun Li et.al. | 2412.08096 | null |
| 2024-12-11 | MAGIC: Mastering Physical Adversarial Generation in Context through Collaborative LLM Agents | Yun Xing et.al. | 2412.08014 | null |
| 2024-12-10 | Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation | Thong Thanh Nguyen et.al. | 2412.07160 | null |
| 2024-12-11 | ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models | Jieyu Zhang et.al. | 2412.07012 | link |
| 2024-12-07 | Timely reliable Bayesian decision-making enabled using memristors | Lekai Song et.al. | 2412.06838 | null |
| 2024-12-09 | Visual Lexicon: Rich Image Features in Language Space | XuDong Wang et.al. | 2412.06774 | null |
| 2024-12-09 | LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations | Mingjie Xu et.al. | 2412.06322 | link |
| 2024-12-09 | Event fields: Capturing light fields at high speed, resolution, and dynamic range | Ziyuan Qu et.al. | 2412.06191 | null |
| 2024-12-07 | TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances | Wenting Xu et.al. | 2412.05596 | null |
| 2024-12-06 | Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model | Lening Wang et.al. | 2412.05280 | link |
| 2024-12-06 | EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding | Yuqi Wu et.al. | 2412.04380 | link |
| 2024-12-04 | Designing DNNs for a trade-off between robustness and processing performance in embedded devices | Jon Gutiérrez-Zaballa et.al. | 2412.03682 | null |
| 2024-12-04 | Assessing the performance of CT image denoisers using Laguerre-Gauss Channelized Hotelling Observer for lesion detection | Prabhat Kc et.al. | 2412.02920 | null |
| 2024-12-03 | BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding | Chenguang Huang et.al. | 2412.02449 | null |
| 2024-12-04 | SparseLGS: Sparse View Language Embedded Gaussian Splatting | Jun Hu et.al. | 2412.02245 | null |
| 2024-12-02 | Occam’s LGS: A Simple Approach for Language Gaussian Splatting | Jiahuan Cheng et.al. | 2412.01807 | null |
| 2024-12-02 | Holistic Understanding of 3D Scenes as Universal Scene Description | Anna-Maria Halacheva et.al. | 2412.01398 | null |
| 2024-12-02 | LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences | Hongyan Zhi et.al. | 2412.01292 | null |
| 2024-12-02 | A Semantic Communication System for Real-time 3D Reconstruction Tasks | Jiaxing Zhang et.al. | 2412.01191 | null |
| 2024-12-02 | TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition | Xingsong Ye et.al. | 2412.01137 | link |
| 2024-12-01 | ChatSplat: 3D Conversational Gaussian Splatting | Hanlin Chen et.al. | 2412.00734 | null |
| 2024-11-30 | Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding | Duo Zheng et.al. | 2412.00493 | null |
| 2024-11-29 | SIMS: Simulating Human-Scene Interactions with Real World Script Planning | Wenjia Wang et.al. | 2411.19921 | null |
| 2024-11-29 | Quantifying the synthetic and real domain gap in aerial scene understanding | Alina Marcu et.al. | 2411.19913 | null |
| 2024-11-29 | Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding | Wenbo Zhang et.al. | 2411.19551 | null |
| 2024-11-28 | GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks | Muhammad Sohail Danish et.al. | 2411.19325 | link |
| 2024-11-28 | On-chip Hyperspectral Image Segmentation with Fully Convolutional Networks for Scene Understanding in Autonomous Driving | Jon Gutiérrez-Zaballa et.al. | 2411.19274 | null |
| 2024-11-28 | InstanceGaussian: Appearance-Semantic Joint Gaussian Representation for 3D Instance-Level Perception | Haijie Li et.al. | 2411.19235 | null |
| 2024-11-27 | Reconstructing Animals and the Wild | Peter Kulits et.al. | 2411.18807 | null |
| 2024-11-27 | Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents | Joongwon Chae et.al. | 2411.18270 | null |
| 2024-11-27 | HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation | Trong-Thuan Nguyen et.al. | 2411.18042 | null |
| 2024-11-26 | Box for Mask and Mask for Box: weak losses for multi-task partially supervised learning | Hoàng-Ân Lê et.al. | 2411.17536 | link |
| 2024-11-26 | HSI-Drive v2.0: More Data for New Challenges in Scene Understanding for Autonomous Driving | Jon Gutiérrez-Zaballa et.al. | 2411.17530 | null |
| 2024-11-25 | RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics | Chan Hee Song et.al. | 2411.16537 | null |
| 2024-11-27 | An End-to-End Robust Point Cloud Semantic Segmentation Network with Single-Step Conditional Diffusion Models | Wentao Qu et.al. | 2411.16308 | link |
| 2024-11-25 | Open-Vocabulary Octree-Graph for 3D Scene Understanding | Zhigang Wang et.al. | 2411.16253 | null |
| 2024-11-24 | SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition | Yongkun Du et.al. | 2411.15858 | link |
| 2024-11-24 | ROOT: VLM based System for Indoor Scene Understanding and Beyond | Yonghui Wang et.al. | 2411.15714 | link |
| 2024-11-23 | Comparative Analysis of Resource-Efficient CNN Architectures for Brain Tumor Classification | Md Ashik Khan et.al. | 2411.15596 | null |
| 2024-11-23 | Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing | Yadong Qu et.al. | 2411.15585 | null |
| 2024-11-22 | UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations | Yuan Ren et.al. | 2411.15355 | null |
| 2024-11-21 | Multimodal 3D Reasoning Segmentation with Complex Scenes | Xueying Jiang et.al. | 2411.13927 | null |
| 2024-11-20 | Unbiased Scene Graph Generation by Type-Aware Message Passing on Heterogeneous and Dual Graphs | Guanglu Sun et.al. | 2411.13287 | null |
| 2024-11-20 | Towards Unbiased and Robust Spatio-Temporal Scene Graph Generation and Anticipation | Rohith Peddi et.al. | 2411.13059 | null |
| 2024-11-19 | GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving | Shaoqing Xu et.al. | 2411.12452 | link |
| 2024-11-19 | Classification of Geographical Land Structure Using Convolution Neural Network and Transfer Learning | Mustafa M. Abd Zaid et.al. | 2411.12415 | null |
| 2024-11-18 | Calibrated and Efficient Sampling-Free Confidence Estimation for LiDAR Scene Semantic Segmentation | Hanieh Shojaei Miandashti et.al. | 2411.11935 | null |
| 2024-11-18 | MGNiceNet: Unified Monocular Geometric Scene Understanding | Markus Schön et.al. | 2411.11466 | null |
| 2024-11-18 | The ADUULM-360 Dataset – A Multi-Modal Dataset for Depth Estimation in Adverse Weather | Markus Schön et.al. | 2411.11455 | null |
| 2024-11-18 | Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications | Scarlett Raine et.al. | 2411.11287 | null |
| 2024-11-19 | Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition | Tiancheng Lin et.al. | 2411.11219 | link |
| 2024-11-17 | Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry | Wenjun Hou et.al. | 2411.10937 | null |
| 2024-11-16 | MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation | Ansh Shah et.al. | 2411.10886 | link |
| 2024-11-16 | Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm | Sari Masri et.al. | 2411.10869 | null |
| 2024-11-15 | TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding | Quang P. M. Pham et.al. | 2411.10509 | null |
| 2024-11-15 | Content-Aware Preserving Image Generation | Giang H. Le et.al. | 2411.09871 | null |
| 2024-11-13 | Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification | Jose-Luis Matez-Bandera et.al. | 2411.08727 | link |
| 2024-11-11 | $SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation | Yinshuang Xu et.al. | 2411.07326 | null |
| 2024-11-06 | Graph-Based Multi-Modal Sensor Fusion for Autonomous Driving | Depanshu Sani et.al. | 2411.03702 | null |
| 2024-11-05 | VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation | Haochen Zhang et.al. | 2411.03540 | link |
| 2024-11-05 | OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing | Pranav Gupta et.al. | 2411.02858 | null |
| 2024-11-04 | Modeling Uncertainty in 3D Gaussian Splatting through Continuous Semantic Splatting | Joey Wilson et.al. | 2411.02547 | null |
| 2024-11-04 | Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360° Images | Kun Huang et.al. | 2411.01749 | link |
| 2024-11-03 | VQ-Map: Bird’s-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization | Yiwei Zhang et.al. | 2411.01618 | link |
| 2024-11-01 | On Deep Learning for Geometric and Semantic Scene Understanding Using On-Vehicle 3D LiDAR | Li Li et.al. | 2411.00600 | link |
| 2024-11-01 | Federated Voxel Scene Graph for Intracranial Hemorrhage | Antoine P. Sanner et.al. | 2411.00578 | null |
| 2024-10-30 | UniRiT: Towards Few-Shot Non-Rigid Point Cloud Registration | Geng Li et.al. | 2410.22909 | null |
| 2024-10-30 | Situational Scene Graph for Structured Human-centric Situation Understanding | Chinthani Sugandhika et.al. | 2410.22829 | null |
| 2024-10-30 | Symbolic Graph Inference for Compound Scene Understanding | FNU Aryan et.al. | 2410.22626 | null |
| 2024-10-29 | Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving | Bo Jiang et.al. | 2410.22313 | link |
| 2024-10-26 | Towards Robust Algorithms for Surgical Phase Recognition via Digital Twin-based Scene Representation | Hao Ding et.al. | 2410.20026 | null |
| 2024-10-23 | Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement | Cheng Yuan et.al. | 2410.17642 | link |
| 2024-10-22 | PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding | Vinh Nguyen et.al. | 2410.16824 | null |
| 2024-10-20 | Scene Graph Generation with Role-Playing Large Language Models | Guikun Chen et.al. | 2410.15364 | null |
| 2024-10-20 | Large Language Models for Autonomous Driving (LLM4AD): Concept, Benchmark, Simulation, and Real-Vehicle Experiment | Can Cui et.al. | 2410.15281 | null |
| 2024-10-19 | Semantically Safe Robot Manipulation: From Semantic Scene Understanding to Motion Safeguards | Lukas Brunke et.al. | 2410.15185 | null |
| 2024-10-19 | Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding | Yi Liu et.al. | 2410.14944 | link |
| 2024-10-17 | ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding | Guangda Ji et.al. | 2410.13924 | link |
| 2024-10-17 | VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | Runsen Xu et.al. | 2410.13860 | link |
| 2024-10-16 | 3D Gaussian Splatting in Robotics: A Survey | Siting Zhu et.al. | 2410.12262 | null |
| 2024-10-17 | SAM-Guided Masked Token Prediction for 3D Scene Understanding | Zhimin Chen et.al. | 2410.12158 | null |
| 2024-10-16 | Leveraging Large Vision Language Model For Better Automatic Web GUI Testing | Siyi Wang et.al. | 2410.12157 | null |
| 2024-10-15 | MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark | Bin Shan et.al. | 2410.11538 | link |
| 2024-10-14 | 3DArticCyclists: Generating Simulated Dynamic 3D Cyclists for Human-Object Interaction (HOI) and Autonomous Driving Applications | Eduardo R. Corral-Soto et.al. | 2410.10782 | null |
| 2024-10-17 | Stratified Domain Adaptation: A Progressive Self-Training Approach for Scene Text Recognition | Kha Nhat Le et.al. | 2410.09913 | null |
| 2024-10-13 | LoLI-Street: Benchmarking Low-Light Image Enhancement and Beyond | Md Tanvir Islam et.al. | 2410.09831 | link |
| 2024-10-12 | Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors | Hritam Basak et.al. | 2410.09467 | null |
| 2024-10-11 | Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking | Wei Zhang et.al. | 2410.08616 | null |
| 2024-10-10 | A transition towards virtual representations of visual scenes | Américo Pereira et.al. | 2410.07987 | null |
| 2024-10-10 | RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation | Songming Liu et.al. | 2410.07864 | null |
| 2024-10-11 | Test-Time Intensity Consistency Adaptation for Shadow Detection | Leyi Zhu et.al. | 2410.07695 | null |
| 2024-10-10 | 3D Vision-Language Gaussian Splatting | Qucheng Peng et.al. | 2410.07577 | null |
| 2024-10-09 | Evaluating the Impact of Point Cloud Colorization on Semantic Segmentation Accuracy | Qinfeng Zhu et.al. | 2410.06725 | null |
| 2024-10-09 | Open-RGBT: Open-vocabulary RGB-T Zero-shot Semantic Segmentation in Open-world Environments | Meng Yu et.al. | 2410.06626 | null |
| 2024-10-08 | BoxMap: Efficient Structural Mapping and Navigation | Zili Wang et.al. | 2410.06263 | null |
| 2024-10-08 | OrionNav: Online Planning for Robot Autonomy with Context-Aware LLM and Open-Vocabulary Semantic Scene Graphs | Venkata Naren Devarakonda et.al. | 2410.06239 | null |
| 2024-10-07 | Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders | Kosta Dakic et.al. | 2410.04817 | null |
| 2024-10-07 | Diffusion Models in 3D Vision: A Survey | Zhen Wang et.al. | 2410.04738 | null |
| 2024-10-06 | In-Place Panoptic Radiance Field Segmentation with Perceptual Prior for 3D Scene Understanding | Shenghao Li et.al. | 2410.04529 | null |
| 2024-10-05 | ETHcavation: A Dataset and Pipeline for Panoptic Scene Understanding and Object Tracking in Dynamic Construction Environments | Lorenzo Terenzi et.al. | 2410.04250 | null |
| 2024-10-05 | Fast Object Detection with a Machine Learning Edge Device | Richard C. Rodriguez et.al. | 2410.04173 | null |
| 2024-10-04 | SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models | Yue Zhang et.al. | 2410.03878 | null |
| 2024-10-03 | RESSCAL3D++: Joint Acquisition and Semantic Segmentation of 3D Point Clouds | Remco Royen et.al. | 2410.02323 | link |
| 2024-10-01 | A Critical Assessment of Visual Sound Source Localization Models Including Negative Audio | Xavier Juanola et.al. | 2410.01020 | link |
| 2024-09-30 | Class-Agnostic Visio-Temporal Scene Sketch Semantic Segmentation | Aleyna Kütük et.al. | 2410.00266 | null |
| 2024-09-30 | Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation | Kun Yuan et.al. | 2410.00263 | link |
| 2024-09-30 | You Only Speak Once to See | Wenhao Yang et.al. | 2409.18372 | null |
| 2024-09-26 | LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness | Chenming Zhu et.al. | 2409.18125 | null |
| 2024-09-26 | Text Image Generation for Low-Resource Languages with Dual Translation Learning | Chihiro Noguchi et.al. | 2409.17747 | null |
| 2024-09-26 | Scene Understanding in Pick-and-Place Tasks: Analyzing Transformations Between Initial and Final Scenes | Seraj Ghasemi et.al. | 2409.17720 | null |
| 2024-10-02 | BehAV: Behavioral Rule Guided Autonomy Using VLMs for Robot Navigation in Outdoor Scenes | Kasun Weerakoon et.al. | 2409.16484 | null |
| 2024-09-24 | Open-World Object Detection with Instance Representation Learning | Sunoh Lee et.al. | 2409.16073 | null |
| 2024-09-24 | Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving | Lingyu Xiao et.al. | 2409.15730 | link |
| 2024-09-27 | Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer | Minh Bui et.al. | 2409.15117 | null |
| 2024-09-23 | An Adverse Weather-Immune Scheme with Unfolded Regularization and Foundation Model Knowledge Distillation for Street Scene Understanding | Wei-Bin Kou et.al. | 2409.14737 | null |
| 2024-09-22 | One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance | Minyi Zhao et.al. | 2409.14483 | null |
| 2024-09-22 | Scene-Text Grounding for Text-Based Video Question Answering | Sheng Zhou et.al. | 2409.14319 | null |
| 2024-09-21 | MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors | Zhenhua Du et.al. | 2409.14019 | null |
| 2024-09-21 | Relevance-driven Decision Making for Safer and More Efficient Human Robot Collaboration | Xiaotong Zhang et.al. | 2409.13998 | null |
| 2024-09-21 | Enhanced Semantic Segmentation for Large-Scale and Imbalanced Point Clouds | Haoran Gong et.al. | 2409.13983 | null |
| 2024-09-19 | CLAIR-A: Leveraging Large Language Models to Judge Audio Captions | Tsung-Han Wu et.al. | 2409.12962 | link |
| 2024-09-18 | Towards Global Localization using Multi-Modal Object-Instance Re-Identification | Aneesh Chavan et.al. | 2409.12002 | null |
| 2024-09-18 | SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection | Tim Engelbracht et.al. | 2409.11870 | null |
| 2024-09-18 | VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer | Humen Zhong et.al. | 2409.11656 | null |
| 2024-09-18 | DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion | Jian Xu et.al. | 2409.11642 | link |
| 2024-09-16 | Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving | Yunsheng Ma et.al. | 2409.11182 | null |
| 2024-09-16 | Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation | Yifan Xu et.al. | 2409.10350 | null |
| 2024-09-16 | Hydra-SGG: Hybrid Relation Assignment for One-stage Scene Graph Generation | Minghan Chen et.al. | 2409.10262 | null |
| 2024-09-15 | Semantic2D: A Semantic Dataset for 2D Lidar Semantic Segmentation | Zhanteng Xie et.al. | 2409.09899 | null |
| 2024-09-12 | LED: Light Enhanced Depth Estimation at Night | Simon de Moreau et.al. | 2409.08031 | link |
| 2024-09-12 | Relevance for Human Robot Collaboration | Xiaotong Zhang et.al. | 2409.07753 | null |
| 2024-09-10 | Towards Localizing Structural Elements: Merging Geometrical Detection with Semantic Verification in RGB-D Data | Ali Tourani et.al. | 2409.06625 | null |
| 2024-09-10 | Loss Distillation via Gradient Matching for Point Cloud Completion with Weighted Chamfer Distance | Fangzhou Lin et.al. | 2409.06171 | link |
| 2024-09-09 | Online 3D reconstruction and dense tracking in endoscopic videos | Michel Hayoz et.al. | 2409.06037 | link |
| 2024-09-08 | TanDepth: Leveraging Global DEMs for Metric Monocular Depth Estimation in UAVs | Horatiu Florea et.al. | 2409.05142 | null |
| 2024-09-06 | Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences | Rui Yu et.al. | 2409.04390 | null |
| 2024-09-06 | RCNet: Deep Recurrent Collaborative Network for Multi-View Low-Light Image Enhancement | Hao Luo et.al. | 2409.04363 | link |
| 2024-09-05 | Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding | Yunze Man et.al. | 2409.03757 | link |
| 2024-09-05 | Optimizing 3D Gaussian Splatting for Sparse Viewpoint Scene Reconstruction | Shen Chen et.al. | 2409.03213 | null |
| 2024-09-04 | Can LVLMs Obtain a Driver’s License? A Benchmark Towards Reliable AGI for Autonomous Driving | Yuhang Lu et.al. | 2409.02914 | null |
| 2024-09-03 | Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning | Xiaowei Hu et.al. | 2409.02108 | link |
| 2024-09-03 | EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video | Zhen Zhou et.al. | 2409.01807 | link |
| 2024-09-03 | GaussianPU: A Hybrid 2D-3D Upsampling Framework for Enhancing Color Point Clouds via 3D Gaussian Splatting | Zixuan Guo et.al. | 2409.01581 | null |
| 2024-08-31 | Leaky Wave Antenna-Equipped RF Chipless Tags for Orientation Estimation | Onel L. A. López et.al. | 2409.00501 | null |
| 2024-08-30 | UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios | Baichuan Zhou et.al. | 2408.17267 | link |
| 2024-08-30 | AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding | Yonghui Wang et.al. | 2408.16986 | link |
| 2024-08-29 | DriveGenVLM: Real-world Video Generation for Vision Language Model based Autonomous Driving | Yongjie Fu et.al. | 2408.16647 | null |
| 2024-08-28 | Str-L Pose: Integrating Point and Structured Line for Relative Pose Estimation in Dual-Graph | Zherong Zhang et.al. | 2408.15750 | null |
| 2024-08-28 | RoboSense: Large-scale Dataset and Benchmark for Multi-sensor Low-speed Autonomous Driving | Haisheng Su et.al. | 2408.15503 | link |
| 2024-08-27 | Handling Geometric Domain Shifts in Semantic Segmentation of Surgical RGB and Hyperspectral Images | Silvia Seidlitz et.al. | 2408.15373 | link |
| 2024-08-27 | MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders | Baijiong Lin et.al. | 2408.15101 | link |
| 2024-08-27 | Interactive Occlusion Boundary Estimation through Exploitation of Synthetic Data | Lintao Xu et.al. | 2408.15038 | null |
| 2024-08-27 | BOX3D: Lightweight Camera-LiDAR Fusion for 3D Object Detection and Localization | Mario A. V. Saucedo et.al. | 2408.14941 | null |
| 2024-08-27 | Platypus: A Generalized Specialist Model for Reading Text in Various Forms | Peng Wang et.al. | 2408.14805 | link |
| 2024-08-27 | RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models | Junyao Ge et.al. | 2408.14744 | link |
| 2024-08-26 | Ensemble Predicate Decoding for Unbiased Scene Graph Generation | Jiasong Feng et.al. | 2408.14187 | null |
| 2024-08-26 | FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation | Daixun Li et.al. | 2408.13980 | null |
| 2024-08-25 | Making Large Language Models Better Planners with Reasoning-Decision Alignment | Zhijian Huang et.al. | 2408.13890 | null |
| 2024-08-25 | 3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing | Shichao Dong et.al. | 2408.13788 | null |
| 2024-08-25 | Extremely Fine-Grained Visual Classification over Resembling Glyphs in the Wild | Fares Bougourzi et.al. | 2408.13774 | link |
| 2024-08-25 | SeeBelow: Sub-dermal 3D Reconstruction of Tumors with Surgical Robotic Palpation and Tactile Exploration | Raghava Uppuluri et.al. | 2408.13699 | null |
| 2024-08-21 | Exploring Scene Coherence for Semi-Supervised 3D Semantic Segmentation | Chuandong Liu et.al. | 2408.11280 | null |
| 2024-08-20 | OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding | Youjun Zhao et.al. | 2408.11030 | link |
| 2024-08-19 | 3D-Aware Instance Segmentation and Tracking in Egocentric Videos | Yash Bhalgat et.al. | 2408.09860 | null |
| 2024-08-16 | Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation | Tri Ton et.al. | 2408.08591 | null |
| 2024-08-15 | Towards Flexible Visual Relationship Segmentation | Fangrui Zhu et.al. | 2408.08305 | null |
| 2024-08-13 | SpectralGaussians: Semantic, spectral 3D Gaussian splatting for multi-spectral scene representation, visualization and analysis | Saptarshi Neil Sinha et.al. | 2408.06975 | null |
| 2024-08-13 | SceneGPT: A Language Model for 3D Scene Understanding | Shivam Chandhok et.al. | 2408.06926 | null |
| 2024-08-12 | HeLiMOS: A Dataset for Moving Object Segmentation in 3D Point Clouds From Heterogeneous LiDAR Sensors | Hyungtae Lim et.al. | 2408.06328 | null |
| 2024-08-11 | Decoder Pre-Training with only Text for Scene Text Recognition | Shuai Zhao et.al. | 2408.05706 | link |
| 2024-08-09 | Spherical World-Locking for Audio-Visual Localization in Egocentric Videos | Heeseung Yun et.al. | 2408.05364 | null |
| 2024-08-15 | DeepInteraction++: Multi-Modality Interaction for Autonomous Driving | Zeyu Yang et.al. | 2408.05075 | link |
| 2024-08-09 | Mesh-based Object Tracking for Dynamic Semantic 3D Scene Graphs via Ray Tracing | Lennart Niecksch et.al. | 2408.04979 | null |
| 2024-08-09 | Manipulable Semantic Components: a Computational Representation of Data Visualization Scenes | Zhicheng Liu et.al. | 2408.04798 | null |
| 2024-08-07 | Leveraging LLMs for Enhanced Open-Vocabulary 3D Scene Understanding in Autonomous Driving | Amirhosein Chahe et.al. | 2408.03516 | null |
| 2024-08-04 | LEGO: Self-Supervised Representation Learning for Scene Text Images | Yujin Ren et.al. | 2408.02036 | null |
| 2024-07-31 | RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion | Jianxin Huang et.al. | 2407.21631 | null |
| 2024-07-31 | Voxel Scene Graph for Intracranial Hemorrhage | Antoine P. Sanner et.al. | 2407.21580 | null |
| 2024-07-31 | A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap | Lijun Zhang et.al. | 2407.21438 | link |
| 2024-07-31 | DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations | Dongwon Son et.al. | 2407.21267 | null |
| 2024-07-30 | From Feature Importance to Natural Language Explanations Using LLMs with RAG | Sule Tekkesinoglu et.al. | 2407.20990 | null |
| 2024-07-30 | Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering | Yanpeng Zhao et.al. | 2407.20908 | link |
| 2024-07-30 | NIS-SLAM: Neural Implicit Semantic RGB-D SLAM for 3D Consistent Scene Understanding | Hongjia Zhai et.al. | 2407.20853 | null |
| 2024-07-29 | SANGRIA: Surgical Video Scene Graph Optimization for Surgical Workflow Prediction | Çağhan Köksal et.al. | 2407.20214 | null |
| 2024-07-29 | Rethinking RGB-D Fusion for Semantic Segmentation in Surgical Datasets | Muhammad Abdullah Jamal et.al. | 2407.19714 | null |
| 2024-07-28 | ASI-Seg: Audio-Driven Surgical Instrument Segmentation with Surgeon Intention Understanding | Zhen Chen et.al. | 2407.19435 | link |
| 2024-07-27 | GP-VLS: A general-purpose vision language model for surgery | Samuel Schmidgall et.al. | 2407.19305 | null |
| 2024-07-27 | Fine-Grained Scene Graph Generation via Sample-Level Bias Prediction | Yansheng Li et.al. | 2407.19259 | null |
| 2024-07-26 | BCTR: Bidirectional Conditioning Transformer for Scene Graph Generation | Peng Hao et.al. | 2407.18715 | null |
| 2024-07-26 | MOoSE: Multi-Orientation Sharing Experts for Open-set Scene Text Recognition | Chang Liu et.al. | 2407.18616 | link |
| 2024-07-26 | Answerability Fields: Answerable Location Estimation via Diffusion Models | Daichi Azuma et.al. | 2407.18497 | null |
| 2024-07-24 | 3D Question Answering for City Scene Understanding | Penglei Sun et.al. | 2407.17398 | null |
| 2024-07-23 | Augmented Efficiency: Reducing Memory Footprint and Accelerating Inference for 3D Semantic Segmentation through Hybrid Vision | Aditya Krishnan et.al. | 2407.16102 | null |
| 2024-07-25 | Semantic Diversity-aware Prototype-based Learning for Unbiased Scene Graph Generation | Jaehyeong Jeon et.al. | 2407.15396 | link |
| 2024-07-21 | VideoGameBunny: Towards vision assistants for video games | Mohammad Reza Taesiri et.al. | 2407.15295 | null |
| 2024-07-21 | Self-training Room Layout Estimation via Geometry-aware Ray-casting | Bolivar Solarte et.al. | 2407.15041 | null |
| 2024-07-19 | A New Lightweight Hybrid Graph Convolutional Neural Network – CNN Scheme for Scene Classification using Object Detection Inference | Ayman Beghdadi et.al. | 2407.14658 | null |
| 2024-07-19 | OpenSU3D: Open World 3D Scene Understanding using Foundation Models | Rafay Mohiuddin et.al. | 2407.14279 | null |
| 2024-07-19 | MC-PanDA: Mask Confidence for Panoptic Domain Adaptation | Ivan Martinović et.al. | 2407.14110 | link |
| 2024-07-19 | GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation | Florian Chabot et.al. | 2407.14108 | null |
| 2024-07-18 | Training-Free Model Merging for Multi-target Domain Adaptation | Wenyi Li et.al. | 2407.13771 | null |
| 2024-07-18 | General Geometry-aware Weakly Supervised 3D Object Detection | Guowen Zhang et.al. | 2407.13748 | link |
| 2024-07-18 | Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation | Pengfei Wang et.al. | 2407.13362 | null |
| 2024-07-17 | InfoNorm: Mutual Information Shaping of Normals for Sparse-View Reconstruction | Xulong Wang et.al. | 2407.12661 | link |
| 2024-07-17 | Out of Length Text Recognition with Sub-String Matching | Yongkun Du et.al. | 2407.12317 | link |
| 2024-07-17 | Dual-Hybrid Attention Network for Specular Highlight Removal | Xiaojiao Guo et.al. | 2407.12255 | null |
| 2024-07-16 | Disentangled Acoustic Fields For Multimodal Physical Scene Understanding | Jie Yin et.al. | 2407.11333 | null |
| 2024-07-15 | OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models | Zijian Zhou et.al. | 2407.11213 | link |
| 2024-07-15 | No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations | Walter Simoncini et.al. | 2407.10964 | link |
| 2024-07-18 | Benchmarking Vision Language Models for Cultural Understanding | Shravan Nayak et.al. | 2407.10920 | null |
| 2024-07-14 | Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data | Tuo Feng et.al. | 2407.10200 | link |
| 2024-07-13 | Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding | Ruihuang Li et.al. | 2407.09781 | null |
| 2024-07-12 | A Fair Ranking and New Model for Panoptic Scene Graph Generation | Julian Lorenz et.al. | 2407.09216 | link |
| 2024-07-12 | From Easy to Hard: Learning Curricular Shape-aware Features for Robust Panoptic Scene Graph Generation | Hanrong Shi et.al. | 2407.09191 | null |
| 2024-07-11 | BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight | Hang Wu et.al. | 2407.08526 | null |
| 2024-07-10 | Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences | Nikolaos Dimitriadis et.al. | 2407.08056 | null |
| 2024-07-10 | Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search | Kirill Paramonov et.al. | 2407.07541 | null |
| 2024-07-09 | Joint prototype and coefficient prediction for 3D instance segmentation | Remco Royen et.al. | 2407.06958 | null |
| 2024-07-09 | LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition | Teng Wang et.al. | 2407.06730 | null |
| 2024-07-08 | Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition | Bangbang Zhou et.al. | 2407.05562 | link |
| 2024-07-07 | Self-supervised Learning via Cluster Distance Prediction for Operating Room Context Awareness | Idris Hamoud et.al. | 2407.05448 | null |
| 2024-07-05 | Hybrid Primal Sketch: Combining Analogy, Qualitative Representations, and Computer Vision for Scene Understanding | Kenneth D. Forbus et.al. | 2407.04859 | null |
| 2024-07-03 | A Unified Framework for 3D Scene Understanding | Wei Xu et.al. | 2407.03263 | null |
| 2024-07-11 | Open Panoramic Segmentation | Junwei Zheng et.al. | 2407.02685 | link |
| 2024-07-02 | MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders | Baijiong Lin et.al. | 2407.02228 | link |
| 2024-07-02 | Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning | Chengchao Shen et.al. | 2407.02014 | link |
| 2024-07-01 | PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction | Xuan Yu et.al. | 2407.01349 | null |
| 2024-06-30 | ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding | Quang P. M. Pham et.al. | 2407.00609 | null |
| 2024-06-28 | EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting | Daiwei Zhang et.al. | 2406.19811 | null |
| 2024-07-01 | Mobile Robot Oriented Large-Scale Indoor Dataset for Dynamic Scene Understanding | Yifan Tang et.al. | 2406.19791 | null |
| 2024-06-28 | PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation | Deyi Ji et.al. | 2406.19632 | null |
| 2024-06-27 | Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation | KuanChao Chu et.al. | 2406.19316 | null |
| 2024-06-26 | 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation | Shengyi Qian et.al. | 2406.18158 | null |
| 2024-06-24 | GPT-4V Explorations: Mining Autonomous Driving | Zixuan Li et.al. | 2406.16817 | null |
| 2024-06-25 | AudioBench: A Universal Benchmark for Audio Large Language Models | Bin Wang et.al. | 2406.16020 | link |
| 2024-06-20 | EvSegSNN: Neuromorphic Semantic Segmentation for Event Data | Dalia Hareb et.al. | 2406.14178 | null |
| 2024-06-19 | StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images | Rushikesh Zawar et.al. | 2406.13735 | null |
| 2024-06-17 | DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features | Letian Wang et.al. | 2406.12095 | null |
| 2024-06-17 | Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding | Yunsong Wang et.al. | 2406.11283 | null |
| 2024-06-15 | PIG: Prompt Images Guidance for Night-Time Scene Parsing | Zhifeng Xie et.al. | 2406.10531 | link |
| 2024-06-14 | MapVision: CVPR 2024 Autonomous Grand Challenge Mapless Driving Tech Report | Zhongyu Yang et.al. | 2406.10125 | null |
| 2024-06-14 | SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding | Junwei Luo et.al. | 2406.10100 | link |
| 2024-06-14 | A Two-Stage Masked Autoencoder Based Network for Indoor Depth Completion | Kailai Sun et.al. | 2406.09792 | link |
| 2024-06-13 | MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding | Fei Wang et.al. | 2406.09411 | link |
| 2024-06-13 | Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach | Yansheng Li et.al. | 2406.09410 | link |
| 2024-06-12 | Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment | Taekbeom Lee et.al. | 2406.08176 | link |
| 2024-06-13 | A3VLM: Actionable Articulation-Aware Vision Language Model | Siyuan Huang et.al. | 2406.07549 | link |
| 2024-06-10 | ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery | Xian Sun et.al. | 2406.06028 | null |
| 2024-06-11 | LOP-Field: Brain-inspired Layout-Object-Position Fields for Robotic Scene Understanding | Jiawei Hou et.al. | 2406.05985 | null |
| 2024-06-08 | 1st Place Winner of the 2024 Pixel-level Video Understanding in the Wild (CVPR’24 PVUW) Challenge in Video Panoptic Segmentation and Best Long Video Consistency of Video Semantic Segmentation | Qingfeng Liu et.al. | 2406.05352 | null |
| 2024-06-06 | Semantic Similarity Score for Measuring Visual Similarity at Semantic Level | Senran Fan et.al. | 2406.03865 | null |
| 2024-06-04 | Radar Spectra-Language Model for Automotive Scene Parsing | Mariia Pushkareva et.al. | 2406.02158 | null |
| 2024-06-04 | Leveraging Predicate and Triplet Learning for Scene Graph Generation | Jiankai Li et.al. | 2406.02038 | link |
| 2024-06-04 | FastLGS: Speeding up Language Embedded Gaussians with Feature Grid Mapping | Yuzhou Ji et.al. | 2406.01916 | null |
| 2024-06-04 | PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning | Yupeng Zheng et.al. | 2406.01587 | null |
| 2024-06-03 | EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding | Thanh-Dat Truong et.al. | 2406.01429 | null |
| 2024-06-03 | Object Aware Egocentric Online Action Detection | Joungbin An et.al. | 2406.01079 | null |
| 2024-06-03 | CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos | Trong-Thuan Nguyen et.al. | 2406.01029 | null |
| 2024-06-02 | Compositional 4D Dynamic Scenes Understanding with Physics Priors for Video Question Answering | Xingrui Wang et.al. | 2406.00622 | link |
| 2024-06-02 | Semi-supervised Video Semantic Segmentation Using Unreliable Pseudo Labels for PVUW2024 | Biao Wu et.al. | 2406.00587 | null |
| 2024-05-30 | Learning 3D Robotics Perception using Inductive Priors | Muhammad Zubair Irshad et.al. | 2405.20364 | null |
| 2024-05-30 | SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation | Junjie Zhang et.al. | 2405.19586 | null |
| 2024-05-29 | Kestrel: Point Grounding Multimodal LLM for Part-Aware 3D Vision-Language Understanding | Junjie Fei et.al. | 2405.18937 | null |
| 2024-05-27 | GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane | Yansong Qu et.al. | 2405.17596 | null |
| 2024-05-27 | OED: Towards One-stage End-to-End Dynamic Scene Graph Generation | Guan Wang et.al. | 2405.16925 | link |
| 2024-05-25 | Real-Time Scene Graph Generation | Maëlic Neau et.al. | 2405.16116 | link |
| 2024-05-24 | Open-Vocabulary SAM3D: Understand Any 3D Scene | Hanchen Tai et.al. | 2405.15580 | null |
| 2024-05-23 | Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis | Basile Van Hoorick et.al. | 2405.14868 | null |
| 2024-05-23 | CoPeD-Advancing Multi-Robot Collaborative Perception: A Comprehensive Dataset in Real-World Environments | Yang Zhou et.al. | 2405.14731 | link |
| 2024-05-23 | Efficient Robot Learning for Perception and Mapping | Niclas Vödisch et.al. | 2405.14688 | null |
| 2024-05-24 | Transformers for Image-Goal Navigation | Nikhilanj Pelluri et.al. | 2405.14128 | null |
| 2024-05-22 | TS40K: a 3D Point Cloud Dataset of Rural Terrain and Electrical Transmission System | Diogo Lavado et.al. | 2405.13989 | null |
| 2024-05-22 | A General Framework for Jersey Number Recognition in Sports Video | Maria Koshkina et.al. | 2405.13896 | link |
| 2024-05-22 | GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games | Aoran Mei et.al. | 2405.13751 | null |
| 2024-05-21 | Anticipating Object State Changes | Victoria Manousaki et.al. | 2405.12789 | null |
| 2024-05-21 | Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency | Hyeongjin Kim et.al. | 2405.12648 | null |
| 2024-05-20 | MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering | Jingqun Tang et.al. | 2405.11985 | link |
| 2024-05-19 | The First Swahili Language Scene Text Detection and Recognition Dataset | Fadila Wendigoundi Douamba et.al. | 2405.11437 | link |
| 2024-05-16 | Grounded 3D-LLM with Referent Tokens | Yilun Chen et.al. | 2405.10370 | link |
| 2024-05-16 | 4D Panoptic Scene Graph Generation | Jingkang Yang et.al. | 2405.10305 | link |
| 2024-05-16 | When LLMs step into the 3D World: A Survey and Meta-Analysis of 3D Tasks via Multi-modal Large Language Models | Xianzheng Ma et.al. | 2405.10255 | link |
| 2024-05-16 | A Preprocessing and Postprocessing Voxel-based Method for LiDAR Semantic Segmentation Improvement in Long Distance | Andrea Matteazzi et.al. | 2405.10046 | null |
| 2024-05-15 | BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation | Yunhao Ge et.al. | 2405.09546 | null |
| 2024-05-15 | HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition | Honghui Chen et.al. | 2405.09125 | null |
| 2024-05-15 | 3D Shape Augmentation with Content-Aware Shape Resizing | Mingxiang Chen et.al. | 2405.09050 | null |
| 2024-05-09 | Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control | Gunshi Gupta et.al. | 2405.05852 | link |
| 2024-05-11 | Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition | Zuan Gao et.al. | 2405.05841 | null |
| 2024-05-09 | Benchmarking Neural Radiance Fields for Autonomous Robots: An Overview | Yuhang Ming et.al. | 2405.05526 | null |
| 2024-05-09 | DTCLMapper: Dual Temporal Consistent Learning for Vectorized HD Map Construction | Siyu Li et.al. | 2405.05518 | null |
| 2024-05-08 | OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies | Lingdong Kong et.al. | 2405.05259 | link |
| 2024-05-08 | Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving | Lingdong Kong et.al. | 2405.05258 | link |
| 2024-05-07 | DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving | Chen Min et.al. | 2405.04390 | null |
| 2024-05-07 | Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing | Boqiang Zhang et.al. | 2405.04377 | null |
| 2024-05-06 | An Empty Room is All We Want: Automatic Defurnishing of Indoor Panoramas | Mira Slavcheva et.al. | 2405.03682 | null |
| 2024-05-04 | Few-Shot Fruit Segmentation via Transfer Learning | Jordan A. James et.al. | 2405.02556 | link |
| 2024-04-29 | Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM | Navid Rajabi et.al. | 2404.19128 | null |
| 2024-04-29 | Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks | Christopher J. Kymn et.al. | 2404.19126 | null |
| 2024-04-24 | Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer | Jiaming Lei et.al. | 2404.15785 | null |
| 2024-04-22 | CloudFort: Enhancing Robustness of 3D Point Cloud Classification Against Backdoor Attacks via Spatial Partitioning and Ensemble Prediction | Wenhao Lan et.al. | 2404.14042 | null |
| 2024-04-22 | On Support Relations Inference and Scene Hierarchy Graph Construction from Point Cloud in Clustered Environments | Gang Ma et.al. | 2404.13842 | null |
| 2024-04-29 | Clio: Real-time Task-Driven Open-Set 3D Scene Graphs | Dominic Maggio et.al. | 2404.13696 | link |
| 2024-04-19 | BACS: Background Aware Continual Semantic Segmentation | Mostafa ElAraby et.al. | 2404.13148 | link |
| 2024-04-19 | Unified Scene Representation and Reconstruction for 3D Large Language Models | Tao Chu et.al. | 2404.13044 | null |
| 2024-04-18 | SPIdepth: Strengthened Pose Information for Self-supervised Monocular Depth Estimation | Mykola Lavreniuk et.al. | 2404.12501 | link |
| 2024-04-19 | AccidentBlip2: Accident Detection With Multi-View MotionBlip2 | Yihua Shao et.al. | 2404.12149 | link |
| 2024-04-17 | Multimodal 3D Object Detection on Unseen Domains | Deepti Hegde et.al. | 2404.11764 | null |
| 2024-04-16 | ECLAIR: A High-Fidelity Aerial LiDAR Dataset for Semantic Segmentation | Iaroslav Melekhov et.al. | 2404.10699 | link |
| 2024-04-16 | PyTorchGeoNodes: Enabling Differentiable Shape Programs for 3D Shape Reconstruction | Sinisa Stekovic et.al. | 2404.10620 | link |
| 2024-04-16 | PreGSU-A Generalized Traffic Scene Understanding Model for Autonomous Driving based on Pre-trained Graph Attention Network | Yuning Wang et.al. | 2404.10263 | null |
| 2024-04-15 | No More Ambiguity in 360° Room Layout via Bi-Layout Estimation | Yu-Ju Tsai et.al. | 2404.09993 | null |
| 2024-04-15 | A Review and Efficient Implementation of Scene Graph Generation Metrics | Julian Lorenz et.al. | 2404.09616 | link |
| 2024-04-14 | Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms | Diandian Guo et.al. | 2404.09231 | null |
| 2024-04-11 | Gaga: Group Any Gaussians via 3D-aware Memory Bank | Weijie Lyu et.al. | 2404.07977 | null |
| 2024-04-11 | AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation | Yansheng Li et.al. | 2404.07788 | null |
| 2024-04-11 | Depth Estimation using Weighted-loss and Transfer Learning | Muhammad Adeel Hafeez et.al. | 2404.07686 | null |
| 2024-04-11 | Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange | Yanhao Wu et.al. | 2404.07504 | null |
| 2024-04-10 | Incorporating Explanations into Human-Machine Interfaces for Trust and Situation Awareness in Autonomous Vehicles | Shahin Atakishiyev et.al. | 2404.07383 | null |
| 2024-04-10 | ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling | Ege Özsoy et.al. | 2404.07031 | link |
| 2024-04-10 | O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation | Muer Tie et.al. | 2404.06836 | null |
| 2024-04-09 | QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding | Yash Mehan et.al. | 2404.06442 | null |
| 2024-04-09 | DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird’s Eye View Segmentation with Occlusion Reasoning | Senthil Yogamani et.al. | 2404.06352 | null |
| 2024-04-09 | JSTR: Judgment Improves Scene Text Recognition | Masato Fujitake et.al. | 2404.05967 | null |
| 2024-04-06 | Panoptic Perception: A Novel Task and Fine-grained Dataset for Universal Remote Sensing Image Interpretation | Danpei Zhao et.al. | 2404.04608 | null |
| 2024-04-06 | SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos | Tao Wu et.al. | 2404.04565 | link |
| 2024-04-05 | Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation | Zifu Wan et.al. | 2404.04256 | link |
| 2024-04-06 | HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion | Jiahang Li et.al. | 2404.03527 | link |
| 2024-04-04 | You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects | Lei Zhou et.al. | 2404.03462 | null |
| 2024-04-03 | Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling | Xu Wang et.al. | 2404.02527 | null |
| 2024-04-05 | EGTR: Extracting Graph from Transformer for Scene Graph Generation | Jinbae Im et.al. | 2404.02072 | link |
| 2024-04-01 | NeRF-MAE : Masked AutoEncoders for Self Supervised 3D representation Learning for Neural Radiance Fields | Muhammad Zubair Irshad et.al. | 2404.01300 | null |
| 2024-04-08 | 360+x: A Panoptic Multi-modal Scene Understanding Dataset | Hao Chen et.al. | 2404.00989 | null |
| 2024-04-01 | Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping | Hyeongjun Kwon et.al. | 2404.00974 | link |
| 2024-04-01 | GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields | Yunsong Wang et.al. | 2404.00931 | link |
| 2024-04-01 | MM3DGS SLAM: Multi-modal 3D Gaussian Splatting for SLAM Using Vision, Depth, and Inertial Measurements | Lisong C. Sun et.al. | 2404.00923 | link |
| 2024-04-01 | From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models | Rongjie Li et.al. | 2404.00906 | null |
| 2024-03-31 | Adapting to Length Shift: FlexiLength Network for Trajectory Prediction | Yi Xu et.al. | 2404.00742 | null |
| 2024-03-31 | Neural Radiance Field-based Visual Rendering: A Comprehensive Review | Mingyuan Yao et.al. | 2404.00714 | null |
| 2024-03-29 | VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection | Zihua Liu et.al. | 2404.00149 | null |
| 2024-03-29 | HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes | Ke Wu et.al. | 2403.20159 | null |
| 2024-04-01 | Efficient 3D Instance Mapping and Localization with Neural Fields | George Tang et.al. | 2403.19797 | null |
| 2024-03-27 | Object Pose Estimation via the Aggregation of Diffusion Features | Tianfu Wang et.al. | 2403.18791 | link |
| 2024-03-25 | Calib3D: Calibrating Model Preferences for Reliable 3D Scene Understanding | Lingdong Kong et.al. | 2403.17010 | link |
| 2024-03-25 | Towards Trustworthy Automated Driving through Qualitative Scene Understanding and Explanations | Nassim Belmecheri et.al. | 2403.16908 | null |
| 2024-03-25 | DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding | Xiaoxuan Yu et.al. | 2403.16431 | link |
| 2024-03-24 | AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans | Cedric Perauer et.al. | 2403.16318 | null |
| 2024-03-24 | Improving Scene Graph Generation with Relation Words’ Debiasing in Vision-Language Models | Yuxuan Wang et.al. | 2403.16184 | null |
| 2024-03-24 | Multi-Task Learning with Multi-Task Optimization | Lu Bai et.al. | 2403.16162 | null |
| 2024-03-24 | Semantic Is Enough: Only Semantic Information For NeRF Reconstruction | Ruibo Wang et.al. | 2403.16043 | null |
| 2024-03-22 | Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting | Jun Guo et.al. | 2403.15624 | null |
| 2024-03-22 | DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data | Hanrong Ye et.al. | 2403.15389 | null |
| 2024-03-21 | DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation | Zeeshan Hayder et.al. | 2403.14886 | null |
| 2024-03-21 | Evaluating Panoramic 3D Estimation in Indoor Lighting Analysis | Zining Cheng et.al. | 2403.14836 | null |
| 2024-03-21 | SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field | Lizhe Liu et.al. | 2403.14366 | null |
| 2024-03-21 | Exosense: A Vision-Centric Scene Understanding System For Safe Exoskeleton Navigation | Jianeng Wang et.al. | 2403.14320 | null |
| 2024-03-21 | Volumetric Environment Representation for Vision-Language Navigation | Rui Liu et.al. | 2403.14158 | null |
| 2024-03-21 | 3D Object Detection from Point Cloud via Voting Step Diffusion | Haoran Hou et.al. | 2403.14133 | null |
| 2024-03-20 | Efficient scene text image super-resolution with semantic guidance | LeoWu TomyEnrique et.al. | 2403.13330 | link |
| 2024-03-19 | SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model | Armen Avetisyan et.al. | 2403.13064 | null |
| 2024-03-19 | HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting | Hongyu Zhou et.al. | 2403.12722 | null |
| 2024-03-19 | M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving | Dongyang Xu et.al. | 2403.12552 | null |
| 2024-03-19 | Multi-Object RANSAC: Efficient Plane Clustering Method in a Clutter | Seunghyeon Lim et.al. | 2403.12449 | null |
| 2024-03-19 | Geometric Constraints in Deep Learning Frameworks: A Survey | Vibhas K Vats et.al. | 2403.12431 | null |
| 2024-03-18 | R3DS: Reality-linked 3D Scenes for Panoramic Scene Understanding | Qirui Wu et.al. | 2403.12301 | null |
| 2024-03-18 | HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation | Ce Zhang et.al. | 2403.12033 | link |
| 2024-03-18 | Agent3D-Zero: An Agent for Zero-shot 3D Understanding | Sha Zhang et.al. | 2403.11835 | null |
| 2024-03-18 | OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation | Haochen Jiang et.al. | 2403.11796 | null |
| 2024-03-19 | Urban Scene Diffusion through Semantic Occupancy Map | Junge Zhang et.al. | 2403.11697 | null |
| 2024-03-18 | Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation | Ming Xu et.al. | 2403.11541 | link |
| 2024-03-18 | Beyond Uncertainty: Risk-Aware Active View Acquisition for Safe Robot Navigation and 3D Scene Understanding with FisherRF | Guangyi Liu et.al. | 2403.11396 | null |
| 2024-03-17 | Omni-Recon: Towards General-Purpose Neural Radiance Fields for Versatile 3D Applications | Yonggan Fu et.al. | 2403.11131 | link |
| 2024-03-16 | N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields | Yash Bhalgat et.al. | 2403.10997 | null |
| 2024-03-16 | Segment Any Object Model (SAOM): Real-to-Simulation Fine-Tuning Strategy for Multi-Class Multi-Instance Segmentation | Mariia Khan et.al. | 2403.10780 | null |
| 2024-03-15 | Robust Shape Fitting for 3D Scene Abstraction | Florian Kluger et.al. | 2403.10452 | link |
| 2024-03-15 | Do Visual-Language Maps Capture Latent Semantics? | Matti Pekkanen et.al. | 2403.10117 | null |
| 2024-03-15 | Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning | Hang Zhang et.al. | 2403.10107 | null |
| 2024-03-14 | GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding | Chengyao Wang et.al. | 2403.09639 | link |
| 2024-03-12 | IndicSTR12: A Dataset for Indic Scene Text Recognition | Harsh Lunia et.al. | 2403.08007 | null |
| 2024-03-12 | Efficient Global Navigational Planning in 3D Structures based on Point Cloud Tomography | Bowen Yang et.al. | 2403.07631 | link |
| 2024-03-12 | Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss | Xuhua Ren et.al. | 2403.07518 | null |
| 2024-03-12 | MoAI: Mixture of All Intelligence for Large Language and Vision Models | Byung-Kwan Lee et.al. | 2403.07508 | link |
| 2024-03-11 | Mapping High-level Semantic Regions in Indoor Environments without Object Recognition | Roberto Bigazzi et.al. | 2403.07076 | null |
| 2024-03-11 | Optimizing Latent Graph Representations of Surgical Scenes for Zero-Shot Domain Transfer | Siddhant Satyanaik et.al. | 2403.06953 | null |
| 2024-03-08 | Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation | Yifan Mao et.al. | 2403.05056 | link |
| 2024-03-07 | Towards Scene Graph Anticipation | Rohith Peddi et.al. | 2403.04899 | null |
| 2024-03-07 | Embodied Understanding of Driving Scenarios | Yunsong Zhou et.al. | 2403.04593 | link |
| 2024-03-07 | Out of the Room: Generalizing Event-Based Dynamic Motion Segmentation for Complex Scenes | Stamatios Georgoulis et.al. | 2403.04562 | null |
| 2024-03-06 | GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding | Zi-Ting Chou et.al. | 2403.03608 | null |
| 2024-03-05 | OORD: The Oxford Offroad Radar Dataset | Matthew Gadd et.al. | 2403.02845 | link |
| 2024-03-05 | HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes | Yichen Yao et.al. | 2403.02769 | null |
| 2024-02-29 | FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything | Safouane El Ghazouali et.al. | 2403.00175 | link |
| 2024-02-29 | One model to use them all: Training a segmentation model with complementary datasets | Alexander C. Jenke et.al. | 2402.19340 | link |
| 2024-02-29 | Feature boosting with efficient attention for scene parsing | Vivek Singh et.al. | 2402.19250 | null |
| 2024-02-29 | PCDepth: Pattern-based Complementary Learning for Monocular Depth Estimation by Best of Both Worlds | Haotian Liu et.al. | 2402.18925 | null |
| 2024-02-28 | Windowed-FourierMixer: Enhancing Clutter-Free Room Modeling with Fourier Transform | Bruno Henriques et.al. | 2402.18287 | null |
| 2024-02-27 | LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment | Yiming Ren et.al. | 2402.17171 | null |
| 2024-02-27 | Efficiently Leveraging Linguistic Priors for Scene Text Spotting | Nguyen Nguyen et.al. | 2402.17134 | null |
| 2024-02-26 | DreamUp3D: Object-Centric Generative Models for Single-View 3D Scene Understanding and Real-to-Sim Transfer | Yizhe Wu et.al. | 2402.16308 | null |
| 2024-02-24 | Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition | Mingkun Yang et.al. | 2402.15806 | null |
| 2024-02-23 | OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding | Francis Engelmann et.al. | 2402.15321 | null |
| 2024-02-22 | S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR | Jialun Pei et.al. | 2402.14461 | null |
| 2024-02-22 | Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding | Yu-Qi Yang et.al. | 2402.14215 | link |
| 2024-02-21 | Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition | Mingkun Yang et.al. | 2402.13643 | link |
| 2024-02-25 | DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | Xiaoyu Tian et.al. | 2402.12289 | null |
(<a href=../README.md>back to main</a>)