Multimodal - 2025-11
Multimodal - 2025-11
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-11-30 | MM-ACT: Learn from Multimodal Parallel Generation to Act | Haotian Liang et.al. | 2512.00975 | translate | read | null |
| 2025-11-29 | Describe Anything Anywhere At Any Moment | Nicolas Gorlo et.al. | 2512.00565 | translate | read | null |
| 2025-11-29 | CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA | Vsevolod Kovalev et.al. | 2512.00360 | translate | read | null |
| 2025-11-28 | Buffer replay enhances the robustness of multimodal learning under missing-modality | Hongye Zhu et.al. | 2511.23070 | translate | read | null |
| 2025-11-27 | Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation | Xinyi Che et.al. | 2511.22463 | translate | read | null |
| 2025-11-27 | Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation | Xinyi Che et.al. | 2511.22447 | translate | read | null |
| 2025-11-27 | Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples | Shuhei Yamashita et.al. | 2511.22141 | translate | read | null |
| 2025-11-26 | WalkCLIP: Multimodal Learning for Urban Walkability Prediction | Shilong Xiang et.al. | 2511.21947 | translate | read | null |
| 2025-11-26 | Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI | Niccolo Marini et.al. | 2511.21827 | translate | read | null |
| 2025-11-26 | Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling | Mengran Li et.al. | 2511.21120 | translate | read | null |
| 2025-11-25 | A review on data fusion in multimodal learning analytics and educational data mining | Wilson Chango et.al. | 2511.20871 | translate | read | null |
| 2025-11-25 | VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning | Bo Pang et.al. | 2511.20422 | translate | read | null |
| 2025-11-25 | MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts | Zilong Huang et.al. | 2511.20415 | translate | read | null |
| 2025-11-25 | ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis | Advik Sinha et.al. | 2511.20274 | translate | read | null |
| 2025-11-24 | Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation | Yingjia Shang et.al. | 2511.19257 | translate | read | null |
| 2025-11-24 | IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes | Carl Lindström et.al. | 2511.19235 | translate | read | null |
| 2025-11-24 | Can Modern Vision Models Understand the Difference Between an Object and a Look-alike? | Itay Cohen et.al. | 2511.19200 | translate | read | null |
| 2025-11-23 | Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion | Haidong Kang et.al. | 2511.18516 | translate | read | null |
| 2025-11-22 | Vulnerability-Aware Robust Multimodal Adversarial Training | Junrui Zhang et.al. | 2511.18138 | translate | read | null |
| 2025-11-22 | Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning | Xiaohong Liu et.al. | 2511.18104 | translate | read | null |
| 2025-11-17 | Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding | Yassir Benhammou et.al. | 2511.17596 | translate | read | null |
| 2025-11-21 | MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment | Huangbiao Xu et.al. | 2511.17397 | translate | read | null |
| 2025-11-21 | UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation | Chi Zhang et.al. | 2511.16917 | translate | read | null |
| 2025-11-20 | LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs | Doriand Petit et.al. | 2511.16454 | translate | read | null |
| 2025-11-20 | Boosting Medical Visual Understanding From Multi-Granular Language Learning | Zihan Li et.al. | 2511.15943 | translate | read | null |
| 2025-11-18 | Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer | Hyo-Jeong Jang et.al. | 2511.15741 | translate | read | null |
| 2025-11-19 | SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome | Dabin Jeong et.al. | 2511.15464 | translate | read | null |
| 2025-11-19 | Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access, Urbanization, and Carbon Emissions | Shan Shan et.al. | 2511.15342 | translate | read | null |
| 2025-11-19 | Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval | Qing Wang et.al. | 2511.15201 | translate | read | null |
| 2025-11-19 | TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition | Wen Yin et.al. | 2511.15085 | translate | read | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | translate | read | null |
| 2025-11-18 | Toward Robust and Harmonious Adaptation for Cross-modal Retrieval | Haobin Li et.al. | 2511.14416 | translate | read | null |
| 2025-11-18 | Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation | Weimin Bai et.al. | 2511.14271 | translate | read | null |
| 2025-11-18 | Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision | Zitang Sun et.al. | 2511.14197 | translate | read | null |
| 2025-11-14 | Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement | Zhe Yang et.al. | 2511.13755 | translate | read | null |
| 2025-11-17 | 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale | Yijia Fan et.al. | 2511.13211 | translate | read | null |
| 2025-11-17 | uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data | Dahyun Chung et.al. | 2511.13036 | translate | read | null |
| 2025-11-17 | Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks | Minsoo Jo et.al. | 2511.12985 | translate | read | null |
| 2025-11-15 | To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance | Wanlong Fang et.al. | 2511.12121 | translate | read | null |
| 2025-11-14 | Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification | Qinghao Gao et.al. | 2511.11460 | translate | read | null |
| 2025-11-14 | AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery | Yuqi Yin et.al. | 2511.11257 | translate | read | null |
| 2025-11-14 | LEMUR: Large scale End-to-end MUltimodal Recommendation | Xintian Han et.al. | 2511.10962 | translate | read | null |
| 2025-11-14 | MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition | Feng Li et.al. | 2511.10892 | translate | read | null |
| 2025-11-13 | Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals | Shruti Singh Baghel et.al. | 2511.10615 | translate | read | null |
| 2025-11-13 | URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding | Yongxin Shi et.al. | 2511.10552 | translate | read | null |
| 2025-11-13 | GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval | Hao Zou et.al. | 2511.10154 | translate | read | null |
| 2025-11-13 | Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction | Mingda Jia et.al. | 2511.10134 | translate | read | null |
| 2025-11-13 | Towards Robust Multimodal Learning in the Open World | Fushuo Huo et.al. | 2511.09989 | translate | read | null |
| 2025-11-12 | Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard | Stelios Zarifis et.al. | 2511.09727 | translate | read | null |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | translate | read | null |
| 2025-11-11 | Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding | Da Li et.al. | 2511.08480 | translate | read | null |
| 2025-11-11 | Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation | Jun Sun et.al. | 2511.08152 | translate | read | null |
| 2025-11-11 | Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval | Likang Peng et.al. | 2511.07780 | translate | read | null |
| 2025-11-11 | Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling | Jiale Liu et.al. | 2511.07710 | translate | read | null |
| 2025-11-10 | A Hybrid Multimodal Deep Learning Framework for Intelligent Fashion Recommendation | Kamand Kalashi et.al. | 2511.07573 | translate | read | null |
| 2025-11-10 | Integrating Epigenetic and Phenotypic Features for Biological Age Estimation in Cancer Patients via Multimodal Learning | Shuyue Jiang et.al. | 2511.07219 | translate | read | null |
| 2025-11-10 | Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images | You-Kyoung Na et.al. | 2511.06752 | translate | read | null |
| 2025-11-09 | LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval | Jian Zhang et.al. | 2511.06268 | translate | read | null |
| 2025-11-09 | VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving | Ruifei Zhang et.al. | 2511.06256 | translate | read | null |
| 2025-11-09 | AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving | Ruifei Zhang et.al. | 2511.06253 | translate | read | null |
| 2025-11-08 | Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models | Akshar Tumu et.al. | 2511.06146 | translate | read | null |
| 2025-11-04 | Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction | An Vuong et.al. | 2511.05577 | translate | read | null |
| 2025-11-06 | DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification | Yujie Yang et.al. | 2511.04281 | translate | read | null |
| 2025-11-05 | Cross-Modal Alignment via Variational Copula Modelling | Feng Wu et.al. | 2511.03196 | translate | read | null |
| 2025-11-04 | SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment | Wenbo Lu et.al. | 2511.03019 | translate | read | null |
| 2025-11-04 | ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology | Srikumar Sastry et.al. | 2511.02946 | translate | read | null |
| 2025-11-04 | When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning | Chenyu Zhang et.al. | 2511.02794 | translate | read | null |
| 2025-11-03 | OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance | Ziqi Wang et.al. | 2511.01320 | translate | read | null |
| 2025-11-02 | Balanced Multimodal Learning via Mutual Information | Rongrong Xie et.al. | 2511.00987 | translate | read | null |
| 2025-11-01 | LIR: The First Workshop on Late Interaction and Multi Vector Retrieval @ ECIR 2026 | Benjamin Clavié et.al. | 2511.00444 | translate | read | null |
| 2025-11-01 | Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities | Xihang Qiu et.al. | 2511.00344 | translate | read | null |
(<a href=../Multimodal.md>back to Multimodal</a>)