Multimodal - 2025-11 | Paper Arxiv Daily

Multimodal - 2025-11

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-11-30	MM-ACT: Learn from Multimodal Parallel Generation to Act	Haotian Liang et.al.	2512.00975	translate	read	null
2025-11-29	Describe Anything Anywhere At Any Moment	Nicolas Gorlo et.al.	2512.00565	translate	read	null
2025-11-29	CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA	Vsevolod Kovalev et.al.	2512.00360	translate	read	null
2025-11-28	Buffer replay enhances the robustness of multimodal learning under missing-modality	Hongye Zhu et.al.	2511.23070	translate	read	null
2025-11-27	Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation	Xinyi Che et.al.	2511.22463	translate	read	null
2025-11-27	Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation	Xinyi Che et.al.	2511.22447	translate	read	null
2025-11-27	Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples	Shuhei Yamashita et.al.	2511.22141	translate	read	null
2025-11-26	WalkCLIP: Multimodal Learning for Urban Walkability Prediction	Shilong Xiang et.al.	2511.21947	translate	read	null
2025-11-26	Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI	Niccolo Marini et.al.	2511.21827	translate	read	null
2025-11-26	Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling	Mengran Li et.al.	2511.21120	translate	read	null
2025-11-25	A review on data fusion in multimodal learning analytics and educational data mining	Wilson Chango et.al.	2511.20871	translate	read	null
2025-11-25	VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning	Bo Pang et.al.	2511.20422	translate	read	null
2025-11-25	MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts	Zilong Huang et.al.	2511.20415	translate	read	null
2025-11-25	ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis	Advik Sinha et.al.	2511.20274	translate	read	null
2025-11-24	Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation	Yingjia Shang et.al.	2511.19257	translate	read	null
2025-11-24	IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes	Carl Lindström et.al.	2511.19235	translate	read	null
2025-11-24	Can Modern Vision Models Understand the Difference Between an Object and a Look-alike?	Itay Cohen et.al.	2511.19200	translate	read	null
2025-11-23	Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion	Haidong Kang et.al.	2511.18516	translate	read	null
2025-11-22	Vulnerability-Aware Robust Multimodal Adversarial Training	Junrui Zhang et.al.	2511.18138	translate	read	null
2025-11-22	Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning	Xiaohong Liu et.al.	2511.18104	translate	read	null
2025-11-17	Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding	Yassir Benhammou et.al.	2511.17596	translate	read	null
2025-11-21	MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment	Huangbiao Xu et.al.	2511.17397	translate	read	null
2025-11-21	UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation	Chi Zhang et.al.	2511.16917	translate	read	null
2025-11-20	LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs	Doriand Petit et.al.	2511.16454	translate	read	null
2025-11-20	Boosting Medical Visual Understanding From Multi-Granular Language Learning	Zihan Li et.al.	2511.15943	translate	read	null
2025-11-18	Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer	Hyo-Jeong Jang et.al.	2511.15741	translate	read	null
2025-11-19	SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome	Dabin Jeong et.al.	2511.15464	translate	read	null
2025-11-19	Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access, Urbanization, and Carbon Emissions	Shan Shan et.al.	2511.15342	translate	read	null
2025-11-19	Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval	Qing Wang et.al.	2511.15201	translate	read	null
2025-11-19	TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition	Wen Yin et.al.	2511.15085	translate	read	null
2025-11-18	Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion	Zanxu Wang et.al.	2511.14969	translate	read	null
2025-11-18	Toward Robust and Harmonious Adaptation for Cross-modal Retrieval	Haobin Li et.al.	2511.14416	translate	read	null
2025-11-18	Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation	Weimin Bai et.al.	2511.14271	translate	read	null
2025-11-18	Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision	Zitang Sun et.al.	2511.14197	translate	read	null
2025-11-14	Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement	Zhe Yang et.al.	2511.13755	translate	read	null
2025-11-17	3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale	Yijia Fan et.al.	2511.13211	translate	read	null
2025-11-17	uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data	Dahyun Chung et.al.	2511.13036	translate	read	null
2025-11-17	Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks	Minsoo Jo et.al.	2511.12985	translate	read	null
2025-11-15	To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance	Wanlong Fang et.al.	2511.12121	translate	read	null
2025-11-14	Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification	Qinghao Gao et.al.	2511.11460	translate	read	null
2025-11-14	AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery	Yuqi Yin et.al.	2511.11257	translate	read	null
2025-11-14	LEMUR: Large scale End-to-end MUltimodal Recommendation	Xintian Han et.al.	2511.10962	translate	read	null
2025-11-14	MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition	Feng Li et.al.	2511.10892	translate	read	null
2025-11-13	Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals	Shruti Singh Baghel et.al.	2511.10615	translate	read	null
2025-11-13	URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding	Yongxin Shi et.al.	2511.10552	translate	read	null
2025-11-13	GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval	Hao Zou et.al.	2511.10154	translate	read	null
2025-11-13	Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction	Mingda Jia et.al.	2511.10134	translate	read	null
2025-11-13	Towards Robust Multimodal Learning in the Open World	Fushuo Huo et.al.	2511.09989	translate	read	null
2025-11-12	Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard	Stelios Zarifis et.al.	2511.09727	translate	read	null
2025-11-12	End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering	Jiliang Hu et.al.	2511.09282	translate	read	null
2025-11-11	Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding	Da Li et.al.	2511.08480	translate	read	null
2025-11-11	Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation	Jun Sun et.al.	2511.08152	translate	read	null
2025-11-11	Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval	Likang Peng et.al.	2511.07780	translate	read	null
2025-11-11	Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling	Jiale Liu et.al.	2511.07710	translate	read	null
2025-11-10	A Hybrid Multimodal Deep Learning Framework for Intelligent Fashion Recommendation	Kamand Kalashi et.al.	2511.07573	translate	read	null
2025-11-10	Integrating Epigenetic and Phenotypic Features for Biological Age Estimation in Cancer Patients via Multimodal Learning	Shuyue Jiang et.al.	2511.07219	translate	read	null
2025-11-10	Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images	You-Kyoung Na et.al.	2511.06752	translate	read	null
2025-11-09	LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval	Jian Zhang et.al.	2511.06268	translate	read	null
2025-11-09	VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving	Ruifei Zhang et.al.	2511.06256	translate	read	null
2025-11-09	AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving	Ruifei Zhang et.al.	2511.06253	translate	read	null
2025-11-08	Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models	Akshar Tumu et.al.	2511.06146	translate	read	null
2025-11-04	Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction	An Vuong et.al.	2511.05577	translate	read	null
2025-11-06	DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification	Yujie Yang et.al.	2511.04281	translate	read	null
2025-11-05	Cross-Modal Alignment via Variational Copula Modelling	Feng Wu et.al.	2511.03196	translate	read	null
2025-11-04	SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment	Wenbo Lu et.al.	2511.03019	translate	read	null
2025-11-04	ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology	Srikumar Sastry et.al.	2511.02946	translate	read	null
2025-11-04	When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning	Chenyu Zhang et.al.	2511.02794	translate	read	null
2025-11-03	OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance	Ziqi Wang et.al.	2511.01320	translate	read	null
2025-11-02	Balanced Multimodal Learning via Mutual Information	Rongrong Xie et.al.	2511.00987	translate	read	null
2025-11-01	LIR: The First Workshop on Late Interaction and Multi Vector Retrieval @ ECIR 2026	Benjamin Clavié et.al.	2511.00444	translate	read	null
2025-11-01	Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities	Xihang Qiu et.al.	2511.00344	translate	read	null

(<a href=../Multimodal.md>back to Multimodal</a>)