Multimodal - 2025-12 | Paper Arxiv Daily

Multimodal - 2025-12

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-12-30	Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset	TsaiChing Ni et.al.	2512.24160	translate	read	null
2025-12-30	Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval	Yizhi Liu et.al.	2512.24064	translate	read	null
2025-12-29	Wireless Multimodal Foundation Model (WMFM): Integrating Vision and Communication Modalities for 6G ISAC Systems	Mohammad Farzanullah et.al.	2512.23897	translate	read	null
2025-12-29	Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition	Arman Martirosyan et.al.	2512.23291	translate	read	null
2025-12-29	Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism	Siyu Zhang et.al.	2512.23243	translate	read	null
2025-12-28	Fusion or Confusion? Multimodal Complexity Is Not All You Need	Tillmann Rheude et.al.	2512.22991	translate	read	null
2025-12-28	Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives	Shuanghao Bai et.al.	2512.22983	translate	read	null
2025-12-25	TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References	Jiahong Yu et.al.	2512.21641	translate	read	null
2025-12-24	Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies	Jing Han et.al.	2512.20938	translate	read	null
2025-12-23	Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI	Muhammad Usman et.al.	2512.20436	translate	read	null
2025-12-23	Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems	YuChe Hsu et.al.	2512.20387	translate	read	null
2025-12-23	Retrieval-augmented Prompt Learning for Pre-trained Foundation Models	Xiang Chen et.al.	2512.20145	translate	read	null
2025-12-22	Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis	Argha Kamal Samanta et.al.	2512.19663	translate	read	null
2025-12-22	Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis	Xiaoming Zhang et.al.	2512.19415	translate	read	null
2025-12-22	OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation	Xueming Yan et.al.	2512.19379	translate	read	null
2025-12-19	STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting	Yifei Cheng et.al.	2512.17667	translate	read	null
2025-12-19	PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology	Fengchun Liu et.al.	2512.17621	translate	read	null
2025-12-18	Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future	Tianshuai Hu et.al.	2512.16760	translate	read	null
2025-12-18	Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors	Kejun Liu et.al.	2512.16485	translate	read	null
2025-12-17	GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection	Yu Wang et.al.	2512.15707	translate	read	null
2025-12-17	An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain	João Daniel Silva et.al.	2512.15531	translate	read	null
2025-12-16	Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris	Wenshuo Li et.al.	2512.14878	translate	read	null
2025-12-15	STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning	Jie Qin et.al.	2512.13752	translate	read	null
2025-12-15	JoVA: Unified Multimodal Learning for Joint Video-Audio Generation	Xiaohu Huang et.al.	2512.13677	translate	read	null
2025-12-15	A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis	Xianchao Guan et.al.	2512.13164	translate	read	null
2025-12-13	EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography	Yuheng Li et.al.	2512.12107	translate	read	null
2025-12-12	VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing	Emanuel Sánchez Aimar et.al.	2512.11490	translate	read	null
2025-12-12	Exploring MLLM-Diffusion Information Transfer with MetaCanvas	Han Lin et.al.	2512.11464	translate	read	null
2025-12-12	AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities	Chenyiming Wen et.al.	2512.11331	translate	read	null
2025-12-02	Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems	Matvey Nepomnyaschiy et.al.	2512.10975	translate	read	null
2025-12-11	Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval	J. Xiao et.al.	2512.10596	translate	read	null
2025-12-11	Cross-modal Retrieval Models for Stripped Binary Analysis	Guoqiang Chen et.al.	2512.10393	translate	read	null
2025-12-05	What Happens When: Learning Temporal Orders of Events in Videos	Daechul Ahn et.al.	2512.08979	translate	read	null
2025-12-09	Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval	Tao Chen et.al.	2512.08410	translate	read	null
2025-12-08	CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification	Pingchuan Ma et.al.	2512.08071	translate	read	null
2025-12-08	Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation	Shihao Zhao et.al.	2512.07747	translate	read	null
2025-12-08	VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation	Md Selim Sarowar et.al.	2512.07215	translate	read	null
2025-12-07	A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations	Waleed Razzaq et.al.	2512.06708	translate	read	null
2025-12-06	Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion	Jaewon Ahn et.al.	2512.06449	translate	read	null
2025-12-05	Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures	Amirkia Rafiei Oskooei et.al.	2512.05908	translate	read	null
2025-12-04	4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer	Xianfeng Wu et.al.	2512.05060	translate	read	null
2025-12-03	Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation	Xiaosen Lyu et.al.	2512.03521	translate	read	null
2025-12-03	Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation	Xieji Li et.al.	2512.03445	translate	read	null
2025-12-03	Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features	Yuzhen Hu et.al.	2512.03430	translate	read	null
2025-12-02	Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation	Ziniu Zhang et.al.	2512.02920	translate	read	null
2025-12-02	Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education	Alvaro Becerra et.al.	2512.02651	translate	read	null
2025-12-02	Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources	Phuc Pham et.al.	2512.02438	translate	read	null

(<a href=../Multimodal.md>back to Multimodal</a>)