Multimodal - 2025-12

Publish Date Title Authors PDF Translate Read Code
2025-12-30 Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset TsaiChing Ni et.al. 2512.24160 translate read null
2025-12-30 Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval Yizhi Liu et.al. 2512.24064 translate read null
2025-12-29 Wireless Multimodal Foundation Model (WMFM): Integrating Vision and Communication Modalities for 6G ISAC Systems Mohammad Farzanullah et.al. 2512.23897 translate read null
2025-12-29 Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition Arman Martirosyan et.al. 2512.23291 translate read null
2025-12-29 Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism Siyu Zhang et.al. 2512.23243 translate read null
2025-12-28 Fusion or Confusion? Multimodal Complexity Is Not All You Need Tillmann Rheude et.al. 2512.22991 translate read null
2025-12-28 Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives Shuanghao Bai et.al. 2512.22983 translate read null
2025-12-25 TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References Jiahong Yu et.al. 2512.21641 translate read null
2025-12-24 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies Jing Han et.al. 2512.20938 translate read null
2025-12-23 Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI Muhammad Usman et.al. 2512.20436 translate read null
2025-12-23 Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems YuChe Hsu et.al. 2512.20387 translate read null
2025-12-23 Retrieval-augmented Prompt Learning for Pre-trained Foundation Models Xiang Chen et.al. 2512.20145 translate read null
2025-12-22 Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis Argha Kamal Samanta et.al. 2512.19663 translate read null
2025-12-22 Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis Xiaoming Zhang et.al. 2512.19415 translate read null
2025-12-22 OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation Xueming Yan et.al. 2512.19379 translate read null
2025-12-19 STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting Yifei Cheng et.al. 2512.17667 translate read null
2025-12-19 PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology Fengchun Liu et.al. 2512.17621 translate read null
2025-12-18 Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future Tianshuai Hu et.al. 2512.16760 translate read null
2025-12-18 Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors Kejun Liu et.al. 2512.16485 translate read null
2025-12-17 GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection Yu Wang et.al. 2512.15707 translate read null
2025-12-17 An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain João Daniel Silva et.al. 2512.15531 translate read null
2025-12-16 Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris Wenshuo Li et.al. 2512.14878 translate read null
2025-12-15 STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning Jie Qin et.al. 2512.13752 translate read null
2025-12-15 JoVA: Unified Multimodal Learning for Joint Video-Audio Generation Xiaohu Huang et.al. 2512.13677 translate read null
2025-12-15 A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis Xianchao Guan et.al. 2512.13164 translate read null
2025-12-13 EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography Yuheng Li et.al. 2512.12107 translate read null
2025-12-12 VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing Emanuel Sánchez Aimar et.al. 2512.11490 translate read null
2025-12-12 Exploring MLLM-Diffusion Information Transfer with MetaCanvas Han Lin et.al. 2512.11464 translate read null
2025-12-12 AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities Chenyiming Wen et.al. 2512.11331 translate read null
2025-12-02 Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems Matvey Nepomnyaschiy et.al. 2512.10975 translate read null
2025-12-11 Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval J. Xiao et.al. 2512.10596 translate read null
2025-12-11 Cross-modal Retrieval Models for Stripped Binary Analysis Guoqiang Chen et.al. 2512.10393 translate read null
2025-12-05 What Happens When: Learning Temporal Orders of Events in Videos Daechul Ahn et.al. 2512.08979 translate read null
2025-12-09 Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval Tao Chen et.al. 2512.08410 translate read null
2025-12-08 CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification Pingchuan Ma et.al. 2512.08071 translate read null
2025-12-08 Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation Shihao Zhao et.al. 2512.07747 translate read null
2025-12-08 VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation Md Selim Sarowar et.al. 2512.07215 translate read null
2025-12-07 A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations Waleed Razzaq et.al. 2512.06708 translate read null
2025-12-06 Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion Jaewon Ahn et.al. 2512.06449 translate read null
2025-12-05 Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures Amirkia Rafiei Oskooei et.al. 2512.05908 translate read null
2025-12-04 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer Xianfeng Wu et.al. 2512.05060 translate read null
2025-12-03 Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation Xiaosen Lyu et.al. 2512.03521 translate read null
2025-12-03 Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation Xieji Li et.al. 2512.03445 translate read null
2025-12-03 Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features Yuzhen Hu et.al. 2512.03430 translate read null
2025-12-02 Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation Ziniu Zhang et.al. 2512.02920 translate read null
2025-12-02 Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education Alvaro Becerra et.al. 2512.02651 translate read null
2025-12-02 Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources Phuc Pham et.al. 2512.02438 translate read null

(<a href=../Multimodal.md>back to Multimodal</a>)