Multimodal - 2025-12
Multimodal - 2025-12
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-12-30 | Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset | TsaiChing Ni et.al. | 2512.24160 | translate | read | null |
| 2025-12-30 | Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval | Yizhi Liu et.al. | 2512.24064 | translate | read | null |
| 2025-12-29 | Wireless Multimodal Foundation Model (WMFM): Integrating Vision and Communication Modalities for 6G ISAC Systems | Mohammad Farzanullah et.al. | 2512.23897 | translate | read | null |
| 2025-12-29 | Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition | Arman Martirosyan et.al. | 2512.23291 | translate | read | null |
| 2025-12-29 | Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism | Siyu Zhang et.al. | 2512.23243 | translate | read | null |
| 2025-12-28 | Fusion or Confusion? Multimodal Complexity Is Not All You Need | Tillmann Rheude et.al. | 2512.22991 | translate | read | null |
| 2025-12-28 | Embodied Robot Manipulation in the Era of Foundation Models: Planning and Learning Perspectives | Shuanghao Bai et.al. | 2512.22983 | translate | read | null |
| 2025-12-25 | TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References | Jiahong Yu et.al. | 2512.21641 | translate | read | null |
| 2025-12-24 | Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies | Jing Han et.al. | 2512.20938 | translate | read | null |
| 2025-12-23 | Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI | Muhammad Usman et.al. | 2512.20436 | translate | read | null |
| 2025-12-23 | Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems | YuChe Hsu et.al. | 2512.20387 | translate | read | null |
| 2025-12-23 | Retrieval-augmented Prompt Learning for Pre-trained Foundation Models | Xiang Chen et.al. | 2512.20145 | translate | read | null |
| 2025-12-22 | Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis | Argha Kamal Samanta et.al. | 2512.19663 | translate | read | null |
| 2025-12-22 | Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis | Xiaoming Zhang et.al. | 2512.19415 | translate | read | null |
| 2025-12-22 | OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation | Xueming Yan et.al. | 2512.19379 | translate | read | null |
| 2025-12-19 | STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting | Yifei Cheng et.al. | 2512.17667 | translate | read | null |
| 2025-12-19 | PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology | Fengchun Liu et.al. | 2512.17621 | translate | read | null |
| 2025-12-18 | Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future | Tianshuai Hu et.al. | 2512.16760 | translate | read | null |
| 2025-12-18 | Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors | Kejun Liu et.al. | 2512.16485 | translate | read | null |
| 2025-12-17 | GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection | Yu Wang et.al. | 2512.15707 | translate | read | null |
| 2025-12-17 | An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain | João Daniel Silva et.al. | 2512.15531 | translate | read | null |
| 2025-12-16 | Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris | Wenshuo Li et.al. | 2512.14878 | translate | read | null |
| 2025-12-15 | STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning | Jie Qin et.al. | 2512.13752 | translate | read | null |
| 2025-12-15 | JoVA: Unified Multimodal Learning for Joint Video-Audio Generation | Xiaohu Huang et.al. | 2512.13677 | translate | read | null |
| 2025-12-15 | A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis | Xianchao Guan et.al. | 2512.13164 | translate | read | null |
| 2025-12-13 | EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography | Yuheng Li et.al. | 2512.12107 | translate | read | null |
| 2025-12-12 | VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing | Emanuel Sánchez Aimar et.al. | 2512.11490 | translate | read | null |
| 2025-12-12 | Exploring MLLM-Diffusion Information Transfer with MetaCanvas | Han Lin et.al. | 2512.11464 | translate | read | null |
| 2025-12-12 | AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities | Chenyiming Wen et.al. | 2512.11331 | translate | read | null |
| 2025-12-02 | Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems | Matvey Nepomnyaschiy et.al. | 2512.10975 | translate | read | null |
| 2025-12-11 | Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval | J. Xiao et.al. | 2512.10596 | translate | read | null |
| 2025-12-11 | Cross-modal Retrieval Models for Stripped Binary Analysis | Guoqiang Chen et.al. | 2512.10393 | translate | read | null |
| 2025-12-05 | What Happens When: Learning Temporal Orders of Events in Videos | Daechul Ahn et.al. | 2512.08979 | translate | read | null |
| 2025-12-09 | Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval | Tao Chen et.al. | 2512.08410 | translate | read | null |
| 2025-12-08 | CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification | Pingchuan Ma et.al. | 2512.08071 | translate | read | null |
| 2025-12-08 | Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation | Shihao Zhao et.al. | 2512.07747 | translate | read | null |
| 2025-12-08 | VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation | Md Selim Sarowar et.al. | 2512.07215 | translate | read | null |
| 2025-12-07 | A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations | Waleed Razzaq et.al. | 2512.06708 | translate | read | null |
| 2025-12-06 | Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion | Jaewon Ahn et.al. | 2512.06449 | translate | read | null |
| 2025-12-05 | Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures | Amirkia Rafiei Oskooei et.al. | 2512.05908 | translate | read | null |
| 2025-12-04 | 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer | Xianfeng Wu et.al. | 2512.05060 | translate | read | null |
| 2025-12-03 | Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation | Xiaosen Lyu et.al. | 2512.03521 | translate | read | null |
| 2025-12-03 | Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation | Xieji Li et.al. | 2512.03445 | translate | read | null |
| 2025-12-03 | Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features | Yuzhen Hu et.al. | 2512.03430 | translate | read | null |
| 2025-12-02 | Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation | Ziniu Zhang et.al. | 2512.02920 | translate | read | null |
| 2025-12-02 | Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education | Alvaro Becerra et.al. | 2512.02651 | translate | read | null |
| 2025-12-02 | Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources | Phuc Pham et.al. | 2512.02438 | translate | read | null |
(<a href=../Multimodal.md>back to Multimodal</a>)