Multimodal
Multimodal
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-12-18 | Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future | Tianshuai Hu et.al. | 2512.16760 | null |
| 2025-12-18 | Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors | Kejun Liu et.al. | 2512.16485 | null |
| 2025-12-17 | GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection | Yu Wang et.al. | 2512.15707 | null |
| 2025-12-17 | An Efficient and Effective Encoder Model for Vision and Language Tasks in the Remote Sensing Domain | João Daniel Silva et.al. | 2512.15531 | null |
| 2025-12-16 | Visual-textual Dermatoglyphic Animal Biometrics: A First Case Study on Panthera tigris | Wenshuo Li et.al. | 2512.14878 | null |
| 2025-12-15 | STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning | Jie Qin et.al. | 2512.13752 | null |
| 2025-12-15 | JoVA: Unified Multimodal Learning for Joint Video-Audio Generation | Xiaohu Huang et.al. | 2512.13677 | null |
| 2025-12-15 | A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis | Xianchao Guan et.al. | 2512.13164 | null |
| 2025-12-13 | EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography | Yuheng Li et.al. | 2512.12107 | null |
| 2025-12-12 | VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing | Emanuel Sánchez Aimar et.al. | 2512.11490 | null |
| 2025-12-12 | Exploring MLLM-Diffusion Information Transfer with MetaCanvas | Han Lin et.al. | 2512.11464 | null |
| 2025-12-12 | AMBER: An Adaptive Multimodal Mask Transformer for Beam Prediction with Missing Modalities | Chenyiming Wen et.al. | 2512.11331 | null |
| 2025-12-02 | Agent-Based Modular Learning for Multimodal Emotion Recognition in Human-Agent Systems | Matvey Nepomnyaschiy et.al. | 2512.10975 | null |
| 2025-12-11 | Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval | J. Xiao et.al. | 2512.10596 | null |
| 2025-12-11 | Cross-modal Retrieval Models for Stripped Binary Analysis | Guoqiang Chen et.al. | 2512.10393 | null |
| 2025-12-05 | What Happens When: Learning Temporal Orders of Events in Videos | Daechul Ahn et.al. | 2512.08979 | null |
| 2025-12-09 | Towards Effective and Efficient Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval | Tao Chen et.al. | 2512.08410 | null |
| 2025-12-08 | CAMO: Causality-Guided Adversarial Multimodal Domain Generalization for Crisis Classification | Pingchuan Ma et.al. | 2512.08071 | null |
| 2025-12-08 | Unison: A Fully Automatic, Task-Universal, and Low-Cost Framework for Unified Understanding and Generation | Shihao Zhao et.al. | 2512.07747 | null |
| 2025-12-08 | VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation | Md Selim Sarowar et.al. | 2512.07215 | null |
| 2025-12-07 | A Novel Multimodal RUL Framework for Remaining Useful Life Estimation with Layer-wise Explanations | Waleed Razzaq et.al. | 2512.06708 | null |
| 2025-12-06 | Enhancing Medical Cross-Modal Hashing Retrieval using Dropout-Voting Mixture-of-Experts Fusion | Jaewon Ahn et.al. | 2512.06449 | null |
| 2025-12-05 | Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures | Amirkia Rafiei Oskooei et.al. | 2512.05908 | null |
| 2025-12-04 | 4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer | Xianfeng Wu et.al. | 2512.05060 | null |
| 2025-12-03 | Cross-Space Synergy: A Unified Framework for Multimodal Emotion Recognition in Conversation | Xiaosen Lyu et.al. | 2512.03521 | null |
| 2025-12-03 | Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation | Xieji Li et.al. | 2512.03445 | null |
| 2025-12-03 | Label-Efficient Hyperspectral Image Classification via Spectral FiLM Modulation of Low-Level Pretrained Diffusion Features | Yuzhen Hu et.al. | 2512.03430 | null |
| 2025-12-02 | Learning Multimodal Embeddings for Traffic Accident Prediction and Causal Estimation | Ziniu Zhang et.al. | 2512.02920 | null |
| 2025-12-02 | Real-Time Multimodal Data Collection Using Smartwatches and Its Visualization in Education | Alvaro Becerra et.al. | 2512.02651 | null |
| 2025-12-02 | Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources | Phuc Pham et.al. | 2512.02438 | null |
| 2025-11-30 | MM-ACT: Learn from Multimodal Parallel Generation to Act | Haotian Liang et.al. | 2512.00975 | null |
| 2025-11-29 | Describe Anything Anywhere At Any Moment | Nicolas Gorlo et.al. | 2512.00565 | null |
| 2025-11-29 | CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA | Vsevolod Kovalev et.al. | 2512.00360 | null |
| 2025-11-28 | Buffer replay enhances the robustness of multimodal learning under missing-modality | Hongye Zhu et.al. | 2511.23070 | null |
| 2025-11-27 | Orthogonal Disentanglement with Projected Feature Alignment for Multimodal Emotion Recognition in Conversation | Xinyi Che et.al. | 2511.22463 | null |
| 2025-11-27 | Angle-Optimized Partial Disentanglement for Multimodal Emotion Recognition in Conversation | Xinyi Che et.al. | 2511.22447 | null |
| 2025-11-27 | Bridging the Modality Gap by Similarity Standardization with Pseudo-Positive Samples | Shuhei Yamashita et.al. | 2511.22141 | null |
| 2025-11-26 | WalkCLIP: Multimodal Learning for Urban Walkability Prediction | Shilong Xiang et.al. | 2511.21947 | null |
| 2025-11-26 | Evaluating Strategies for Synthesizing Clinical Notes for Medical Multimodal AI | Niccolo Marini et.al. | 2511.21827 | null |
| 2025-11-26 | Learning Cell-Aware Hierarchical Multi-Modal Representations for Robust Molecular Modeling | Mengran Li et.al. | 2511.21120 | null |
| 2025-11-25 | A review on data fusion in multimodal learning analytics and educational data mining | Wilson Chango et.al. | 2511.20871 | null |
| 2025-11-25 | VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning | Bo Pang et.al. | 2511.20422 | null |
| 2025-11-25 | MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts | Zilong Huang et.al. | 2511.20415 | null |
| 2025-11-25 | ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis | Advik Sinha et.al. | 2511.20274 | null |
| 2025-11-24 | Medusa: Cross-Modal Transferable Adversarial Attacks on Multimodal Medical Retrieval-Augmented Generation | Yingjia Shang et.al. | 2511.19257 | null |
| 2025-11-24 | IDSplat: Instance-Decomposed 3D Gaussian Splatting for Driving Scenes | Carl Lindström et.al. | 2511.19235 | null |
| 2025-11-24 | Can Modern Vision Models Understand the Difference Between an Object and a Look-alike? | Itay Cohen et.al. | 2511.19200 | null |
| 2025-11-23 | Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion | Haidong Kang et.al. | 2511.18516 | null |
| 2025-11-22 | Vulnerability-Aware Robust Multimodal Adversarial Training | Junrui Zhang et.al. | 2511.18138 | null |
| 2025-11-22 | Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning | Xiaohong Liu et.al. | 2511.18104 | null |
| 2025-11-17 | Reconstruction-Driven Multimodal Representation Learning for Automated Media Understanding | Yassir Benhammou et.al. | 2511.17596 | null |
| 2025-11-21 | MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment | Huangbiao Xu et.al. | 2511.17397 | null |
| 2025-11-21 | UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation | Chi Zhang et.al. | 2511.16917 | null |
| 2025-11-20 | LLaVA $^3$ : Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs | Doriand Petit et.al. | 2511.16454 | null |
| 2025-11-20 | Boosting Medical Visual Understanding From Multi-Granular Language Learning | Zihan Li et.al. | 2511.15943 | null |
| 2025-11-18 | Uncertainty-Resilient Multimodal Learning via Consistency-Guided Cross-Modal Transfer | Hyo-Jeong Jang et.al. | 2511.15741 | null |
| 2025-11-19 | SIGMMA: Hierarchical Graph-Based Multi-Scale Multi-modal Contrastive Alignment of Histopathology Image and Spatial Transcriptome | Dabin Jeong et.al. | 2511.15464 | null |
| 2025-11-19 | Reflexive Evidence-Based Multimodal Learning for Clean Energy Transitions: Causal Insights on Cooking Fuel Access, Urbanization, and Carbon Emissions | Shan Shan et.al. | 2511.15342 | null |
| 2025-11-19 | Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval | Qing Wang et.al. | 2511.15201 | null |
| 2025-11-19 | TiCAL:Typicality-Based Consistency-Aware Learning for Multimodal Emotion Recognition | Wen Yin et.al. | 2511.15085 | null |
| 2025-11-18 | Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion | Zanxu Wang et.al. | 2511.14969 | null |
| 2025-11-18 | Toward Robust and Harmonious Adaptation for Cross-modal Retrieval | Haobin Li et.al. | 2511.14416 | null |
| 2025-11-18 | Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation | Weimin Bai et.al. | 2511.14271 | null |
| 2025-11-18 | Online Data Curation for Object Detection via Marginal Contributions to Dataset-level Average Precision | Zitang Sun et.al. | 2511.14197 | null |
| 2025-11-14 | Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement | Zhe Yang et.al. | 2511.13755 | null |
| 2025-11-17 | 3DAlign-DAER: Dynamic Attention Policy and Efficient Retrieval Strategy for Fine-grained 3D-Text Alignment at Scale | Yijia Fan et.al. | 2511.13211 | null |
| 2025-11-17 | uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data | Dahyun Chung et.al. | 2511.13036 | null |
| 2025-11-17 | Angular Gradient Sign Method: Uncovering Vulnerabilities in Hyperbolic Networks | Minsoo Jo et.al. | 2511.12985 | null |
| 2025-11-15 | To Align or Not to Align: Strategic Multimodal Representation Alignment for Optimal Performance | Wanlong Fang et.al. | 2511.12121 | null |
| 2025-11-14 | Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification | Qinghao Gao et.al. | 2511.11460 | null |
| 2025-11-14 | AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery | Yuqi Yin et.al. | 2511.11257 | null |
| 2025-11-14 | LEMUR: Large scale End-to-end MUltimodal Recommendation | Xintian Han et.al. | 2511.10962 | null |
| 2025-11-14 | MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition | Feng Li et.al. | 2511.10892 | null |
| 2025-11-13 | Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals | Shruti Singh Baghel et.al. | 2511.10615 | null |
| 2025-11-13 | URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding | Yongxin Shi et.al. | 2511.10552 | null |
| 2025-11-13 | GEA: Generation-Enhanced Alignment for Text-to-Image Person Retrieval | Hao Zou et.al. | 2511.10154 | null |
| 2025-11-13 | Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction | Mingda Jia et.al. | 2511.10134 | null |
| 2025-11-13 | Towards Robust Multimodal Learning in the Open World | Fushuo Huo et.al. | 2511.09989 | null |
| 2025-11-12 | Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard | Stelios Zarifis et.al. | 2511.09727 | null |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | null |
| 2025-11-11 | Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding | Da Li et.al. | 2511.08480 | null |
| 2025-11-11 | Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation | Jun Sun et.al. | 2511.08152 | null |
| 2025-11-11 | Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval | Likang Peng et.al. | 2511.07780 | null |
| 2025-11-11 | Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling | Jiale Liu et.al. | 2511.07710 | null |
| 2025-11-10 | A Hybrid Multimodal Deep Learning Framework for Intelligent Fashion Recommendation | Kamand Kalashi et.al. | 2511.07573 | null |
| 2025-11-10 | Integrating Epigenetic and Phenotypic Features for Biological Age Estimation in Cancer Patients via Multimodal Learning | Shuyue Jiang et.al. | 2511.07219 | null |
| 2025-11-10 | Med-SORA: Symptom to Organ Reasoning in Abdomen CT Images | You-Kyoung Na et.al. | 2511.06752 | null |
| 2025-11-09 | LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval | Jian Zhang et.al. | 2511.06268 | null |
| 2025-11-09 | VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving | Ruifei Zhang et.al. | 2511.06256 | null |
| 2025-11-09 | AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving | Ruifei Zhang et.al. | 2511.06253 | null |
| 2025-11-08 | Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models | Akshar Tumu et.al. | 2511.06146 | null |
| 2025-11-04 | Fine-Tuning Vision-Language Models for Multimodal Polymer Property Prediction | An Vuong et.al. | 2511.05577 | null |
| 2025-11-06 | DINOv2 Driven Gait Representation Learning for Video-Based Visible-Infrared Person Re-identification | Yujie Yang et.al. | 2511.04281 | null |
| 2025-11-05 | Cross-Modal Alignment via Variational Copula Modelling | Feng Wu et.al. | 2511.03196 | null |
| 2025-11-04 | SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment | Wenbo Lu et.al. | 2511.03019 | null |
| 2025-11-04 | ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology | Srikumar Sastry et.al. | 2511.02946 | null |
| 2025-11-04 | When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning | Chenyu Zhang et.al. | 2511.02794 | null |
| 2025-11-03 | OmniFuser: Adaptive Multimodal Fusion for Service-Oriented Predictive Maintenance | Ziqi Wang et.al. | 2511.01320 | null |
| 2025-11-02 | Balanced Multimodal Learning via Mutual Information | Rongrong Xie et.al. | 2511.00987 | null |
| 2025-11-01 | LIR: The First Workshop on Late Interaction and Multi Vector Retrieval @ ECIR 2026 | Benjamin Clavié et.al. | 2511.00444 | null |
| 2025-11-01 | Federated Dialogue-Semantic Diffusion for Emotion Recognition under Incomplete Modalities | Xihang Qiu et.al. | 2511.00344 | null |
| 2025-10-24 | Multimodal Detection of Fake Reviews using BERT and ResNet-50 | Suhasnadh Reddy Veluru et.al. | 2511.00020 | null |
| 2025-10-04 | Multimodal Learning with Augmentation Techniques for Natural Disaster Assessment | Adrian-Dinu Urse et.al. | 2511.00004 | null |
| 2025-10-31 | MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data | Yu-Chen Kuo et.al. | 2510.27321 | null |
| 2025-10-30 | Evaluating Perspectival Biases in Cross-Modal Retrieval | Teerapol Saengsukhiran et.al. | 2510.26861 | null |
| 2025-10-30 | Contribution-Guided Asymmetric Learning for Robust Multimodal Fusion under Imbalance and Noise | Zijing Xu et.al. | 2510.26289 | null |
| 2025-10-29 | Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start | Kun Chen et.al. | 2510.25801 | null |
| 2025-10-29 | LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation | Yang Miao et.al. | 2510.25263 | null |
| 2025-10-29 | H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts | Peilin Tan et.al. | 2510.25091 | null |
| 2025-10-29 | Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments | Manjunath Prasad Holenarasipura Rajiv et.al. | 2510.25070 | null |
| 2025-10-28 | Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning | Hossein R. Nowdeh et.al. | 2510.24919 | null |
| 2025-10-28 | MCIHN: A Hybrid Network Model Based on Multi-path Cross-modal Interaction for Multimodal Emotion Recognition | Haoyang Zhang et.al. | 2510.24827 | null |
| 2025-10-24 | Towards Fine-Grained Human Motion Video Captioning | Guorui Song et.al. | 2510.24767 | null |
| 2025-10-27 | Toward Clinically Grounded Foundation Models in Pathology | Hamid R. Tizhoosh et.al. | 2510.23807 | null |
| 2025-10-27 | Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier | Hyeongseop Rha et.al. | 2510.23506 | null |
| 2025-10-27 | Evaluation of Vision-LLMs in Surveillance Video | Pascal Benschop et.al. | 2510.23190 | null |
| 2025-10-21 | Unifying Inductive, Cross-Domain, and Multimodal Learning for Robust and Generalizable Recommendation | Chanyoung Chung et.al. | 2510.21812 | null |
| 2025-10-07 | Avi: Action from Volumetric Inference | Harris Song et.al. | 2510.21746 | null |
| 2025-10-24 | CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis | Yiming Tang et.al. | 2510.21464 | null |
| 2025-10-24 | Bridging the gap to real-world language-grounded visual concept learning | Whie Jung et.al. | 2510.21412 | null |
| 2025-10-23 | Multimodal Negative Learning | Baoquan Gong et.al. | 2510.20877 | null |
| 2025-10-23 | Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process | Tsai Hor Chan et.al. | 2510.20736 | null |
| 2025-10-23 | Calibrating Multimodal Consensus for Emotion Recognition | Guowei Zhong et.al. | 2510.20256 | null |
| 2025-10-22 | Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment | Yuhang Liu et.al. | 2510.19384 | null |
| 2025-10-22 | FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation | Chirag Padubidri et.al. | 2510.19305 | null |
| 2025-10-21 | Grounding or Guessing? Visual Signals for Detecting Hallucinations in Sign Language Translation | Yasser Hamidullah et.al. | 2510.18439 | null |
| 2025-10-20 | Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware | Stavros Mitsis et.al. | 2510.18036 | null |
| 2025-10-20 | MILES: Modality-Informed Learning Rate Scheduler for Balancing Multimodal Learning | Alejandro Guerra-Manzanares et.al. | 2510.17394 | null |
| 2025-10-19 | Graph4MM: Weaving Multimodal Learning with Structural Information | Xuying Ning et.al. | 2510.16990 | null |
| 2025-10-19 | ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning | Yingxu Wang et.al. | 2510.16824 | null |
| 2025-10-19 | Pursuing Minimal Sufficiency in Spatial Reasoning | Yejie Guo et.al. | 2510.16688 | null |
| 2025-10-18 | Safire: Similarity Framework for Visualization Retrieval | Huyen N. Nguyen et.al. | 2510.16662 | null |
| 2025-10-18 | Structured Interfaces for Automated Reasoning with 3D Scene Graphs | Aaron Ray et.al. | 2510.16643 | null |
| 2025-10-09 | Lyapunov-Stable Adaptive Control for Multimodal Concept Drift | Tianyu Bell Pan et.al. | 2510.15944 | null |
| 2025-10-17 | Towards Relaxed Multimodal Inputs for Gait-based Parkinson’s Disease Assessment | Minlin Zeng et.al. | 2510.15748 | null |
| 2025-10-16 | From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance | Zhe Li et.al. | 2510.14952 | null |
| 2025-10-16 | Revisit Modality Imbalance at the Decision Layer | Xiaoyu Ma et.al. | 2510.14411 | null |
| 2025-10-15 | A Multimodal Approach to Heritage Preservation in the Context of Climate Change | David Roqui et.al. | 2510.14136 | null |
| 2025-10-15 | Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation | Jiamin Chen et.al. | 2510.13191 | null |
| 2025-10-15 | Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning | Rongrong Xie et.al. | 2510.13182 | null |
| 2025-10-14 | A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation | Shurong Chai et.al. | 2510.12482 | null |
| 2025-10-14 | Ground Stratification for a Logic of Definitions with Induction | Nathan Guermond et.al. | 2510.12297 | null |
| 2025-10-14 | IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation | Wenxu Zhou et.al. | 2510.12095 | null |
| 2025-10-13 | Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis | Blessing Agyei Kyem et.al. | 2510.11907 | null |
| 2025-10-10 | Cattle-CLIP: A Multimodal Framework for Cattle Behaviour Recognition | Huimin Liu et.al. | 2510.09203 | null |
| 2025-10-09 | Provably Robust Adaptation for Language-Empowered Foundation Models | Yuni Lai et.al. | 2510.08659 | null |
| 2025-10-07 | Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations | Yu Liu et.al. | 2510.08606 | null |
| 2025-10-09 | Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling | Bianca-Mihaela Ganescu et.al. | 2510.08470 | link |
| 2025-10-08 | FLEET: Formal Language-Grounded Scheduling for Heterogeneous Robot Teams | Corban Rivera et.al. | 2510.07417 | null |
| 2025-09-30 | MultiFair: Multimodal Balanced Fairness-Aware Medical Classification with Dual-Level Gradient Modulation | Md Zubair et.al. | 2510.07328 | null |
| 2025-10-08 | TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation | Jiaben Chen et.al. | 2510.07249 | null |
| 2025-10-08 | Expressive and Scalable Quantum Fusion for Multimodal Learning | Tuyen Nguyen et.al. | 2510.06938 | null |
| 2025-10-07 | Deforming Videos to Masks: Flow Matching for Referring Video Segmentation | Zanyi Wang et.al. | 2510.06139 | link |
| 2025-10-04 | Towards Unsupervised Speech Recognition at the Syllable-Level | Liming Wang et.al. | 2510.03639 | null |
| 2025-09-25 | Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data | Jiancheng Zhang et.al. | 2510.03247 | null |
| 2025-10-02 | Latency-aware Multimodal Federated Learning over UAV Networks | Shaba Shaon et.al. | 2510.01717 | null |
| 2025-10-01 | PhraseStereo: The First Open-Vocabulary Stereo Image Segmentation Dataset | Thomas Campagnolo et.al. | 2510.00818 | null |
| 2025-09-30 | MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning | Seong-Hyeon Hwang et.al. | 2509.25831 | null |
| 2025-09-29 | FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology | Faizan Farooq Khan et.al. | 2509.25564 | null |
| 2025-09-29 | MAESTRO : Adaptive Sparse Attention and Robust Learning for Multimodal Dynamic Time Series | Payal Mohapatra et.al. | 2509.25278 | null |
| 2025-09-29 | A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity | Giordano Cicchetti et.al. | 2509.24734 | null |
| 2025-09-29 | Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey | Yuntao Shou et.al. | 2509.24322 | link |
| 2025-09-28 | Contrastive Learning Enhances Language Model Based Cell Embeddings for Low-Sample Single Cell Transcriptomics | Luxuan Zhang et.al. | 2509.23543 | null |
| 2025-09-26 | RefAM: Attention Magnets for Zero-Shot Referral Segmentation | Anna Kukleva et.al. | 2509.22650 | null |
| 2025-09-26 | HELIOS: Hierarchical Exploration for Language-grounded Interaction in Open Scenes | Katrina Ashton et.al. | 2509.22498 | null |
| 2025-09-26 | From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment | Ke Ye et.al. | 2509.22205 | null |
| 2025-09-26 | VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation | Huayi Zhou et.al. | 2509.21723 | null |
| 2025-09-14 | LibEMER: A novel benchmark and algorithms library for EEG-based Multimodal Emotion Recognition | Zejun Liu et.al. | 2509.19330 | null |
| 2025-09-10 | Advancing Few-Shot Pediatric Arrhythmia Classification with a Novel Contrastive Loss and Multimodal Learning | Yiqiao Chen et.al. | 2509.19315 | null |
| 2025-09-23 | Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation | Christian Ganhör et.al. | 2509.18807 | null |
| 2025-09-23 | M4SER: Multimodal, Multirepresentation, Multitask, and Multistrategy Learning for Speech Emotion Recognition | Jiajun He et.al. | 2509.18706 | null |
| 2025-09-22 | Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction | Yi Gu et.al. | 2509.18284 | null |
| 2025-09-22 | ClassMind: Scaling Classroom Observation and Instructional Feedback with Multimodal AI | Ao Qu et.al. | 2509.18020 | null |
| 2025-09-22 | M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer | Yanxin Zhang et.al. | 2509.18005 | null |
| 2025-09-22 | Trainee Action Recognition through Interaction Analysis in CCATT Mixed-Reality Training | Divya Mereddy et.al. | 2509.17888 | null |
| 2025-09-20 | Self-organized epithelial reticulum inhibits cell proliferation | Liav Daraf et.al. | 2509.16661 | null |
| 2025-09-19 | Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation | Weimin Bai et.al. | 2509.15772 | null |
| 2025-09-19 | Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion | Shanghong Li et.al. | 2509.15578 | null |
| 2025-09-19 | Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues | Wei Chen et.al. | 2509.15540 | null |
| 2025-09-17 | Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays | Hanbin Ko et.al. | 2509.15234 | null |
| 2025-09-17 | VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI | Daiqi Liu et.al. | 2509.13767 | null |
| 2025-09-15 | Evaluating Robustness of Vision-Language Models Under Noisy Conditions | Purushoth et.al. | 2509.12492 | null |
| 2025-09-15 | OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling | Yang Zhou et.al. | 2509.12201 | link |
| 2025-09-15 | Enriched text-guided variational multimodal knowledge distillation network (VMD) for automated diagnosis of plaque vulnerability in 3D carotid artery MRI | Bo Cao et.al. | 2509.11924 | null |
| 2025-09-14 | GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration | Wan Xu et.al. | 2509.11360 | null |
| 2025-09-14 | DMLDroid: Deep Multimodal Fusion Framework for Android Malware Detection with Resilience to Code Obfuscation and Adversarial Perturbations | Doan Minh Trung et.al. | 2509.11187 | null |
| 2025-09-14 | Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation | Nhi Kieu et.al. | 2509.11102 | null |
| 2025-09-13 | Why Bonds Fail Differently? Explainable Multimodal Learning for Multi-Class Default Prediction | Yi Lu et.al. | 2509.10802 | null |
| 2025-09-11 | Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training | Anthony P. Addison et.al. | 2509.09290 | null |
| 2025-09-09 | Enhancing Online Learning by Integrating Biosensors and Multimodal Learning Analytics for Detecting and Predicting Student Behavior: A Review | Alvaro Becerra et.al. | 2509.07742 | null |
| 2025-09-08 | Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding | Jiangnan Xie et.al. | 2509.06291 | null |
| 2025-09-06 | GraMFedDHAR: Graph Based Multimodal Differentially Private Federated HAR | Labani Halder et.al. | 2509.05671 | null |
| 2025-09-06 | Causal Debiasing Medical Multimodal Representation Learning with Missing Modalities | Xiaoguang Zhu et.al. | 2509.05615 | null |
| 2025-09-04 | Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models | Kimia Ehsani et.al. | 2509.03837 | null |
| 2025-09-03 | Designing Gaze Analytics for ELA Instruction: A User-Centered Dashboard with Conversational AI Support | Eduardo Davalos et.al. | 2509.03741 | null |
| 2025-09-03 | Robult: Leveraging Redundancy and Modality Specific Features for Robust Multimodal Learning | Duy A. Nguyen et.al. | 2509.03477 | null |
| 2025-09-03 | Multimodal learning of melt pool dynamics in laser powder bed fusion | Satyajit Mojumder et.al. | 2509.03029 | null |
| 2025-09-03 | Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability | Shuai Jiang et.al. | 2509.02962 | null |
| 2025-09-02 | Language-Guided Long Horizon Manipulation with LLM-based Planning and Visual Perception | Changshi Zhou et.al. | 2509.02324 | null |
| 2025-09-02 | Balanced Multimodal Learning: An Unidirectional Dynamic Interaction Perspective | Shijie Wang et.al. | 2509.02281 | null |
| 2025-09-02 | Content and Engagement Trends in COVID-19 YouTube Videos: Evidence from the Late Pandemic | Nirmalya Thakur et.al. | 2509.01954 | null |
| 2025-09-01 | OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning | Yanqing Liu et.al. | 2509.01644 | link |
| 2025-09-01 | Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement | Jiayi Gao et.al. | 2509.01362 | null |
| 2025-08-29 | Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer | Daniël Boeke et.al. | 2508.21581 | null |
| 2025-08-27 | Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement | Mohammed Rakibul Hasan et.al. | 2508.19887 | null |
| 2025-08-27 | AIM: Adaptive Intra-Network Modulation for Balanced Multimodal Learning | Shu Shen et.al. | 2508.19769 | null |
| 2025-08-25 | BTW: A Non-Parametric Variance Stabilization Framework for Multimodal Model Integration | Jun Hou et.al. | 2508.18551 | null |
| 2025-08-22 | Can VLMs Recall Factual Associations From Visual References? | Dhananjay Ashok et.al. | 2508.18297 | null |
| 2025-08-20 | Human-like Content Analysis for Generative AI with Language-Grounded Sparse Encoders | Yiming Tang et.al. | 2508.18236 | null |
| 2025-08-24 | Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice | Hugo Bohy et.al. | 2508.17502 | link |
| 2025-08-24 | Multimodal Representation Learning Conditioned on Semantic Relations | Yang Qiao et.al. | 2508.17497 | null |
| 2025-08-24 | SEER-VAR: Semantic Egocentric Environment Reasoner for Vehicle Augmented Reality | Yuzhi Lai et.al. | 2508.17255 | null |
| 2025-08-10 | An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance | Hsuan-Kung Yang et.al. | 2508.16602 | null |
| 2025-08-22 | Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization | Yupei Zhang et.al. | 2508.16479 | null |
| 2025-08-22 | A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension | Mohammad Zia Ur Rehman et.al. | 2508.16300 | null |
| 2025-08-21 | Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation | Huy Hoang Nguyen et.al. | 2508.15427 | null |
| 2025-08-21 | DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding | Zhu Wang et.al. | 2508.15297 | null |
| 2025-08-20 | MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs | Ruyi Ding et.al. | 2508.15036 | null |
| 2025-08-19 | Beyond Simple Edits: Composed Video Retrieval with Dense Modifications | Omkar Thawakar et.al. | 2508.14039 | link |
| 2025-08-19 | CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter | Junyeong Park et.al. | 2508.13530 | null |
| 2025-08-19 | CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models | Catherine Glossop et.al. | 2508.13446 | null |
| 2025-08-18 | SPANER: Shared Prompt Aligner for Multimodal Semantic Representation | Thye Shan Ng et.al. | 2508.13387 | null |
| 2025-08-18 | Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation | Tanjim Islam Riju et.al. | 2508.13068 | null |
| 2025-08-17 | Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping | Xuhui Zhan et.al. | 2508.12466 | link |
| 2025-08-16 | MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization | Haochen You et.al. | 2508.12149 | null |
| 2025-08-16 | ExploreVLM: Closed-Loop Robot Exploration Task Planning with Vision-Language Models | Zhichen Lou et.al. | 2508.11918 | null |
| 2025-08-13 | MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning | Thanh-Dat Truong et.al. | 2508.10133 | null |
| 2025-08-13 | Empowering Morphing Attack Detection using Interpretable Image-Text Foundation Model | Sushrut Patwardhan et.al. | 2508.10110 | null |
| 2025-08-12 | LPGNet: A Lightweight Network with Parallel Attention and Gated Fusion for Multimodal Emotion Recognition | Zhining He et.al. | 2508.08925 | null |
| 2025-08-12 | Multimodal learning enables instant ionizing radiation alerts on unmodified mobile phones for real-world emergency response | Yanfeng Xie et.al. | 2508.08541 | null |
| 2025-08-11 | BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models | Maozhen Zhang et.al. | 2508.08040 | null |
| 2025-08-11 | A Trustworthy Method for Multimodal Emotion Recognition | Junxiao Xue et.al. | 2508.07625 | null |
| 2025-08-10 | Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding | Zhaoyu Chen et.al. | 2508.07388 | null |
| 2025-08-10 | FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning | Van Duc Cuong et.al. | 2508.07264 | null |
| 2025-08-09 | Can Multitask Learning Enhance Model Explainability? | Hiba Najjar et.al. | 2508.06966 | null |
| 2025-08-09 | Intrinsic Explainability of Multimodal Learning for Crop Yield Prediction | Hiba Najjar et.al. | 2508.06939 | null |
| 2025-08-09 | Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities | Rui Liu et.al. | 2508.06800 | null |
| 2025-08-08 | Early Detection of Pancreatic Cancer Using Multimodal Learning on Electronic Health Records | Mosbah Aouad et.al. | 2508.06627 | null |
| 2025-08-07 | Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features | Manish Kansana et.al. | 2508.06566 | null |
| 2025-08-06 | Grounding Emotion Recognition with Visual Prototypes: VEGA – Revisiting CLIP in MERC | Guanyu Hu et.al. | 2508.06564 | null |
| 2025-08-08 | Text as Any-Modality for Zero-Shot Classification by Consistent Prompt Tuning | Xiangyu Wu et.al. | 2508.06382 | null |
| 2025-08-08 | ECMF: Enhanced Cross-Modal Fusion for Multimodal Emotion Recognition in MER-SEMI Challenge | Juewen Hu et.al. | 2508.05991 | null |
| 2025-08-07 | Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning | Luai Abuelsamen et.al. | 2508.05077 | null |
| 2025-08-07 | MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding | Weifan Zhang et.al. | 2508.05021 | null |
| 2025-08-06 | Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models | Md Raisul Kibria et.al. | 2508.04427 | null |
| 2025-08-06 | Length Matters: Length-Aware Transformer for Temporal Sentence Grounding | Yifan Wang et.al. | 2508.04299 | null |
| 2025-08-06 | SVC 2025: the First Multimodal Deception Detection Challenge | Xun Lin et.al. | 2508.04129 | null |
| 2025-07-29 | Multimodal Video Emotion Recognition with Reliable Reasoning Priors | Zhepeng Wang et.al. | 2508.03722 | null |
| 2025-08-05 | T2UE: Generating Unlearnable Examples from Text Descriptions | Xingjun Ma et.al. | 2508.03091 | null |
| 2025-08-04 | MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming | Shuo Wang et.al. | 2508.02549 | null |
| 2025-08-04 | Hierarchical MoE: Continuous Multimodal Emotion Recognition with Incomplete and Asynchronous Inputs | Yitong Zhu et.al. | 2508.02133 | null |
| 2025-08-04 | “Harmless to You, Hurtful to Me!”: Investigating the Detection of Toxic Languages Grounded in the Perspective of Youth | Yaqiong Li et.al. | 2508.02094 | null |
| 2025-08-03 | DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition | Peiyuan Jiang et.al. | 2508.01644 | null |
| 2025-08-02 | A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics | Rushin H. Gindra et.al. | 2508.01490 | null |
| 2025-08-02 | AffectGPT-R1: Leveraging Reinforcement Learning for Open-Vocabulary Emotion Recognition | Zheng Lian et.al. | 2508.01318 | null |
| 2025-07-29 | SmartCLIP: Modular Vision-language Alignment with Identification Guarantees | Shaoan Xie et.al. | 2507.22264 | null |
| 2025-07-29 | MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces | Shaojun E et.al. | 2507.21741 | link |
| 2025-07-29 | Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion | Zeyu Deng et.al. | 2507.21395 | null |
| 2025-07-28 | On the Limits of Hierarchically Embedded Logic in Classical Neural Networks | Bill Cochran et.al. | 2507.20960 | null |
| 2025-07-28 | TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model | Ao Li et.al. | 2507.20630 | null |
| 2025-07-25 | Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization | Hsuan-Yu Wang et.al. | 2507.19356 | null |
| 2025-07-25 | SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality | Sijie Li et.al. | 2507.19264 | null |
| 2025-07-24 | Deep Learning for Blood-Brain Barrier Permeability Prediction | Zihan Yang et.al. | 2507.18557 | null |
| 2025-07-23 | RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding | Xi Xiao et.al. | 2507.17353 | null |
| 2025-07-22 | VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings | Ramin Giahi et.al. | 2507.17080 | null |
| 2025-07-20 | TD-Interpreter: Enhancing the Understanding of Timing Diagrams with Visual-Language Learning | Jie He et.al. | 2507.16844 | null |
| 2025-07-21 | Applying multimodal learning to Classify transient Detections Early (AppleCiDEr) I: Data set, methods, and infrastructure | Alexandra Junell et.al. | 2507.16088 | null |
| 2025-07-21 | MEETI: A Multimodal ECG Dataset from MIMIC-IV-ECG with Signals, Images, Features and Interpretations | Deyun Zhang et.al. | 2507.15255 | null |
| 2025-07-20 | LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering | Xinxin Dong et.al. | 2507.14784 | null |
| 2025-07-18 | MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training | Yuechen Xie et.al. | 2507.13673 | null |
| 2025-07-17 | City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning | Penglei Sun et.al. | 2507.12795 | null |
| 2025-07-17 | A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models | Weijieying Ren et.al. | 2507.12774 | null |
| 2025-07-15 | Partitioner Guided Modal Learning Framework | Guimin Hu et.al. | 2507.11661 | null |
| 2025-07-15 | A Robust Incomplete Multimodal Low-Rank Adaptation Approach for Emotion Recognition | Xinkui Zhao et.al. | 2507.11202 | null |
| 2025-07-14 | Ground-Compose-Reinforce: Tasking Reinforcement Learning Agents through Formal Language | Andrew C. Li et.al. | 2507.10741 | null |
| 2025-07-14 | Boosting Multimodal Learning via Disentangled Gradient Learning | Shicai Wei et.al. | 2507.10213 | null |
| 2025-07-21 | Improving Multimodal Learning via Imbalanced Learning | Shicai Wei et.al. | 2507.10203 | link |
| 2025-07-13 | HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space | Changli Wang et.al. | 2507.09487 | null |
| 2025-07-09 | Robust Multimodal Learning Framework For Intake Gesture Detection Using Contactless Radar and Wearable IMU Sensors | Chunzhuo Wang et.al. | 2507.07261 | null |
| 2025-07-09 | Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey | Getamesay Haile Dagnaw et.al. | 2507.07148 | null |
| 2025-07-08 | Enhancing Synthetic CT from CBCT via Multimodal Fusion and End-To-End Registration | Maximilian Tschuchnig et.al. | 2507.06067 | null |
| 2025-07-08 | Graph Learning | Feng Xia et.al. | 2507.05636 | null |
| 2025-07-07 | Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models | Eunseop Yoon et.al. | 2507.04976 | null |
| 2025-07-07 | From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach | Mihai Masala et.al. | 2507.04815 | null |
| 2025-07-07 | MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding | Zhicheng Zhang et.al. | 2507.04635 | null |
| 2025-07-10 | DMER-Ranker: Learning to Rank Emotion Descriptions in the Absence of Ground Truth | Zheng Lian et.al. | 2507.04278 | null |
| 2025-07-05 | Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation | Fernando Gabriela Garcia et.al. | 2507.04151 | null |
| 2025-07-03 | Intelligent Histology for Tumor Neurosurgery | Xinhai Hou et.al. | 2507.03037 | null |
| 2025-07-01 | Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers | Yusuf Shihata et.al. | 2507.02985 | null |
| 2025-07-02 | TAGF: Time-aware Gated Fusion for Multimodal Valence-Arousal Estimation | Yubeen Lee et.al. | 2507.02080 | null |
| 2025-06-27 | XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science | Jithendaraa Subramanian et.al. | 2507.01054 | null |
| 2025-06-27 | Test-Time Consistency in Vision Language Models | Shih-Han Chou et.al. | 2506.22395 | null |
| 2025-06-27 | Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems | Abdulmomen Ghalkha et.al. | 2506.22374 | null |
| 2025-06-26 | ImplicitQA: Going beyond frames towards Implicit Video Reasoning | Sirnam Swetha et.al. | 2506.21742 | link |
| 2025-06-28 | G $^{2}$ D: Boosting Multimodal Learning with Gradient-Guided Distillation | Mohammed Rakib et.al. | 2506.21514 | null |
| 2025-06-26 | V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling | Junwei You et.al. | 2506.21041 | null |
| 2025-06-26 | TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence | Feng Jiang et.al. | 2506.21028 | null |
| 2025-06-26 | Where is AIED Headed? Key Topics and Emerging Frontiers (2020-2024) | Shihui Feng et.al. | 2506.20971 | null |
| 2025-06-24 | Emergence of Text Readability in Vision Language Models | Jaeyoo Park et.al. | 2506.19389 | null |
| 2025-06-27 | Haptic-ACT – Pseudo Oocyte Manipulation by a Robot Using Multimodal Information and Action Chunking with Transformers | Pedro Miguel Uriguen Eljuri et.al. | 2506.18212 | null |
| 2025-06-21 | Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning? | Yuesheng Huang et.al. | 2506.17623 | null |
| 2025-06-24 | AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning | Alvaro Becerra et.al. | 2506.17364 | null |
| 2025-06-20 | With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You | Fabian Gröger et.al. | 2506.16895 | null |
| 2025-06-18 | A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion | Fangzhou Lin et.al. | 2506.15747 | null |
| 2025-06-18 | Foundation of Affective Computing and Interaction | Changzeng Fu et.al. | 2506.15497 | null |
| 2025-06-18 | video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Changli Tang et.al. | 2506.15220 | link |
| 2025-06-17 | Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? | Nitesh Subedi et.al. | 2506.14507 | link |
| 2025-06-16 | Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography | Yusdivia Molina-Román et.al. | 2506.13964 | null |
| 2025-06-16 | A Survey on World Models Grounded in Acoustic Physical Information | Xiaoliang Chen et.al. | 2506.13833 | link |
| 2025-06-16 | A Survey on Imitation Learning for Contact-Rich Tasks in Robotics | Toshiaki Tsuji et.al. | 2506.13498 | null |
| 2025-06-16 | Fatigue-Aware Adaptive Interfaces for Wearable Devices Using Deep Learning | Yikan Wang et.al. | 2506.13203 | null |
| 2025-06-15 | Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models | Liam Bennett et.al. | 2506.12733 | null |
| 2025-06-14 | Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics | Asifullah khan et.al. | 2506.12365 | null |
| 2025-06-14 | GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition | Yuntao Shou et.al. | 2506.12325 | null |
| 2025-06-16 | Improving Multimodal Learning Balance and Sufficiency through Data Remixing | Xiaoyu Ma et.al. | 2506.11550 | link |
| 2025-06-13 | RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer | Haotian Ni et.al. | 2506.11465 | null |
| 2025-06-12 | Combining Log Data and Collaborative Dialogue Features to Predict Project Quality in Middle School AI Education | Conrad Borchers et.al. | 2506.11326 | null |
| 2025-06-12 | Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction | Thanathai Lertpetchpun et.al. | 2506.10930 | null |
| 2025-06-12 | Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts | Guowei Zhong et.al. | 2506.10452 | link |
| 2025-06-09 | Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance | Peilin Li et.al. | 2506.09071 | null |
| 2025-06-10 | Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment | Maximilian Tschuchnig et.al. | 2506.08716 | null |
| 2025-06-10 | MOSAIC-F: A Framework for Enhancing Students’ Oral Presentation Skills through Personalized Feedback | Alvaro Becerra et.al. | 2506.08634 | null |
| 2025-06-09 | Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs | Jared Strader et.al. | 2506.07454 | null |
| 2025-06-08 | A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning | Jiachen Zhong et.al. | 2506.07236 | null |
| 2025-06-08 | Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning | Tianyi Bai et.al. | 2506.07227 | null |
| 2025-06-08 | A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge | Tarique Dahri et.al. | 2506.07055 | null |
| 2025-06-06 | Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning | Sheng Chen et.al. | 2506.06205 | null |
| 2025-06-06 | Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization | Jonathan Yang et.al. | 2506.06196 | null |
| 2025-06-06 | MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory | Ana Carolina Condez et.al. | 2506.05696 | null |
| 2025-06-03 | Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation | Israa A. Albadarneh et.al. | 2506.05399 | null |
| 2025-06-05 | Towards Language-Augmented Multi-Agent Deep Reinforcement Learning | Maxime Toquebiau et.al. | 2506.05236 | null |
| 2025-06-05 | Quantifying Cross-Modality Memorization in Vision-Language Models | Yuxin Wen et.al. | 2506.05198 | null |
| 2025-06-05 | A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions | Anh Le et.al. | 2506.05061 | null |
| 2025-06-04 | EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation | Cheng Zhang et.al. | 2506.03652 | null |
| 2025-06-03 | Enriching Location Representation with Detailed Semantic Information | Junyuan Liu et.al. | 2506.02744 | null |
| 2025-06-02 | Entity Image and Mixed-Modal Image Retrieval Datasets | Cristian-Ioan Blaga et.al. | 2506.02291 | null |
| 2025-06-02 | Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities | Yanxi Luo et.al. | 2506.01490 | null |
| 2025-06-02 | Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark | Shuyu Yang et.al. | 2506.01466 | null |
| 2025-06-02 | Agentic Episodic Control | Xidong Yang et.al. | 2506.01442 | null |
| 2025-06-01 | Leveraging CLIP Encoder for Multimodal Emotion Recognition | Yehun Song et.al. | 2506.00903 | null |
| 2025-06-01 | GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints | Jiajun He et.al. | 2506.00865 | null |
| 2025-06-01 | TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning | Jiaqi Luo et.al. | 2506.00813 | null |
| 2025-05-30 | Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework | Can Polat et.al. | 2506.00302 | null |
| 2025-05-30 | Mixpert: Mitigating Multimodal Learning Conflicts with Efficient Mixture-of-Vision-Experts | Xin He et.al. | 2505.24541 | null |
| 2025-05-29 | Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition | Sean Foley et.al. | 2505.24059 | null |
| 2025-06-02 | Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles | Zifu Wang et.al. | 2505.23590 | link |
| 2025-05-29 | OmniEarth-Bench: Towards Holistic Evaluation of Earth’s Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data | Fengxiang Wang et.al. | 2505.23522 | null |
| 2025-05-29 | Bidirectional predictive coding | Gaspard Oliviers et.al. | 2505.23415 | null |
| 2025-05-29 | Deep Modeling and Optimization of Medical Image Classification | Yihang Wu et.al. | 2505.23040 | link |
| 2025-05-30 | EmotionTalk: An Interactive Chinese Multimodal Emotion Dataset With Rich Annotations | Haoqin Sun et.al. | 2505.23018 | link |
| 2025-05-27 | A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features | Ihab Bendidi et.al. | 2505.21317 | null |
| 2025-05-26 | Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects | Chengyan Wu et.al. | 2505.20511 | null |
| 2025-05-25 | PDFBench: A Benchmark for De novo Protein Design from Function | Jiahao Kuang et.al. | 2505.20346 | null |
| 2025-05-26 | Learning Optimal Multimodal Information Bottleneck Representations | Qilong Wu et.al. | 2505.19996 | null |
| 2025-05-26 | ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs | Pooneh Mousavi et.al. | 2505.19937 | null |
| 2025-05-26 | Multiplicity is an Inevitable and Inherent Challenge in Multimodal Learning | Sanghyuk Chun et.al. | 2505.19614 | null |
| 2025-05-26 | Rethinking Gating Mechanism in Sparse MoE: Handling Arbitrary Modality Inputs with Confidence-Guided Gate | Liangwei Nathan Zheng et.al. | 2505.19525 | null |
| 2025-05-25 | Where Paths Collide: A Comprehensive Survey of Classic and Learning-Based Multi-Agent Pathfinding | Shiyue Wang et.al. | 2505.19219 | null |
| 2025-05-25 | I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts | Jiayi Xin et.al. | 2505.19190 | link |
| 2025-05-23 | Segment Anyword: Mask Prompt Inversion for Open-Set Grounded Segmentation | Zhihua Liu et.al. | 2505.17994 | null |
| 2025-05-23 | HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning | Chuhao Zhou et.al. | 2505.17645 | null |
| 2025-05-23 | RoHyDR: Robust Hybrid Diffusion Recovery for Incomplete Multimodal Emotion Recognition | Yuehan Jin et.al. | 2505.17501 | null |
| 2025-05-21 | NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation | Weiming Wu et.al. | 2505.17121 | null |
| 2025-05-22 | ICYM2I: The illusion of multimodal informativeness under missingness | Young Sang Choi et.al. | 2505.16953 | link |
| 2025-05-22 | Grounding Chest X-Ray Visual Question Answering with Generated Radiology Reports | Francesco Dalla Serra et.al. | 2505.16624 | null |
| 2025-05-22 | Multimodal Online Federated Learning with Modality Missing in Internet of Things | Heqiang Wang et.al. | 2505.16138 | null |
| 2025-05-21 | Robust Multimodal Learning via Entropy-Gated Contrastive Fusion | Leon Chlon et.al. | 2505.15417 | null |
| 2025-05-21 | EndoVLA: Dual-Phase Vision-Language-Action Model for Autonomous Tracking in Endoscopy | Chi Kit Ng et.al. | 2505.15206 | null |
| 2025-05-21 | Graph Foundation Models: A Comprehensive Survey | Zehong Wang et.al. | 2505.15116 | link |
| 2025-05-19 | HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity | Xuejun Sun et.al. | 2505.14725 | link |
| 2025-05-20 | Spiking Neural Networks with Temporal Attention-Guided Adaptive Fusion for imbalanced Multi-modal Learning | Jiangrong Shen et.al. | 2505.14535 | null |
| 2025-05-20 | Multimodal Mixture of Low-Rank Experts for Sentiment Analysis and Emotion Recognition | Shuo Zhang et.al. | 2505.14143 | null |
| 2025-05-20 | LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts | Qifeng Cai et.al. | 2505.13928 | link |
| 2025-05-17 | Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering | Hessa Alawwad et.al. | 2505.13520 | null |
| 2025-05-19 | AdaToken-3D: Dynamic Spatial Gating for Efficient 3D Large Multimodal-Models Reasoning | Kai Zhang et.al. | 2505.12782 | null |
| 2025-05-19 | PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI | Yingchen He et.al. | 2505.12707 | null |
| 2025-05-17 | Understanding the Capabilities of Molecular Graph Neural Networks in Materials Science Through Multimodal Learning and Physical Context Encoding | Can Polat et.al. | 2505.12137 | null |
| 2025-05-17 | SafeVid: Toward Safety Aligned Video Large Multimodal Models | Yixu Wang et.al. | 2505.11926 | null |
| 2025-05-16 | GeoMM: On Geodesic Perspective for Multi-modal Learning | Shibin Mei et.al. | 2505.11216 | null |
| 2025-05-15 | Incorporating brain-inspired mechanisms for multimodal learning in artificial intelligence | Xiang He et.al. | 2505.10176 | link |
| 2025-05-14 | VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation | Chaofan Zhang et.al. | 2505.09577 | null |
| 2025-05-16 | Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora | Michael Majurski et.al. | 2505.08905 | link |
| 2025-05-13 | Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities | Jueqing Lu et.al. | 2505.08283 | null |
| 2025-05-11 | MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning | Lishan Yang et.al. | 2505.06911 | null |
| 2025-05-10 | Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning | H M Dipu Kabir et.al. | 2505.06592 | link |
| 2025-05-10 | TACFN: Transformer-based Adaptive Cross-modal Fusion Network for Multimodal Emotion Recognition | Feng Liu et.al. | 2505.06536 | link |
| 2025-05-09 | NSF-MAP: Neurosymbolic Multimodal Fusion for Robust and Interpretable Anomaly Prediction in Assembly Pipelines | Chathurangi Shyalika et.al. | 2505.06333 | link |
| 2025-05-09 | Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models | Jugal Gajjar et.al. | 2505.06110 | null |
| 2025-05-09 | Why Are You Wrong? Counterfactual Explanations for Language Grounding with 3D Objects | Tobias Preintner et.al. | 2505.06030 | link |
| 2025-05-08 | The Moon’s Many Faces: A Single Unified Transformer for Multimodal Lunar Reconstruction | Tom Sander et.al. | 2505.05644 | null |
| 2025-05-07 | OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning | Xianhang Li et.al. | 2505.04601 | null |
| 2025-05-02 | Mapping the Climate Change Landscape on TikTok | Alessia Galdeman et.al. | 2505.03813 | null |
| 2025-05-06 | Reinforced Correlation Between Vision and Language for Precise Medical AI Assistant | Haonan Wang et.al. | 2505.03380 | null |
| 2025-05-06 | A Vision-Language Model for Focal Liver Lesion Classification | Song Jian et.al. | 2505.03350 | null |
| 2025-05-06 | SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation | Yu-Ren Guo et.al. | 2505.03244 | null |
| 2025-05-05 | The Multimodal Paradox: How Added and Missing Modalities Shape Bias and Performance in Multimodal AI | Kishore Sampath et.al. | 2505.03020 | null |
| 2025-05-02 | Aggregation of Dependent Expert Distributions in Multimodal Variational Autoencoders | Rogelio A Mancisidor et.al. | 2505.01134 | null |
| 2025-04-30 | Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design | Vasudev Sharma et.al. | 2505.00134 | null |
| 2025-04-28 | DEEMO: De-identity Multimodal Emotion Recognition and Reasoning | Deng Li et.al. | 2504.19549 | null |
| 2025-04-27 | Platonic Grounding for Efficient Multimodal Language Models | Moulik Choraria et.al. | 2504.19327 | null |
| 2025-04-27 | DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning | Jialang Lu et.al. | 2504.19127 | null |
| 2025-04-23 | A multi-scale vision transformer-based multimodal GeoAI model for mapping Arctic permafrost thaw | Wenwen Li et.al. | 2504.17822 | null |
| 2025-04-23 | Monte Carlo Planning with Large Language Model for Text-Based Game Agents | Zijing Shi et.al. | 2504.16855 | null |
| 2025-04-23 | Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation | Lakshita Agarwal et.al. | 2504.16788 | null |
| 2025-04-23 | PsyCounAssist: A Full-Cycle AI-Powered Psychological Counseling Assistant System | Xianghe Liu et.al. | 2504.16573 | null |
| 2025-04-22 | CLIP-IT: CLIP-based Pairing for Histology Images Classification | Banafsheh Karimian et.al. | 2504.16181 | null |
| 2025-04-22 | SAGA: Semantic-Aware Gray color Augmentation for Visible-to-Thermal Domain Adaptation across Multi-View Drone and Ground-Based Vision Systems | Manjunath D et.al. | 2504.15728 | null |
| 2025-04-21 | Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | Guo Chen et.al. | 2504.15271 | null |
| 2025-04-21 | IoT-AMLHP: Aligned Multimodal Learning of Header-Payload Representations for Resource-Efficient Malicious IoT Traffic Classification | Fengyuan Nie et.al. | 2504.14833 | null |
| 2025-04-19 | Text-Audio-Visual-conditioned Diffusion Model for Video Saliency Prediction | Li Yu et.al. | 2504.14267 | null |
| 2025-04-19 | PEFT A2Z: Parameter-Efficient Fine-Tuning Survey for Large Language and Vision Models | Nusrat Jahan Prottasha et.al. | 2504.14117 | null |
| 2025-04-18 | Are you SURE? Enhancing Multimodal Pretraining with Missing Modalities through Uncertainty Estimation | Duy A. Nguyen et.al. | 2504.13465 | null |
| 2025-04-17 | A Survey on Cross-Modal Interaction Between Music and Multimodal Data | Sifei Li et.al. | 2504.12796 | null |
| 2025-04-16 | An Algebraic Extension of Intuitionistic Linear Logic: The $L_!^S$ -Calculus and Its Categorical Model | Alejandro Díaz-Caro et.al. | 2504.12128 | null |
| 2025-04-16 | FedEPA: Enhancing Personalization and Modality Alignment in Multimodal Federated Learning | Yu Zhang et.al. | 2504.12025 | null |
| 2025-04-15 | Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset | Elisa Ancarani et.al. | 2504.11232 | null |
| 2025-04-14 | Improving Multimodal Hateful Meme Detection Exploiting LMM-Generated Knowledge | Maria Tzelepi et.al. | 2504.09914 | null |
| 2025-04-13 | Automatic Detection of Intro and Credits in Video using CLIP and Multihead Attention | Vasilii Korolkov et.al. | 2504.09738 | null |
| 2025-04-13 | Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation | Yongchao Feng et.al. | 2504.09480 | link |
| 2025-04-09 | Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging | Siyuan Dai et.al. | 2504.07336 | null |
| 2025-04-07 | Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework | Yu Min Park et.al. | 2504.05187 | null |
| 2025-04-07 | Leveraging Label Potential for Enhanced Multimodal Emotion Recognition | Xuechun Shao et.al. | 2504.05158 | null |
| 2025-04-06 | FluentLip: A Phonemes-Based Two-stage Approach for Audio-Driven Lip Synthesis with Optical Flow Consistency | Shiyan Liu et.al. | 2504.04427 | null |
| 2025-04-04 | Interpretable Multimodal Learning for Tumor Protein-Metal Binding: Progress, Challenges, and Perspectives | Xiaokun Liu et.al. | 2504.03847 | null |
| 2025-04-04 | DML-RAM: Deep Multimodal Learning Framework for Robotic Arm Manipulation using Pre-trained Models | Sathish Kumar et.al. | 2504.03423 | null |
| 2025-04-02 | Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities | Jing Liu et.al. | 2504.01954 | null |
| 2025-04-02 | Deep Learning-Driven Protein Structure Prediction and Design: Key Model Developments by Nobel Laureates and Multi-Domain Applications | Wanqing Yang et.al. | 2504.01490 | null |
| 2025-03-31 | Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition | François Olivier et.al. | 2503.24110 | null |
| 2025-03-31 | DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description | Adrienne Deganutti et.al. | 2503.24096 | null |
| 2025-03-31 | BeMERC: Behavior-Aware MLLM-based Framework for Multimodal Emotion Recognition in Conversation | Yumeng Fu et.al. | 2503.23990 | null |
| 2025-03-31 | Unimodal-driven Distillation in Multimodal Emotion Recognition with Dynamic Fusion | Jiagen Li et.al. | 2503.23721 | null |
| 2025-03-31 | HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation | Kun Liu et.al. | 2503.23715 | null |
| 2025-03-27 | Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models | Ruizhou Li et.al. | 2503.21435 | null |
| 2025-03-27 | UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning | Hongxuan Tang et.al. | 2503.21193 | null |
| 2025-03-27 | AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction | Shuaiyu Zhang et.al. | 2503.21124 | link |
| 2025-03-26 | GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations | Yupei Li et.al. | 2503.20919 | null |
| 2025-03-26 | An Encoding of Interaction Nets in OCaml | Nikolaus Huber et.al. | 2503.20463 | null |
| 2025-03-27 | RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models | Mehdi Moshtaghi et.al. | 2503.19654 | null |
| 2025-03-25 | VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction | Zizhi Chen et.al. | 2503.19367 | link |
| 2025-03-25 | LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text | Weizhi Chen et.al. | 2503.19311 | link |
| 2025-03-24 | Adaptive Unimodal Regulation for Balanced Multimodal Information Acquisition | Chengxiang Huang et.al. | 2503.18595 | link |
| 2025-03-21 | Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition | Ran Liu et.al. | 2503.17453 | link |
| 2025-03-21 | MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering | Jialin Chen et.al. | 2503.16858 | null |
| 2025-03-20 | EVA-MED: An Enhanced Valence-Arousal Multimodal Emotion Dataset for Emotion Recognition | Xin Huang et.al. | 2503.16584 | null |
| 2025-03-18 | Do Multimodal Large Language Models Understand Welding? | Grigorii Khvatskii et.al. | 2503.16537 | null |
| 2025-03-19 | EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis | Matthew Massey et.al. | 2503.15625 | link |
| 2025-03-19 | Optimal Transport Adapter Tuning for Bridging Modality Gaps in Few-Shot Remote Sensing Scene Classification | Zhong Ji et.al. | 2503.14938 | null |
| 2025-03-18 | HySurvPred: Multimodal Hyperbolic Embedding with Angle-Aware Hierarchical Contrastive Learning and Uncertainty Constraints for Survival Prediction | Jiaqi Yang et.al. | 2503.13862 | null |
| 2025-03-17 | Exploring 3D Activity Reasoning and Planning: From Implicit Human Intentions to Route-Aware Planning | Xueying Jiang et.al. | 2503.12974 | null |
| 2025-03-16 | BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries | Tianle Li et.al. | 2503.12446 | null |
| 2025-03-15 | Handling Weak Complementary Relationships for Audio-Visual Emotion Recognition | R. Gnana Praveen et.al. | 2503.12261 | null |
| 2025-03-14 | Cross-Modal Learning for Music-to-Music-Video Description Generation | Zhuoyuan Mao et.al. | 2503.11190 | null |
| 2025-03-20 | Unifying 2D and 3D Vision-Language Understanding | Ayush Jain et.al. | 2503.10745 | null |
| 2025-03-11 | TLA: Tactile-Language-Action Model for Contact-Rich Manipulation | Peng Hao et.al. | 2503.08548 | null |
| 2025-03-10 | Federated Multimodal Learning with Dual Adapters and Selective Pruning for Communication and Computational Efficiency | Duy Phuong Nguyen et.al. | 2503.07552 | link |
| 2025-03-10 | A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis | Xiang Liu et.al. | 2503.06973 | link |
| 2025-03-10 | HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation | Xingzu Zhan et.al. | 2503.06897 | null |
| 2025-03-10 | Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting | Cagri Gungor et.al. | 2503.06860 | null |
| 2025-03-09 | Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts | Aref Farhadipour et.al. | 2503.06805 | null |
| 2025-03-13 | DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning | Chengxuan Qian et.al. | 2503.06456 | link |
| 2025-03-05 | Beyond H&E: Unlocking Pathological Insights with Polarization via Self-supervised Learning | Yao Du et.al. | 2503.05933 | null |
| 2025-03-10 | R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning | Jiaxing Zhao et.al. | 2503.05379 | null |
| 2025-03-07 | Robust Multimodal Learning for Ophthalmic Disease Grading via Disentangled Representation | Xinkun Wang et.al. | 2503.05319 | null |
| 2025-03-06 | Large Language Models in Bioinformatics: A Survey | Zhenyu Wang et.al. | 2503.04490 | null |
| 2025-03-05 | Rebalanced Multimodal Learning with Data-aware Unimodal Sampling | Qingyuan Jiang et.al. | 2503.03792 | null |
| 2025-03-04 | Multimodal Deep Learning for Subtype Classification in Breast Cancer Using Histopathological Images and Gene Expression Data | Amin Honarmandi Shandiz et.al. | 2503.02849 | null |
| 2025-03-04 | Multimodal AI predicts clinical outcomes of drug combinations from preclinical data | Yepeng Huang et.al. | 2503.02781 | null |
| 2025-03-03 | Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA | Zhusi Zhong et.al. | 2503.02034 | null |
| 2025-03-03 | DeepSuM: Deep Sufficient Modality Learning Framework | Zhe Gao et.al. | 2503.01728 | null |
| 2025-03-03 | Dementia Insights: A Context-Based MultiModal Approach | Sahar Sinene Mehdoui et.al. | 2503.01226 | null |
| 2025-03-03 | HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation | Hongye Cheng et.al. | 2503.01175 | null |
| 2025-02-28 | Foundation-Model-Boosted Multimodal Learning for fMRI-based Neuropathic Pain Drug Response Prediction | Wenrui Fan et.al. | 2503.00210 | null |
| 2025-02-28 | PathVG: A New Benchmark and Dataset for Pathology Visual Grounding | Chunlin Zhong et.al. | 2502.20869 | null |
| 2025-02-28 | Multimodal Learning for Just-In-Time Software Defect Prediction in Autonomous Driving Systems | Faisal Mohammad et.al. | 2502.20806 | null |
| 2025-02-27 | VideoA11y: Method and Dataset for Accessible Video Description | Chaoyu Li et.al. | 2502.20480 | null |
| 2025-02-27 | LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding | Ang Cao et.al. | 2502.20389 | null |
| 2025-02-27 | Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion | QingYuan Jiang et.al. | 2502.20120 | null |
| 2025-02-27 | MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification | Tong Zhang et.al. | 2502.19674 | null |
| 2025-02-25 | CPVis: Evidence-based Multimodal Learning Analytics for Evaluation in Collaborative Programming | Gefei Zhang et.al. | 2502.17835 | null |
| 2025-02-24 | Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI | Syed Abdul Gaffar Shakhadri et.al. | 2502.17092 | null |
| 2025-02-24 | DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications | Ibrahim Fayad et.al. | 2502.17066 | null |
| 2025-02-23 | Category-Selective Neurons in Deep Networks: Comparing Purely Visual and Visual-Language Models | Zitong Lu et.al. | 2502.16456 | null |
| 2025-02-23 | A Survey on Industrial Anomalies Synthesis | Xichen Xu et.al. | 2502.16412 | link |
| 2025-02-22 | Understanding the Emergence of Multimodal Representation Alignment | Megan Tjandrasuwita et.al. | 2502.16282 | link |
| 2025-02-21 | M2LADS Demo: A System for Generating Multimodal Learning Analytics Dashboards | Alvaro Becerra et.al. | 2502.15363 | null |
| 2025-02-20 | FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis | Fadillah Maani et.al. | 2502.14807 | link |
| 2025-02-21 | AVD2: Accident Video Diffusion for Accident Video Description | Cheng Li et.al. | 2502.14801 | null |
| 2025-02-19 | Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition | Jingwang Huang et.al. | 2502.13954 | link |
| 2025-02-22 | Grounding LLM Reasoning with Knowledge Graphs | Alfonso Amayuelas et.al. | 2502.13247 | null |
| 2025-02-18 | SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation | Zekun Qi et.al. | 2502.13143 | null |
| 2025-02-18 | Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning | Mengshi Qi et.al. | 2502.12425 | link |
| 2025-02-16 | AudioSpa: Spatializing Sound Events with Text | Linfeng Feng et.al. | 2502.11219 | null |
| 2025-02-18 | BalanceBenchmark: A Survey for Imbalanced Learning | Shaoxuan Xu et.al. | 2502.10816 | link |
| 2025-02-17 | Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation | Mohammad Mahdi Abootorabi et.al. | 2502.08826 | link |
| 2025-02-12 | A Novel Approach to for Multimodal Emotion Recognition : Multimodal semantic information fusion | Wei Dai et.al. | 2502.08573 | null |
| 2025-02-17 | What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations | Dongqi Liu et.al. | 2502.08279 | null |
| 2025-02-11 | Enhancing Video Understanding: Deep Neural Networks for Spatiotemporal Analysis | Amir Hosein Fadaei et.al. | 2502.07277 | null |
| 2025-02-10 | Generative Distribution Prediction: A Unified Approach to Multimodal Learning | Xinyu Tian et.al. | 2502.07090 | null |
| 2025-02-06 | CAST: Cross Attention based multimodal fusion of Structure and Text for materials property prediction | Jaewan Lee et.al. | 2502.06836 | null |
| 2025-02-10 | Learning Musical Representations for Music Performance Question Answering | Xingjian Diao et.al. | 2502.06710 | null |
| 2025-02-04 | Exploring Spatial Language Grounding Through Referring Expressions | Akshar Tumu et.al. | 2502.04359 | null |
| 2025-02-03 | Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective | Xiaorui Ma et.al. | 2502.01524 | null |
| 2025-02-03 | MIND: Modality-Informed Knowledge Distillation Framework for Multimodal Clinical Prediction Tasks | Alejandro Guerra-Manzanares et.al. | 2502.01158 | null |
| 2025-02-01 | Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition | Zaitian Wang et.al. | 2502.00547 | link |
| 2025-01-29 | U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning | Md Kaykobad Reza et.al. | 2501.17823 | null |
| 2025-01-28 | Molecular-driven Foundation Model for Oncologic Pathology | Anurag Vaidya et.al. | 2501.16652 | null |
| 2025-01-27 | AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models | Zheng Lian et.al. | 2501.16566 | null |
| 2025-01-25 | Inductive Biases for Zero-shot Systematic Generalization in Language-informed Reinforcement Learning | Negin Hashemi Dijujin et.al. | 2501.15270 | null |
| 2025-01-25 | Deep Multimodal Learning for Real-Time DDoS Attacks Detection in Internet of Vehicles | Mohamed Ababsa et.al. | 2501.15252 | link |
| 2025-01-25 | Cross-modal Context Fusion and Adaptive Graph Convolutional Network for Multimodal Conversational Emotion Recognition | Junwei Feng et.al. | 2501.15063 | null |
| 2025-01-23 | Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge | Haomiao Xiong et.al. | 2501.13468 | link |
| 2025-01-22 | EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network | Shamin Bin Habib Avro et.al. | 2501.12674 | null |
| 2025-01-21 | Compositional Instruction Following with Language Models and Reinforcement Learning | Vanya Cohen et.al. | 2501.12539 | null |
| 2025-01-21 | Multi-stage intermediate fusion for multimodal learning to classify non-small cell lung cancer subtypes from CT and PET | Fatih Aksu et.al. | 2501.12425 | null |
| 2025-01-20 | LLM supervised Pre-training for Multimodal Emotion Recognition in Conversations | Soumya Dutta et.al. | 2501.11468 | null |
| 2025-01-20 | ITCFN: Incomplete Triple-Modal Co-Attention Fusion Network for Mild Cognitive Impairment Conversion Prediction | Xiangyang Hu et.al. | 2501.11276 | link |
| 2025-01-18 | Fake Advertisements Detection Using Automated Multimodal Learning: A Case Study for Vietnamese Real Estate Data | Duy Nguyen et.al. | 2501.10848 | null |
| 2025-01-17 | A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features | Enes Karanfil et.al. | 2501.10144 | null |
| 2025-01-17 | TeamVision: An AI-powered Learning Analytics System for Supporting Reflection in Team-based Healthcare Simulation | Vanessa Echeverria et.al. | 2501.09930 | null |
| 2025-01-19 | IDEA: Image Description Enhanced CLIP-Adapter | Zhipeng Ye et.al. | 2501.08816 | link |
| 2025-01-14 | Towards Zero-Shot & Explainable Video Description by Reasoning over Graphs of Events in Space and Time | Mihai Masala et.al. | 2501.08460 | null |
| 2025-01-12 | SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval | Bhavin Jawade et.al. | 2501.08347 | null |
| 2025-01-17 | Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding | Liping Yuan et.al. | 2501.07888 | null |
| 2025-01-13 | Exploring the Use of Contrastive Language-Image Pre-Training for Human Posture Classification: Insights from Yoga Pose Analysis | Andrzej D. Dobrzycki et.al. | 2501.07221 | null |
| 2025-01-12 | 3DCoMPaT200: Language-Grounded Compositional Understanding of Parts and Materials of 3D Shapes | Mahmoud Ahmed et.al. | 2501.06785 | link |
| 2025-01-14 | Beyond Sight: Finetuning Generalist Robot Policies with Heterogeneous Sensors via Language Grounding | Joshua Jones et.al. | 2501.04693 | null |
| 2025-01-06 | CM3T: Framework for Efficient Multimodal Learning for Inhomogeneous Interaction Datasets | Tanay Agrawal et.al. | 2501.03332 | null |
| 2025-01-06 | MVP: Multimodal Emotion Recognition based on Video and Physiological Signals | Valeriya Strizhkova et.al. | 2501.03103 | null |
| 2025-01-02 | Asymmetric Reinforcing against Multi-modal Representation Bias | Xiyuan Gao et.al. | 2501.01240 | link |
| 2025-01-02 | Retrieval-Augmented Dynamic Prompt Tuning for Incomplete Multimodal Learning | Jian Lang et.al. | 2501.01120 | link |
| 2024-12-30 | Aviary: training language agents on challenging scientific tasks | Siddharth Narayanan et.al. | 2412.21154 | null |
| 2024-12-30 | Hierarchical Banzhaf Interaction for General Video-Language Representation Learning | Peng Jin et.al. | 2412.20964 | link |
| 2024-12-30 | Enhancing Multimodal Emotion Recognition through Multi-Granularity Cross-Modal Alignment | Xuechen Wang et.al. | 2412.20821 | null |
| 2024-12-29 | Diff4MMLiTS: Advanced Multimodal Liver Tumor Segmentation via Diffusion-Based Image Synthesis and Alignment | Shiyun Chen et.al. | 2412.20418 | null |
| 2024-12-26 | Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching | Wenjing Chen et.al. | 2412.19184 | null |
| 2024-12-26 | CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting | Siyu Jiao et.al. | 2412.19142 | null |
| 2024-12-24 | MixMAS: A Framework for Sampling-Based Mixer Architecture Search for Multimodal Fusion and Learning | Abdelmadjid Chergui et.al. | 2412.18437 | link |
| 2024-12-23 | Multimodal Learning with Uncertainty Quantification based on Discounted Belief Fusion | Grigor Bezirganyan et.al. | 2412.18024 | link |
| 2024-12-23 | A Multimodal Emotion Recognition System: Integrating Facial Expressions, Body Movement, Speech, and Spoken Language | Kris Kraack et.al. | 2412.17907 | null |
| 2024-12-18 | Constraint-Based Model in Multimodal Learning to Improve Ventricular Arrhythmia Prediction | Evariste Njomgue Fotso et.al. | 2412.17840 | null |
| 2024-12-23 | Survey of Large Multimodal Model Datasets, Application Categories and Taxonomy | Priyaranjan Pattnayak et.al. | 2412.17759 | null |
| 2024-12-23 | EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities | Zhe Chen et.al. | 2412.17677 | link |
| 2024-12-23 | V $^2$ -SfMLearner: Learning Monocular Depth and Ego-motion for Multimodal Wireless Capsule Endoscopy | Long Bai et.al. | 2412.17595 | null |
| 2024-12-22 | COVID-19 on YouTube: A Data-Driven Analysis of Sentiment, Toxicity, and Content Recommendations | Vanessa Su et.al. | 2412.17180 | null |
| 2024-12-17 | DoPTA: Improving Document Layout Analysis using Patch-Text Alignment | Nikitha SR et.al. | 2412.12902 | null |
| 2024-12-17 | Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning | Shiping Ge et.al. | 2412.12791 | link |
| 2024-12-17 | PBVS 2024 Solution: Self-Supervised Learning and Sampling Strategies for SAR Classification in Extreme Long-Tail Distribution | Yuhyun Kim et.al. | 2412.12565 | null |
| 2024-12-16 | Gramian Multimodal Representation Learning and Alignment | Giordano Cicchetti et.al. | 2412.11959 | null |
| 2024-12-10 | Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning | Can Yaras et.al. | 2412.07909 | null |
| 2024-12-07 | WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition | Feng Li et.al. | 2412.05558 | null |
| 2024-12-05 | Lattice Lingo: Effect of Textual Detail on Multimodal Learning for Property Prediction of Crystals | Mrigi Munjal et.al. | 2412.04670 | null |
| 2024-12-04 | Training-Free Mitigation of Language Reasoning Degradation After Multimodal Instruction Tuning | Neale Ratzlaff et.al. | 2412.03467 | null |
| 2024-12-04 | Grounded Language Design for Lightweight Diagramming for Formal Methods | Siddhartha Prasad et.al. | 2412.03310 | null |
| 2024-12-04 | Dynamic Graph Neural Ordinary Differential Equation Network for Multi-modal Emotion Recognition in Conversation | Yuntao Shou et.al. | 2412.02935 | null |
| 2024-12-03 | Initial Study On Improving Segmentation By Combining Preoperative CT And Intraoperative CBCT Using Synthetic Data | Maximilian E. Tschuchnig et.al. | 2412.02294 | null |
| 2024-12-02 | Occam’s LGS: A Simple Approach for Language Gaussian Splatting | Jiahuan Cheng et.al. | 2412.01807 | null |
| 2024-11-30 | Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment | Dongfang Zhao et.al. | 2412.00373 | null |
| 2024-11-29 | SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition | Fangze Fu et.al. | 2411.19822 | null |
| 2024-11-26 | Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment | Zheng Chen et.al. | 2411.17237 | link |
| 2024-11-26 | Learning Robust Anymodal Segmentor with Unimodal and Cross-modal Distillation | Xu Zheng et.al. | 2411.17141 | link |
| 2024-11-26 | Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models | Colin Conwell et.al. | 2411.17066 | link |
| 2024-11-26 | Multimodal Alignment and Fusion: A Survey | Songtao Li et.al. | 2411.17040 | null |
| 2024-11-25 | Language Driven Occupancy Prediction | Zhu Yu et.al. | 2411.16072 | link |
| 2024-11-23 | From Complexity to Parsimony: Integrating Latent Class Analysis to Uncover Multimodal Learning Patterns in Collaborative Learning | Lixiang Yan et.al. | 2411.15590 | null |
| 2024-11-23 | Botfip-LLM: An Enhanced Multimodal Scientific Computing Framework Leveraging Knowledge Distillation from Large Language Models | Tianhao Chen et.al. | 2411.15525 | null |
| 2024-11-22 | PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision | Arnav M. Das et.al. | 2411.15127 | null |
| 2024-11-21 | Generative AI for Music and Audio | Hao-Wen Dong et.al. | 2411.14627 | null |
| 2024-11-21 | Multimodal 3D Reasoning Segmentation with Complex Scenes | Xueying Jiang et.al. | 2411.13927 | null |
| 2024-11-12 | Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media | Naga VS Raviteja Chappa et.al. | 2411.13572 | null |
| 2024-11-20 | I Can Tell What I am Doing: Toward Real-World Natural Language Grounding of Robot Experiences | Zihan Wang et.al. | 2411.12960 | null |
| 2024-11-18 | MMBind: Unleashing the Potential of Distributed and Heterogeneous Data for Multimodal Learning in IoT | Xiaomin Ouyang et.al. | 2411.12126 | null |
| 2024-11-19 | SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach | Ruoxi Sun et.al. | 2411.11195 | null |
| 2024-11-15 | Everything is a Video: Unifying Modalities through Next-Frame Prediction | G. Thomas Hudson et.al. | 2411.10503 | null |
| 2024-11-15 | Weakly-Supervised Multimodal Learning on MIMIC-CXR | Andrea Agostini et.al. | 2411.10356 | null |
| 2024-11-15 | CMATH: Cross-Modality Augmented Transformer with Hierarchical Variational Distillation for Multimodal Emotion Recognition in Conversation | Xiaofei Zhu et.al. | 2411.10060 | null |
| 2024-11-21 | Instruction-Guided Editing Controls for Images and Multimedia: A Survey in LLM era | Thanh Tam Nguyen et.al. | 2411.09955 | link |
| 2024-11-14 | SmartInv: Multimodal Learning for Smart Contract Invariant Inference | Sally Junsong Wang et.al. | 2411.09217 | null |
| 2024-11-12 | NL-SLAM for OC-VLN: Natural Language Grounded SLAM for Object-Centric VLN | Sonia Raychaudhuri et.al. | 2411.07848 | null |
| 2024-11-11 | Multimodal Fusion Balancing Through Game-Theoretic Regularization | Konstantinos Kontras et.al. | 2411.07335 | null |
| 2024-11-11 | StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification | Yichen He et.al. | 2411.07076 | link |
| 2024-11-08 | Smile upon the Face but Sadness in the Eyes: Emotion Recognition based on Facial Expressions and Eye Behaviors | Yuanyuan Liu et.al. | 2411.05879 | null |
| 2024-11-06 | AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool | Zhongliang Tang et.al. | 2411.03709 | null |
| 2024-11-05 | STEER: Flexible Robotic Manipulation via Dense Language Grounding | Laura Smith et.al. | 2411.03409 | null |
| 2024-11-05 | Grounding Natural Language to SQL Translation with Data-Based Self-Explanations | Yuankai Fan et.al. | 2411.02948 | link |
| 2024-11-04 | Grounding Emotional Descriptions to Electrovibration Haptic Signals | Guimin Hu et.al. | 2411.02118 | null |
| 2024-11-03 | Classifier-guided Gradient Modulation for Enhanced Multimodal Learning | Zirun Guo et.al. | 2411.01409 | link |
| 2024-11-01 | Text2Freq: Learning Series Patterns from Text via Frequency Domain | Ming-Chih Lo et.al. | 2411.00929 | null |
| 2024-10-29 | EEG-based Multimodal Representation Learning for Emotion Recognition | Kang Yin et.al. | 2411.00822 | null |
| 2024-11-01 | Analyzing Multimodal Integration in the Variational Autoencoder from an Information-Theoretic Perspective | Carlotta Langer et.al. | 2411.00522 | null |
| 2024-10-30 | PV-VTT: A Privacy-Centric Dataset for Mission-Specific Anomaly Detection and Natural Language Interpretation | Ryozo Masukawa et.al. | 2410.22623 | null |
| 2024-10-28 | IndraEye: Infrared Electro-Optical UAV-based Perception Dataset for Robust Downstream Tasks | Manjunath D et.al. | 2410.20953 | link |
| 2024-10-25 | TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning | Xiangyu Zeng et.al. | 2410.19702 | null |
| 2024-10-24 | UGotMe: An Embodied System for Affective Human-Robot Interaction | Peizhen Li et.al. | 2410.18373 | link |
| 2024-10-22 | EVC-MF: End-to-end Video Captioning Network with Multi-scale Features | Tian-Zi Niu et.al. | 2410.16624 | null |
| 2024-10-22 | MoRE: Multi-Modal Contrastive Pre-training with Transformers on X-Rays, ECGs, and Diagnostic Report | Samrajya Thapa et.al. | 2410.16239 | link |
| 2024-10-21 | Multimodal Learning for Embryo Viability Prediction in Clinical IVF | Junsik Kim et.al. | 2410.15581 | null |
| 2024-10-20 | Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison | Shiyu Hu et.al. | 2410.15270 | null |
| 2024-10-15 | CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning | Qingqing Cao et.al. | 2410.11963 | null |
| 2024-10-15 | Generalizable Spacecraft Trajectory Generation via Multimodal Learning with Transformers | Davide Celestini et.al. | 2410.11723 | null |
| 2024-10-15 | On-the-fly Modulation for Balanced Multimodal Learning | Yake Wei et.al. | 2410.11582 | link |
| 2024-10-14 | MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models | Peng Xia et.al. | 2410.10139 | link |
| 2024-10-10 | Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts | Sukwon Yun et.al. | 2410.08245 | link |
| 2024-10-11 | Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization | Changli Tang et.al. | 2410.06682 | null |
| 2024-10-08 | Multimodal Representation Learning using Adaptive Graph Construction | Weichen Huang et.al. | 2410.06395 | null |
| 2024-10-07 | Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models | Dehong Kong et.al. | 2410.04884 | null |
| 2024-10-07 | MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection | Niki Nezakati et.al. | 2410.03010 | null |
| 2024-10-02 | Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations | Minoh Jeong et.al. | 2410.02086 | null |
| 2024-10-02 | Open-vocabulary Multimodal Emotion Recognition: Dataset, Metric, and Benchmark | Zheng Lian et.al. | 2410.01495 | null |
| 2024-10-04 | VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models | Jiapeng Wang et.al. | 2410.00741 | null |
| 2024-09-30 | Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning | Weitai Kang et.al. | 2410.00255 | link |
| 2024-09-30 | Towards Robust Multimodal Sentiment Analysis with Incomplete Data | Haoyu Zhang et.al. | 2409.20012 | link |
| 2024-10-02 | CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling | Jihai Zhang et.al. | 2409.19291 | link |
| 2024-09-26 | Infer Human’s Intentions Before Following Natural Language Instructions | Yanming Wan et.al. | 2409.18073 | link |
| 2024-09-26 | A Multimodal Single-Branch Embedding Network for Recommendation in Cold-Start and Missing Modality Scenarios | Christian Ganhör et.al. | 2409.17864 | null |
| 2024-09-26 | Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification | Raja Kumar et.al. | 2409.17777 | null |
| 2024-09-25 | Language Grounded Multi-agent Communication for Ad-hoc Teamwork | Huao Li et.al. | 2409.17348 | null |
| 2024-09-24 | CLSP: High-Fidelity Contrastive Language-State Pre-training for Agent State Representation | Fuxian Huang et.al. | 2409.15806 | null |
| 2024-09-18 | All-in-one foundational models learning across quantum chemical levels | Yuxinxin Chen et.al. | 2409.12015 | link |
| 2024-09-13 | Hierarchical Hypercomplex Network for Multimodal Emotion Recognition | Eleonora Lopez et.al. | 2409.09194 | link |
| 2024-09-13 | Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing | Minh-Duc Vu et.al. | 2409.08885 | null |
| 2024-09-13 | A Multimodal Approach for Fluid Overload Prediction: Integrating Lung Ultrasound and Clinical Data | Tianqi Yang et.al. | 2409.08790 | null |
| 2024-09-13 | A Comprehensive Survey on Deep Multimodal Learning with Missing Modality | Renjie Wu et.al. | 2409.07825 | null |
| 2024-09-11 | What to align in multimodal contrastive learning? | Benoit Dufumier et.al. | 2409.07402 | null |
| 2024-09-11 | Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective | Guimin Hu et.al. | 2409.07388 | link |
| 2024-09-11 | Multimodal Emotion Recognition with Vision-language Prompting and Modality Dropout | Anbin QI et.al. | 2409.07078 | null |
| 2024-09-11 | A Survey of Multimodal Composite Editing and Retrieval | Suyan Li et.al. | 2409.05405 | link |
| 2024-09-09 | Diagnostic Reasoning in Natural Language: Computational Model and Application | Nils Dycke et.al. | 2409.05367 | null |
| 2024-09-10 | Improving Multimodal Emotion Recognition by Leveraging Acoustic Adaptation and Visual Alignment | Zhixian Zhao et.al. | 2409.05015 | null |
| 2024-08-31 | Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification | Aref Farhadipour et.al. | 2409.00562 | null |
| 2024-08-29 | Toward Robust Early Detection of Alzheimer’s Disease via an Integrated Multimodal Learning Approach | Yifei Chen et.al. | 2408.16343 | link |
| 2024-08-28 | Meta-Learn Unimodal Signals with Weak Supervision for Multimodal Sentiment Analysis | Sijie Mai et.al. | 2408.16029 | null |
| 2024-08-28 | ModalityMirror: Improving Audio Classification in Modality Heterogeneity Federated Learning with Multimodal Distillation | Tiantian Feng et.al. | 2408.15803 | null |
| 2024-08-28 | Visual Prompt Engineering for Medical Vision Language Models in Radiology | Stefan Denner et.al. | 2408.15802 | null |
| 2024-08-27 | The Benefits of Balance: From Information Projections to Variance Reduction | Lang Liu et.al. | 2408.15065 | null |
| 2024-08-27 | NeuralOOD: Improving Out-of-Distribution Generalization Performance with Brain-machine Fusion Learning Framework | Shuangchen Zhao et.al. | 2408.14950 | null |
| 2024-09-03 | Foundation Models for Music: A Survey | Yinghao Ma et.al. | 2408.14340 | link |
| 2024-09-06 | Quantum Multimodal Contrastive Learning Framework | Chi-Sheng Chen et.al. | 2408.13919 | null |
| 2024-08-25 | Multimodal Ensemble with Conditional Feature Fusion for Dysgraphia Diagnosis in Children from Handwriting Samples | Jayakanth Kunhoth et.al. | 2408.13754 | null |
| 2024-08-24 | R2G: Reasoning to Ground in 3D Scenes | Yixuan Li et.al. | 2408.13499 | null |
| 2024-08-23 | Ada2I: Enhancing Modality Balance for Multimodal Conversational Emotion Recognition | Cam-Van Thi Nguyen et.al. | 2408.12895 | null |
| 2024-08-23 | Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive Survey | Qika Lin et.al. | 2408.12880 | link |
| 2024-08-23 | Grounding Fallacies Misrepresenting Scientific Publications in Evidence | Max Glockner et.al. | 2408.12812 | null |
| 2024-08-22 | Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models | Jean Park et.al. | 2408.12763 | null |
| 2024-08-22 | Mental-Perceiver: Audio-Textual Multimodal Learning for Mental Health Assessment | Jinghui Qin et.al. | 2408.12088 | null |
| 2024-08-22 | Video Emotion Open-vocabulary Recognition Based on Multimodal Large Language Model | Mengying Ge et.al. | 2408.11286 | null |
| 2024-08-21 | SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition | Zebang Cheng et.al. | 2408.10500 | link |
| 2024-08-19 | Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation | Liu He et.al. | 2408.10453 | null |
| 2024-08-18 | Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition | Qifei Li et.al. | 2408.09438 | link |
| 2024-08-16 | Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition | Muhammad Haseeb Aslam et.al. | 2408.09035 | link |
| 2024-08-14 | Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach | Muhammad Saad Saeed et.al. | 2408.07445 | null |
| 2024-08-14 | Robust Semi-supervised Multimodal Medical Image Segmentation via Cross Modality Collaboration | Xiaogen Zhon et.al. | 2408.07341 | link |
| 2024-08-14 | Enhancing Visual Question Answering through Ranking-Based Hybrid Training and Multimodal Fusion | Peiyuan Chen et.al. | 2408.07303 | null |
| 2024-08-13 | Prioritizing Modalities: Flexible Importance Scheduling in Federated Multimodal Learning | Jieming Bian et.al. | 2408.06549 | null |
| 2024-08-04 | Distribution-Level Memory Recall for Continual Learning: Preserving Knowledge and Avoiding Confusion | Shaoxu Cheng et.al. | 2408.02695 | null |
| 2024-08-06 | Infusing Environmental Captions for Long-Form Video Language Grounding | Hyogun Lee et.al. | 2408.02336 | null |
| 2024-08-05 | REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models | Agneet Chatterjee et.al. | 2408.02231 | null |
| 2024-08-04 | CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization | Xiang He et.al. | 2408.01952 | link |
| 2024-08-02 | Multimodal Fusion via Hypergraph Autoencoder and Contrastive Learning for Emotion Recognition in Conversation | Zijian Yi et.al. | 2408.00970 | link |
| 2024-08-01 | The Monetisation of Toxicity: Analysing YouTube Content Creators and Controversy-Driven Engagement | Thales Bertaglia et.al. | 2408.00534 | null |
| 2024-07-31 | Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition | Jiang Li et.al. | 2407.21536 | null |
| 2024-07-31 | DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations | Dongwon Son et.al. | 2407.21267 | null |
| 2024-07-30 | HyperMM : Robust Multimodal Learning with Varying-sized Inputs | Hava Chaptoukaev et.al. | 2407.20768 | null |
| 2024-07-29 | ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2 | Wenjun Huang et.al. | 2407.19832 | null |
| 2024-08-02 | XLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training | Biao Wu et.al. | 2407.19546 | link |
| 2024-07-28 | Detached and Interactive Multimodal Learning | Yunfeng Fan et.al. | 2407.19514 | link |
| 2024-07-26 | Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment | Yuze Zheng et.al. | 2407.18854 | null |
| 2024-07-26 | Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention | Joe Dhanith P R et.al. | 2407.18552 | null |
| 2024-07-25 | $\mathbb{X}$ -Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs | Vlad Sobal et.al. | 2407.18134 | null |
| 2024-07-25 | Cross-Vendor Reproducibility of Radiomics-based Machine Learning Models for Computer-aided Diagnosis | Jatin Chaudhary et.al. | 2407.18060 | null |
| 2024-07-23 | Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation | Tao Meng et.al. | 2407.16714 | null |
| 2024-07-24 | MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues | Liyun Zhang et.al. | 2407.16552 | null |
| 2024-07-23 | Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities | Muhammad Irzam Liaqat et.al. | 2407.16243 | null |
| 2024-07-22 | Resource-Efficient Federated Multimodal Learning via Layer-wise and Progressive Training | Ye Lin Tun et.al. | 2407.15426 | null |
| 2024-07-17 | Text- and Feature-based Models for Compound Multimodal Emotion Recognition in the Wild | Nicolas Richet et.al. | 2407.12927 | link |
| 2024-07-17 | Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models | Donggeun Kim et.al. | 2407.12616 | null |
| 2024-07-12 | Diagnosing and Re-learning for Balanced Multimodal Learning | Yake Wei et.al. | 2407.09705 | link |
| 2024-07-12 | Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework | Haoqin Sun et.al. | 2407.09029 | null |
| 2024-07-10 | AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition | Zheng Lian et.al. | 2407.07653 | link |
| 2024-07-06 | Completed Feature Disentanglement Learning for Multimodal MRIs Analysis | Tianling Liu et.al. | 2407.04916 | null |
| 2024-07-05 | Multimodal Classification via Modal-Aware Interactive Enhancement | Qing-Yuan Jiang et.al. | 2407.04587 | null |
| 2024-07-05 | Robust Multimodal Learning via Representation Decoupling | Shicai Wei et.al. | 2407.04458 | null |
| 2024-07-05 | Smart Vision-Language Reasoners | Denisa Roberts et.al. | 2407.04212 | link |
| 2024-07-04 | ADAPT: Multimodal Learning for Detecting Physiological Changes under Missing Modalities | Julie Mordacq et.al. | 2407.03836 | link |
| 2024-07-02 | Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties | Srivathsan Badrinarayanan et.al. | 2407.03380 | link |
| 2024-07-05 | Multi-Task Domain Adaptation for Language Grounding with 3D Objects | Penglei Sun et.al. | 2407.02846 | null |
| 2024-07-01 | Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation | Sirui Xia et.al. | 2407.01796 | null |
| 2024-06-30 | Tarsier: Recipes for Training and Evaluating Large Video Description Models | Jiawei Wang et.al. | 2407.00634 | link |
| 2024-06-28 | Multimodal Learning and Cognitive Processes in Radiology: MedGaze for Chest X-ray Scanpath Prediction | Akash Awasthi et.al. | 2407.00129 | null |
| 2024-06-27 | From Efficient Multimodal Models to World Models: A Survey | Xinji Mai et.al. | 2407.00118 | null |
| 2024-06-27 | Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment | Hao Fei et.al. | 2406.19255 | null |
| 2024-06-27 | RAVEN: Multitask Retrieval Augmented Vision-Language Learning | Varun Nagaraj Rao et.al. | 2406.19150 | null |
| 2024-06-26 | Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs | Uttaran Bhattacharya et.al. | 2406.18068 | null |
| 2024-06-25 | Data curation via joint example selection further accelerates multimodal learning | Talfan Evans et.al. | 2406.17711 | null |
| 2024-06-23 | LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control | Delin Qu et.al. | 2406.16038 | null |
| 2024-06-20 | Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning | Yupei Zhang et.al. | 2406.13979 | link |
| 2024-06-19 | VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models | Haowen Hou et.al. | 2406.13362 | link |
| 2024-06-18 | Language and Multimodal Models in Sports: A Survey of Datasets and Applications | Haotian Xia et.al. | 2406.12252 | null |
| 2024-07-01 | Multimodal Learning With Intraoperative CBCT & Variably Aligned Preoperative CT Data To Improve Segmentation | Maximilian E. Tschuchnig et.al. | 2406.11650 | null |
| 2024-06-17 | Relational Learning in Pre-Trained Models: A Theory from Hypergraph Recovery Perspective | Yang Chen et.al. | 2406.11249 | null |
| 2024-06-17 | Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning | Zebang Cheng et.al. | 2406.11161 | link |
| 2024-06-13 | Explore the Limits of Omni-modal Pretraining at Scale | Yiyuan Zhang et.al. | 2406.09412 | link |
| 2024-06-13 | OpenVLA: An Open-Source Vision-Language-Action Model | Moo Jin Kim et.al. | 2406.09246 | link |
| 2024-06-13 | Zoom and Shift are All You Need | Jiahao Qin et.al. | 2406.08866 | null |
| 2024-06-11 | Embedding-based Multimodal Learning on Pan-Squamous Cell Carcinomas for Improved Survival Outcomes | Asim Waqas et.al. | 2406.08521 | null |
| 2024-06-16 | A Labelled Dataset for Sentiment Analysis of Videos on YouTube, TikTok, and Other Sources about the 2024 Outbreak of Measles | Nirmalya Thakur et.al. | 2406.07693 | null |
| 2024-06-11 | Situational Awareness Matters in 3D Vision Language Reasoning | Yunze Man et.al. | 2406.07544 | link |
| 2024-06-11 | Unified Modeling Enhanced Multimodal Learning for Precision Neuro-Oncology | Huahui Yi et.al. | 2406.07078 | link |
| 2024-06-10 | NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative | Asmar Nadeem et.al. | 2406.06499 | null |
| 2024-06-10 | Vript: A Video Is Worth Thousands of Words | Dongjie Yang et.al. | 2406.06040 | link |
| 2024-06-09 | Stealthy Targeted Backdoor Attacks against Image Captioning | Wenshu Fan et.al. | 2406.05874 | null |
| 2024-06-07 | Predictive Dynamic Fusion | Bing Cao et.al. | 2406.04802 | link |
| 2024-06-07 | AICoderEval: Improving AI Domain Code Generation of Large Language Models | Yinghui Xia et.al. | 2406.04712 | null |
| 2024-06-02 | Multimodal Deep Learning for Low-Resource Settings: A Vector Embedding Alignment Approach for Healthcare Applications | David Restrepo et.al. | 2406.02601 | null |
| 2024-06-04 | Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization | Yunpeng Zhao et.al. | 2406.01987 | null |
| 2024-06-03 | Automatic Fused Multimodal Deep Learning for Plant Identification | Alfreds Lapkovskis et.al. | 2406.01455 | link |
| 2024-06-05 | Pulmonary Embolism Mortality Prediction Using Multimodal Learning Based on Computed Tomography Angiography and Clinical Data | Zhusi Zhong et.al. | 2406.01302 | null |
| 2024-06-02 | Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient | Zechu Li et.al. | 2406.00681 | null |
| 2024-05-31 | Ovis: Structural Embedding Alignment for Multimodal Large Language Model | Shiyin Lu et.al. | 2405.20797 | null |
| 2024-05-31 | Visual Attention Analysis in Online Learning | Miriam Navarro et.al. | 2405.20091 | null |
| 2024-05-29 | Thermodynamically Informed Multimodal Learning of High-Dimensional Free Energy Models in Molecular Coarse Graining | Blake R. Duschatko et.al. | 2405.19386 | null |
| 2024-05-29 | LLMs Meet Multimodal Generation and Editing: A Survey | Yingqing He et.al. | 2405.19334 | link |
| 2024-05-29 | Exploring Exotic Decays of the Higgs Boson to Multi-Photons at the LHC via Multimodal Learning Approaches | A. Hammad et.al. | 2405.18834 | null |
| 2024-05-28 | RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives | Jaehong Yoon et.al. | 2405.18406 | link |
| 2024-05-28 | MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance | Yake Wei et.al. | 2405.17730 | link |
| 2024-05-27 | Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning | Zihua Zhao et.al. | 2405.16996 | null |
| 2024-05-27 | Multilingual Diversity Improves Vision-Language Representations | Thao Nguyen et.al. | 2405.16915 | null |
| 2024-05-27 | Hawk: Learning to Understand Open-World Video Anomalies | Jiaqi Tang et.al. | 2405.16886 | link |
| 2024-05-24 | Shopping Queries Image Dataset (SQID): An Image-Enriched ESCI Dataset for Exploring Multimodal Learning in Product Search | Marie Al Ghossein et.al. | 2405.15190 | link |
| 2024-05-23 | TIGER: Text-Instructed 3D Gaussian Retrieval and Coherent Editing | Teng Xu et.al. | 2405.14455 | null |
| 2024-05-22 | Grounding Toxicity in Real-World Events across Languages | Wondimagegnhue Tsegaye Tufa et.al. | 2405.13754 | link |
| 2024-05-21 | A Survey of Robotic Language Grounding: Tradeoffs Between Symbols and Embeddings | Vanya Cohen et.al. | 2405.13245 | null |
| 2024-05-21 | Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition | R Gnana Praveen et.al. | 2405.12853 | null |
| 2024-05-21 | Scientific discourse on YouTube: Motivations for citing research in comments | Sören Striewski et.al. | 2405.12798 | null |
| 2024-05-21 | Amplifying Academic Research through YouTube: Engagement Metrics as Predictors of Citation Impact | Olga Zagovora et.al. | 2405.12734 | null |
| 2024-05-21 | A Multimodal Learning-based Approach for Autonomous Landing of UAV | Francisco Neves et.al. | 2405.12681 | null |
| 2024-05-21 | Mutual Information Analysis in Multimodal Learning Systems | Hadi Hadizadeh et.al. | 2405.12456 | null |
| 2024-05-16 | Grounded 3D-LLM with Referent Tokens | Yilun Chen et.al. | 2405.10370 | link |
| 2024-05-13 | Improving Multimodal Learning with Multi-Loss Gradient Modulation | Konstantinos Kontras et.al. | 2405.07930 | link |
| 2024-05-13 | Generating Human Motion in 3D Scenes from Text Descriptions | Zhi Cen et.al. | 2405.07784 | null |
| 2024-05-13 | An Efficient Multimodal Learning Framework to Comprehend Consumer Preferences Using BERT and Cross-Attention | Junichiro Niimi et.al. | 2405.07435 | null |
| 2024-05-10 | A First Step in Using Machine Learning Methods to Enhance Interaction Analysis for Embodied Learning Environments | Joyce Fonteles et.al. | 2405.06203 | null |
| 2024-05-09 | Prompt When the Animal is: Temporal Animal Behavior Grounding with Positional Recovery Training | Sheng Yan et.al. | 2405.05523 | null |
| 2024-05-08 | Empathy Through Multimodality in Conversational Interfaces | Mahyar Abbasian et.al. | 2405.04777 | null |
| 2024-05-08 | All in One Framework for Multimodal Re-identification in the Wild | He Li et.al. | 2405.04741 | null |
| 2024-05-07 | Interpretable Tensor Fusion | Saurabh Varshneya et.al. | 2405.04671 | null |
| 2024-04-27 | MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning | Nadia Saeed et.al. | 2405.01583 | null |
| 2024-04-29 | 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset | Xinyu Ma et.al. | 2404.18413 | link |
| 2024-04-28 | LEGENT: Open Platform for Embodied Agents | Zhili Cheng et.al. | 2404.18243 | null |
| 2024-05-03 | Revisiting Multimodal Emotion Recognition in Conversation from the Perspective of Graph Spectrum | Tao Meng et.al. | 2404.17862 | null |
| 2024-04-29 | MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition | Zheng Lian et.al. | 2404.17113 | link |
| 2024-04-30 | AutoGluon-Multimodal (AutoMM): Supercharging Multimodal AutoML with Foundation Models | Zhiqiang Tang et.al. | 2404.16233 | null |
| 2024-04-23 | Hidden in Plain Sight: Exploring the Intersections of Mental Health, Eating Disorders, and Content Moderation on TikTok | Charles Bickham et.al. | 2404.15457 | null |
| 2024-04-14 | A Survey on Multimodal Wearable Sensor-based Human Action Recognition | Jianyuan Ni et.al. | 2404.15349 | null |
| 2024-04-23 | Between Flat-Earthers and Fitness Coaches: Who is Citing Scientific Publications in YouTube Video Descriptions? | Olga Zagovora et.al. | 2404.15083 | null |
| 2024-04-19 | Cooperative Sentiment Agents for Multimodal Sentiment Analysis | Shanmin Wang et.al. | 2404.12642 | link |
| 2024-04-18 | Dynamic Modality and View Selection for Multimodal Emotion Recognition with Missing Modalities | Luciana Trinkaus Menon et.al. | 2404.12251 | null |
| 2024-04-19 | TC-OCR: TableCraft OCR for Efficient Detection & Recognition of Table Structure & Content | Avinash Anand et.al. | 2404.10305 | null |
| 2024-04-15 | AIGeN: An Adversarial Approach for Instruction Generation in VLN | Niyati Rawal et.al. | 2404.10054 | null |
| 2024-04-22 | Neuro-Inspired Information-Theoretic Hierarchical Perception for Multimodal Learning | Xiongye Xiao et.al. | 2404.09403 | link |
| 2024-04-14 | TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning | Quang Minh Dinh et.al. | 2404.09275 | link |
| 2024-04-13 | MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild | Kateryna Chumachenko et.al. | 2404.09010 | link |
| 2024-04-12 | OmniSat: Self-Supervised Modality Fusion for Earth Observation | Guillaume Astruc et.al. | 2404.08351 | link |
| 2024-04-11 | Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios | Yuan Zhang et.al. | 2404.07484 | null |
| 2024-04-07 | X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model | Jan Held et.al. | 2404.06332 | null |
| 2024-04-07 | A Data-to-Product Multimodal Conceptual Framework to Achieve Automated Software Evolution for Context-rich Intelligent Applications | Songhui Yue et.al. | 2404.04821 | null |
| 2024-04-06 | Interpretable Multimodal Learning for Cardiovascular Hemodynamics Assessment | Prasun C Tripathi et.al. | 2404.04718 | link |
| 2024-04-05 | Mitigating Heterogeneity in Federated Multimodal Learning with Biomedical Vision-Language Pre-training | Zitao Shuai et.al. | 2404.03854 | null |
| 2024-04-02 | On Stronger Computational Separations Between Multimodal and Unimodal Machine Learning | Ari Karchmer et.al. | 2404.02254 | null |
| 2024-04-01 | iMD4GC: Incomplete Multimodal Data Integration to Advance Precise Treatment Response Prediction and Survival Analysis for Gastric Cancer | Fengtao Zhou et.al. | 2404.01192 | link |
| 2024-04-11 | MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models | Zebang Cheng et.al. | 2404.00511 | link |
| 2024-03-30 | UniMEEC: Towards Unified Multimodal Emotion Recognition and Emotion Cause | Guimin Hu et.al. | 2404.00403 | null |
| 2024-03-28 | IVLMap: Instance-Aware Visual Language Grounding for Consumer Robot Navigation | Jiacui Huang et.al. | 2403.19336 | null |
| 2024-03-26 | Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation | Abdelrhman Werby et.al. | 2403.17846 | null |
| 2024-03-26 | Project MOSLA: Recording Every Moment of Second Language Acquisition | Masato Hagiwara et.al. | 2403.17314 | null |
| 2024-03-17 | A Survey of IMU Based Cross-Modal Transfer Learning in Human Activity Recognition | Abhi Kamboj et.al. | 2403.15444 | null |
| 2024-03-22 | Contrastive Learning on Multimodal Analysis of Electronic Health Records | Tianxi Cai et.al. | 2403.14926 | null |
| 2024-03-20 | Grounding Spatial Relations in Text-Only Language Models | Gorka Azkune et.al. | 2403.13666 | link |
| 2024-04-02 | Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition | R. Gnana Praveen et.al. | 2403.13659 | null |
| 2024-03-20 | VL-Mamba: Exploring State Space Models for Multimodal Learning | Yanyuan Qiao et.al. | 2403.13600 | null |
| 2024-03-17 | From Pixels to Predictions: Spectrogram and Vision Transformer for Better Time Series Forecasting | Zhen Zeng et.al. | 2403.11047 | null |
| 2024-03-26 | Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity | Zhuo Zhi et.al. | 2403.09428 | link |
| 2024-03-14 | Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation | Daniel Honerkamp et.al. | 2403.08605 | link |
| 2024-03-12 | A Multimodal Intermediate Fusion Network with Manifold Learning for Stress Detection | Morteza Bodaghi et.al. | 2403.08077 | null |
| 2024-03-10 | WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs | Deshun Yang et.al. | 2403.07944 | null |
| 2024-03-25 | FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks | Muhammad Saif Ullah Khan et.al. | 2403.06904 | null |
| 2024-03-11 | DiaLoc: An Iterative Approach to Embodied Dialog Localization | Chao Zhang et.al. | 2403.06846 | null |
| 2024-03-11 | Zero-Shot ECG Classification with Multimodal Learning and Test-time Clinical Knowledge Enhancement | Che Liu et.al. | 2403.06659 | link |
| 2024-03-07 | A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data | Marco D Alessandro et.al. | 2403.04866 | link |
| 2024-03-05 | JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models | Arefa et.al. | 2403.04798 | link |
| 2024-03-07 | CLIP the Bias: How Useful is Balancing Data in Multimodal Learning? | Ibrahim Alabdulmohsin et.al. | 2403.04547 | null |
| 2024-03-04 | Reactive Programming without Functions | Bjarno Oeyen et.al. | 2403.02296 | null |
| 2024-03-03 | Hyperspectral Image Analysis in Single-Modal and Multimodal setting using Deep Learning Techniques | Shivam Pande et.al. | 2403.01546 | null |
| 2024-03-02 | ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation | Moran Yanuka et.al. | 2403.01306 | link |
| 2024-03-02 | Adversarial Testing for Visual Grounding via Image-Aware Property Reduction | Zhiyuan Chang et.al. | 2403.01118 | null |
| 2024-02-29 | Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers | Tsai-Shien Chen et.al. | 2402.19479 | null |
| 2024-02-29 | FATE in MMLA: A Student-Centred Exploration of Fairness, Accountability, Transparency, and Ethics in Multimodal Learning Analytics | Yueqiao Jin et.al. | 2402.19071 | null |
| 2024-02-28 | Grounding Language Models for Visual Entity Recognition | Zilin Xiao et.al. | 2402.18695 | link |
| 2024-02-28 | Multimodal Learning To Improve Cardiac Late Mechanical Activation Detection From Cine MR Images | Jiarui Xing et.al. | 2402.18507 | null |
| 2024-02-28 | DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning | Jianxiong Li et.al. | 2402.18137 | null |
| 2024-02-27 | Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control | Thong Nguyen et.al. | 2402.17535 | link |
| 2024-02-27 | Curriculum Learning Meets Directed Acyclic Graph for Multimodal Emotion Recognition | Cam-Van Thi Nguyen et.al. | 2402.17269 | null |
| 2024-02-26 | GROUNDHOG: Grounding Large Language Models to Holistic Segmentation | Yichi Zhang et.al. | 2402.16846 | null |
| 2024-02-26 | Gradient-Guided Modality Decoupling for Missing-Modality Robustness | Hao Wang et.al. | 2402.16318 | null |
| 2024-02-24 | FedMM: Federated Multi-Modal Learning with Modality Heterogeneity in Computational Pathology | Yuanzhe Peng et.al. | 2402.15858 | null |
| 2024-02-20 | GRAFFORD: A Benchmark Dataset for Testing the Knowledge of Object Affordances of Language and Vision Models | Sayantan Adak et.al. | 2402.12881 | link |
| 2024-02-19 | Multimodal Emotion Recognition from Raw Audio with Sinc-convolution | Xiaohui Zhang et.al. | 2402.11954 | null |
| 2024-02-18 | Efficient Multimodal Learning from Data-centric Perspective | Muyang He et.al. | 2402.11530 | link |
(<a href=../README.md>back to main</a>)