Multimodal - 2025-06 | Paper Arxiv Daily

Multimodal - 2025-06

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-06-27	XxaCT-NN: Structure Agnostic Multimodal Learning for Materials Science	Jithendaraa Subramanian et.al.	2507.01054	translate	read	null
2025-06-27	Test-Time Consistency in Vision Language Models	Shih-Han Chou et.al.	2506.22395	translate	read	null
2025-06-27	Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems	Abdulmomen Ghalkha et.al.	2506.22374	translate	read	null
2025-06-26	ImplicitQA: Going beyond frames towards Implicit Video Reasoning	Sirnam Swetha et.al.	2506.21742	translate	read	link
2025-06-28	G $^{2}$ D: Boosting Multimodal Learning with Gradient-Guided Distillation	Mohammed Rakib et.al.	2506.21514	translate	read	null
2025-06-26	V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling	Junwei You et.al.	2506.21041	translate	read	null
2025-06-26	TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence	Feng Jiang et.al.	2506.21028	translate	read	null
2025-06-26	Where is AIED Headed? Key Topics and Emerging Frontiers (2020-2024)	Shihui Feng et.al.	2506.20971	translate	read	null
2025-06-24	Emergence of Text Readability in Vision Language Models	Jaeyoo Park et.al.	2506.19389	translate	read	null
2025-06-27	Haptic-ACT – Pseudo Oocyte Manipulation by a Robot Using Multimodal Information and Action Chunking with Transformers	Pedro Miguel Uriguen Eljuri et.al.	2506.18212	translate	read	null
2025-06-21	Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?	Yuesheng Huang et.al.	2506.17623	translate	read	null
2025-06-24	AI-based Multimodal Biometrics for Detecting Smartphone Distractions: Application to Online Learning	Alvaro Becerra et.al.	2506.17364	translate	read	null
2025-06-20	With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You	Fabian Gröger et.al.	2506.16895	translate	read	null
2025-06-18	A Strong View-Free Baseline Approach for Single-View Image Guided Point Cloud Completion	Fangzhou Lin et.al.	2506.15747	translate	read	null
2025-06-18	Foundation of Affective Computing and Interaction	Changzeng Fu et.al.	2506.15497	translate	read	null
2025-06-18	video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models	Changli Tang et.al.	2506.15220	translate	read	link
2025-06-17	Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?	Nitesh Subedi et.al.	2506.14507	translate	read	link
2025-06-16	Comparison of ConvNeXt and Vision-Language Models for Breast Density Assessment in Screening Mammography	Yusdivia Molina-Román et.al.	2506.13964	translate	read	null
2025-06-16	A Survey on World Models Grounded in Acoustic Physical Information	Xiaoliang Chen et.al.	2506.13833	translate	read	link
2025-06-16	A Survey on Imitation Learning for Contact-Rich Tasks in Robotics	Toshiaki Tsuji et.al.	2506.13498	translate	read	null
2025-06-16	Fatigue-Aware Adaptive Interfaces for Wearable Devices Using Deep Learning	Yikan Wang et.al.	2506.13203	translate	read	null
2025-06-15	Learning to Fuse: Modality-Aware Adaptive Scheduling for Robust Multimodal Foundation Models	Liam Bennett et.al.	2506.12733	translate	read	null
2025-06-14	Advances in LLMs with Focus on Reasoning, Adaptability, Efficiency and Ethics	Asifullah khan et.al.	2506.12365	translate	read	null
2025-06-14	GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition	Yuntao Shou et.al.	2506.12325	translate	read	null
2025-06-16	Improving Multimodal Learning Balance and Sufficiency through Data Remixing	Xiaoyu Ma et.al.	2506.11550	translate	read	link
2025-06-13	RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer	Haotian Ni et.al.	2506.11465	translate	read	null
2025-06-12	Combining Log Data and Collaborative Dialogue Features to Predict Project Quality in Middle School AI Education	Conrad Borchers et.al.	2506.11326	translate	read	null
2025-06-12	Developing a High-performance Framework for Speech Emotion Recognition in Naturalistic Conditions Challenge for Emotional Attribute Prediction	Thanathai Lertpetchpun et.al.	2506.10930	translate	read	null
2025-06-12	Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts	Guowei Zhong et.al.	2506.10452	translate	read	link
2025-06-09	Segment Any Architectural Facades (SAAF):An automatic segmentation model for building facades, walls and windows based on multimodal semantics guidance	Peilin Li et.al.	2506.09071	translate	read	null
2025-06-10	Enhancing Synthetic CT from CBCT via Multimodal Fusion: A Study on the Impact of CBCT Quality and Alignment	Maximilian Tschuchnig et.al.	2506.08716	translate	read	null
2025-06-10	MOSAIC-F: A Framework for Enhancing Students’ Oral Presentation Skills through Personalized Feedback	Alvaro Becerra et.al.	2506.08634	translate	read	null
2025-06-09	Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs	Jared Strader et.al.	2506.07454	translate	read	null
2025-06-08	A Narrative Review on Large AI Models in Lung Cancer Screening, Diagnosis, and Treatment Planning	Jiachen Zhong et.al.	2506.07236	translate	read	null
2025-06-08	Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning	Tianyi Bai et.al.	2506.07227	translate	read	null
2025-06-08	A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge	Tarique Dahri et.al.	2506.07055	translate	read	null
2025-06-06	Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning	Sheng Chen et.al.	2506.06205	translate	read	null
2025-06-06	Bridging Perception and Action: Spatially-Grounded Mid-Level Representations for Robot Generalization	Jonathan Yang et.al.	2506.06196	translate	read	null
2025-06-06	MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory	Ana Carolina Condez et.al.	2506.05696	translate	read	null
2025-06-03	Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation	Israa A. Albadarneh et.al.	2506.05399	translate	read	null
2025-06-05	Towards Language-Augmented Multi-Agent Deep Reinforcement Learning	Maxime Toquebiau et.al.	2506.05236	translate	read	null
2025-06-05	Quantifying Cross-Modality Memorization in Vision-Language Models	Yuxin Wen et.al.	2506.05198	translate	read	null
2025-06-05	A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions	Anh Le et.al.	2506.05061	translate	read	null
2025-06-04	EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation	Cheng Zhang et.al.	2506.03652	translate	read	null
2025-06-03	Enriching Location Representation with Detailed Semantic Information	Junyuan Liu et.al.	2506.02744	translate	read	null
2025-06-02	Entity Image and Mixed-Modal Image Retrieval Datasets	Cristian-Ioan Blaga et.al.	2506.02291	translate	read	null
2025-06-02	Confidence-Aware Self-Distillation for Multimodal Sentiment Analysis with Incomplete Modalities	Yanxi Luo et.al.	2506.01490	translate	read	null
2025-06-02	Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark	Shuyu Yang et.al.	2506.01466	translate	read	null
2025-06-02	Agentic Episodic Control	Xidong Yang et.al.	2506.01442	translate	read	null
2025-06-01	Leveraging CLIP Encoder for Multimodal Emotion Recognition	Yehun Song et.al.	2506.00903	translate	read	null
2025-06-01	GIA-MIC: Multimodal Emotion Recognition with Gated Interactive Attention and Modality-Invariant Learning Constraints	Jiajun He et.al.	2506.00865	translate	read	null
2025-06-01	TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning	Jiaqi Luo et.al.	2506.00813	translate	read	null
2025-06-02	Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles	Zifu Wang et.al.	2505.23590	translate	read	link

(<a href=../Multimodal.md>back to Multimodal</a>)