Audio Processing - 2026-02

Publish Date Title Authors PDF Translate Read Code
2026-02-28 Polynomial Mixing for Efficient Self-supervised Speech Encoders Eva Feillet et.al. 2603.00683 translate read null
2026-02-28 CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction Yinghao Ma et.al. 2603.00610 translate read null
2026-02-28 Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation Jinhan Xu et.al. 2603.00576 translate read null
2026-02-28 Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion Sen Zhang et.al. 2603.00563 translate read null
2026-02-26 Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems Siyuan Liu et.al. 2602.23266 translate read null
2026-02-26 Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment Sanjid Hasan et.al. 2602.23070 translate read null
2026-02-26 A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment Zarif Ishmam et.al. 2602.22935 translate read null
2026-02-26 Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing An-Ci Peng et.al. 2602.22522 translate read null
2026-02-25 TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition Cheng-Yeh Yang et.al. 2602.22039 translate read null
2026-02-25 Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization MD. Sagor Chowdhury et.al. 2602.21741 translate read null
2026-02-25 Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration Tangsang Chongbang et.al. 2602.21647 translate read null
2026-02-25 A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation Chun-wei Ho et.al. 2602.21476 translate read null
2026-02-24 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio Ratnajit Dhar et.al. 2602.21183 translate read null
2026-02-24 Training-Free Intelligibility-Guided Observation Addition for Noisy ASR Haoyang Li et.al. 2602.20967 translate read null
2026-02-23 An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction Guanting Shen et.al. 2602.20219 translate read null
2026-02-23 Can You Tell It’s AI? Human Perception of Synthetic Voices in Vishing Scenarios Zoha Hayat Bhatti et.al. 2602.20061 translate read null
2026-02-23 Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling Yungang Yi et.al. 2602.19816 translate read null
2026-02-22 Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition Alexandros Haliassos et.al. 2602.19316 translate read null
2026-02-21 Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Yonathan Ron et.al. 2602.18966 translate read null
2026-02-21 ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models Zefang Liu et.al. 2602.18721 translate read null
2026-02-18 Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models Joseph Bingham et.al. 2602.18507 translate read null
2026-02-20 MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows Takuhiro Kaneko et.al. 2602.18104 translate read null
2026-02-19 MusicSem: A Semantically Rich Language–Audio Dataset of Natural Music Descriptions Rebecca Salganik et.al. 2602.17769 translate read null
2026-02-19 Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment Ivan Rinaldi et.al. 2602.17599 translate read null
2026-02-19 Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks Nuno Saavedra et.al. 2602.17394 translate read null
2026-02-13 Speech to Speech Synthesis for Voice Impersonation Bjorn Johnson et.al. 2602.16721 translate read null
2026-02-18 Multi-Channel Replay Speech Detection using Acoustic Maps Michael Neri et.al. 2602.16399 translate read null
2026-02-18 How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection Yixuan Xiao et.al. 2602.16343 translate read null
2026-02-17 LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models Ahmed Khaled Khamis et.al. 2602.15675 translate read null
2026-02-17 Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios Yiming Yang et.al. 2602.15519 translate read null
2026-02-17 Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits Gilad Nurko et.al. 2602.15405 translate read null
2026-02-16 Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis Frederik Rautenberg et.al. 2602.14686 translate read null
2026-02-16 Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer’s Disease Detection via Speech Xiao Wei et.al. 2602.14655 translate read null
2026-02-16 CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia Yacouba Kaloga et.al. 2602.14584 translate read null
2026-02-15 From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset Jandad Jahani et.al. 2602.14062 translate read null
2026-02-15 Eureka-Audio: Triggering Audio Intelligence in Compact Language Models Dan Zhang et.al. 2602.13954 translate read null
2026-02-14 voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models Aju Ani Justus et.al. 2602.13928 translate read null
2026-02-14 ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification Amro Asali et.al. 2602.13761 translate read null
2026-02-13 ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark Tung X. Nguyen et.al. 2602.12911 translate read null
2026-02-13 Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting Jing Xu et.al. 2602.12746 translate read null
2026-02-13 PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People Mahdi Haghighat Joo et.al. 2602.12597 translate read null
2026-02-13 Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR Jaeyoung Lee et.al. 2602.12546 translate read null
2026-02-12 “Sorry, I Didn’t Catch That”: How Speech Models Miss What Matters Most Kaitlyn Zhou et.al. 2602.12249 translate read null
2026-02-12 Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications Manjunath Kudlur et.al. 2602.12241 translate read null
2026-02-12 On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy Luiz Pereira et.al. 2602.12009 translate read null
2026-02-12 TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR Qingshun She et.al. 2602.11546 translate read null
2026-02-12 SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis Yifan Liang et.al. 2602.11477 translate read null
2026-02-11 Voxtral Realtime Alexander H. Liu et.al. 2602.11298 translate read null
2026-02-11 Self-Supervised Learning for Speaker Recognition: A study and review Theo Lepage et.al. 2602.10829 translate read null
2026-02-05 Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language Isaac Wiafe et.al. 2602.05406 translate read null
2026-02-03 Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization Sai Sindhur Malleni et.al. 2602.04900 translate read null
2026-02-04 Speaker-Aware Simulation Improves Conversational Speech Recognition Máté Gedeon et.al. 2602.04776 translate read null
2026-02-04 HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing Xuenan Xu et.al. 2602.04535 translate read null
2026-02-04 Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement Chien-Chun Wang et.al. 2602.04307 translate read null
2026-02-04 Frontend Token Enhancement for Token-Based Speech Recognition Takanori Ashihara et.al. 2602.04217 translate read null
2026-02-03 Mići Princ – A Little Boy Teaching Speech Technologies the Chakavian Dialect Nikola Ljubešić et.al. 2602.03245 translate read null
2026-02-03 Rethinking Music Captioning with Music Metadata LLMs Irmak Bukey et.al. 2602.03023 translate read null
2026-02-02 WAXAL: A Large-Scale Multilingual African Language Speech Corpus Abdoulaye Diack et.al. 2602.02734 translate read null
2026-02-01 VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis Chengyuan Ma et.al. 2602.02591 translate read null
2026-02-02 DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild Arnab Das et.al. 2602.02286 translate read null
2026-02-02 Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition Wonjun Lee et.al. 2602.01967 translate read null
2026-02-02 LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency Jaejun Lee et.al. 2602.01908 translate read null
2026-02-02 Joint Optimization of ASV and CM tasks: BTUEF Team’s Submission for WildSpoof Challenge Oguzhan Kurnaz et.al. 2602.01722 translate read null
2026-02-02 BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition Hyunsik Kim et.al. 2602.01717 translate read null
2026-02-01 Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings Mariëtte Olijslager et.al. 2602.01363 translate read null
2026-02-01 EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech Besher Hassan et.al. 2602.01170 translate read null
2026-02-01 HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection Zhili Nicholas Liang et.al. 2602.01032 translate read null
2026-02-01 Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages Yang Xiao et.al. 2602.01008 translate read null
2026-02-01 MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA Yutong Song et.al. 2602.00981 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)