Audio Processing - 2026-02 | Paper Arxiv Daily

Audio Processing - 2026-02

Publish Date	Title	Authors	PDF	Translate	Read	Code
2026-02-28	Polynomial Mixing for Efficient Self-supervised Speech Encoders	Eva Feillet et.al.	2603.00683	translate	read	null
2026-02-28	CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction	Yinghao Ma et.al.	2603.00610	translate	read	null
2026-02-28	Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation	Jinhan Xu et.al.	2603.00576	translate	read	null
2026-02-28	Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion	Sen Zhang et.al.	2603.00563	translate	read	null
2026-02-26	Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems	Siyuan Liu et.al.	2602.23266	translate	read	null
2026-02-26	Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment	Sanjid Hasan et.al.	2602.23070	translate	read	null
2026-02-26	A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment	Zarif Ishmam et.al.	2602.22935	translate	read	null
2026-02-26	Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing	An-Ci Peng et.al.	2602.22522	translate	read	null
2026-02-25	TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition	Cheng-Yeh Yang et.al.	2602.22039	translate	read	null
2026-02-25	Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization	MD. Sagor Chowdhury et.al.	2602.21741	translate	read	null
2026-02-25	Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration	Tangsang Chongbang et.al.	2602.21647	translate	read	null
2026-02-25	A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation	Chun-wei Ho et.al.	2602.21476	translate	read	null
2026-02-24	823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio	Ratnajit Dhar et.al.	2602.21183	translate	read	null
2026-02-24	Training-Free Intelligibility-Guided Observation Addition for Noisy ASR	Haoyang Li et.al.	2602.20967	translate	read	null
2026-02-23	An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction	Guanting Shen et.al.	2602.20219	translate	read	null
2026-02-23	Can You Tell It’s AI? Human Perception of Synthetic Voices in Vishing Scenarios	Zoha Hayat Bhatti et.al.	2602.20061	translate	read	null
2026-02-23	Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling	Yungang Yi et.al.	2602.19816	translate	read	null
2026-02-22	Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition	Alexandros Haliassos et.al.	2602.19316	translate	read	null
2026-02-21	Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation	Yonathan Ron et.al.	2602.18966	translate	read	null
2026-02-21	ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models	Zefang Liu et.al.	2602.18721	translate	read	null
2026-02-18	Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models	Joseph Bingham et.al.	2602.18507	translate	read	null
2026-02-20	MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows	Takuhiro Kaneko et.al.	2602.18104	translate	read	null
2026-02-19	MusicSem: A Semantically Rich Language–Audio Dataset of Natural Music Descriptions	Rebecca Salganik et.al.	2602.17769	translate	read	null
2026-02-19	Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment	Ivan Rinaldi et.al.	2602.17599	translate	read	null
2026-02-19	Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks	Nuno Saavedra et.al.	2602.17394	translate	read	null
2026-02-13	Speech to Speech Synthesis for Voice Impersonation	Bjorn Johnson et.al.	2602.16721	translate	read	null
2026-02-18	Multi-Channel Replay Speech Detection using Acoustic Maps	Michael Neri et.al.	2602.16399	translate	read	null
2026-02-18	How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection	Yixuan Xiao et.al.	2602.16343	translate	read	null
2026-02-17	LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models	Ahmed Khaled Khamis et.al.	2602.15675	translate	read	null
2026-02-17	Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios	Yiming Yang et.al.	2602.15519	translate	read	null
2026-02-17	Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits	Gilad Nurko et.al.	2602.15405	translate	read	null
2026-02-16	Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis	Frederik Rautenberg et.al.	2602.14686	translate	read	null
2026-02-16	Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer’s Disease Detection via Speech	Xiao Wei et.al.	2602.14655	translate	read	null
2026-02-16	CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia	Yacouba Kaloga et.al.	2602.14584	translate	read	null
2026-02-15	From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset	Jandad Jahani et.al.	2602.14062	translate	read	null
2026-02-15	Eureka-Audio: Triggering Audio Intelligence in Compact Language Models	Dan Zhang et.al.	2602.13954	translate	read	null
2026-02-14	voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models	Aju Ani Justus et.al.	2602.13928	translate	read	null
2026-02-14	ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification	Amro Asali et.al.	2602.13761	translate	read	null
2026-02-13	ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark	Tung X. Nguyen et.al.	2602.12911	translate	read	null
2026-02-13	Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting	Jing Xu et.al.	2602.12746	translate	read	null
2026-02-13	PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People	Mahdi Haghighat Joo et.al.	2602.12597	translate	read	null
2026-02-13	Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR	Jaeyoung Lee et.al.	2602.12546	translate	read	null
2026-02-12	“Sorry, I Didn’t Catch That”: How Speech Models Miss What Matters Most	Kaitlyn Zhou et.al.	2602.12249	translate	read	null
2026-02-12	Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications	Manjunath Kudlur et.al.	2602.12241	translate	read	null
2026-02-12	On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy	Luiz Pereira et.al.	2602.12009	translate	read	null
2026-02-12	TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR	Qingshun She et.al.	2602.11546	translate	read	null
2026-02-12	SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis	Yifan Liang et.al.	2602.11477	translate	read	null
2026-02-11	Voxtral Realtime	Alexander H. Liu et.al.	2602.11298	translate	read	null
2026-02-11	Self-Supervised Learning for Speaker Recognition: A study and review	Theo Lepage et.al.	2602.10829	translate	read	null
2026-02-05	Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language	Isaac Wiafe et.al.	2602.05406	translate	read	null
2026-02-03	Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization	Sai Sindhur Malleni et.al.	2602.04900	translate	read	null
2026-02-04	Speaker-Aware Simulation Improves Conversational Speech Recognition	Máté Gedeon et.al.	2602.04776	translate	read	null
2026-02-04	HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing	Xuenan Xu et.al.	2602.04535	translate	read	null
2026-02-04	Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement	Chien-Chun Wang et.al.	2602.04307	translate	read	null
2026-02-04	Frontend Token Enhancement for Token-Based Speech Recognition	Takanori Ashihara et.al.	2602.04217	translate	read	null
2026-02-03	Mići Princ – A Little Boy Teaching Speech Technologies the Chakavian Dialect	Nikola Ljubešić et.al.	2602.03245	translate	read	null
2026-02-03	Rethinking Music Captioning with Music Metadata LLMs	Irmak Bukey et.al.	2602.03023	translate	read	null
2026-02-02	WAXAL: A Large-Scale Multilingual African Language Speech Corpus	Abdoulaye Diack et.al.	2602.02734	translate	read	null
2026-02-01	VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis	Chengyuan Ma et.al.	2602.02591	translate	read	null
2026-02-02	DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild	Arnab Das et.al.	2602.02286	translate	read	null
2026-02-02	Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition	Wonjun Lee et.al.	2602.01967	translate	read	null
2026-02-02	LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency	Jaejun Lee et.al.	2602.01908	translate	read	null
2026-02-02	Joint Optimization of ASV and CM tasks: BTUEF Team’s Submission for WildSpoof Challenge	Oguzhan Kurnaz et.al.	2602.01722	translate	read	null
2026-02-02	BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition	Hyunsik Kim et.al.	2602.01717	translate	read	null
2026-02-01	Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings	Mariëtte Olijslager et.al.	2602.01363	translate	read	null
2026-02-01	EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech	Besher Hassan et.al.	2602.01170	translate	read	null
2026-02-01	HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection	Zhili Nicholas Liang et.al.	2602.01032	translate	read	null
2026-02-01	Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages	Yang Xiao et.al.	2602.01008	translate	read	null
2026-02-01	MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA	Yutong Song et.al.	2602.00981	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)