Audio Processing - 2026-03 | Paper Arxiv Daily

Audio Processing - 2026-03

Publish Date	Title	Authors	PDF	Translate	Read	Code
2026-03-31	FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish	Daban Q. Jaff et.al.	2603.29892	translate	read	null
2026-03-31	LLM Probe: Evaluating LLMs for Low-Resource Languages	Hailay Kidu Teklehaymanot et.al.	2603.29517	translate	read	null
2026-03-31	Spoken Digit Recognition and Speaker Classification by Nonlinear Interfered Spin Wave-Based Physical Reservoir Computing	Sota Hikasa et.al.	2603.29311	translate	read	null
2026-03-31	Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition	Lukuang Dong et.al.	2603.29217	translate	read	null
2026-03-31	From Natural Alignment to Conditional Controllability in Multimodal Dialogue	Zeyu Jin et.al.	2603.29162	translate	read	null
2026-03-30	EBuddy: a workflow orchestrator for industrial human-machine collaboration	Michele Banfi et.al.	2603.28579	translate	read	null
2026-03-30	Voice-Controlled Scratch for Children with (Motor) Disabilities	Elias Goller et.al.	2603.28246	translate	read	null
2026-03-30	Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models	Luigi Curini et.al.	2603.28103	translate	read	null
2026-03-30	On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR	Ganesh Pavan Kartikeya Bharadwaj Kolluri et.al.	2603.27981	translate	read	null
2026-03-25	POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan	Marta Moscati et.al.	2603.24569	translate	read	null
2026-03-25	A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English	Dana Serditova et.al.	2603.24549	translate	read	null
2026-03-25	What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification	Massa Baali et.al.	2603.24432	translate	read	null
2026-03-25	When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools	Xingming Li et.al.	2603.24389	translate	read	null
2026-03-25	Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing	Rinku Sebastian et.al.	2603.24283	translate	read	null
2026-03-25	How Open is Open TTS? A Practical Evaluation of Open Source TTS Tools for Romanian	Teodora Răgman et.al.	2603.24116	translate	read	null
2026-03-25	From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs	Xiaoyong Guo et.al.	2603.24034	translate	read	null
2026-03-24	Echoes: A semantically-aligned music deepfake detection dataset	Octavian Pascu et.al.	2603.23667	translate	read	null
2026-03-24	Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages	Badr M. Abdullah et.al.	2603.23654	translate	read	null
2026-03-24	Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework	Zeinab Dehghani et.al.	2603.23625	translate	read	null
2026-03-24	MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates	Zikang Huang et.al.	2603.23048	translate	read	null
2026-03-24	When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse	Yihuan Huang et.al.	2603.22915	translate	read	null
2026-03-24	Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics	Naohiro Tawara et.al.	2603.22709	translate	read	null
2026-03-24	MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation	Di Zhu et.al.	2603.22677	translate	read	null
2026-03-23	Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks	Matías Pizarro et.al.	2603.22590	translate	read	null
2026-03-23	SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation	Lucas H. Ueda et.al.	2603.22252	translate	read	null
2026-03-23	SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding	Haroun Elleuch et.al.	2603.21940	translate	read	null
2026-03-23	Ara-Best-RQ: Multi Dialectal Arabic SSL	Haroun Elleuch et.al.	2603.21900	translate	read	null
2026-03-23	Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment	Lei Yang et.al.	2603.21808	translate	read	null
2026-03-23	RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue	Meng-Chen Lee et.al.	2603.21682	translate	read	null
2026-03-22	HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit	Khushiyant et.al.	2603.21316	translate	read	null
2026-03-22	Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation	Soudeep Ghoshal et.al.	2603.21282	translate	read	null
2026-03-22	SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing	Jianyi Chen et.al.	2603.21073	translate	read	null
2026-03-20	Audio Avatar Fingerprinting: An Approach for Authorized Use of Voice Cloning in the Era of Synthetic Audio	Candice R. Gerstner et.al.	2603.20165	translate	read	null
2026-03-20	Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech	Niclas Pokel et.al.	2603.20112	translate	read	null
2026-03-20	LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families	Jianan Chen et.al.	2603.20042	translate	read	null
2026-03-20	Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?	Lokesh Kumar et.al.	2603.19831	translate	read	null
2026-03-20	Borderless Long Speech Synthesis	Xingchen Song et.al.	2603.19798	translate	read	null
2026-03-19	Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction	Anh-Tuan Dao et.al.	2603.18657	translate	read	null
2026-03-18	Impact of automatic speech recognition quality on Alzheimer’s disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation	Himadri Samanta et.al.	2603.18239	translate	read	null
2026-03-18	Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition	Yuxiang Mei et.al.	2603.17558	translate	read	null
2026-03-17	Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network	Protopopov Alexey et.al.	2603.16972	translate	read	null
2026-03-17	On the Emotion Understanding of Synthesized Speech	Yuan Ge et.al.	2603.16483	translate	read	null
2026-03-17	RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery	Abhishek Kumar et.al.	2603.16411	translate	read	null
2026-03-17	Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus	Martina Simonotti et.al.	2603.16258	translate	read	null
2026-03-17	Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR	Quy-Anh Dang et.al.	2603.16184	translate	read	null
2026-03-16	Lost in Transcription: Subtitle Errors in Automatic Speech Recognition Reduce Speaker and Content Evaluations	Kowe Kadoma et.al.	2603.15807	translate	read	null
2026-03-16	SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia	Pengfei Yue et.al.	2603.15409	translate	read	null
2026-03-16	Tagarela - A Portuguese speech dataset from podcasts	Frederico Santos de Oliveira et.al.	2603.15326	translate	read	null
2026-03-16	Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization	Shan Jiang et.al.	2603.15261	translate	read	null
2026-03-16	LLMs and Speech: Integration vs. Combination	Robin Schmitt et.al.	2603.15045	translate	read	null
2026-03-16	PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation	Vamshi Nallaguntla et.al.	2603.15037	translate	read	null
2026-03-16	Vietnamese Automatic Speech Recognition: A Revisit	Thi Vu et.al.	2603.14779	translate	read	null
2026-03-16	Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments	Anacin et.al.	2603.14767	translate	read	null
2026-03-15	Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations	Deok-Hyeon Cho et.al.	2603.14432	translate	read	null
2026-03-15	CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents	Wen-Chin Huang et.al.	2603.14328	translate	read	null
2026-03-12	Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition	Umberto Cappellazzo et.al.	2603.12046	translate	read	null
2026-03-12	ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping	Ivan Yakovlev et.al.	2603.11841	translate	read	null
2026-03-12	Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2	Suvendu Sekhar Mohanty et.al.	2603.11683	translate	read	null
2026-03-12	RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis	Yongjoon Lee et.al.	2603.11678	translate	read	null
2026-03-11	Continued Pretraining for Low-Resource Swahili ASR: Achieving State-of-the-Art Performance with Minimal Labeled Data	Hillary Mutisya et.al.	2603.11378	translate	read	null
2026-03-11	Duration Aware Scheduling for ASR Serving Under Workload Drift	Darshan Makwana et.al.	2603.11273	translate	read	null
2026-03-11	Huntington Disease Automatic Speech Recognition with Biomarker Supervision	Charles L. Wang et.al.	2603.11168	translate	read	null
2026-03-11	Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition	Yinfeng Xia et.al.	2603.11123	translate	read	null
2026-03-11	V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation	Yan-Bo Lin et.al.	2603.11042	translate	read	null
2026-03-11	Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation	Thomas Thebaud et.al.	2603.10827	translate	read	null
2026-03-11	Probabilistic Verification of Voice Anti-Spoofing Models	Evgeny Kushnir et.al.	2603.10713	translate	read	null
2026-03-11	AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow	Duojia Li et.al.	2603.10701	translate	read	null
2026-03-11	Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing	Hao Shi et.al.	2603.10587	translate	read	null
2026-03-11	FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System	Kaituo Xu et.al.	2603.10420	translate	read	null
2026-03-11	NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction	Jun Rekimoto et.al.	2603.10324	translate	read	null
2026-03-10	SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases	Laya Iyer et.al.	2603.09853	translate	read	null
2026-03-10	A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition	Dimme de Groot et.al.	2603.09725	translate	read	null
2026-03-10	Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models	Haoyuan Yang et.al.	2603.09120	translate	read	null
2026-03-10	Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition	Jordan Prescott et.al.	2603.09034	translate	read	null
2026-03-09	Universal Speech Content Factorization	Henry Li Xinyuan et.al.	2603.08977	translate	read	null
2026-03-09	NLE: Non-autoregressive LLM-based ASR by Transcript Editing	Avihu Dekel et.al.	2603.08397	translate	read	null
2026-03-09	Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data	Pol Buitrago et.al.	2603.08249	translate	read	null
2026-03-09	Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks	Pol Buitrago et.al.	2603.08231	translate	read	null
2026-03-09	Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS	Rania Al-Sabbagh et.al.	2603.08125	translate	read	null
2026-03-09	Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge	Ze Li et.al.	2603.08092	translate	read	null
2026-03-09	Designing a Generative AI-Assisted Music Psychotherapy Tool for Deaf and Hard-of-Hearing Individuals	Youjin Choi et.al.	2603.07963	translate	read	null
2026-03-08	Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR	Rishikesh Kumar Sharma et.al.	2603.07554	translate	read	null
2026-03-08	Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech	Tajamul Ashraf et.al.	2603.07513	translate	read	null
2026-03-07	Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning	Wenjie Tian et.al.	2603.07263	translate	read	null
2026-03-07	The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication	Hanlong Li et.al.	2603.07072	translate	read	null
2026-03-06	Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning	Yuchen Zhang et.al.	2603.06505	translate	read	null
2026-03-06	Continual Adaptation for Pacific Indigenous Speech Recognition	Yang Xiao et.al.	2603.06310	translate	read	null
2026-03-06	Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding	Hoseong Ahn et.al.	2603.06193	translate	read	null
2026-03-06	Is it Me? Toward Self-Extension to AI Avatars in Virtual Reality	Jieying Zhang et.al.	2603.06030	translate	read	null
2026-03-06	How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?	Daixian Li et.al.	2603.05852	translate	read	null
2026-03-06	Which Data Matter? Embedding-Based Data Selection for Speech Recognition	Zakaria Aldeneh et.al.	2603.05819	translate	read	null
2026-03-06	Activation Steering for Accent Adaptation in Speech Foundation Models	Jinuo Sun et.al.	2603.05813	translate	read	null
2026-03-05	Koopman Regularized Deep Speech Disentanglement for Speaker Verification	Nikos Chazaridis et.al.	2603.05577	translate	read	null
2026-03-05	Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection	Junchuan Zhao et.al.	2603.05373	translate	read	null
2026-03-05	PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration	Mohammad Javad Ranjbar Kalahroodi et.al.	2603.05314	translate	read	null
2026-03-05	Visual-Informed Speech Enhancement Using Attention-Based Beamforming	Chihyun Liu et.al.	2603.05270	translate	read	null
2026-03-05	Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography	Ting-Hui Cheng et.al.	2603.05267	translate	read	null
2026-03-05	Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards	Linghan Fang et.al.	2603.05231	translate	read	null
2026-03-05	Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition	Mengze Hong et.al.	2603.04945	translate	read	null
2026-03-05	Spectral dynamics reservoir computing for high-speed hardware-efficient neuromorphic processing	Jiaxuan Chen et.al.	2603.04901	translate	read	null
2026-03-05	WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech	Aurchi Chowdhury et.al.	2603.04809	translate	read	null
2026-03-05	When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper	Akif Islam et.al.	2603.04710	translate	read	null
2026-03-04	ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis	Youngwon Choi et.al.	2603.04219	translate	read	null
2026-03-04	Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement	Fei Su et.al.	2603.03811	translate	read	null
2026-03-03	An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization	Epshita Jahan et.al.	2603.03158	translate	read	null
2026-03-03	Speech recognition assisted by large language models to command software orally – Application to an augmented and virtual reality web app for immersive molecular graphics	Fabio Cortes Rodriguez et.al.	2603.02901	translate	read	null
2026-03-03	SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition	Giusy Spacone et.al.	2603.02847	translate	read	null
2026-03-03	Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge	Dhanya E et.al.	2603.02813	translate	read	null
2026-03-02	ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models	Xiaoyu Yi et.al.	2603.01984	translate	read	null
2026-03-02	VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications	Loan Do et.al.	2603.01894	translate	read	null
2026-03-02	More Data, Fewer Diacritics: Scaling Arabic TTS	Ahmed Musleh et.al.	2603.01622	translate	read	null
2026-03-02	The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge	Ya Jiang et.al.	2603.01415	translate	read	null
2026-03-02	End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation	Minghui Wu et.al.	2603.01382	translate	read	null
2026-03-02	DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement	Minghui Wu et.al.	2603.01369	translate	read	null
2026-03-01	VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling	Yanir Marmor et.al.	2603.01270	translate	read	null
2026-03-01	SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation	Hongrui Wang et.al.	2603.01101	translate	read	null
2026-03-01	Using Songs to Improve Kazakh Automatic Speech Recognition	Rustem Yeshpanov et.al.	2603.00961	translate	read	null
2026-03-01	Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages	Kaushal Santosh Bhogale et.al.	2603.00941	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)