Audio Processing - 2026-03
Audio Processing - 2026-03
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2026-03-31 | FLEURS-Kobani: Extending the FLEURS Dataset for Northern Kurdish | Daban Q. Jaff et.al. | 2603.29892 | translate | read | null |
| 2026-03-31 | LLM Probe: Evaluating LLMs for Low-Resource Languages | Hailay Kidu Teklehaymanot et.al. | 2603.29517 | translate | read | null |
| 2026-03-31 | Spoken Digit Recognition and Speaker Classification by Nonlinear Interfered Spin Wave-Based Physical Reservoir Computing | Sota Hikasa et.al. | 2603.29311 | translate | read | null |
| 2026-03-31 | Advancing LLM-based phoneme-to-grapheme for multilingual speech recognition | Lukuang Dong et.al. | 2603.29217 | translate | read | null |
| 2026-03-31 | From Natural Alignment to Conditional Controllability in Multimodal Dialogue | Zeyu Jin et.al. | 2603.29162 | translate | read | null |
| 2026-03-30 | EBuddy: a workflow orchestrator for industrial human-machine collaboration | Michele Banfi et.al. | 2603.28579 | translate | read | null |
| 2026-03-30 | Voice-Controlled Scratch for Children with (Motor) Disabilities | Elias Goller et.al. | 2603.28246 | translate | read | null |
| 2026-03-30 | Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models | Luigi Curini et.al. | 2603.28103 | translate | read | null |
| 2026-03-30 | On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR | Ganesh Pavan Kartikeya Bharadwaj Kolluri et.al. | 2603.27981 | translate | read | null |
| 2026-03-25 | POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan | Marta Moscati et.al. | 2603.24569 | translate | read | null |
| 2026-03-25 | A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English | Dana Serditova et.al. | 2603.24549 | translate | read | null |
| 2026-03-25 | What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification | Massa Baali et.al. | 2603.24432 | translate | read | null |
| 2026-03-25 | When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools | Xingming Li et.al. | 2603.24389 | translate | read | null |
| 2026-03-25 | Bridging Biological Hearing and Neuromorphic Computing: End-to-End Time-Domain Audio Signal Processing with Reservoir Computing | Rinku Sebastian et.al. | 2603.24283 | translate | read | null |
| 2026-03-25 | How Open is Open TTS? A Practical Evaluation of Open Source TTS Tools for Romanian | Teodora Răgman et.al. | 2603.24116 | translate | read | null |
| 2026-03-25 | From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs | Xiaoyong Guo et.al. | 2603.24034 | translate | read | null |
| 2026-03-24 | Echoes: A semantically-aligned music deepfake detection dataset | Octavian Pascu et.al. | 2603.23667 | translate | read | null |
| 2026-03-24 | Ethio-ASR: Joint Multilingual Speech Recognition and Language Identification for Ethiopian Languages | Badr M. Abdullah et.al. | 2603.23654 | translate | read | null |
| 2026-03-24 | Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework | Zeinab Dehghani et.al. | 2603.23625 | translate | read | null |
| 2026-03-24 | MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates | Zikang Huang et.al. | 2603.23048 | translate | read | null |
| 2026-03-24 | When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse | Yihuan Huang et.al. | 2603.22915 | translate | read | null |
| 2026-03-24 | Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics | Naohiro Tawara et.al. | 2603.22709 | translate | read | null |
| 2026-03-24 | MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation | Di Zhu et.al. | 2603.22677 | translate | read | null |
| 2026-03-23 | Precision-Varying Prediction (PVP): Robustifying ASR systems against adversarial attacks | Matías Pizarro et.al. | 2603.22590 | translate | read | null |
| 2026-03-23 | SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation | Lucas H. Ueda et.al. | 2603.22252 | translate | read | null |
| 2026-03-23 | SLURP-TN : Resource for Tunisian Dialect Spoken Language Understanding | Haroun Elleuch et.al. | 2603.21940 | translate | read | null |
| 2026-03-23 | Ara-Best-RQ: Multi Dialectal Arabic SSL | Haroun Elleuch et.al. | 2603.21900 | translate | read | null |
| 2026-03-23 | Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment | Lei Yang et.al. | 2603.21808 | translate | read | null |
| 2026-03-23 | RESPOND: Responsive Engagement Strategy for Predictive Orchestration and Dialogue | Meng-Chen Lee et.al. | 2603.21682 | translate | read | null |
| 2026-03-22 | HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit | Khushiyant et.al. | 2603.21316 | translate | read | null |
| 2026-03-22 | Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation | Soudeep Ghoshal et.al. | 2603.21282 | translate | read | null |
| 2026-03-22 | SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing | Jianyi Chen et.al. | 2603.21073 | translate | read | null |
| 2026-03-20 | Audio Avatar Fingerprinting: An Approach for Authorized Use of Voice Cloning in the Era of Synthetic Audio | Candice R. Gerstner et.al. | 2603.20165 | translate | read | null |
| 2026-03-20 | Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech | Niclas Pokel et.al. | 2603.20112 | translate | read | null |
| 2026-03-20 | LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families | Jianan Chen et.al. | 2603.20042 | translate | read | null |
| 2026-03-20 | Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech? | Lokesh Kumar et.al. | 2603.19831 | translate | read | null |
| 2026-03-20 | Borderless Long Speech Synthesis | Xingchen Song et.al. | 2603.19798 | translate | read | null |
| 2026-03-19 | Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction | Anh-Tuan Dao et.al. | 2603.18657 | translate | read | null |
| 2026-03-18 | Impact of automatic speech recognition quality on Alzheimer’s disease detection from spontaneous speech: a reproducible benchmark study with lexical modeling and statistical validation | Himadri Samanta et.al. | 2603.18239 | translate | read | null |
| 2026-03-18 | Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition | Yuxiang Mei et.al. | 2603.17558 | translate | read | null |
| 2026-03-17 | Over-the-air White-box Attack on the Wav2Vec Speech Recognition Neural Network | Protopopov Alexey et.al. | 2603.16972 | translate | read | null |
| 2026-03-17 | On the Emotion Understanding of Synthesized Speech | Yuan Ge et.al. | 2603.16483 | translate | read | null |
| 2026-03-17 | RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery | Abhishek Kumar et.al. | 2603.16411 | translate | read | null |
| 2026-03-17 | Is Semi-Automatic Transcription Useful in Corpus Creation? Preliminary Considerations on the KIParla Corpus | Martina Simonotti et.al. | 2603.16258 | translate | read | null |
| 2026-03-17 | Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR | Quy-Anh Dang et.al. | 2603.16184 | translate | read | null |
| 2026-03-16 | Lost in Transcription: Subtitle Errors in Automatic Speech Recognition Reduce Speaker and Content Evaluations | Kowe Kadoma et.al. | 2603.15807 | translate | read | null |
| 2026-03-16 | SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia | Pengfei Yue et.al. | 2603.15409 | translate | read | null |
| 2026-03-16 | Tagarela - A Portuguese speech dataset from podcasts | Frederico Santos de Oliveira et.al. | 2603.15326 | translate | read | null |
| 2026-03-16 | Two-Stage Adaptation for Non-Normative Speech Recognition: Revisiting Speaker-Independent Initialization for Personalization | Shan Jiang et.al. | 2603.15261 | translate | read | null |
| 2026-03-16 | LLMs and Speech: Integration vs. Combination | Robin Schmitt et.al. | 2603.15045 | translate | read | null |
| 2026-03-16 | PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation | Vamshi Nallaguntla et.al. | 2603.15037 | translate | read | null |
| 2026-03-16 | Vietnamese Automatic Speech Recognition: A Revisit | Thi Vu et.al. | 2603.14779 | translate | read | null |
| 2026-03-16 | Investigating the Impact of Speech Enhancement on Audio Deepfake Detection in Noisy Environments | Anacin et.al. | 2603.14767 | translate | read | null |
| 2026-03-15 | Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations | Deok-Hyeon Cho et.al. | 2603.14432 | translate | read | null |
| 2026-03-15 | CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents | Wen-Chin Huang et.al. | 2603.14328 | translate | read | null |
| 2026-03-12 | Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition | Umberto Cappellazzo et.al. | 2603.12046 | translate | read | null |
| 2026-03-12 | ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping | Ivan Yakovlev et.al. | 2603.11841 | translate | read | null |
| 2026-03-12 | Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2 | Suvendu Sekhar Mohanty et.al. | 2603.11683 | translate | read | null |
| 2026-03-12 | RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis | Yongjoon Lee et.al. | 2603.11678 | translate | read | null |
| 2026-03-11 | Continued Pretraining for Low-Resource Swahili ASR: Achieving State-of-the-Art Performance with Minimal Labeled Data | Hillary Mutisya et.al. | 2603.11378 | translate | read | null |
| 2026-03-11 | Duration Aware Scheduling for ASR Serving Under Workload Drift | Darshan Makwana et.al. | 2603.11273 | translate | read | null |
| 2026-03-11 | Huntington Disease Automatic Speech Recognition with Biomarker Supervision | Charles L. Wang et.al. | 2603.11168 | translate | read | null |
| 2026-03-11 | Uni-ASR: Unified LLM-Based Architecture for Non-Streaming and Streaming Automatic Speech Recognition | Yinfeng Xia et.al. | 2603.11123 | translate | read | null |
| 2026-03-11 | V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation | Yan-Bo Lin et.al. | 2603.11042 | translate | read | null |
| 2026-03-11 | Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation | Thomas Thebaud et.al. | 2603.10827 | translate | read | null |
| 2026-03-11 | Probabilistic Verification of Voice Anti-Spoofing Models | Evgeny Kushnir et.al. | 2603.10713 | translate | read | null |
| 2026-03-11 | AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow | Duojia Li et.al. | 2603.10701 | translate | read | null |
| 2026-03-11 | Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing | Hao Shi et.al. | 2603.10587 | translate | read | null |
| 2026-03-11 | FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System | Kaituo Xu et.al. | 2603.10420 | translate | read | null |
| 2026-03-11 | NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction | Jun Rekimoto et.al. | 2603.10324 | translate | read | null |
| 2026-03-10 | SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases | Laya Iyer et.al. | 2603.09853 | translate | read | null |
| 2026-03-10 | A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition | Dimme de Groot et.al. | 2603.09725 | translate | read | null |
| 2026-03-10 | Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models | Haoyuan Yang et.al. | 2603.09120 | translate | read | null |
| 2026-03-10 | Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition | Jordan Prescott et.al. | 2603.09034 | translate | read | null |
| 2026-03-09 | Universal Speech Content Factorization | Henry Li Xinyuan et.al. | 2603.08977 | translate | read | null |
| 2026-03-09 | NLE: Non-autoregressive LLM-based ASR by Transcript Editing | Avihu Dekel et.al. | 2603.08397 | translate | read | null |
| 2026-03-09 | Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data | Pol Buitrago et.al. | 2603.08249 | translate | read | null |
| 2026-03-09 | Quantifying Cross-Lingual Transfer in Paralinguistic Speech Tasks | Pol Buitrago et.al. | 2603.08231 | translate | read | null |
| 2026-03-09 | Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS | Rania Al-Sabbagh et.al. | 2603.08125 | translate | read | null |
| 2026-03-09 | Language-Invariant Multilingual Speaker Verification for the TidyVoice 2026 Challenge | Ze Li et.al. | 2603.08092 | translate | read | null |
| 2026-03-09 | Designing a Generative AI-Assisted Music Psychotherapy Tool for Deaf and Hard-of-Hearing Individuals | Youjin Choi et.al. | 2603.07963 | translate | read | null |
| 2026-03-08 | Nwāchā Munā: A Devanagari Speech Corpus and Proximal Transfer Benchmark for Nepal Bhasha ASR | Rishikesh Kumar Sharma et.al. | 2603.07554 | translate | read | null |
| 2026-03-08 | Bolbosh: Script-Aware Flow Matching for Kashmiri Text-to-Speech | Tajamul Ashraf et.al. | 2603.07513 | translate | read | null |
| 2026-03-07 | Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning | Wenjie Tian et.al. | 2603.07263 | translate | read | null |
| 2026-03-07 | The Talking Robot: Distortion-Robust Acoustic Models for Robot-Robot Communication | Hanlong Li et.al. | 2603.07072 | translate | read | null |
| 2026-03-06 | Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning | Yuchen Zhang et.al. | 2603.06505 | translate | read | null |
| 2026-03-06 | Continual Adaptation for Pacific Indigenous Speech Recognition | Yang Xiao et.al. | 2603.06310 | translate | read | null |
| 2026-03-06 | Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding | Hoseong Ahn et.al. | 2603.06193 | translate | read | null |
| 2026-03-06 | Is it Me? Toward Self-Extension to AI Avatars in Virtual Reality | Jieying Zhang et.al. | 2603.06030 | translate | read | null |
| 2026-03-06 | How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World? | Daixian Li et.al. | 2603.05852 | translate | read | null |
| 2026-03-06 | Which Data Matter? Embedding-Based Data Selection for Speech Recognition | Zakaria Aldeneh et.al. | 2603.05819 | translate | read | null |
| 2026-03-06 | Activation Steering for Accent Adaptation in Speech Foundation Models | Jinuo Sun et.al. | 2603.05813 | translate | read | null |
| 2026-03-05 | Koopman Regularized Deep Speech Disentanglement for Speaker Verification | Nikos Chazaridis et.al. | 2603.05577 | translate | read | null |
| 2026-03-05 | Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection | Junchuan Zhao et.al. | 2603.05373 | translate | read | null |
| 2026-03-05 | PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration | Mohammad Javad Ranjbar Kalahroodi et.al. | 2603.05314 | translate | read | null |
| 2026-03-05 | Visual-Informed Speech Enhancement Using Attention-Based Beamforming | Chihyun Liu et.al. | 2603.05270 | translate | read | null |
| 2026-03-05 | Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography | Ting-Hui Cheng et.al. | 2603.05267 | translate | read | null |
| 2026-03-05 | Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards | Linghan Fang et.al. | 2603.05231 | translate | read | null |
| 2026-03-05 | Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition | Mengze Hong et.al. | 2603.04945 | translate | read | null |
| 2026-03-05 | Spectral dynamics reservoir computing for high-speed hardware-efficient neuromorphic processing | Jiaxuan Chen et.al. | 2603.04901 | translate | read | null |
| 2026-03-05 | WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech | Aurchi Chowdhury et.al. | 2603.04809 | translate | read | null |
| 2026-03-05 | When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper | Akif Islam et.al. | 2603.04710 | translate | read | null |
| 2026-03-04 | ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis | Youngwon Choi et.al. | 2603.04219 | translate | read | null |
| 2026-03-04 | Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement | Fei Su et.al. | 2603.03811 | translate | read | null |
| 2026-03-03 | An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization | Epshita Jahan et.al. | 2603.03158 | translate | read | null |
| 2026-03-03 | Speech recognition assisted by large language models to command software orally – Application to an augmented and virtual reality web app for immersive molecular graphics | Fabio Cortes Rodriguez et.al. | 2603.02901 | translate | read | null |
| 2026-03-03 | SilentWear: an Ultra-Low Power Wearable System for EMG-based Silent Speech Recognition | Giusy Spacone et.al. | 2603.02847 | translate | read | null |
| 2026-03-03 | Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge | Dhanya E et.al. | 2603.02813 | translate | read | null |
| 2026-03-02 | ViTex: Visual Texture Control for Multi-Track Symbolic Music Generation via Discrete Diffusion Models | Xiaoyu Yi et.al. | 2603.01984 | translate | read | null |
| 2026-03-02 | VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications | Loan Do et.al. | 2603.01894 | translate | read | null |
| 2026-03-02 | More Data, Fewer Diacritics: Scaling Arabic TTS | Ahmed Musleh et.al. | 2603.01622 | translate | read | null |
| 2026-03-02 | The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge | Ya Jiang et.al. | 2603.01415 | translate | read | null |
| 2026-03-02 | End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation | Minghui Wu et.al. | 2603.01382 | translate | read | null |
| 2026-03-02 | DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement | Minghui Wu et.al. | 2603.01369 | translate | read | null |
| 2026-03-01 | VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling | Yanir Marmor et.al. | 2603.01270 | translate | read | null |
| 2026-03-01 | SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation | Hongrui Wang et.al. | 2603.01101 | translate | read | null |
| 2026-03-01 | Using Songs to Improve Kazakh Automatic Speech Recognition | Rustem Yeshpanov et.al. | 2603.00961 | translate | read | null |
| 2026-03-01 | Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages | Kaushal Santosh Bhogale et.al. | 2603.00941 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)