Audio Processing - 2025-05

Publish Date Title Authors PDF Translate Read Code
2025-05-30 Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach Nick Rossenbach et.al. 2505.24721 translate read null
2025-05-30 Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification Badr M. Abdullah et.al. 2505.24713 translate read link
2025-05-30 Pretraining Multi-Speaker Identification for Neural Speaker Diarization Shota Horiguchi et.al. 2505.24545 translate read null
2025-05-30 SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition Longjie Luo et.al. 2505.24450 translate read null
2025-05-30 Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge Longjie Luo et.al. 2505.24446 translate read null
2025-05-30 Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction Yangui Fang et.al. 2505.24347 translate read null
2025-05-30 When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds Minsu Kang et.al. 2505.24336 translate read null
2025-05-30 A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater’s Shadowing and Sequence-to-sequence Voice Conversion Haopeng Geng et.al. 2505.24304 translate read null
2025-05-30 Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion Kaidi Wang et.al. 2505.24291 translate read null
2025-05-29 Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection Griffin Dietz Smith et.al. 2505.23627 translate read null
2025-05-29 ZeroSep: Separate Anything in Audio with Zero Training Chao Huang et.al. 2505.23625 translate read link
2025-05-29 MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction Yunkee Chae et.al. 2505.23305 translate read null
2025-05-29 Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation Zhennan Lin et.al. 2505.23077 translate read null
2025-05-29 AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition Yuhang Dai et.al. 2505.23036 translate read link
2025-05-28 BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models Susan Liang et.al. 2505.22865 translate read null
2025-05-28 NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding Vladimir Bataev et.al. 2505.22857 translate read null
2025-05-28 Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition Yuan Tseng et.al. 2505.22251 translate read null
2025-05-28 Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis Stefan Bleeck et.al. 2505.22231 translate read null
2025-05-28 On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition Shujie HU et.al. 2505.22072 translate read null
2025-05-28 Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR Mingchen Shao et.al. 2505.22063 translate read null
2025-05-28 Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge Shangkun Huang et.al. 2505.22013 translate read null
2025-05-28 Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection Shangkun Huang et.al. 2505.22005 translate read null
2025-05-27 GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task Chutong Meng et.al. 2505.21781 translate read null
2025-05-27 VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin Zhiqi Ai et.al. 2505.21445 translate read null
2025-05-27 Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision Zhaoqing Li et.al. 2505.21245 translate read null
2025-05-27 PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems Nima Sedghiyeh et.al. 2505.21230 translate read null
2025-05-27 Topological Deep Learning for Speech Data Zhiwang Yu et.al. 2505.21173 translate read null
2025-05-27 Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis Tianyi Xu et.al. 2505.21138 translate read null
2025-05-27 Text-Queried Audio Source Separation via Hierarchical Modeling Xinlei Yin et.al. 2505.21025 translate read null
2025-05-27 VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion Joon-Seung Choi et.al. 2505.20794 translate read null
2025-05-27 REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion Ishan D. Biyani et.al. 2505.20756 translate read null
2025-05-27 PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts Tianhua Qi et.al. 2505.20678 translate read null
2025-05-27 Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation Dancheng Liu et.al. 2505.20606 translate read null
2025-05-26 Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks Chang Liu et.al. 2505.20038 translate read null
2025-05-26 Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition Raphaël Bagat et.al. 2505.20006 translate read null
2025-05-26 Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy Elvir Karimov et.al. 2505.19951 translate read null
2025-05-26 DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech Deok-Hyeon Cho et.al. 2505.19687 translate read null
2025-05-26 KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization Zhaolin Li et.al. 2505.19679 translate read null
2025-05-26 Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling Haiyang Sun et.al. 2505.19669 translate read null
2025-05-26 Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically Ryan Soh-Eun Shim et.al. 2505.19606 translate read null
2025-05-26 Training-Free Multi-Step Audio Source Separation Yongyi Zang et.al. 2505.19534 translate read null
2025-05-26 Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer’s Disease Detection Yin-Long Liu et.al. 2505.19448 translate read null
2025-05-26 GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor Seokgi Lee et.al. 2505.19384 translate read null
2025-05-23 Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities Ziwei Zhou et.al. 2505.17862 translate read link
2025-05-23 CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training Zhihao Du et.al. 2505.17589 translate read null
2025-05-23 Private kNN-VC: Interpretable Anonymization of Converted Speech Carlos Franzreb et.al. 2505.17584 translate read link
2025-05-23 Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition Leonora Vesterbacka et.al. 2505.17538 translate read null
2025-05-23 Speechless: Speech Instruction Training Without Speech for Low Resource Languages Alan Dao et.al. 2505.17417 translate read link
2025-05-23 LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context Natsuo Yamashita et.al. 2505.17410 translate read null
2025-05-23 An End-to-End Approach for Child Reading Assessment in the Xhosa Language Sergio Chevtchenko et.al. 2505.17371 translate read null
2025-05-22 An Effective Training Framework for Light-Weight Automatic Speech Recognition Models Abdul Hannan et.al. 2505.16991 translate read null
2025-05-22 From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Tianduo Wang et.al. 2505.16972 translate read link
2025-05-23 EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion Advait Joglekar et.al. 2505.16691 translate read link
2025-05-22 SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding Sushant Gautam et.al. 2505.16630 translate read link
2025-05-22 HPP-Voice: A Large-Scale Evaluation of Speech Embeddings for Multi-Phenotypic Classification David Krongauz et.al. 2505.16490 translate read null
2025-05-22 X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance Junbo Zhang et.al. 2505.16369 translate read link
2025-05-22 Large Language Models based ASR Error Correction for Child Conversations Anfeng Xu et.al. 2505.16212 translate read null
2025-05-22 Differentiable K-means for Fully-optimized Discrete Token-based ASR Kentaro Onda et.al. 2505.16207 translate read null
2025-05-22 Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora Kentaro Onda et.al. 2505.16191 translate read null
2025-05-22 Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty Hongfei Xue et.al. 2505.16168 translate read null
2025-05-21 MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling Cheng Yifan et.al. 2505.15772 translate read null
2025-05-21 Word Level Timestamp Generation for Automatic Speech Recognition and Translation Ke Hu et.al. 2505.15646 translate read null
2025-05-21 Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes Zixun Guo et.al. 2505.15559 translate read null
2025-05-21 Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models Zirui Song et.al. 2505.15406 translate read link
2025-05-21 Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning Junchuan Zhao et.al. 2505.15402 translate read null
2025-05-21 Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding Zijian Lin et.al. 2505.15380 translate read null
2025-05-21 Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework Kyungguen Byun et.al. 2505.15254 translate read null
2025-05-20 In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties Nathan Roll et.al. 2505.14887 translate read link
2025-05-20 Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages Chin-Jou Li et.al. 2505.14874 translate read null
2025-05-20 Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits Tiantian Feng et.al. 2505.14648 translate read link
2025-05-20 Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference Tomer Gafni et.al. 2505.14638 translate read null
2025-05-20 SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification Theo Lepage et.al. 2505.14561 translate read null
2025-05-20 Pairwise Evaluation of Accent Similarity in Speech Synthesis Jinzuomu Zhong et.al. 2505.14410 translate read null
2025-05-20 PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs Sho Inoue et.al. 2505.14356 translate read null
2025-05-20 FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Yutong Liu et.al. 2505.14351 translate read null
2025-05-20 Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach Umberto Cappellazzo et.al. 2505.14336 translate read null
2025-05-20 HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing Shamsuddeen Hassan Muhammad et.al. 2505.14311 translate read null
2025-05-20 Source Verification for Speech Deepfakes Viola Negroni et.al. 2505.14188 translate read null
2025-05-20 The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition Ming Gao et.al. 2505.13971 translate read null
2025-05-19 Granary: Speech Recognition and Translation Dataset in 25 European Languages Nithin Rao Koluguri et.al. 2505.13404 translate read null
2025-05-19 Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space Zhengrui Ma et.al. 2505.13181 translate read link
2025-05-19 Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR Xugang Lu et.al. 2505.13079 translate read null
2025-05-19 KIT’s Offline Speech Translation and Instruction Following Submission for IWSLT 2025 Sai Koneru et.al. 2505.13036 translate read link
2025-05-19 Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition Dominik Wagner et.al. 2505.12991 translate read null
2025-05-19 Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down Yingzhi Wang et.al. 2505.12969 translate read null
2025-05-19 Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio Jongmin Jung et.al. 2505.12863 translate read null
2025-05-19 OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching Hieu-Nghia Huynh-Nguyen et.al. 2505.12800 translate read null
2025-05-19 RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations Seungmin Kim et.al. 2505.12686 translate read null
2025-05-19 Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment Abhinaba Roy et.al. 2505.12669 translate read link
2025-05-16 LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models Danilo de Oliveira et.al. 2505.11391 translate read null
2025-05-16 LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors Rao Ma et.al. 2505.11352 translate read null
2025-05-16 Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio Xinlu He et.al. 2505.10975 translate read null
2025-05-16 Multi-Stage Speaker Diarization for Noisy Classrooms Ali Sartaz Khan et.al. 2505.10879 translate read null
2025-05-15 UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech Jiaxuan Liu et.al. 2505.10599 translate read null
2025-05-15 Inclusivity of AI Speech in Healthcare: A Decade Look Back Retno Larasati et.al. 2505.10596 translate read null
2025-05-15 Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio Tu Duyen Nguyen et.al. 2505.10500 translate read null
2025-05-14 GlobalMood: A cross-cultural benchmark for music emotion recognition Harin Lee et.al. 2505.09539 translate read null
2025-05-14 SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset Yicheng Gu et.al. 2505.09325 translate read null
2025-05-14 DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis Zeeshan Ahmad et.al. 2505.09091 translate read null
2025-05-13 Inference Attacks for X-Vector Speaker Anonymization Luke Bauer et.al. 2505.08978 translate read null
2025-05-13 Investigating self-supervised features for expressive, multilingual voice conversion Álvaro Martín-Cortinas et.al. 2505.08278 translate read null
2025-05-13 Not that Groove: Zero-Shot Symbolic Music Editing Li Zhang et.al. 2505.08203 translate read null
2025-05-12 Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications Biel Tura Vecino et.al. 2505.07701 translate read null
2025-05-12 Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells Jaeyun Yoo et.al. 2505.07273 translate read null
2025-05-09 Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients Jinsheng Yuan et.al. 2505.06335 translate read null
2025-05-08 Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations Linrong Pan et.al. 2505.05056 translate read null
2025-05-08 A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration Shaja Arul Selvamani et.al. 2505.04885 translate read null
2025-05-07 Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond Jessie Richter-Powell et.al. 2505.04621 translate read null
2025-05-07 SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer Young-Hu Park et.al. 2505.04394 translate read null
2025-05-07 Discrete Optimal Transport and Voice Conversion Anton Selitskiy et.al. 2505.04382 translate read null
2025-05-07 Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement Rauf Nasretdinov et.al. 2505.04237 translate read null
2025-05-06 VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Zuwei Long et.al. 2505.03739 translate read link
2025-05-06 Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech Susmita Bhattacharjee et.al. 2505.03697 translate read null
2025-05-06 Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation Jincheng Zhang et.al. 2505.03314 translate read link
2025-05-06 SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation Zhaoxi Mu et.al. 2505.03273 translate read null
2025-05-06 SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation Yu-Ren Guo et.al. 2505.03244 translate read null
2025-05-06 MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification Ya Li et.al. 2505.03228 translate read link
2025-05-06 CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization Detao Bai et.al. 2505.03186 translate read null
2025-05-05 Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Yemin Shi et.al. 2505.02707 translate read link
2025-05-05 LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Qingkai Fang et.al. 2505.02625 translate read link
2025-05-04 Transforming faces into video stories – VideoFace2.0 Branko Brkljač et.al. 2505.02060 translate read null
2025-05-04 A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction Xiaoliang Chen et.al. 2505.01998 translate read null
2025-05-02 Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments Noussaiba Djeffal et.al. 2505.01632 translate read null
2025-05-01 Scaling On-Device GPU Inference for Large Generative Models Jiuqiang Tang et.al. 2505.00232 translate read null
2025-05-02 Towards Flow-Matching-based TTS without Classifier-Free Guidance Yuzhe Liang et.al. 2504.20334 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)