Audio Processing - 2025-05
Audio Processing - 2025-05
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-05-30 | Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach | Nick Rossenbach et.al. | 2505.24721 | translate | read | null |
| 2025-05-30 | Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification | Badr M. Abdullah et.al. | 2505.24713 | translate | read | link |
| 2025-05-30 | Pretraining Multi-Speaker Identification for Neural Speaker Diarization | Shota Horiguchi et.al. | 2505.24545 | translate | read | null |
| 2025-05-30 | SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition | Longjie Luo et.al. | 2505.24450 | translate | read | null |
| 2025-05-30 | Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge | Longjie Luo et.al. | 2505.24446 | translate | read | null |
| 2025-05-30 | Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction | Yangui Fang et.al. | 2505.24347 | translate | read | null |
| 2025-05-30 | When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds | Minsu Kang et.al. | 2505.24336 | translate | read | null |
| 2025-05-30 | A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater’s Shadowing and Sequence-to-sequence Voice Conversion | Haopeng Geng et.al. | 2505.24304 | translate | read | null |
| 2025-05-30 | Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion | Kaidi Wang et.al. | 2505.24291 | translate | read | null |
| 2025-05-29 | Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection | Griffin Dietz Smith et.al. | 2505.23627 | translate | read | null |
| 2025-05-29 | ZeroSep: Separate Anything in Audio with Zero Training | Chao Huang et.al. | 2505.23625 | translate | read | link |
| 2025-05-29 | MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction | Yunkee Chae et.al. | 2505.23305 | translate | read | null |
| 2025-05-29 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation | Zhennan Lin et.al. | 2505.23077 | translate | read | null |
| 2025-05-29 | AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition | Yuhang Dai et.al. | 2505.23036 | translate | read | link |
| 2025-05-28 | BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models | Susan Liang et.al. | 2505.22865 | translate | read | null |
| 2025-05-28 | NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding | Vladimir Bataev et.al. | 2505.22857 | translate | read | null |
| 2025-05-28 | Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition | Yuan Tseng et.al. | 2505.22251 | translate | read | null |
| 2025-05-28 | Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis | Stefan Bleeck et.al. | 2505.22231 | translate | read | null |
| 2025-05-28 | On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition | Shujie HU et.al. | 2505.22072 | translate | read | null |
| 2025-05-28 | Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR | Mingchen Shao et.al. | 2505.22063 | translate | read | null |
| 2025-05-28 | Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge | Shangkun Huang et.al. | 2505.22013 | translate | read | null |
| 2025-05-28 | Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection | Shangkun Huang et.al. | 2505.22005 | translate | read | null |
| 2025-05-27 | GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task | Chutong Meng et.al. | 2505.21781 | translate | read | null |
| 2025-05-27 | VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin | Zhiqi Ai et.al. | 2505.21445 | translate | read | null |
| 2025-05-27 | Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision | Zhaoqing Li et.al. | 2505.21245 | translate | read | null |
| 2025-05-27 | PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems | Nima Sedghiyeh et.al. | 2505.21230 | translate | read | null |
| 2025-05-27 | Topological Deep Learning for Speech Data | Zhiwang Yu et.al. | 2505.21173 | translate | read | null |
| 2025-05-27 | Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis | Tianyi Xu et.al. | 2505.21138 | translate | read | null |
| 2025-05-27 | Text-Queried Audio Source Separation via Hierarchical Modeling | Xinlei Yin et.al. | 2505.21025 | translate | read | null |
| 2025-05-27 | VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion | Joon-Seung Choi et.al. | 2505.20794 | translate | read | null |
| 2025-05-27 | REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion | Ishan D. Biyani et.al. | 2505.20756 | translate | read | null |
| 2025-05-27 | PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts | Tianhua Qi et.al. | 2505.20678 | translate | read | null |
| 2025-05-27 | Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation | Dancheng Liu et.al. | 2505.20606 | translate | read | null |
| 2025-05-26 | Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks | Chang Liu et.al. | 2505.20038 | translate | read | null |
| 2025-05-26 | Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition | Raphaël Bagat et.al. | 2505.20006 | translate | read | null |
| 2025-05-26 | Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy | Elvir Karimov et.al. | 2505.19951 | translate | read | null |
| 2025-05-26 | DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech | Deok-Hyeon Cho et.al. | 2505.19687 | translate | read | null |
| 2025-05-26 | KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization | Zhaolin Li et.al. | 2505.19679 | translate | read | null |
| 2025-05-26 | Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling | Haiyang Sun et.al. | 2505.19669 | translate | read | null |
| 2025-05-26 | Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically | Ryan Soh-Eun Shim et.al. | 2505.19606 | translate | read | null |
| 2025-05-26 | Training-Free Multi-Step Audio Source Separation | Yongyi Zang et.al. | 2505.19534 | translate | read | null |
| 2025-05-26 | Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer’s Disease Detection | Yin-Long Liu et.al. | 2505.19448 | translate | read | null |
| 2025-05-26 | GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor | Seokgi Lee et.al. | 2505.19384 | translate | read | null |
| 2025-05-23 | Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities | Ziwei Zhou et.al. | 2505.17862 | translate | read | link |
| 2025-05-23 | CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training | Zhihao Du et.al. | 2505.17589 | translate | read | null |
| 2025-05-23 | Private kNN-VC: Interpretable Anonymization of Converted Speech | Carlos Franzreb et.al. | 2505.17584 | translate | read | link |
| 2025-05-23 | Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition | Leonora Vesterbacka et.al. | 2505.17538 | translate | read | null |
| 2025-05-23 | Speechless: Speech Instruction Training Without Speech for Low Resource Languages | Alan Dao et.al. | 2505.17417 | translate | read | link |
| 2025-05-23 | LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context | Natsuo Yamashita et.al. | 2505.17410 | translate | read | null |
| 2025-05-23 | An End-to-End Approach for Child Reading Assessment in the Xhosa Language | Sergio Chevtchenko et.al. | 2505.17371 | translate | read | null |
| 2025-05-22 | An Effective Training Framework for Light-Weight Automatic Speech Recognition Models | Abdul Hannan et.al. | 2505.16991 | translate | read | null |
| 2025-05-22 | From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition | Tianduo Wang et.al. | 2505.16972 | translate | read | link |
| 2025-05-23 | EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion | Advait Joglekar et.al. | 2505.16691 | translate | read | link |
| 2025-05-22 | SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding | Sushant Gautam et.al. | 2505.16630 | translate | read | link |
| 2025-05-22 | HPP-Voice: A Large-Scale Evaluation of Speech Embeddings for Multi-Phenotypic Classification | David Krongauz et.al. | 2505.16490 | translate | read | null |
| 2025-05-22 | X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance | Junbo Zhang et.al. | 2505.16369 | translate | read | link |
| 2025-05-22 | Large Language Models based ASR Error Correction for Child Conversations | Anfeng Xu et.al. | 2505.16212 | translate | read | null |
| 2025-05-22 | Differentiable K-means for Fully-optimized Discrete Token-based ASR | Kentaro Onda et.al. | 2505.16207 | translate | read | null |
| 2025-05-22 | Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora | Kentaro Onda et.al. | 2505.16191 | translate | read | null |
| 2025-05-22 | Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty | Hongfei Xue et.al. | 2505.16168 | translate | read | null |
| 2025-05-21 | MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling | Cheng Yifan et.al. | 2505.15772 | translate | read | null |
| 2025-05-21 | Word Level Timestamp Generation for Automatic Speech Recognition and Translation | Ke Hu et.al. | 2505.15646 | translate | read | null |
| 2025-05-21 | Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes | Zixun Guo et.al. | 2505.15559 | translate | read | null |
| 2025-05-21 | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Zirui Song et.al. | 2505.15406 | translate | read | link |
| 2025-05-21 | Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning | Junchuan Zhao et.al. | 2505.15402 | translate | read | null |
| 2025-05-21 | Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding | Zijian Lin et.al. | 2505.15380 | translate | read | null |
| 2025-05-21 | Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework | Kyungguen Byun et.al. | 2505.15254 | translate | read | null |
| 2025-05-20 | In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties | Nathan Roll et.al. | 2505.14887 | translate | read | link |
| 2025-05-20 | Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages | Chin-Jou Li et.al. | 2505.14874 | translate | read | null |
| 2025-05-20 | Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits | Tiantian Feng et.al. | 2505.14648 | translate | read | link |
| 2025-05-20 | Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference | Tomer Gafni et.al. | 2505.14638 | translate | read | null |
| 2025-05-20 | SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification | Theo Lepage et.al. | 2505.14561 | translate | read | null |
| 2025-05-20 | Pairwise Evaluation of Accent Similarity in Speech Synthesis | Jinzuomu Zhong et.al. | 2505.14410 | translate | read | null |
| 2025-05-20 | PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs | Sho Inoue et.al. | 2505.14356 | translate | read | null |
| 2025-05-20 | FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation | Yutong Liu et.al. | 2505.14351 | translate | read | null |
| 2025-05-20 | Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach | Umberto Cappellazzo et.al. | 2505.14336 | translate | read | null |
| 2025-05-20 | HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing | Shamsuddeen Hassan Muhammad et.al. | 2505.14311 | translate | read | null |
| 2025-05-20 | Source Verification for Speech Deepfakes | Viola Negroni et.al. | 2505.14188 | translate | read | null |
| 2025-05-20 | The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition | Ming Gao et.al. | 2505.13971 | translate | read | null |
| 2025-05-19 | Granary: Speech Recognition and Translation Dataset in 25 European Languages | Nithin Rao Koluguri et.al. | 2505.13404 | translate | read | null |
| 2025-05-19 | Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space | Zhengrui Ma et.al. | 2505.13181 | translate | read | link |
| 2025-05-19 | Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR | Xugang Lu et.al. | 2505.13079 | translate | read | null |
| 2025-05-19 | KIT’s Offline Speech Translation and Instruction Following Submission for IWSLT 2025 | Sai Koneru et.al. | 2505.13036 | translate | read | link |
| 2025-05-19 | Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition | Dominik Wagner et.al. | 2505.12991 | translate | read | null |
| 2025-05-19 | Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down | Yingzhi Wang et.al. | 2505.12969 | translate | read | null |
| 2025-05-19 | Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio | Jongmin Jung et.al. | 2505.12863 | translate | read | null |
| 2025-05-19 | OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching | Hieu-Nghia Huynh-Nguyen et.al. | 2505.12800 | translate | read | null |
| 2025-05-19 | RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations | Seungmin Kim et.al. | 2505.12686 | translate | read | null |
| 2025-05-19 | Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment | Abhinaba Roy et.al. | 2505.12669 | translate | read | link |
| 2025-05-16 | LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models | Danilo de Oliveira et.al. | 2505.11391 | translate | read | null |
| 2025-05-16 | LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors | Rao Ma et.al. | 2505.11352 | translate | read | null |
| 2025-05-16 | Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio | Xinlu He et.al. | 2505.10975 | translate | read | null |
| 2025-05-16 | Multi-Stage Speaker Diarization for Noisy Classrooms | Ali Sartaz Khan et.al. | 2505.10879 | translate | read | null |
| 2025-05-15 | UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech | Jiaxuan Liu et.al. | 2505.10599 | translate | read | null |
| 2025-05-15 | Inclusivity of AI Speech in Healthcare: A Decade Look Back | Retno Larasati et.al. | 2505.10596 | translate | read | null |
| 2025-05-15 | Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio | Tu Duyen Nguyen et.al. | 2505.10500 | translate | read | null |
| 2025-05-14 | GlobalMood: A cross-cultural benchmark for music emotion recognition | Harin Lee et.al. | 2505.09539 | translate | read | null |
| 2025-05-14 | SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset | Yicheng Gu et.al. | 2505.09325 | translate | read | null |
| 2025-05-14 | DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis | Zeeshan Ahmad et.al. | 2505.09091 | translate | read | null |
| 2025-05-13 | Inference Attacks for X-Vector Speaker Anonymization | Luke Bauer et.al. | 2505.08978 | translate | read | null |
| 2025-05-13 | Investigating self-supervised features for expressive, multilingual voice conversion | Álvaro Martín-Cortinas et.al. | 2505.08278 | translate | read | null |
| 2025-05-13 | Not that Groove: Zero-Shot Symbolic Music Editing | Li Zhang et.al. | 2505.08203 | translate | read | null |
| 2025-05-12 | Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications | Biel Tura Vecino et.al. | 2505.07701 | translate | read | null |
| 2025-05-12 | Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells | Jaeyun Yoo et.al. | 2505.07273 | translate | read | null |
| 2025-05-09 | Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients | Jinsheng Yuan et.al. | 2505.06335 | translate | read | null |
| 2025-05-08 | Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations | Linrong Pan et.al. | 2505.05056 | translate | read | null |
| 2025-05-08 | A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration | Shaja Arul Selvamani et.al. | 2505.04885 | translate | read | null |
| 2025-05-07 | Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond | Jessie Richter-Powell et.al. | 2505.04621 | translate | read | null |
| 2025-05-07 | SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer | Young-Hu Park et.al. | 2505.04394 | translate | read | null |
| 2025-05-07 | Discrete Optimal Transport and Voice Conversion | Anton Selitskiy et.al. | 2505.04382 | translate | read | null |
| 2025-05-07 | Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement | Rauf Nasretdinov et.al. | 2505.04237 | translate | read | null |
| 2025-05-06 | VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model | Zuwei Long et.al. | 2505.03739 | translate | read | link |
| 2025-05-06 | Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech | Susmita Bhattacharjee et.al. | 2505.03697 | translate | read | null |
| 2025-05-06 | Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation | Jincheng Zhang et.al. | 2505.03314 | translate | read | link |
| 2025-05-06 | SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation | Zhaoxi Mu et.al. | 2505.03273 | translate | read | null |
| 2025-05-06 | SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation | Yu-Ren Guo et.al. | 2505.03244 | translate | read | null |
| 2025-05-06 | MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification | Ya Li et.al. | 2505.03228 | translate | read | link |
| 2025-05-06 | CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization | Detao Bai et.al. | 2505.03186 | translate | read | null |
| 2025-05-05 | Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play | Yemin Shi et.al. | 2505.02707 | translate | read | link |
| 2025-05-05 | LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis | Qingkai Fang et.al. | 2505.02625 | translate | read | link |
| 2025-05-04 | Transforming faces into video stories – VideoFace2.0 | Branko Brkljač et.al. | 2505.02060 | translate | read | null |
| 2025-05-04 | A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction | Xiaoliang Chen et.al. | 2505.01998 | translate | read | null |
| 2025-05-02 | Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments | Noussaiba Djeffal et.al. | 2505.01632 | translate | read | null |
| 2025-05-01 | Scaling On-Device GPU Inference for Large Generative Models | Jiuqiang Tang et.al. | 2505.00232 | translate | read | null |
| 2025-05-02 | Towards Flow-Matching-based TTS without Classifier-Free Guidance | Yuzhe Liang et.al. | 2504.20334 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)