Audio Processing - 2025-05 | Paper Arxiv Daily

Audio Processing - 2025-05

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-05-30	Running Conventional Automatic Speech Recognition on Memristor Hardware: A Simulated Approach	Nick Rossenbach et.al.	2505.24721	translate	read	null
2025-05-30	Voice Conversion Improves Cross-Domain Robustness for Spoken Arabic Dialect Identification	Badr M. Abdullah et.al.	2505.24713	translate	read	link
2025-05-30	Pretraining Multi-Speaker Identification for Neural Speaker Diarization	Shota Horiguchi et.al.	2505.24545	translate	read	null
2025-05-30	SuPseudo: A Pseudo-supervised Learning Method for Neural Speech Enhancement in Far-field Speech Recognition	Longjie Luo et.al.	2505.24450	translate	read	null
2025-05-30	Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge	Longjie Luo et.al.	2505.24446	translate	read	null
2025-05-30	Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction	Yangui Fang et.al.	2505.24347	translate	read	null
2025-05-30	When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds	Minsu Kang et.al.	2505.24336	translate	read	null
2025-05-30	A Perception-Based L2 Speech Intelligibility Indicator: Leveraging a Rater’s Shadowing and Sequence-to-sequence Voice Conversion	Haopeng Geng et.al.	2505.24304	translate	read	null
2025-05-30	Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion	Kaidi Wang et.al.	2505.24291	translate	read	null
2025-05-29	Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection	Griffin Dietz Smith et.al.	2505.23627	translate	read	null
2025-05-29	ZeroSep: Separate Anything in Audio with Zero Training	Chao Huang et.al.	2505.23625	translate	read	link
2025-05-29	MGE-LDM: Joint Latent Diffusion for Simultaneous Music Generation and Source Extraction	Yunkee Chae et.al.	2505.23305	translate	read	null
2025-05-29	Contextualized Automatic Speech Recognition with Dynamic Vocabulary Prediction and Activation	Zhennan Lin et.al.	2505.23077	translate	read	null
2025-05-29	AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition	Yuhang Dai et.al.	2505.23036	translate	read	link
2025-05-28	BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models	Susan Liang et.al.	2505.22865	translate	read	null
2025-05-28	NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding	Vladimir Bataev et.al.	2505.22857	translate	read	null
2025-05-28	Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition	Yuan Tseng et.al.	2505.22251	translate	read	null
2025-05-28	Advancing Hearing Assessment: An ASR-Based Frequency-Specific Speech Test for Diagnosing Presbycusis	Stefan Bleeck et.al.	2505.22231	translate	read	null
2025-05-28	On-the-fly Routing for Zero-shot MoE Speaker Adaptation of Speech Foundation Models for Dysarthric Speech Recognition	Shujie HU et.al.	2505.22072	translate	read	null
2025-05-28	Weakly Supervised Data Refinement and Flexible Sequence Compression for Efficient Thai LLM-based ASR	Mingchen Shao et.al.	2505.22063	translate	read	null
2025-05-28	Overlap-Adaptive Hybrid Speaker Diarization and ASR-Aware Observation Addition for MISP 2025 Challenge	Shangkun Huang et.al.	2505.22013	translate	read	null
2025-05-28	Leveraging LLM for Stuttering Speech: A Unified Architecture Bridging Recognition and Event Detection	Shangkun Huang et.al.	2505.22005	translate	read	null
2025-05-27	GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task	Chutong Meng et.al.	2505.21781	translate	read	null
2025-05-27	VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin	Zhiqi Ai et.al.	2505.21445	translate	read	null
2025-05-27	Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision	Zhaoqing Li et.al.	2505.21245	translate	read	null
2025-05-27	PSRB: A Comprehensive Benchmark for Evaluating Persian ASR Systems	Nima Sedghiyeh et.al.	2505.21230	translate	read	null
2025-05-27	Topological Deep Learning for Speech Data	Zhiwang Yu et.al.	2505.21173	translate	read	null
2025-05-27	Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis	Tianyi Xu et.al.	2505.21138	translate	read	null
2025-05-27	Text-Queried Audio Source Separation via Hierarchical Modeling	Xinlei Yin et.al.	2505.21025	translate	read	null
2025-05-27	VibE-SVC: Vibrato Extraction with High-frequency F0 Contour for Singing Voice Conversion	Joon-Seung Choi et.al.	2505.20794	translate	read	null
2025-05-27	REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion	Ishan D. Biyani et.al.	2505.20756	translate	read	null
2025-05-27	PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts	Tianhua Qi et.al.	2505.20678	translate	read	null
2025-05-27	Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation	Dancheng Liu et.al.	2505.20606	translate	read	null
2025-05-26	Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks	Chang Liu et.al.	2505.20038	translate	read	null
2025-05-26	Mixture of LoRA Experts for Low-Resourced Multi-Accent Automatic Speech Recognition	Raphaël Bagat et.al.	2505.20006	translate	read	null
2025-05-26	Novel Loss-Enhanced Universal Adversarial Patches for Sustainable Speaker Privacy	Elvir Karimov et.al.	2505.19951	translate	read	null
2025-05-26	DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distillation for Cross-Speaker Emotion Transfer in Text-to-Speech	Deok-Hyeon Cho et.al.	2505.19687	translate	read	null
2025-05-26	KIT’s Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization	Zhaolin Li et.al.	2505.19679	translate	read	null
2025-05-26	Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling	Haiyang Sun et.al.	2505.19669	translate	read	null
2025-05-26	Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically	Ryan Soh-Eun Shim et.al.	2505.19606	translate	read	null
2025-05-26	Training-Free Multi-Step Audio Source Separation	Yongyi Zang et.al.	2505.19534	translate	read	null
2025-05-26	Beyond Manual Transcripts: The Potential of Automated Speech Recognition Errors in Improving Alzheimer’s Disease Detection	Yin-Long Liu et.al.	2505.19448	translate	read	null
2025-05-26	GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor	Seokgi Lee et.al.	2505.19384	translate	read	null
2025-05-23	Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities	Ziwei Zhou et.al.	2505.17862	translate	read	link
2025-05-23	CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training	Zhihao Du et.al.	2505.17589	translate	read	null
2025-05-23	Private kNN-VC: Interpretable Anonymization of Converted Speech	Carlos Franzreb et.al.	2505.17584	translate	read	link
2025-05-23	Swedish Whispers; Leveraging a Massive Speech Corpus for Swedish Speech Recognition	Leonora Vesterbacka et.al.	2505.17538	translate	read	null
2025-05-23	Speechless: Speech Instruction Training Without Speech for Low Resource Languages	Alan Dao et.al.	2505.17417	translate	read	link
2025-05-23	LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context	Natsuo Yamashita et.al.	2505.17410	translate	read	null
2025-05-23	An End-to-End Approach for Child Reading Assessment in the Xhosa Language	Sergio Chevtchenko et.al.	2505.17371	translate	read	null
2025-05-22	An Effective Training Framework for Light-Weight Automatic Speech Recognition Models	Abdul Hannan et.al.	2505.16991	translate	read	null
2025-05-22	From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition	Tianduo Wang et.al.	2505.16972	translate	read	link
2025-05-23	EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion	Advait Joglekar et.al.	2505.16691	translate	read	link
2025-05-22	SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding	Sushant Gautam et.al.	2505.16630	translate	read	link
2025-05-22	HPP-Voice: A Large-Scale Evaluation of Speech Embeddings for Multi-Phenotypic Classification	David Krongauz et.al.	2505.16490	translate	read	null
2025-05-22	X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance	Junbo Zhang et.al.	2505.16369	translate	read	link
2025-05-22	Large Language Models based ASR Error Correction for Child Conversations	Anfeng Xu et.al.	2505.16212	translate	read	null
2025-05-22	Differentiable K-means for Fully-optimized Discrete Token-based ASR	Kentaro Onda et.al.	2505.16207	translate	read	null
2025-05-22	Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora	Kentaro Onda et.al.	2505.16191	translate	read	null
2025-05-22	Selective Invocation for Multilingual ASR: A Cost-effective Approach Adapting to Speech Recognition Difficulty	Hongfei Xue et.al.	2505.16168	translate	read	null
2025-05-21	MIKU-PAL: An Automated and Standardized Multi-Modal Method for Speech Paralinguistic and Affect Labeling	Cheng Yifan et.al.	2505.15772	translate	read	null
2025-05-21	Word Level Timestamp Generation for Automatic Speech Recognition and Translation	Ke Hu et.al.	2505.15646	translate	read	null
2025-05-21	Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes	Zixun Guo et.al.	2505.15559	translate	read	null
2025-05-21	Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models	Zirui Song et.al.	2505.15406	translate	read	link
2025-05-21	Prosody-Adaptable Audio Codecs for Zero-Shot Voice Conversion via In-Context Learning	Junchuan Zhao et.al.	2505.15402	translate	read	null
2025-05-21	Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding	Zijian Lin et.al.	2505.15380	translate	read	null
2025-05-21	Voice-ENHANCE: Speech Restoration using a Diffusion-based Voice Conversion Framework	Kyungguen Byun et.al.	2505.15254	translate	read	null
2025-05-20	In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties	Nathan Roll et.al.	2505.14887	translate	read	link
2025-05-20	Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages	Chin-Jou Li et.al.	2505.14874	translate	read	null
2025-05-20	Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits	Tiantian Feng et.al.	2505.14648	translate	read	link
2025-05-20	Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference	Tomer Gafni et.al.	2505.14638	translate	read	null
2025-05-20	SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification	Theo Lepage et.al.	2505.14561	translate	read	null
2025-05-20	Pairwise Evaluation of Accent Similarity in Speech Synthesis	Jinzuomu Zhong et.al.	2505.14410	translate	read	null
2025-05-20	PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs	Sho Inoue et.al.	2505.14356	translate	read	null
2025-05-20	FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation	Yutong Liu et.al.	2505.14351	translate	read	null
2025-05-20	Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach	Umberto Cappellazzo et.al.	2505.14336	translate	read	null
2025-05-20	HausaNLP: Current Status, Challenges and Future Directions for Hausa Natural Language Processing	Shamsuddeen Hassan Muhammad et.al.	2505.14311	translate	read	null
2025-05-20	Source Verification for Speech Deepfakes	Viola Negroni et.al.	2505.14188	translate	read	null
2025-05-20	The Multimodal Information Based Speech Processing (MISP) 2025 Challenge: Audio-Visual Diarization and Recognition	Ming Gao et.al.	2505.13971	translate	read	null
2025-05-19	Granary: Speech Recognition and Translation Dataset in 25 European Languages	Nithin Rao Koluguri et.al.	2505.13404	translate	read	null
2025-05-19	Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space	Zhengrui Ma et.al.	2505.13181	translate	read	link
2025-05-19	Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR	Xugang Lu et.al.	2505.13079	translate	read	null
2025-05-19	KIT’s Offline Speech Translation and Instruction Following Submission for IWSLT 2025	Sai Koneru et.al.	2505.13036	translate	read	link
2025-05-19	Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition	Dominik Wagner et.al.	2505.12991	translate	read	null
2025-05-19	Calm-Whisper: Reduce Whisper Hallucination On Non-Speech By Calming Crazy Heads Down	Yingzhi Wang et.al.	2505.12969	translate	read	null
2025-05-19	Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio	Jongmin Jung et.al.	2505.12863	translate	read	null
2025-05-19	OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching	Hieu-Nghia Huynh-Nguyen et.al.	2505.12800	translate	read	null
2025-05-19	RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations	Seungmin Kim et.al.	2505.12686	translate	read	null
2025-05-19	Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment	Abhinaba Roy et.al.	2505.12669	translate	read	link
2025-05-16	LipDiffuser: Lip-to-Speech Generation with Conditional Diffusion Models	Danilo de Oliveira et.al.	2505.11391	translate	read	null
2025-05-16	LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors	Rao Ma et.al.	2505.11352	translate	read	null
2025-05-16	Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio	Xinlu He et.al.	2505.10975	translate	read	null
2025-05-16	Multi-Stage Speaker Diarization for Noisy Classrooms	Ali Sartaz Khan et.al.	2505.10879	translate	read	null
2025-05-15	UDDETTS: Unifying Discrete and Dimensional Emotions for Controllable Emotional Text-to-Speech	Jiaxuan Liu et.al.	2505.10599	translate	read	null
2025-05-15	Inclusivity of AI Speech in Healthcare: A Decade Look Back	Retno Larasati et.al.	2505.10596	translate	read	null
2025-05-15	Quantized Approximate Signal Processing (QASP): Towards Homomorphic Encryption for audio	Tu Duyen Nguyen et.al.	2505.10500	translate	read	null
2025-05-14	GlobalMood: A cross-cultural benchmark for music emotion recognition	Harin Lee et.al.	2505.09539	translate	read	null
2025-05-14	SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset	Yicheng Gu et.al.	2505.09325	translate	read	null
2025-05-14	DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis	Zeeshan Ahmad et.al.	2505.09091	translate	read	null
2025-05-13	Inference Attacks for X-Vector Speaker Anonymization	Luke Bauer et.al.	2505.08978	translate	read	null
2025-05-13	Investigating self-supervised features for expressive, multilingual voice conversion	Álvaro Martín-Cortinas et.al.	2505.08278	translate	read	null
2025-05-13	Not that Groove: Zero-Shot Symbolic Music Editing	Li Zhang et.al.	2505.08203	translate	read	null
2025-05-12	Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications	Biel Tura Vecino et.al.	2505.07701	translate	read	null
2025-05-12	Full simulation on the dynamics of auditory synaptic fusion: Strong clustering of calcium channel might be the origin of the coherent release in the auditory hair cells	Jaeyun Yoo et.al.	2505.07273	translate	read	null
2025-05-09	Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients	Jinsheng Yuan et.al.	2505.06335	translate	read	null
2025-05-08	Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations	Linrong Pan et.al.	2505.05056	translate	read	null
2025-05-08	A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration	Shaja Arul Selvamani et.al.	2505.04885	translate	read	null
2025-05-07	Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond	Jessie Richter-Powell et.al.	2505.04621	translate	read	null
2025-05-07	SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer	Young-Hu Park et.al.	2505.04394	translate	read	null
2025-05-07	Discrete Optimal Transport and Voice Conversion	Anton Selitskiy et.al.	2505.04382	translate	read	null
2025-05-07	Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement	Rauf Nasretdinov et.al.	2505.04237	translate	read	null
2025-05-06	VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model	Zuwei Long et.al.	2505.03739	translate	read	link
2025-05-06	Fairness of Automatic Speech Recognition in Cleft Lip and Palate Speech	Susmita Bhattacharjee et.al.	2505.03697	translate	read	null
2025-05-06	Mamba-Diffusion Model with Learnable Wavelet for Controllable Symbolic Music Generation	Jincheng Zhang et.al.	2505.03314	translate	read	link
2025-05-06	SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation	Zhaoxi Mu et.al.	2505.03273	translate	read	null
2025-05-06	SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation	Yu-Ren Guo et.al.	2505.03244	translate	read	null
2025-05-06	MGFF-TDNN: A Multi-Granularity Feature Fusion TDNN Model with Depth-Wise Separable Module for Speaker Verification	Ya Li et.al.	2505.03228	translate	read	link
2025-05-06	CoGenAV: Versatile Audio-Visual Representation Learning via Contrastive-Generative Synchronization	Detao Bai et.al.	2505.03186	translate	read	null
2025-05-05	Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play	Yemin Shi et.al.	2505.02707	translate	read	link
2025-05-05	LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis	Qingkai Fang et.al.	2505.02625	translate	read	link
2025-05-04	Transforming faces into video stories – VideoFace2.0	Branko Brkljač et.al.	2505.02060	translate	read	null
2025-05-04	A Synergistic Framework of Nonlinear Acoustic Computing and Reinforcement Learning for Real-World Human-Robot Interaction	Xiaoliang Chen et.al.	2505.01998	translate	read	null
2025-05-02	Transfer Learning-Based Deep Residual Learning for Speech Recognition in Clean and Noisy Environments	Noussaiba Djeffal et.al.	2505.01632	translate	read	null
2025-05-01	Scaling On-Device GPU Inference for Large Generative Models	Jiuqiang Tang et.al.	2505.00232	translate	read	null
2025-05-02	Towards Flow-Matching-based TTS without Classifier-Free Guidance	Yuzhe Liang et.al.	2504.20334	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)