Audio Processing - 2025-04 | Paper Arxiv Daily

Audio Processing - 2025-04

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-04-30	BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition	Paige Tuttösí et.al.	2505.00059	translate	read	link
2025-04-30	From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems	Huan Zhang et.al.	2504.21815	translate	read	null
2025-04-30	Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction	Máté Gedeon et.al.	2504.21372	translate	read	null
2025-04-29	AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation	Jeongsoo Choi et.al.	2504.20629	translate	read	null
2025-04-28	A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks	Shadan Shukr Sabr et.al.	2504.19645	translate	read	null
2025-04-27	Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements	Sandipan Dhar et.al.	2504.19197	translate	read	null
2025-04-25	Kimi-Audio Technical Report	KimiTeam et.al.	2504.18425	translate	read	link
2025-04-28	Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication	Sunday David Ubur et.al.	2504.17171	translate	read	null
2025-04-23	SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward	Nicolas Jonason et.al.	2504.16839	translate	read	null
2025-04-22	TinyML for Speech Recognition	Andrew Barovic et.al.	2504.16213	translate	read	null
2025-04-22	LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Joya Chen et.al.	2504.16030	translate	read	link
2025-04-22	Quantifying Source Speaker Leakage in One-to-One Voice Conversion	Scott Wellington et.al.	2504.15822	translate	read	null
2025-04-22	Development and evaluation of a deep learning algorithm for German word recognition from lip movements	Dinh Nam Pham et.al.	2504.15792	translate	read	null
2025-04-22	FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning	Ju Yeon Kang et.al.	2504.15663	translate	read	null
2025-04-22	A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models	Gengxian Cao et.al.	2504.15552	translate	read	null
2025-04-21	Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides	Jinghua Zhao et.al.	2504.15066	translate	read	null
2025-04-21	SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation	Yue Li et.al.	2504.15035	translate	read	null
2025-04-21	Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues	Rui Ribeiro et.al.	2504.14963	translate	read	null
2025-04-21	StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models	Yeona Hong et.al.	2504.14915	translate	read	null
2025-04-20	DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue	Xiang Li et.al.	2504.14482	translate	read	link
2025-04-19	The First VoicePrivacy Attacker Challenge	Natalia Tomashenko et.al.	2504.14183	translate	read	null
2025-04-18	Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion	Sandipan Dhar et.al.	2504.13791	translate	read	null
2025-04-18	MusFlow: Multimodal Music Generation via Conditional Flow Matching	Jiahao Song et.al.	2504.13535	translate	read	null
2025-04-17	Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope	Leena G Pillai et.al.	2504.13308	translate	read	null
2025-04-16	Dysarthria Normalization via Local Lie Group Transformations for Robust ASR	Mikhail Osipov et.al.	2504.12279	translate	read	null
2025-04-16	Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning	Mahmoud Salhab et.al.	2504.12254	translate	read	null
2025-04-16	Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder	Soobin Suh et.al.	2504.12005	translate	read	null
2025-04-15	Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation	Yan Rong et.al.	2504.11002	translate	read	null
2025-04-15	Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition	Naoto Nishida et.al.	2504.10849	translate	read	null
2025-04-15	Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy	Botao Zhao et.al.	2504.10819	translate	read	null
2025-04-14	Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis	Yifan Yang et.al.	2504.10352	translate	read	null
2025-04-14	AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis	Dan Luo et.al.	2504.10309	translate	read	null
2025-04-14	SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis	Zhisheng Zhang et.al.	2504.09839	translate	read	link
2025-04-12	AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis	Yubing Cao et.al.	2504.09225	translate	read	null
2025-04-11	Spatial Audio Processing with Large Language Model on Wearable Devices	Ayushi Mishra et.al.	2504.08907	translate	read	null
2025-04-11	Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion	Na Li et.al.	2504.08524	translate	read	null
2025-04-10	From Speech to Summary: A Comprehensive Survey of Speech Summarization	Fabian Retkowski et.al.	2504.08024	translate	read	null
2025-04-10	Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis	Yizhong Geng et.al.	2504.07858	translate	read	null
2025-04-10	SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow	Kaidi Wang et.al.	2504.07776	translate	read	null
2025-04-10	Extending Visual Dynamics for Video-to-Music Generation	Xiaohao Liu et.al.	2504.07594	translate	read	null
2025-04-09	Visual-Aware Speech Recognition for Noisy Scenarios	Lakshmipathi Balaji et.al.	2504.07229	translate	read	null
2025-04-09	RNN-Transducer-based Losses for Speech Recognition on Noisy Targets	Vladimir Bataev et.al.	2504.06963	translate	read	null
2025-04-08	AVENet: Disentangling Features by Approximating Average Features for Voice Conversion	Wenyu Wang et.al.	2504.05833	translate	read	null
2025-04-08	kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization	Keren Shao et.al.	2504.05686	translate	read	null
2025-04-07	Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation	Manvi Agarwal et.al.	2504.05364	translate	read	null
2025-04-07	DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation	Xinglin Lyu et.al.	2504.05122	translate	read	null
2025-04-06	Trainable Adaptive Score Normalization for Automatic Speaker Verification	Jeong-Hwan Choi et.al.	2504.04512	translate	read	null
2025-04-06	Public speech recognition transcripts as a configuring parameter	Damien Rudaz et.al.	2504.04488	translate	read	null
2025-04-06	Activation Patching for Interpretable Steering in Music Generation	Simone Facchiano et.al.	2504.04479	translate	read	null
2025-04-08	LoopGen: Training-Free Loopable Music Generation	Davide Marincione et.al.	2504.04466	translate	read	null
2025-04-06	Selective Masking Adversarial Attack on Automatic Speech Recognition Systems	Zheng Fang et.al.	2504.04394	translate	read	null
2025-04-04	An Efficient GPU-based Implementation for Noise Robust Sound Source Localization	Zirui Lin et.al.	2504.03373	translate	read	null
2025-04-04	A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations	Abdul Mannan Mohammed et.al.	2504.03147	translate	read	null
2025-04-03	LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect	Hedi Naouara et.al.	2504.02604	translate	read	null
2025-04-03	Deep learning for music generation. Four approaches and their comparative evaluation	Razvan Paroiu et.al.	2504.02586	translate	read	null
2025-04-03	F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization	Xiaohui Sun et.al.	2504.02407	translate	read	null
2025-04-03	VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models	Kim Sung-Bin et.al.	2504.02386	translate	read	null
2025-04-02	Chain of Correction for Full-text Speech Recognition with Large Language Models	Zhiyuan Tang et.al.	2504.01519	translate	read	null
2025-04-01	Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems	Weifei Jin et.al.	2504.00858	translate	read	link
2025-04-01	A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives: Data, Methods, and Challenges	Shuyu Li et.al.	2504.00837	translate	read	null
2025-04-02	TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection	Zhiming Ma et.al.	2503.24115	translate	read	link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)