Audio Processing - 2025-04

Publish Date Title Authors PDF Translate Read Code
2025-04-30 BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition Paige Tuttösí et.al. 2505.00059 translate read link
2025-04-30 From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems Huan Zhang et.al. 2504.21815 translate read null
2025-04-30 Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction Máté Gedeon et.al. 2504.21372 translate read null
2025-04-29 AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation Jeongsoo Choi et.al. 2504.20629 translate read null
2025-04-28 A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks Shadan Shukr Sabr et.al. 2504.19645 translate read null
2025-04-27 Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements Sandipan Dhar et.al. 2504.19197 translate read null
2025-04-25 Kimi-Audio Technical Report KimiTeam et.al. 2504.18425 translate read link
2025-04-28 Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication Sunday David Ubur et.al. 2504.17171 translate read null
2025-04-23 SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward Nicolas Jonason et.al. 2504.16839 translate read null
2025-04-22 TinyML for Speech Recognition Andrew Barovic et.al. 2504.16213 translate read null
2025-04-22 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Joya Chen et.al. 2504.16030 translate read link
2025-04-22 Quantifying Source Speaker Leakage in One-to-One Voice Conversion Scott Wellington et.al. 2504.15822 translate read null
2025-04-22 Development and evaluation of a deep learning algorithm for German word recognition from lip movements Dinh Nam Pham et.al. 2504.15792 translate read null
2025-04-22 FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning Ju Yeon Kang et.al. 2504.15663 translate read null
2025-04-22 A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models Gengxian Cao et.al. 2504.15552 translate read null
2025-04-21 Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides Jinghua Zhao et.al. 2504.15066 translate read null
2025-04-21 SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation Yue Li et.al. 2504.15035 translate read null
2025-04-21 Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues Rui Ribeiro et.al. 2504.14963 translate read null
2025-04-21 StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models Yeona Hong et.al. 2504.14915 translate read null
2025-04-20 DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue Xiang Li et.al. 2504.14482 translate read link
2025-04-19 The First VoicePrivacy Attacker Challenge Natalia Tomashenko et.al. 2504.14183 translate read null
2025-04-18 Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion Sandipan Dhar et.al. 2504.13791 translate read null
2025-04-18 MusFlow: Multimodal Music Generation via Conditional Flow Matching Jiahao Song et.al. 2504.13535 translate read null
2025-04-17 Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope Leena G Pillai et.al. 2504.13308 translate read null
2025-04-16 Dysarthria Normalization via Local Lie Group Transformations for Robust ASR Mikhail Osipov et.al. 2504.12279 translate read null
2025-04-16 Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning Mahmoud Salhab et.al. 2504.12254 translate read null
2025-04-16 Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder Soobin Suh et.al. 2504.12005 translate read null
2025-04-15 Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation Yan Rong et.al. 2504.11002 translate read null
2025-04-15 Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition Naoto Nishida et.al. 2504.10849 translate read null
2025-04-15 Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy Botao Zhao et.al. 2504.10819 translate read null
2025-04-14 Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis Yifan Yang et.al. 2504.10352 translate read null
2025-04-14 AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis Dan Luo et.al. 2504.10309 translate read null
2025-04-14 SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis Zhisheng Zhang et.al. 2504.09839 translate read link
2025-04-12 AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis Yubing Cao et.al. 2504.09225 translate read null
2025-04-11 Spatial Audio Processing with Large Language Model on Wearable Devices Ayushi Mishra et.al. 2504.08907 translate read null
2025-04-11 Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion Na Li et.al. 2504.08524 translate read null
2025-04-10 From Speech to Summary: A Comprehensive Survey of Speech Summarization Fabian Retkowski et.al. 2504.08024 translate read null
2025-04-10 Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis Yizhong Geng et.al. 2504.07858 translate read null
2025-04-10 SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow Kaidi Wang et.al. 2504.07776 translate read null
2025-04-10 Extending Visual Dynamics for Video-to-Music Generation Xiaohao Liu et.al. 2504.07594 translate read null
2025-04-09 Visual-Aware Speech Recognition for Noisy Scenarios Lakshmipathi Balaji et.al. 2504.07229 translate read null
2025-04-09 RNN-Transducer-based Losses for Speech Recognition on Noisy Targets Vladimir Bataev et.al. 2504.06963 translate read null
2025-04-08 AVENet: Disentangling Features by Approximating Average Features for Voice Conversion Wenyu Wang et.al. 2504.05833 translate read null
2025-04-08 kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization Keren Shao et.al. 2504.05686 translate read null
2025-04-07 Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation Manvi Agarwal et.al. 2504.05364 translate read null
2025-04-07 DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation Xinglin Lyu et.al. 2504.05122 translate read null
2025-04-06 Trainable Adaptive Score Normalization for Automatic Speaker Verification Jeong-Hwan Choi et.al. 2504.04512 translate read null
2025-04-06 Public speech recognition transcripts as a configuring parameter Damien Rudaz et.al. 2504.04488 translate read null
2025-04-06 Activation Patching for Interpretable Steering in Music Generation Simone Facchiano et.al. 2504.04479 translate read null
2025-04-08 LoopGen: Training-Free Loopable Music Generation Davide Marincione et.al. 2504.04466 translate read null
2025-04-06 Selective Masking Adversarial Attack on Automatic Speech Recognition Systems Zheng Fang et.al. 2504.04394 translate read null
2025-04-04 An Efficient GPU-based Implementation for Noise Robust Sound Source Localization Zirui Lin et.al. 2504.03373 translate read null
2025-04-04 A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations Abdul Mannan Mohammed et.al. 2504.03147 translate read null
2025-04-03 LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect Hedi Naouara et.al. 2504.02604 translate read null
2025-04-03 Deep learning for music generation. Four approaches and their comparative evaluation Razvan Paroiu et.al. 2504.02586 translate read null
2025-04-03 F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization Xiaohui Sun et.al. 2504.02407 translate read null
2025-04-03 VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models Kim Sung-Bin et.al. 2504.02386 translate read null
2025-04-02 Chain of Correction for Full-text Speech Recognition with Large Language Models Zhiyuan Tang et.al. 2504.01519 translate read null
2025-04-01 Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems Weifei Jin et.al. 2504.00858 translate read link
2025-04-01 A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives: Data, Methods, and Challenges Shuyu Li et.al. 2504.00837 translate read null
2025-04-02 TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection Zhiming Ma et.al. 2503.24115 translate read link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)