Audio Processing - 2025-04
Audio Processing - 2025-04
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-04-30 | BERSting at the Screams: A Benchmark for Distanced, Emotional and Shouted Speech Recognition | Paige Tuttösí et.al. | 2505.00059 | translate | read | link |
| 2025-04-30 | From Aesthetics to Human Preferences: Comparative Perspectives of Evaluating Text-to-Music Systems | Huan Zhang et.al. | 2504.21815 | translate | read | null |
| 2025-04-30 | Retrieval-Enhanced Few-Shot Prompting for Speech Event Extraction | Máté Gedeon et.al. | 2504.21372 | translate | read | null |
| 2025-04-29 | AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation | Jeongsoo Choi et.al. | 2504.20629 | translate | read | null |
| 2025-04-28 | A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks | Shadan Shukr Sabr et.al. | 2504.19645 | translate | read | null |
| 2025-04-27 | Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements | Sandipan Dhar et.al. | 2504.19197 | translate | read | null |
| 2025-04-25 | Kimi-Audio Technical Report | KimiTeam et.al. | 2504.18425 | translate | read | link |
| 2025-04-28 | Augmenting Captions with Emotional Cues: An AR Interface for Real-Time Accessible Communication | Sunday David Ubur et.al. | 2504.17171 | translate | read | null |
| 2025-04-23 | SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward | Nicolas Jonason et.al. | 2504.16839 | translate | read | null |
| 2025-04-22 | TinyML for Speech Recognition | Andrew Barovic et.al. | 2504.16213 | translate | read | null |
| 2025-04-22 | LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale | Joya Chen et.al. | 2504.16030 | translate | read | link |
| 2025-04-22 | Quantifying Source Speaker Leakage in One-to-One Voice Conversion | Scott Wellington et.al. | 2504.15822 | translate | read | null |
| 2025-04-22 | Development and evaluation of a deep learning algorithm for German word recognition from lip movements | Dinh Nam Pham et.al. | 2504.15792 | translate | read | null |
| 2025-04-22 | FADEL: Uncertainty-aware Fake Audio Detection with Evidential Deep Learning | Ju Yeon Kang et.al. | 2504.15663 | translate | read | null |
| 2025-04-22 | A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models | Gengxian Cao et.al. | 2504.15552 | translate | read | null |
| 2025-04-21 | Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides | Jinghua Zhao et.al. | 2504.15066 | translate | read | null |
| 2025-04-21 | SOLIDO: A Robust Watermarking Method for Speech Synthesis via Low-Rank Adaptation | Yue Li et.al. | 2504.15035 | translate | read | null |
| 2025-04-21 | Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues | Rui Ribeiro et.al. | 2504.14963 | translate | read | null |
| 2025-04-21 | StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models | Yeona Hong et.al. | 2504.14915 | translate | read | null |
| 2025-04-20 | DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue | Xiang Li et.al. | 2504.14482 | translate | read | link |
| 2025-04-19 | The First VoicePrivacy Attacker Challenge | Natalia Tomashenko et.al. | 2504.14183 | translate | read | null |
| 2025-04-18 | Collective Learning Mechanism based Optimal Transport Generative Adversarial Network for Non-parallel Voice Conversion | Sandipan Dhar et.al. | 2504.13791 | translate | read | null |
| 2025-04-18 | MusFlow: Multimodal Music Generation via Conditional Flow Matching | Jiahao Song et.al. | 2504.13535 | translate | read | null |
| 2025-04-17 | Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope | Leena G Pillai et.al. | 2504.13308 | translate | read | null |
| 2025-04-16 | Dysarthria Normalization via Local Lie Group Transformations for Robust ASR | Mikhail Osipov et.al. | 2504.12279 | translate | read | null |
| 2025-04-16 | Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning | Mahmoud Salhab et.al. | 2504.12254 | translate | read | null |
| 2025-04-16 | Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder | Soobin Suh et.al. | 2504.12005 | translate | read | null |
| 2025-04-15 | Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation | Yan Rong et.al. | 2504.11002 | translate | read | null |
| 2025-04-15 | Real-Time Word-Level Temporal Segmentation in Streaming Speech Recognition | Naoto Nishida et.al. | 2504.10849 | translate | read | null |
| 2025-04-15 | Generalized Audio Deepfake Detection Using Frame-level Latent Information Entropy | Botao Zhao et.al. | 2504.10819 | translate | read | null |
| 2025-04-14 | Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis | Yifan Yang et.al. | 2504.10352 | translate | read | null |
| 2025-04-14 | AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis | Dan Luo et.al. | 2504.10309 | translate | read | null |
| 2025-04-14 | SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis | Zhisheng Zhang et.al. | 2504.09839 | translate | read | link |
| 2025-04-12 | AMNet: An Acoustic Model Network for Enhanced Mandarin Speech Synthesis | Yubing Cao et.al. | 2504.09225 | translate | read | null |
| 2025-04-11 | Spatial Audio Processing with Large Language Model on Wearable Devices | Ayushi Mishra et.al. | 2504.08907 | translate | read | null |
| 2025-04-11 | Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion | Na Li et.al. | 2504.08524 | translate | read | null |
| 2025-04-10 | From Speech to Summary: A Comprehensive Survey of Speech Summarization | Fabian Retkowski et.al. | 2504.08024 | translate | read | null |
| 2025-04-10 | Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis | Yizhong Geng et.al. | 2504.07858 | translate | read | null |
| 2025-04-10 | SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow | Kaidi Wang et.al. | 2504.07776 | translate | read | null |
| 2025-04-10 | Extending Visual Dynamics for Video-to-Music Generation | Xiaohao Liu et.al. | 2504.07594 | translate | read | null |
| 2025-04-09 | Visual-Aware Speech Recognition for Noisy Scenarios | Lakshmipathi Balaji et.al. | 2504.07229 | translate | read | null |
| 2025-04-09 | RNN-Transducer-based Losses for Speech Recognition on Noisy Targets | Vladimir Bataev et.al. | 2504.06963 | translate | read | null |
| 2025-04-08 | AVENet: Disentangling Features by Approximating Average Features for Voice Conversion | Wenyu Wang et.al. | 2504.05833 | translate | read | null |
| 2025-04-08 | kNN-SVC: Robust Zero-Shot Singing Voice Conversion with Additive Synthesis and Concatenation Smoothness Optimization | Keren Shao et.al. | 2504.05686 | translate | read | null |
| 2025-04-07 | Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation | Manvi Agarwal et.al. | 2504.05364 | translate | read | null |
| 2025-04-07 | DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation | Xinglin Lyu et.al. | 2504.05122 | translate | read | null |
| 2025-04-06 | Trainable Adaptive Score Normalization for Automatic Speaker Verification | Jeong-Hwan Choi et.al. | 2504.04512 | translate | read | null |
| 2025-04-06 | Public speech recognition transcripts as a configuring parameter | Damien Rudaz et.al. | 2504.04488 | translate | read | null |
| 2025-04-06 | Activation Patching for Interpretable Steering in Music Generation | Simone Facchiano et.al. | 2504.04479 | translate | read | null |
| 2025-04-08 | LoopGen: Training-Free Loopable Music Generation | Davide Marincione et.al. | 2504.04466 | translate | read | null |
| 2025-04-06 | Selective Masking Adversarial Attack on Automatic Speech Recognition Systems | Zheng Fang et.al. | 2504.04394 | translate | read | null |
| 2025-04-04 | An Efficient GPU-based Implementation for Noise Robust Sound Source Localization | Zirui Lin et.al. | 2504.03373 | translate | read | null |
| 2025-04-04 | A Human Digital Twin Architecture for Knowledge-based Interactions and Context-Aware Conversations | Abdul Mannan Mohammed et.al. | 2504.03147 | translate | read | null |
| 2025-04-03 | LinTO Audio and Textual Datasets to Train and Evaluate Automatic Speech Recognition in Tunisian Arabic Dialect | Hedi Naouara et.al. | 2504.02604 | translate | read | null |
| 2025-04-03 | Deep learning for music generation. Four approaches and their comparative evaluation | Razvan Paroiu et.al. | 2504.02586 | translate | read | null |
| 2025-04-03 | F5R-TTS: Improving Flow Matching based Text-to-Speech with Group Relative Policy Optimization | Xiaohui Sun et.al. | 2504.02407 | translate | read | null |
| 2025-04-03 | VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models | Kim Sung-Bin et.al. | 2504.02386 | translate | read | null |
| 2025-04-02 | Chain of Correction for Full-text Speech Recognition with Large Language Models | Zhiyuan Tang et.al. | 2504.01519 | translate | read | null |
| 2025-04-01 | Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems | Weifei Jin et.al. | 2504.00858 | translate | read | link |
| 2025-04-01 | A Survey on Music Generation from Single-Modal, Cross-Modal, and Multi-Modal Perspectives: Data, Methods, and Challenges | Shuyu Li et.al. | 2504.00837 | translate | read | null |
| 2025-04-02 | TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection | Zhiming Ma et.al. | 2503.24115 | translate | read | link |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)