Audio Processing - 2025-03

Publish Date Title Authors PDF Translate Read Code
2025-03-31 Can Diffusion Models Disentangle? A Theoretical Perspective Liming Wang et.al. 2504.00220 translate read null
2025-03-31 SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation Ngoc Dung Huynh et.al. 2503.24164 translate read null
2025-03-31 SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development Minghan Wang et.al. 2503.23848 translate read link
2025-03-30 The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR Injy Hamed et.al. 2503.23576 translate read null
2025-03-30 Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages Xabier de Zuazo et.al. 2503.23542 translate read link
2025-03-30 Scaling Auditory Cognition via Test-Time Compute in Audio Language Models Ting Dang et.al. 2503.23395 translate read null
2025-03-29 SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System Hyeongju Kim et.al. 2503.23108 translate read null
2025-03-28 Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model Changchang Sun et.al. 2503.22138 translate read null
2025-03-27 VALLR: Visual ASR Language Model for Lip Reading Marshall Thomas et.al. 2503.21408 translate read null
2025-03-27 A 71.2- $μ$ W Speech Recognition Accelerator with Recurrent Spiking Neural Network Chih-Chyau Yang et.al. 2503.21337 translate read null
2025-03-27 Vision-to-Music Generation: A Survey Zhaokai Wang et.al. 2503.21254 translate read link
2025-03-26 Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit Aniket Abhishek Soni et.al. 2503.21025 translate read null
2025-03-26 Text-Driven Voice Conversion via Latent State-Space Modeling Wen Li et.al. 2503.20999 translate read null
2025-03-26 FinAudio: A Benchmark for Audio Large Language Models in Financial Applications Yupeng Cao et.al. 2503.20990 translate read null
2025-03-26 Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages Yangyang Meng et.al. 2503.20212 translate read link
2025-03-25 Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy Athiya Deviyani et.al. 2503.19828 translate read null
2025-03-25 Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation Max W. Y. Lam et.al. 2503.19611 translate read null
2025-03-25 Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization Weifei Jin et.al. 2503.19591 translate read null
2025-03-25 Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment Ghazanfar Ali et.al. 2503.19334 translate read null
2025-03-22 A Survey on Structured State Space Sequence (S4) Models Shriyank Somvanshi et.al. 2503.18970 translate read link
2025-03-24 Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems Jacopo de Berardinis et.al. 2503.18814 translate read null
2025-03-24 Whispering in Amharic: Fine-tuning Whisper for Low-resource Language Dawit Ketema Gete et.al. 2503.18485 translate read null
2025-03-23 Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition Yufeng Yang et.al. 2503.17886 translate read null
2025-03-22 LZMidi: Compression-Based Symbolic Music Generation Connor Ding et.al. 2503.17654 translate read null
2025-03-21 Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication Yiwen Xu et.al. 2503.17479 translate read null
2025-03-21 From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech Ji-Hoon Kim et.al. 2503.16956 translate read null
2025-03-20 CAARMA: Class Augmentation with Adversarial Mixup Regularization Massa Baali et.al. 2503.16718 translate read null
2025-03-20 WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching Tianze Luo et.al. 2503.16689 translate read null
2025-03-20 SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors Yang Chen et.al. 2503.16578 translate read null
2025-03-19 A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions Saddam Hussain Khan et.al. 2503.16546 translate read null
2025-03-19 Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces Korbinian Kuhn et.al. 2503.15124 translate read null
2025-03-19 Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition Korbinian Kuhn et.al. 2503.15120 translate read null
2025-03-19 MoonCast: High-Quality Zero-Shot Podcast Generation Zeqian Ju et.al. 2503.14345 translate read link
2025-03-18 InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being Guang Dai et.al. 2503.14257 translate read null
2025-03-17 Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis Jakob Sponholz et.al. 2503.13031 translate read null
2025-03-14 MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens Jeong Hun Yeo et.al. 2503.11315 translate read link
2025-03-13 AudioX: Diffusion Transformer for Anything-to-Audio Generation Zeyue Tian et.al. 2503.10522 translate read link
2025-03-13 Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings Jakaria Islam Emon et.al. 2503.10446 translate read link
2025-03-14 Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models Sebastian Möller et.al. 2503.10298 translate read null
2025-03-12 ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization Haaris Mehmood et.al. 2503.09906 translate read null
2025-03-12 Quantization for OpenAI’s Whisper Models: A Comparative Analysis Allison Andreyev et.al. 2503.09905 translate read link
2025-03-12 Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment Xiaowei Bi et.al. 2503.09081 translate read null
2025-03-11 An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR Sewade Ogun et.al. 2503.08954 translate read null
2025-03-11 YuE: Scaling Open Foundation Models for Long-Form Music Generation Ruibin Yuan et.al. 2503.08638 translate read link
2025-03-11 Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos Soumya Shamarao Jahagirdar et.al. 2503.08335 translate read null
2025-03-11 FilmComposer: LLM-Driven Music Production for Silent Film Clips Zhifeng Xie et.al. 2503.08147 translate read link
2025-03-11 Boundary Regression for Leitmotif Detection in Music Audio Sihun Lee et.al. 2503.07977 translate read null
2025-03-10 Building English ASR model with regional language support Purvi Agrawal et.al. 2503.07522 translate read null
2025-03-10 Impact of Microphone Array Mismatches to Learning-based Replay Speech Detection Michael Neri et.al. 2503.07357 translate read null
2025-03-10 Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling Michael McGuire et.al. 2503.06924 translate read null
2025-03-09 Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs Umberto Cappellazzo et.al. 2503.06362 translate read null
2025-03-08 Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations Jeong Hun Yeo et.al. 2503.06273 translate read link
2025-03-08 A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment Koji Inoue et.al. 2503.06241 translate read null
2025-03-07 DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility Yifan Liu et.al. 2503.05223 translate read null
2025-03-06 From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment Yutian Pang et.al. 2503.04974 translate read null
2025-03-04 Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis Yiming Wang et.al. 2503.04814 translate read null
2025-03-06 LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Sambal Shikhar et.al. 2503.04724 translate read link
2025-03-06 Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning Lucas Block Medin et.al. 2503.04710 translate read null
2025-03-05 Good practices for evaluation of synthesized speech Erica Cooper et.al. 2503.03250 translate read null
2025-03-03 Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis Samuel S. Sohn et.al. 2503.02907 translate read null
2025-03-04 Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization Aviv Shamsian et.al. 2503.02312 translate read null
2025-03-05 Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization Leonid Berlyand et.al. 2503.01922 translate read null
2025-03-03 Augmenting Online Meetings with Context-Aware Real-time Music Generation Haruki Suzawa et.al. 2503.01354 translate read null
2025-03-03 Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology Birger Moell et.al. 2503.01266 translate read null
2025-03-03 DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion Ziqian Ning et.al. 2503.01183 translate read link
2025-03-02 Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems Ajinkya Kulkarni et.al. 2503.00907 translate read null
2025-03-02 UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation Alexander H. Liu et.al. 2503.00733 translate read null
2025-03-01 PodAgent: A Comprehensive Framework for Podcast Generation Yujia Xiao et.al. 2503.00455 translate read link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)