Audio Processing - 2025-03 | Paper Arxiv Daily

Audio Processing - 2025-03

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-03-31	Can Diffusion Models Disentangle? A Theoretical Perspective	Liming Wang et.al.	2504.00220	translate	read	null
2025-03-31	SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation	Ngoc Dung Huynh et.al.	2503.24164	translate	read	null
2025-03-31	SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development	Minghan Wang et.al.	2503.23848	translate	read	link
2025-03-30	The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR	Injy Hamed et.al.	2503.23576	translate	read	null
2025-03-30	Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages	Xabier de Zuazo et.al.	2503.23542	translate	read	link
2025-03-30	Scaling Auditory Cognition via Test-Time Compute in Audio Language Models	Ting Dang et.al.	2503.23395	translate	read	null
2025-03-29	SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System	Hyeongju Kim et.al.	2503.23108	translate	read	null
2025-03-28	Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model	Changchang Sun et.al.	2503.22138	translate	read	null
2025-03-27	VALLR: Visual ASR Language Model for Lip Reading	Marshall Thomas et.al.	2503.21408	translate	read	null
2025-03-27	A 71.2- $μ$ W Speech Recognition Accelerator with Recurrent Spiking Neural Network	Chih-Chyau Yang et.al.	2503.21337	translate	read	null
2025-03-27	Vision-to-Music Generation: A Survey	Zhaokai Wang et.al.	2503.21254	translate	read	link
2025-03-26	Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit	Aniket Abhishek Soni et.al.	2503.21025	translate	read	null
2025-03-26	Text-Driven Voice Conversion via Latent State-Space Modeling	Wen Li et.al.	2503.20999	translate	read	null
2025-03-26	FinAudio: A Benchmark for Audio Large Language Models in Financial Applications	Yupeng Cao et.al.	2503.20990	translate	read	null
2025-03-26	Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages	Yangyang Meng et.al.	2503.20212	translate	read	link
2025-03-25	Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy	Athiya Deviyani et.al.	2503.19828	translate	read	null
2025-03-25	Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation	Max W. Y. Lam et.al.	2503.19611	translate	read	null
2025-03-25	Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization	Weifei Jin et.al.	2503.19591	translate	read	null
2025-03-25	Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment	Ghazanfar Ali et.al.	2503.19334	translate	read	null
2025-03-22	A Survey on Structured State Space Sequence (S4) Models	Shriyank Somvanshi et.al.	2503.18970	translate	read	link
2025-03-24	Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems	Jacopo de Berardinis et.al.	2503.18814	translate	read	null
2025-03-24	Whispering in Amharic: Fine-tuning Whisper for Low-resource Language	Dawit Ketema Gete et.al.	2503.18485	translate	read	null
2025-03-23	Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition	Yufeng Yang et.al.	2503.17886	translate	read	null
2025-03-22	LZMidi: Compression-Based Symbolic Music Generation	Connor Ding et.al.	2503.17654	translate	read	null
2025-03-21	Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication	Yiwen Xu et.al.	2503.17479	translate	read	null
2025-03-21	From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech	Ji-Hoon Kim et.al.	2503.16956	translate	read	null
2025-03-20	CAARMA: Class Augmentation with Adversarial Mixup Regularization	Massa Baali et.al.	2503.16718	translate	read	null
2025-03-20	WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching	Tianze Luo et.al.	2503.16689	translate	read	null
2025-03-20	SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors	Yang Chen et.al.	2503.16578	translate	read	null
2025-03-19	A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions	Saddam Hussain Khan et.al.	2503.16546	translate	read	null
2025-03-19	Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces	Korbinian Kuhn et.al.	2503.15124	translate	read	null
2025-03-19	Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition	Korbinian Kuhn et.al.	2503.15120	translate	read	null
2025-03-19	MoonCast: High-Quality Zero-Shot Podcast Generation	Zeqian Ju et.al.	2503.14345	translate	read	link
2025-03-18	InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being	Guang Dai et.al.	2503.14257	translate	read	null
2025-03-17	Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis	Jakob Sponholz et.al.	2503.13031	translate	read	null
2025-03-14	MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens	Jeong Hun Yeo et.al.	2503.11315	translate	read	link
2025-03-13	AudioX: Diffusion Transformer for Anything-to-Audio Generation	Zeyue Tian et.al.	2503.10522	translate	read	link
2025-03-13	Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings	Jakaria Islam Emon et.al.	2503.10446	translate	read	link
2025-03-14	Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models	Sebastian Möller et.al.	2503.10298	translate	read	null
2025-03-12	ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization	Haaris Mehmood et.al.	2503.09906	translate	read	null
2025-03-12	Quantization for OpenAI’s Whisper Models: A Comparative Analysis	Allison Andreyev et.al.	2503.09905	translate	read	link
2025-03-12	Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment	Xiaowei Bi et.al.	2503.09081	translate	read	null
2025-03-11	An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR	Sewade Ogun et.al.	2503.08954	translate	read	null
2025-03-11	YuE: Scaling Open Foundation Models for Long-Form Music Generation	Ruibin Yuan et.al.	2503.08638	translate	read	link
2025-03-11	Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos	Soumya Shamarao Jahagirdar et.al.	2503.08335	translate	read	null
2025-03-11	FilmComposer: LLM-Driven Music Production for Silent Film Clips	Zhifeng Xie et.al.	2503.08147	translate	read	link
2025-03-11	Boundary Regression for Leitmotif Detection in Music Audio	Sihun Lee et.al.	2503.07977	translate	read	null
2025-03-10	Building English ASR model with regional language support	Purvi Agrawal et.al.	2503.07522	translate	read	null
2025-03-10	Impact of Microphone Array Mismatches to Learning-based Replay Speech Detection	Michael Neri et.al.	2503.07357	translate	read	null
2025-03-10	Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling	Michael McGuire et.al.	2503.06924	translate	read	null
2025-03-09	Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs	Umberto Cappellazzo et.al.	2503.06362	translate	read	null
2025-03-08	Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations	Jeong Hun Yeo et.al.	2503.06273	translate	read	link
2025-03-08	A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment	Koji Inoue et.al.	2503.06241	translate	read	null
2025-03-07	DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility	Yifan Liu et.al.	2503.05223	translate	read	null
2025-03-06	From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment	Yutian Pang et.al.	2503.04974	translate	read	null
2025-03-04	Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis	Yiming Wang et.al.	2503.04814	translate	read	null
2025-03-06	LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM	Sambal Shikhar et.al.	2503.04724	translate	read	link
2025-03-06	Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning	Lucas Block Medin et.al.	2503.04710	translate	read	null
2025-03-05	Good practices for evaluation of synthesized speech	Erica Cooper et.al.	2503.03250	translate	read	null
2025-03-03	Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis	Samuel S. Sohn et.al.	2503.02907	translate	read	null
2025-03-04	Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization	Aviv Shamsian et.al.	2503.02312	translate	read	null
2025-03-05	Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization	Leonid Berlyand et.al.	2503.01922	translate	read	null
2025-03-03	Augmenting Online Meetings with Context-Aware Real-time Music Generation	Haruki Suzawa et.al.	2503.01354	translate	read	null
2025-03-03	Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology	Birger Moell et.al.	2503.01266	translate	read	null
2025-03-03	DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion	Ziqian Ning et.al.	2503.01183	translate	read	link
2025-03-02	Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems	Ajinkya Kulkarni et.al.	2503.00907	translate	read	null
2025-03-02	UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation	Alexander H. Liu et.al.	2503.00733	translate	read	null
2025-03-01	PodAgent: A Comprehensive Framework for Podcast Generation	Yujia Xiao et.al.	2503.00455	translate	read	link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)