Audio Processing - 2024-12 | Paper Arxiv Daily

Audio Processing - 2024-12

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-12-31	Fotheidil: an Automatic Transcription System for the Irish Language	Liam Lonergan et.al.	2501.00509	translate	read	null
2024-12-31	Unrolled Creative Adversarial Network For Generating Novel Musical Pieces	Pratik Nag et.al.	2501.00452	translate	read	null
2024-12-31	Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages	Or Haim Anidjar et.al.	2501.00425	translate	read	null
2024-12-30	Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study	Mykola Maslych et.al.	2501.00168	translate	read	null
2024-12-30	DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition	Alexander Polok et.al.	2501.00114	translate	read	null
2024-12-29	EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion	Ashishkumar Gudmalwar et.al.	2412.20359	translate	read	null
2024-12-28	Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting	Wooseok Han et.al.	2412.20155	translate	read	null
2024-12-28	CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation	Ji-Hoon Kim et.al.	2412.20048	translate	read	null
2024-12-27	Enhancing Whisper’s Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization	Kumud Tripathi et.al.	2412.19785	translate	read	null
2024-12-26	Towards a Single ASR Model That Generalizes to Disordered Speech	Jimmy Tobin et.al.	2412.19315	translate	read	null
2024-12-26	VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis	Jaemin Jung et.al.	2412.19259	translate	read	null
2024-12-26	Attacking Voice Anonymization Systems with Augmented Feature and Speaker Identity Difference	Yanzhe Zhang et.al.	2412.19068	translate	read	null
2024-12-26	Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization	Yihan Wu et.al.	2412.19005	translate	read	link
2024-12-25	MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI	Neil Shah et.al.	2412.18836	translate	read	null
2024-12-25	Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition	Shujie Hu et.al.	2412.18832	translate	read	null
2024-12-25	Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants	Mequanent Argaw Muluneh et.al.	2412.18784	translate	read	null
2024-12-25	Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis	Zhenqi Jia et.al.	2412.18733	translate	read	null
2024-12-24	Zero-resource Speech Translation and Recognition with LLMs	Karel Mundnich et.al.	2412.18566	translate	read	null
2024-12-23	Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning	Orson Mengara et.al.	2412.17908	translate	read	null
2024-12-23	Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution	Orchid Chetia Phukan et.al.	2412.17796	translate	read	null
2024-12-23	VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music	Jiatong Shi et.al.	2412.17667	translate	read	link
2024-12-23	UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition	Li Fu et.al.	2412.17507	translate	read	null
2024-12-23	Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions	Yindan Luo et.al.	2412.17349	translate	read	null
2024-12-23	Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding	Yueqian Wang et.al.	2412.17295	translate	read	link
2024-12-22	Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization	Natalia Tomashenko et.al.	2412.17164	translate	read	null
2024-12-22	Tandem spoofing-robust automatic speaker verification based on time-domain embeddings	Avishai Weizman et.al.	2412.17133	translate	read	null
2024-12-22	Uncovering the Visual Contribution in Audio-Visual Speech Recognition	Zhaofeng Lin et.al.	2412.17129	translate	read	null
2024-12-22	Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis	Ye-Xin Lu et.al.	2412.16977	translate	read	null
2024-12-22	Autoregressive Speech Synthesis with Next-Distribution Prediction	Xinfa Zhu et.al.	2412.16846	translate	read	null
2024-12-20	MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula	Sieun Hyeon et.al.	2412.15655	translate	read	link
2024-12-20	TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch	Xingchen Song et.al.	2412.15622	translate	read	null
2024-12-19	Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition	Niko Moritz et.al.	2412.15415	translate	read	null
2024-12-19	LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration	Sangmin Lee et.al.	2412.15299	translate	read	null
2024-12-17	Deep Speech Synthesis from Multimodal Articulatory Representations	Peter Wu et.al.	2412.13387	translate	read	null
2024-12-17	CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition	He Wang et.al.	2412.12760	translate	read	null
2024-12-17	Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency	Yu Xi et.al.	2412.12635	translate	read	null
2024-12-17	Hierarchical Control of Emotion Rendering in Speech Synthesis	Sho Inoue et.al.	2412.12498	translate	read	link
2024-12-17	Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback	Kate Knill et.al.	2412.11986	translate	read	null
2024-12-17	Speak & Improve Challenge 2025: Tasks and Baseline Systems	Mengjie Qian et.al.	2412.11985	translate	read	null
2024-12-19	ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis	Xiangheng He et.al.	2412.11795	translate	read	null
2024-12-16	Region-Based Optimization in Continual Learning for Audio Deepfake Detection	Yujie Chen et.al.	2412.11551	translate	read	link
2024-12-16	Towards a Speech Foundation Model for Singapore and Beyond	Muhammad Huzaifah et.al.	2412.11538	translate	read	null
2024-12-15	Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition	Han Zhu et.al.	2412.11185	translate	read	null
2024-12-14	MASV: Speaker Verification with Global and Local Context Mamba	Yang Liu et.al.	2412.10989	translate	read	null
2024-12-14	Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network	Ali Nasr-Esfahani et.al.	2412.10857	translate	read	null
2024-12-14	Efficient Adaptation of Multilingual Models for Japanese ASR	Mark Bajo et.al.	2412.10705	translate	read	null
2024-12-16	Efficient Generative Modeling with Residual Vector Quantization-Based Tokens	Jaehyeon Kim et.al.	2412.10208	translate	read	null
2024-12-13	CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models	Zhihao Du et.al.	2412.10117	translate	read	null
2024-12-13	AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation	Xiyuan Gao et.al.	2412.10103	translate	read	null
2024-12-13	CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls	Li Chai et.al.	2412.09887	translate	read	null
2024-12-13	MERaLiON-AudioLLM: Technical Report	Yingxu He et.al.	2412.09818	translate	read	null
2024-12-12	Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation	Baisen Wang et.al.	2412.09428	translate	read	link
2024-12-12	Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew’s Treatise	Tornike Karchkhadze et.al.	2412.08944	translate	read	null
2024-12-11	Multimodal Latent Language Modeling with Next-Token Diffusion	Yutao Sun et.al.	2412.08635	translate	read	link
2024-12-12	Watermarking Training Data of Music Generation Models	Pascal Epple et.al.	2412.08549	translate	read	null
2024-12-11	Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition	Xiaodong Cui et.al.	2412.08548	translate	read	null
2024-12-11	Zero-Shot Mono-to-Binaural Speech Synthesis	Alon Levkovitch et.al.	2412.08356	translate	read	null
2024-12-11	A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction	Sowmya Cheripally et.al.	2412.08312	translate	read	null
2024-12-10	Frechet Music Distance: A Metric For Generative Symbolic Music Evaluation	Jan Retkowski et.al.	2412.07948	translate	read	null
2024-12-10	Style-agnostic evaluation of ASR using multiple reference transcripts	Quinten McNamara et.al.	2412.07937	translate	read	null
2024-12-09	Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning	Yingyi Ma et.al.	2412.06967	translate	read	null
2024-12-09	MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models	Shansong Liu et.al.	2412.06660	translate	read	link
2024-12-09	Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey	Tianxin Xie et.al.	2412.06602	translate	read	link
2024-12-09	Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer’s Disease Detection	Jiawen Kang et.al.	2412.06332	translate	read	null
2024-12-09	VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features	Sifei Li et.al.	2412.06296	translate	read	null
2024-12-09	Leveraging Prompt Learning and Pause Encoding for Alzheimer’s Disease Detection	Yin-Long Liu et.al.	2412.06259	translate	read	null
2024-12-07	SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR	Pengcheng Guo et.al.	2412.05589	translate	read	null
2024-12-06	Adaptive Dropout for Pruning Conformers	Yotaro Kubo et.al.	2412.04836	translate	read	null
2024-12-10	StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching	Jixun Yao et.al.	2412.04724	translate	read	null
2024-12-05	Missing Melodies: AI Music Generation and its “Nearly” Complete Omission of the Global South	Atharva Mehta et.al.	2412.04100	translate	read	null
2024-12-05	Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding	Vakada Naveen et.al.	2412.03980	translate	read	null
2024-12-05	Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech	Yerin Choi et.al.	2412.03784	translate	read	null
2024-12-04	ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction	Victor Junqiu Wei et.al.	2412.03075	translate	read	null
2024-12-04	Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model	Joonyong Park et.al.	2412.03074	translate	read	null
2024-12-03	GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot	Aohan Zeng et.al.	2412.02612	translate	read	link
2024-12-01	Late fusion ensembles for speech recognition on diverse input audio representations	Marin Jezidžić et.al.	2412.01861	translate	read	null
2024-12-02	Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification	Bei Liu et.al.	2412.01195	translate	read	null
2024-12-01	Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment	Firdavs Nasriddinov et.al.	2412.00760	translate	read	link
2024-12-04	A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario	Zheshu Song et.al.	2412.00721	translate	read	null
2024-12-02	CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion	Yuke Li et.al.	2411.18918	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)