Audio Processing - 2024-12

Publish Date Title Authors PDF Translate Read Code
2024-12-31 Fotheidil: an Automatic Transcription System for the Irish Language Liam Lonergan et.al. 2501.00509 translate read null
2024-12-31 Unrolled Creative Adversarial Network For Generating Novel Musical Pieces Pratik Nag et.al. 2501.00452 translate read null
2024-12-31 Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages Or Haim Anidjar et.al. 2501.00425 translate read null
2024-12-30 Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study Mykola Maslych et.al. 2501.00168 translate read null
2024-12-30 DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition Alexander Polok et.al. 2501.00114 translate read null
2024-12-29 EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion Ashishkumar Gudmalwar et.al. 2412.20359 translate read null
2024-12-28 Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting Wooseok Han et.al. 2412.20155 translate read null
2024-12-28 CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation Ji-Hoon Kim et.al. 2412.20048 translate read null
2024-12-27 Enhancing Whisper’s Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization Kumud Tripathi et.al. 2412.19785 translate read null
2024-12-26 Towards a Single ASR Model That Generalizes to Disordered Speech Jimmy Tobin et.al. 2412.19315 translate read null
2024-12-26 VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis Jaemin Jung et.al. 2412.19259 translate read null
2024-12-26 Attacking Voice Anonymization Systems with Augmented Feature and Speaker Identity Difference Yanzhe Zhang et.al. 2412.19068 translate read null
2024-12-26 Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization Yihan Wu et.al. 2412.19005 translate read link
2024-12-25 MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI Neil Shah et.al. 2412.18836 translate read null
2024-12-25 Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition Shujie Hu et.al. 2412.18832 translate read null
2024-12-25 Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants Mequanent Argaw Muluneh et.al. 2412.18784 translate read null
2024-12-25 Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis Zhenqi Jia et.al. 2412.18733 translate read null
2024-12-24 Zero-resource Speech Translation and Recognition with LLMs Karel Mundnich et.al. 2412.18566 translate read null
2024-12-23 Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning Orson Mengara et.al. 2412.17908 translate read null
2024-12-23 Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution Orchid Chetia Phukan et.al. 2412.17796 translate read null
2024-12-23 VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music Jiatong Shi et.al. 2412.17667 translate read link
2024-12-23 UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition Li Fu et.al. 2412.17507 translate read null
2024-12-23 Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions Yindan Luo et.al. 2412.17349 translate read null
2024-12-23 Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding Yueqian Wang et.al. 2412.17295 translate read link
2024-12-22 Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization Natalia Tomashenko et.al. 2412.17164 translate read null
2024-12-22 Tandem spoofing-robust automatic speaker verification based on time-domain embeddings Avishai Weizman et.al. 2412.17133 translate read null
2024-12-22 Uncovering the Visual Contribution in Audio-Visual Speech Recognition Zhaofeng Lin et.al. 2412.17129 translate read null
2024-12-22 Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis Ye-Xin Lu et.al. 2412.16977 translate read null
2024-12-22 Autoregressive Speech Synthesis with Next-Distribution Prediction Xinfa Zhu et.al. 2412.16846 translate read null
2024-12-20 MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula Sieun Hyeon et.al. 2412.15655 translate read link
2024-12-20 TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch Xingchen Song et.al. 2412.15622 translate read null
2024-12-19 Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition Niko Moritz et.al. 2412.15415 translate read null
2024-12-19 LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration Sangmin Lee et.al. 2412.15299 translate read null
2024-12-17 Deep Speech Synthesis from Multimodal Articulatory Representations Peter Wu et.al. 2412.13387 translate read null
2024-12-17 CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition He Wang et.al. 2412.12760 translate read null
2024-12-17 Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency Yu Xi et.al. 2412.12635 translate read null
2024-12-17 Hierarchical Control of Emotion Rendering in Speech Synthesis Sho Inoue et.al. 2412.12498 translate read link
2024-12-17 Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback Kate Knill et.al. 2412.11986 translate read null
2024-12-17 Speak & Improve Challenge 2025: Tasks and Baseline Systems Mengjie Qian et.al. 2412.11985 translate read null
2024-12-19 ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis Xiangheng He et.al. 2412.11795 translate read null
2024-12-16 Region-Based Optimization in Continual Learning for Audio Deepfake Detection Yujie Chen et.al. 2412.11551 translate read link
2024-12-16 Towards a Speech Foundation Model for Singapore and Beyond Muhammad Huzaifah et.al. 2412.11538 translate read null
2024-12-15 Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition Han Zhu et.al. 2412.11185 translate read null
2024-12-14 MASV: Speaker Verification with Global and Local Context Mamba Yang Liu et.al. 2412.10989 translate read null
2024-12-14 Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network Ali Nasr-Esfahani et.al. 2412.10857 translate read null
2024-12-14 Efficient Adaptation of Multilingual Models for Japanese ASR Mark Bajo et.al. 2412.10705 translate read null
2024-12-16 Efficient Generative Modeling with Residual Vector Quantization-Based Tokens Jaehyeon Kim et.al. 2412.10208 translate read null
2024-12-13 CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models Zhihao Du et.al. 2412.10117 translate read null
2024-12-13 AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation Xiyuan Gao et.al. 2412.10103 translate read null
2024-12-13 CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls Li Chai et.al. 2412.09887 translate read null
2024-12-13 MERaLiON-AudioLLM: Technical Report Yingxu He et.al. 2412.09818 translate read null
2024-12-12 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation Baisen Wang et.al. 2412.09428 translate read link
2024-12-12 Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew’s Treatise Tornike Karchkhadze et.al. 2412.08944 translate read null
2024-12-11 Multimodal Latent Language Modeling with Next-Token Diffusion Yutao Sun et.al. 2412.08635 translate read link
2024-12-12 Watermarking Training Data of Music Generation Models Pascal Epple et.al. 2412.08549 translate read null
2024-12-11 Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition Xiaodong Cui et.al. 2412.08548 translate read null
2024-12-11 Zero-Shot Mono-to-Binaural Speech Synthesis Alon Levkovitch et.al. 2412.08356 translate read null
2024-12-11 A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction Sowmya Cheripally et.al. 2412.08312 translate read null
2024-12-10 Frechet Music Distance: A Metric For Generative Symbolic Music Evaluation Jan Retkowski et.al. 2412.07948 translate read null
2024-12-10 Style-agnostic evaluation of ASR using multiple reference transcripts Quinten McNamara et.al. 2412.07937 translate read null
2024-12-09 Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning Yingyi Ma et.al. 2412.06967 translate read null
2024-12-09 MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models Shansong Liu et.al. 2412.06660 translate read link
2024-12-09 Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey Tianxin Xie et.al. 2412.06602 translate read link
2024-12-09 Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer’s Disease Detection Jiawen Kang et.al. 2412.06332 translate read null
2024-12-09 VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features Sifei Li et.al. 2412.06296 translate read null
2024-12-09 Leveraging Prompt Learning and Pause Encoding for Alzheimer’s Disease Detection Yin-Long Liu et.al. 2412.06259 translate read null
2024-12-07 SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR Pengcheng Guo et.al. 2412.05589 translate read null
2024-12-06 Adaptive Dropout for Pruning Conformers Yotaro Kubo et.al. 2412.04836 translate read null
2024-12-10 StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching Jixun Yao et.al. 2412.04724 translate read null
2024-12-05 Missing Melodies: AI Music Generation and its “Nearly” Complete Omission of the Global South Atharva Mehta et.al. 2412.04100 translate read null
2024-12-05 Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding Vakada Naveen et.al. 2412.03980 translate read null
2024-12-05 Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech Yerin Choi et.al. 2412.03784 translate read null
2024-12-04 ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction Victor Junqiu Wei et.al. 2412.03075 translate read null
2024-12-04 Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model Joonyong Park et.al. 2412.03074 translate read null
2024-12-03 GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Aohan Zeng et.al. 2412.02612 translate read link
2024-12-01 Late fusion ensembles for speech recognition on diverse input audio representations Marin Jezidžić et.al. 2412.01861 translate read null
2024-12-02 Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification Bei Liu et.al. 2412.01195 translate read null
2024-12-01 Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment Firdavs Nasriddinov et.al. 2412.00760 translate read link
2024-12-04 A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario Zheshu Song et.al. 2412.00721 translate read null
2024-12-02 CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion Yuke Li et.al. 2411.18918 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)