Audio Processing - 2024-12
Audio Processing - 2024-12
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-12-31 | Fotheidil: an Automatic Transcription System for the Irish Language | Liam Lonergan et.al. | 2501.00509 | translate | read | null |
| 2024-12-31 | Unrolled Creative Adversarial Network For Generating Novel Musical Pieces | Pratik Nag et.al. | 2501.00452 | translate | read | null |
| 2024-12-31 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages | Or Haim Anidjar et.al. | 2501.00425 | translate | read | null |
| 2024-12-30 | Takeaways from Applying LLM Capabilities to Multiple Conversational Avatars in a VR Pilot Study | Mykola Maslych et.al. | 2501.00168 | translate | read | null |
| 2024-12-30 | DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition | Alexander Polok et.al. | 2501.00114 | translate | read | null |
| 2024-12-29 | EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion | Ashishkumar Gudmalwar et.al. | 2412.20359 | translate | read | null |
| 2024-12-28 | Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting | Wooseok Han et.al. | 2412.20155 | translate | read | null |
| 2024-12-28 | CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation | Ji-Hoon Kim et.al. | 2412.20048 | translate | read | null |
| 2024-12-27 | Enhancing Whisper’s Accuracy and Speed for Indian Languages through Prompt-Tuning and Tokenization | Kumud Tripathi et.al. | 2412.19785 | translate | read | null |
| 2024-12-26 | Towards a Single ASR Model That Generalizes to Disordered Speech | Jimmy Tobin et.al. | 2412.19315 | translate | read | null |
| 2024-12-26 | VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis | Jaemin Jung et.al. | 2412.19259 | translate | read | null |
| 2024-12-26 | Attacking Voice Anonymization Systems with Augmented Feature and Speaker Identity Difference | Yanzhe Zhang et.al. | 2412.19068 | translate | read | null |
| 2024-12-26 | Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization | Yihan Wu et.al. | 2412.19005 | translate | read | link |
| 2024-12-25 | MRI2Speech: Speech Synthesis from Articulatory Movements Recorded by Real-time MRI | Neil Shah et.al. | 2412.18836 | translate | read | null |
| 2024-12-25 | Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition | Shujie Hu et.al. | 2412.18832 | translate | read | null |
| 2024-12-25 | Zema Dataset: A Comprehensive Study of Yaredawi Zema with a Focus on Horologium Chants | Mequanent Argaw Muluneh et.al. | 2412.18784 | translate | read | null |
| 2024-12-25 | Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis | Zhenqi Jia et.al. | 2412.18733 | translate | read | null |
| 2024-12-24 | Zero-resource Speech Translation and Recognition with LLMs | Karel Mundnich et.al. | 2412.18566 | translate | read | null |
| 2024-12-23 | Trading Devil RL: Backdoor attack via Stock market, Bayesian Optimization and Reinforcement Learning | Orson Mengara et.al. | 2412.17908 | translate | read | null |
| 2024-12-23 | Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution | Orchid Chetia Phukan et.al. | 2412.17796 | translate | read | null |
| 2024-12-23 | VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music | Jiatong Shi et.al. | 2412.17667 | translate | read | link |
| 2024-12-23 | UME: Upcycling Mixture-of-Experts for Scalable and Efficient Automatic Speech Recognition | Li Fu et.al. | 2412.17507 | translate | read | null |
| 2024-12-23 | Deep Learning in Proteomics Informatics: Applications, Challenges, and Future Directions | Yindan Luo et.al. | 2412.17349 | translate | read | null |
| 2024-12-23 | Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding | Yueqian Wang et.al. | 2412.17295 | translate | read | link |
| 2024-12-22 | Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization | Natalia Tomashenko et.al. | 2412.17164 | translate | read | null |
| 2024-12-22 | Tandem spoofing-robust automatic speaker verification based on time-domain embeddings | Avishai Weizman et.al. | 2412.17133 | translate | read | null |
| 2024-12-22 | Uncovering the Visual Contribution in Audio-Visual Speech Recognition | Zhaofeng Lin et.al. | 2412.17129 | translate | read | null |
| 2024-12-22 | Incremental Disentanglement for Environment-Aware Zero-Shot Text-to-Speech Synthesis | Ye-Xin Lu et.al. | 2412.16977 | translate | read | null |
| 2024-12-22 | Autoregressive Speech Synthesis with Next-Distribution Prediction | Xinfa Zhu et.al. | 2412.16846 | translate | read | null |
| 2024-12-20 | MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula | Sieun Hyeon et.al. | 2412.15655 | translate | read | link |
| 2024-12-20 | TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch | Xingchen Song et.al. | 2412.15622 | translate | read | null |
| 2024-12-19 | Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition | Niko Moritz et.al. | 2412.15415 | translate | read | null |
| 2024-12-19 | LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration | Sangmin Lee et.al. | 2412.15299 | translate | read | null |
| 2024-12-17 | Deep Speech Synthesis from Multimodal Articulatory Representations | Peter Wu et.al. | 2412.13387 | translate | read | null |
| 2024-12-17 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition | He Wang et.al. | 2412.12760 | translate | read | null |
| 2024-12-17 | Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency | Yu Xi et.al. | 2412.12635 | translate | read | null |
| 2024-12-17 | Hierarchical Control of Emotion Rendering in Speech Synthesis | Sho Inoue et.al. | 2412.12498 | translate | read | link |
| 2024-12-17 | Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback | Kate Knill et.al. | 2412.11986 | translate | read | null |
| 2024-12-17 | Speak & Improve Challenge 2025: Tasks and Baseline Systems | Mengjie Qian et.al. | 2412.11985 | translate | read | null |
| 2024-12-19 | ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis | Xiangheng He et.al. | 2412.11795 | translate | read | null |
| 2024-12-16 | Region-Based Optimization in Continual Learning for Audio Deepfake Detection | Yujie Chen et.al. | 2412.11551 | translate | read | link |
| 2024-12-16 | Towards a Speech Foundation Model for Singapore and Beyond | Muhammad Huzaifah et.al. | 2412.11538 | translate | read | null |
| 2024-12-15 | Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition | Han Zhu et.al. | 2412.11185 | translate | read | null |
| 2024-12-14 | MASV: Speaker Verification with Global and Local Context Mamba | Yang Liu et.al. | 2412.10989 | translate | read | null |
| 2024-12-14 | Robust Recognition of Persian Isolated Digits in Speech using Deep Neural Network | Ali Nasr-Esfahani et.al. | 2412.10857 | translate | read | null |
| 2024-12-14 | Efficient Adaptation of Multilingual Models for Japanese ASR | Mark Bajo et.al. | 2412.10705 | translate | read | null |
| 2024-12-16 | Efficient Generative Modeling with Residual Vector Quantization-Based Tokens | Jaehyeon Kim et.al. | 2412.10208 | translate | read | null |
| 2024-12-13 | CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models | Zhihao Du et.al. | 2412.10117 | translate | read | null |
| 2024-12-13 | AMuSeD: An Attentive Deep Neural Network for Multimodal Sarcasm Detection Incorporating Bi-modal Data Augmentation | Xiyuan Gao et.al. | 2412.10103 | translate | read | null |
| 2024-12-13 | CSL-L2M: Controllable Song-Level Lyric-to-Melody Generation Based on Conditional Transformer with Fine-Grained Lyric and Musical Controls | Li Chai et.al. | 2412.09887 | translate | read | null |
| 2024-12-13 | MERaLiON-AudioLLM: Technical Report | Yingxu He et.al. | 2412.09818 | translate | read | null |
| 2024-12-12 | Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation | Baisen Wang et.al. | 2412.09428 | translate | read | link |
| 2024-12-12 | Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew’s Treatise | Tornike Karchkhadze et.al. | 2412.08944 | translate | read | null |
| 2024-12-11 | Multimodal Latent Language Modeling with Next-Token Diffusion | Yutao Sun et.al. | 2412.08635 | translate | read | link |
| 2024-12-12 | Watermarking Training Data of Music Generation Models | Pascal Epple et.al. | 2412.08549 | translate | read | null |
| 2024-12-11 | Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition | Xiaodong Cui et.al. | 2412.08548 | translate | read | null |
| 2024-12-11 | Zero-Shot Mono-to-Binaural Speech Synthesis | Alon Levkovitch et.al. | 2412.08356 | translate | read | null |
| 2024-12-11 | A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction | Sowmya Cheripally et.al. | 2412.08312 | translate | read | null |
| 2024-12-10 | Frechet Music Distance: A Metric For Generative Symbolic Music Evaluation | Jan Retkowski et.al. | 2412.07948 | translate | read | null |
| 2024-12-10 | Style-agnostic evaluation of ASR using multiple reference transcripts | Quinten McNamara et.al. | 2412.07937 | translate | read | null |
| 2024-12-09 | Effective Text Adaptation for LLM-based ASR through Soft Prompt Fine-Tuning | Yingyi Ma et.al. | 2412.06967 | translate | read | null |
| 2024-12-09 | MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large Language Models | Shansong Liu et.al. | 2412.06660 | translate | read | link |
| 2024-12-09 | Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey | Tianxin Xie et.al. | 2412.06602 | translate | read | link |
| 2024-12-09 | Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer’s Disease Detection | Jiawen Kang et.al. | 2412.06332 | translate | read | null |
| 2024-12-09 | VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features | Sifei Li et.al. | 2412.06296 | translate | read | null |
| 2024-12-09 | Leveraging Prompt Learning and Pause Encoding for Alzheimer’s Disease Detection | Yin-Long Liu et.al. | 2412.06259 | translate | read | null |
| 2024-12-07 | SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR | Pengcheng Guo et.al. | 2412.05589 | translate | read | null |
| 2024-12-06 | Adaptive Dropout for Pruning Conformers | Yotaro Kubo et.al. | 2412.04836 | translate | read | null |
| 2024-12-10 | StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching | Jixun Yao et.al. | 2412.04724 | translate | read | null |
| 2024-12-05 | Missing Melodies: AI Music Generation and its “Nearly” Complete Omission of the Global South | Atharva Mehta et.al. | 2412.04100 | translate | read | null |
| 2024-12-05 | Comprehensive Audio Query Handling System with Integrated Expert Models and Contextual Understanding | Vakada Naveen et.al. | 2412.03980 | translate | read | null |
| 2024-12-05 | Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech | Yerin Choi et.al. | 2412.03784 | translate | read | null |
| 2024-12-04 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction | Victor Junqiu Wei et.al. | 2412.03075 | translate | read | null |
| 2024-12-04 | Analytic Study of Text-Free Speech Synthesis for Raw Audio using a Self-Supervised Learning Model | Joonyong Park et.al. | 2412.03074 | translate | read | null |
| 2024-12-03 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot | Aohan Zeng et.al. | 2412.02612 | translate | read | link |
| 2024-12-01 | Late fusion ensembles for speech recognition on diverse input audio representations | Marin Jezidžić et.al. | 2412.01861 | translate | read | null |
| 2024-12-02 | Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification | Bei Liu et.al. | 2412.01195 | translate | read | null |
| 2024-12-01 | Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment | Firdavs Nasriddinov et.al. | 2412.00760 | translate | read | link |
| 2024-12-04 | A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario | Zheshu Song et.al. | 2412.00721 | translate | read | null |
| 2024-12-02 | CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion | Yuke Li et.al. | 2411.18918 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)