Audio Processing - 2024-09

Publish Date Title Authors PDF Translate Read Code
2024-09-30 Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding Takafumi Moriya et.al. 2409.20313 translate read null
2024-09-30 Alignment-Free Training for Transducer-based Multi-Talker ASR Takafumi Moriya et.al. 2409.20301 translate read null
2024-09-30 AfriHuBERT: A self-supervised speech representation model for African languages Jesujoba O. Alabi et.al. 2409.20201 translate read null
2024-09-30 Melody Is All You Need For Music Generation Shaopeng Wei et.al. 2409.20196 translate read link
2024-09-30 Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems Oswald Zink et.al. 2409.19990 translate read null
2024-09-30 HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models Bingshen Mu et.al. 2409.19878 translate read null
2024-09-29 Fine-Tuning Automatic Speech Recognition for People with Parkinson’s: An Effective Strategy for Enhancing Speech Technology Accessibility Xiuwen Zheng et.al. 2409.19818 translate read null
2024-09-29 Efficient Long-Form Speech Recognition for General Speech In-Context Learning Hao Yen et.al. 2409.19757 translate read null
2024-09-29 Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective Chen Chen et.al. 2409.19575 translate read null
2024-09-29 CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought Yexing Du et.al. 2409.19510 translate read link
2024-09-27 Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models Xiaoxue Gao et.al. 2409.18654 translate read null
2024-09-27 ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 Jiaming Zhou et.al. 2409.18584 translate read null
2024-09-27 EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis Haoyu Wang et.al. 2409.18512 translate read null
2024-09-27 Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking Brian Yan et.al. 2409.18428 translate read null
2024-09-26 Unveiling the Role of Pretraining in Direct Speech Translation Belen Alastruey et.al. 2409.18044 translate read null
2024-09-26 Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study Keyu An et.al. 2409.17750 translate read null
2024-09-26 Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition Keyu An et.al. 2409.17746 translate read null
2024-09-26 Deep CLAS: Deep Contextual Listen, Attend and Spell Shifu Xiong et.al. 2409.17603 translate read null
2024-09-25 Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion Giuseppe Ruggiero et.al. 2409.17387 translate read null
2024-09-25 Exploring synthetic data for cross-speaker style transfer in style representation based TTS Lucas H. Ueda et.al. 2409.17364 translate read null
2024-09-25 How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not Francesco Verdini et.al. 2409.17044 translate read null
2024-09-25 MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events Xiaoyu Yang et.al. 2409.17010 translate read null
2024-09-25 Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition Andrés Piñeiro-Martín et.al. 2409.16954 translate read null
2024-09-25 Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling Yuanchao Li et.al. 2409.16937 translate read link
2024-09-25 Speech Recognition Rescoring with Large Speech-Text Foundation Models Prashanth Gurunath Shivakumar et.al. 2409.16654 translate read null
2024-09-24 Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices Leonid Velikovich et.al. 2409.16469 translate read null
2024-09-24 FastTalker: Jointly Generating Speech and Conversational Gestures from Text Zixin Guo et.al. 2409.16404 translate read null
2024-09-24 Revisiting Acoustic Features for Robust ASR Muhammad A. Shah et.al. 2409.16399 translate read null
2024-09-24 Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech Yunji Chu et.al. 2409.16203 translate read null
2024-09-24 ComiCap: A VLMs pipeline for dense captioning of Comic Panels Emanuele Vivoli et.al. 2409.16159 translate read link
2024-09-24 Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs Yang Yuhang et.al. 2409.16005 translate read null
2024-09-24 Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification Fengrun Zhang et.al. 2409.15974 translate read null
2024-09-24 Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Fengrun Zhang et.al. 2409.15905 translate read null
2024-09-24 Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization Sotheara Leang et.al. 2409.15882 translate read null
2024-09-24 WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction Shuai Wang et.al. 2409.15799 translate read null
2024-09-24 M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions Shuai Wang et.al. 2409.15782 translate read null
2024-09-24 Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample Zhiyong Chen et.al. 2409.15742 translate read null
2024-09-24 StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis Zhiyong Chen et.al. 2409.15741 translate read null
2024-09-19 WeHelp: A Shared Autonomy System for Wheelchair Users Abulikemu Abuduweili et.al. 2409.12159 translate read link
2024-09-18 ASR Benchmarking: Need for a More Representative Conversational Dataset Gaurav Maheshwari et.al. 2409.12042 translate read link
2024-09-18 Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 Zhiyong Wang et.al. 2409.11909 translate read null
2024-09-18 M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Jiaming Zhou et.al. 2409.11889 translate read null
2024-09-18 METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation Dinh-Viet-Toan Le et.al. 2409.11753 translate read link
2024-09-19 Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations Haopeng Geng et.al. 2409.11742 translate read null
2024-09-17 Discrete Unit based Masking for Improving Disentanglement in Voice Conversion Philip H. Lee et.al. 2409.11560 translate read null
2024-09-17 Chain-of-Thought Prompting for Speech Translation Ke Hu et.al. 2409.11538 translate read null
2024-09-17 M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses Yufeng Yang et.al. 2409.11494 translate read null
2024-09-17 Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models Jiahao Qin et.al. 2409.11263 translate read null
2024-09-17 WER We Stand: Benchmarking Urdu ASR Models Samee Arif et.al. 2409.11252 translate read null
2024-09-17 Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text Hongfei Xue et.al. 2409.11214 translate read null
2024-09-17 Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora Francesco Nespoli et.al. 2409.11107 translate read null
2024-09-17 Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation Gerard I. Gállego et.al. 2409.11003 translate read null
2024-09-17 Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models Potsawee Manakul et.al. 2409.10999 translate read null
2024-09-17 Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data Jing Xu et.al. 2409.10969 translate read null
2024-09-17 Speech Recognition for Analysis of Police Radio Communication Tejes Srivastava et.al. 2409.10858 translate read null
2024-09-17 PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing Phillip Long et.al. 2409.10831 translate read null
2024-09-16 Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels Zakaria Aldeneh et.al. 2409.10791 translate read null
2024-09-16 An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems Hitesh Tulsiani et.al. 2409.10515 translate read null
2024-09-16 Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages Ming-Hao Hsu et.al. 2409.10429 translate read null
2024-09-16 Voice control interface for surgical robot assistants Ana Davila et.al. 2409.10225 translate read null
2024-09-16 Augmenting Automatic Speech Recognition Models with Disfluency Detection Robin Amann et.al. 2409.10177 translate read null
2024-09-16 Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization Xiaoxue Gao et.al. 2409.10157 translate read null
2024-09-16 Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge Shuiyun Liu et.al. 2409.10076 translate read null
2024-09-16 Speaker Contrastive Learning for Source Speaker Tracing Qing Wang et.al. 2409.10072 translate read null
2024-09-16 StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion Yinghao Aaron Li et.al. 2409.10058 translate read null
2024-09-16 A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models Ryandhimas E. Zezario et.al. 2409.09914 translate read null
2024-09-15 Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition, Speaker Tagging, and Emotion Recognition Chao-Han Huck Yang et.al. 2409.09785 translate read null
2024-09-13 Clean Label Attacks against SLU Systems Henry Li Xinyuan et.al. 2409.08985 translate read null
2024-09-13 HLTCOE JHU Submission to the Voice Privacy Challenge 2024 Henry Li Xinyuan et.al. 2409.08913 translate read null
2024-09-13 Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages Yao-Fei Cheng et.al. 2409.08872 translate read null
2024-09-13 Exploring SSL Discrete Tokens for Multilingual ASR Mingyu Cui et.al. 2409.08805 translate read null
2024-09-13 Text-To-Speech Synthesis In The Wild Jee-weon Jung et.al. 2409.08711 translate read null
2024-09-13 NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training Minglun Han et.al. 2409.08680 translate read null
2024-09-13 LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Shaojun Li et.al. 2409.08597 translate read null
2024-09-13 Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Lingwei Meng et.al. 2409.08596 translate read link
2024-09-13 LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling Yubo Huang et.al. 2409.08583 translate read null
2024-09-13 LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study Mahta Fetrat Qharabagh et.al. 2409.08554 translate read null
2024-09-12 Hierarchical Symbolic Pop Music Generation with Graph Neural Networks Wen Qing Lim et.al. 2409.08155 translate read null
2024-09-12 Faster Speech-LLaMA Inference with Multi-token Prediction Desh Raj et.al. 2409.08148 translate read null
2024-09-12 WhisperNER: Unified Open Named Entity and Speech Recognition Gil Ayache et.al. 2409.08107 translate read null
2024-09-12 The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language Michael Ong et.al. 2409.08103 translate read null
2024-09-12 Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations Wangjin Zhou et.al. 2409.08039 translate read null
2024-09-12 Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction Xiangyu Zhang et.al. 2409.07969 translate read null
2024-09-12 Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models Nikolai L. Kühne et.al. 2409.07936 translate read null
2024-09-12 Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning Elizabeth Wilson et.al. 2409.07918 translate read null
2024-09-12 Bridging Paintings and Music – Exploring Emotion based Music Generation through Paintings Tanisha Hisariya et.al. 2409.07827 translate read null
2024-09-12 Full-text Error Correction for Chinese Speech Recognition with Large Language Model Zhiyuan Tang et.al. 2409.07790 translate read null
2024-09-11 VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos Yan-Bo Lin et.al. 2409.07450 translate read null
2024-09-11 D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack Hong-Hanh Nguyen-Le et.al. 2409.07390 translate read null
2024-09-11 Rethinking Mamba in Speech Processing by Self-Supervised Models Xiangyu Zhang et.al. 2409.07273 translate read null
2024-09-11 ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages Mahta Fetrat Qharabagh et.al. 2409.07259 translate read null
2024-09-11 Enhancing CTC-Based Visual Speech Recognition Hendrik Laux et.al. 2409.07210 translate read null
2024-09-11 Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Titouan Parcollet et.al. 2409.07165 translate read null
2024-09-11 The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction Wen-Chin Huang et.al. 2409.07001 translate read null
2024-09-10 An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition Yi-Cheng Wang et.al. 2409.06468 translate read null
2024-09-10 What happens to diffusion model likelihood when your model is conditional? Mattias Cross et.al. 2409.06364 translate read null
2024-09-10 VoiceWukong: Benchmarking Deepfake Voice Detection Ziwei Yan et.al. 2409.06348 translate read null
2024-09-10 Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches Chang Zeng et.al. 2409.06327 translate read null
2024-09-10 Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Jihyun Lee et.al. 2409.06263 translate read null
2024-09-10 RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion Wei Chen et.al. 2409.06237 translate read null
2024-09-10 Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings Sakshi Deo Shukla et.al. 2409.06222 translate read null
2024-09-10 Multi-Source Music Generation with Latent Diffusion Zhongweiyang Xu et.al. 2409.06190 translate read link
2024-09-10 VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion Kyungguen Byun et.al. 2409.06126 translate read null
2024-09-09 Retrieval Augmented Correction of Named Entity Speech Recognition Errors Ernest Pusateri et.al. 2409.06062 translate read null
2024-09-09 PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification Massa Baali et.al. 2409.05799 translate read null
2024-09-09 Consensus-based Distributed Quantum Kernel Learning for Speech Recognition Kuan-Cheng Chen et.al. 2409.05770 translate read null
2024-09-09 A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR Giovanni Morrone et.al. 2409.05750 translate read null
2024-09-09 AS-Speech: Adaptive Style For Speech Synthesis Zhipeng Li et.al. 2409.05730 translate read null
2024-09-09 Evaluation of real-time transcriptions using end-to-end ASR models Carlos Arriaga et.al. 2409.05674 translate read null
2024-09-09 Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation Nithin Rao Koluguri et.al. 2409.05601 translate read null
2024-09-09 An investigation of modularity for noise robustness in conformer-based ASR Louise Coppieters de Gibson et.al. 2409.05589 translate read null
2024-09-09 NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge Naoyuki Kamo et.al. 2409.05554 translate read null
2024-09-09 Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge Hongfei Xue et.al. 2409.05430 translate read null
2024-09-08 Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection Theophile Stourbe et.al. 2409.05032 translate read null
2024-09-05 Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization Zexin Cai et.al. 2409.03655 translate read null
2024-09-05 DiffEVC: Any-to-Any Emotion Voice Conversion with Expressive Guidance Hsing-Hang Chou et.al. 2409.03636 translate read null
2024-09-05 Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder Yuying Xie et.al. 2409.03520 translate read null
2024-09-04 Probing self-attention in self-supervised speech models for cross-linguistic differences Sai Gopinath et.al. 2409.03115 translate read null
2024-09-04 Quantification of stylistic differences in human- and ASR-produced transcripts of African American English Annika Heuser et.al. 2409.03059 translate read null
2024-09-04 SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints Haonan Chen et.al. 2409.03055 translate read null
2024-09-04 Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model Tornike Karchkhadze et.al. 2409.02845 translate read null
2024-09-04 Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models Jakob Poncelet et.al. 2409.02565 translate read null
2024-09-04 Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm Sidonie Foulon et.al. 2409.02477 translate read null
2024-09-04 Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP Yisi Liu et.al. 2409.02451 translate read null
2024-09-04 What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Kavya Manohar et.al. 2409.02449 translate read null
2024-09-04 MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision Jiatao Chen et.al. 2409.02421 translate read link
2024-09-03 FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation Takuhiro Kaneko et.al. 2409.02245 translate read null
2024-09-03 Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR Xugang Lu et.al. 2409.02239 translate read null
2024-09-03 Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model Hukai Huang et.al. 2409.02050 translate read null
2024-09-03 The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge Shutong Niu et.al. 2409.02041 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)