Audio Processing - 2024-09
Audio Processing - 2024-09
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-09-30 | Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding | Takafumi Moriya et.al. | 2409.20313 | translate | read | null |
| 2024-09-30 | Alignment-Free Training for Transducer-based Multi-Talker ASR | Takafumi Moriya et.al. | 2409.20301 | translate | read | null |
| 2024-09-30 | AfriHuBERT: A self-supervised speech representation model for African languages | Jesujoba O. Alabi et.al. | 2409.20201 | translate | read | null |
| 2024-09-30 | Melody Is All You Need For Music Generation | Shaopeng Wei et.al. | 2409.20196 | translate | read | link |
| 2024-09-30 | Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems | Oswald Zink et.al. | 2409.19990 | translate | read | null |
| 2024-09-30 | HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models | Bingshen Mu et.al. | 2409.19878 | translate | read | null |
| 2024-09-29 | Fine-Tuning Automatic Speech Recognition for People with Parkinson’s: An Effective Strategy for Enhancing Speech Technology Accessibility | Xiuwen Zheng et.al. | 2409.19818 | translate | read | null |
| 2024-09-29 | Efficient Long-Form Speech Recognition for General Speech In-Context Learning | Hao Yen et.al. | 2409.19757 | translate | read | null |
| 2024-09-29 | Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective | Chen Chen et.al. | 2409.19575 | translate | read | null |
| 2024-09-29 | CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought | Yexing Du et.al. | 2409.19510 | translate | read | link |
| 2024-09-27 | Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models | Xiaoxue Gao et.al. | 2409.18654 | translate | read | null |
| 2024-09-27 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 | Jiaming Zhou et.al. | 2409.18584 | translate | read | null |
| 2024-09-27 | EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis | Haoyu Wang et.al. | 2409.18512 | translate | read | null |
| 2024-09-27 | Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking | Brian Yan et.al. | 2409.18428 | translate | read | null |
| 2024-09-26 | Unveiling the Role of Pretraining in Direct Speech Translation | Belen Alastruey et.al. | 2409.18044 | translate | read | null |
| 2024-09-26 | Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study | Keyu An et.al. | 2409.17750 | translate | read | null |
| 2024-09-26 | Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition | Keyu An et.al. | 2409.17746 | translate | read | null |
| 2024-09-26 | Deep CLAS: Deep Contextual Listen, Attend and Spell | Shifu Xiong et.al. | 2409.17603 | translate | read | null |
| 2024-09-25 | Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion | Giuseppe Ruggiero et.al. | 2409.17387 | translate | read | null |
| 2024-09-25 | Exploring synthetic data for cross-speaker style transfer in style representation based TTS | Lucas H. Ueda et.al. | 2409.17364 | translate | read | null |
| 2024-09-25 | How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not | Francesco Verdini et.al. | 2409.17044 | translate | read | null |
| 2024-09-25 | MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events | Xiaoyu Yang et.al. | 2409.17010 | translate | read | null |
| 2024-09-25 | Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition | Andrés Piñeiro-Martín et.al. | 2409.16954 | translate | read | null |
| 2024-09-25 | Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling | Yuanchao Li et.al. | 2409.16937 | translate | read | link |
| 2024-09-25 | Speech Recognition Rescoring with Large Speech-Text Foundation Models | Prashanth Gurunath Shivakumar et.al. | 2409.16654 | translate | read | null |
| 2024-09-24 | Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices | Leonid Velikovich et.al. | 2409.16469 | translate | read | null |
| 2024-09-24 | FastTalker: Jointly Generating Speech and Conversational Gestures from Text | Zixin Guo et.al. | 2409.16404 | translate | read | null |
| 2024-09-24 | Revisiting Acoustic Features for Robust ASR | Muhammad A. Shah et.al. | 2409.16399 | translate | read | null |
| 2024-09-24 | Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech | Yunji Chu et.al. | 2409.16203 | translate | read | null |
| 2024-09-24 | ComiCap: A VLMs pipeline for dense captioning of Comic Panels | Emanuele Vivoli et.al. | 2409.16159 | translate | read | link |
| 2024-09-24 | Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs | Yang Yuhang et.al. | 2409.16005 | translate | read | null |
| 2024-09-24 | Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification | Fengrun Zhang et.al. | 2409.15974 | translate | read | null |
| 2024-09-24 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM | Fengrun Zhang et.al. | 2409.15905 | translate | read | null |
| 2024-09-24 | Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization | Sotheara Leang et.al. | 2409.15882 | translate | read | null |
| 2024-09-24 | WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction | Shuai Wang et.al. | 2409.15799 | translate | read | null |
| 2024-09-24 | M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions | Shuai Wang et.al. | 2409.15782 | translate | read | null |
| 2024-09-24 | Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample | Zhiyong Chen et.al. | 2409.15742 | translate | read | null |
| 2024-09-24 | StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis | Zhiyong Chen et.al. | 2409.15741 | translate | read | null |
| 2024-09-19 | WeHelp: A Shared Autonomy System for Wheelchair Users | Abulikemu Abuduweili et.al. | 2409.12159 | translate | read | link |
| 2024-09-18 | ASR Benchmarking: Need for a More Representative Conversational Dataset | Gaurav Maheshwari et.al. | 2409.12042 | translate | read | link |
| 2024-09-18 | Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 | Zhiyong Wang et.al. | 2409.11909 | translate | read | null |
| 2024-09-18 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper | Jiaming Zhou et.al. | 2409.11889 | translate | read | null |
| 2024-09-18 | METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation | Dinh-Viet-Toan Le et.al. | 2409.11753 | translate | read | link |
| 2024-09-19 | Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations | Haopeng Geng et.al. | 2409.11742 | translate | read | null |
| 2024-09-17 | Discrete Unit based Masking for Improving Disentanglement in Voice Conversion | Philip H. Lee et.al. | 2409.11560 | translate | read | null |
| 2024-09-17 | Chain-of-Thought Prompting for Speech Translation | Ke Hu et.al. | 2409.11538 | translate | read | null |
| 2024-09-17 | M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses | Yufeng Yang et.al. | 2409.11494 | translate | read | null |
| 2024-09-17 | Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models | Jiahao Qin et.al. | 2409.11263 | translate | read | null |
| 2024-09-17 | WER We Stand: Benchmarking Urdu ASR Models | Samee Arif et.al. | 2409.11252 | translate | read | null |
| 2024-09-17 | Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text | Hongfei Xue et.al. | 2409.11214 | translate | read | null |
| 2024-09-17 | Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora | Francesco Nespoli et.al. | 2409.11107 | translate | read | null |
| 2024-09-17 | Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation | Gerard I. Gállego et.al. | 2409.11003 | translate | read | null |
| 2024-09-17 | Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models | Potsawee Manakul et.al. | 2409.10999 | translate | read | null |
| 2024-09-17 | Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data | Jing Xu et.al. | 2409.10969 | translate | read | null |
| 2024-09-17 | Speech Recognition for Analysis of Police Radio Communication | Tejes Srivastava et.al. | 2409.10858 | translate | read | null |
| 2024-09-17 | PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing | Phillip Long et.al. | 2409.10831 | translate | read | null |
| 2024-09-16 | Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels | Zakaria Aldeneh et.al. | 2409.10791 | translate | read | null |
| 2024-09-16 | An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems | Hitesh Tulsiani et.al. | 2409.10515 | translate | read | null |
| 2024-09-16 | Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages | Ming-Hao Hsu et.al. | 2409.10429 | translate | read | null |
| 2024-09-16 | Voice control interface for surgical robot assistants | Ana Davila et.al. | 2409.10225 | translate | read | null |
| 2024-09-16 | Augmenting Automatic Speech Recognition Models with Disfluency Detection | Robin Amann et.al. | 2409.10177 | translate | read | null |
| 2024-09-16 | Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization | Xiaoxue Gao et.al. | 2409.10157 | translate | read | null |
| 2024-09-16 | Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge | Shuiyun Liu et.al. | 2409.10076 | translate | read | null |
| 2024-09-16 | Speaker Contrastive Learning for Source Speaker Tracing | Qing Wang et.al. | 2409.10072 | translate | read | null |
| 2024-09-16 | StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion | Yinghao Aaron Li et.al. | 2409.10058 | translate | read | null |
| 2024-09-16 | A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models | Ryandhimas E. Zezario et.al. | 2409.09914 | translate | read | null |
| 2024-09-15 | Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition, Speaker Tagging, and Emotion Recognition | Chao-Han Huck Yang et.al. | 2409.09785 | translate | read | null |
| 2024-09-13 | Clean Label Attacks against SLU Systems | Henry Li Xinyuan et.al. | 2409.08985 | translate | read | null |
| 2024-09-13 | HLTCOE JHU Submission to the Voice Privacy Challenge 2024 | Henry Li Xinyuan et.al. | 2409.08913 | translate | read | null |
| 2024-09-13 | Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages | Yao-Fei Cheng et.al. | 2409.08872 | translate | read | null |
| 2024-09-13 | Exploring SSL Discrete Tokens for Multilingual ASR | Mingyu Cui et.al. | 2409.08805 | translate | read | null |
| 2024-09-13 | Text-To-Speech Synthesis In The Wild | Jee-weon Jung et.al. | 2409.08711 | translate | read | null |
| 2024-09-13 | NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training | Minglun Han et.al. | 2409.08680 | translate | read | null |
| 2024-09-13 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation | Shaojun Li et.al. | 2409.08597 | translate | read | null |
| 2024-09-13 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions | Lingwei Meng et.al. | 2409.08596 | translate | read | link |
| 2024-09-13 | LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling | Yubo Huang et.al. | 2409.08583 | translate | read | null |
| 2024-09-13 | LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study | Mahta Fetrat Qharabagh et.al. | 2409.08554 | translate | read | null |
| 2024-09-12 | Hierarchical Symbolic Pop Music Generation with Graph Neural Networks | Wen Qing Lim et.al. | 2409.08155 | translate | read | null |
| 2024-09-12 | Faster Speech-LLaMA Inference with Multi-token Prediction | Desh Raj et.al. | 2409.08148 | translate | read | null |
| 2024-09-12 | WhisperNER: Unified Open Named Entity and Speech Recognition | Gil Ayache et.al. | 2409.08107 | translate | read | null |
| 2024-09-12 | The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language | Michael Ong et.al. | 2409.08103 | translate | read | null |
| 2024-09-12 | Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations | Wangjin Zhou et.al. | 2409.08039 | translate | read | null |
| 2024-09-12 | Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction | Xiangyu Zhang et.al. | 2409.07969 | translate | read | null |
| 2024-09-12 | Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models | Nikolai L. Kühne et.al. | 2409.07936 | translate | read | null |
| 2024-09-12 | Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning | Elizabeth Wilson et.al. | 2409.07918 | translate | read | null |
| 2024-09-12 | Bridging Paintings and Music – Exploring Emotion based Music Generation through Paintings | Tanisha Hisariya et.al. | 2409.07827 | translate | read | null |
| 2024-09-12 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model | Zhiyuan Tang et.al. | 2409.07790 | translate | read | null |
| 2024-09-11 | VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos | Yan-Bo Lin et.al. | 2409.07450 | translate | read | null |
| 2024-09-11 | D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack | Hong-Hanh Nguyen-Le et.al. | 2409.07390 | translate | read | null |
| 2024-09-11 | Rethinking Mamba in Speech Processing by Self-Supervised Models | Xiangyu Zhang et.al. | 2409.07273 | translate | read | null |
| 2024-09-11 | ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages | Mahta Fetrat Qharabagh et.al. | 2409.07259 | translate | read | null |
| 2024-09-11 | Enhancing CTC-Based Visual Speech Recognition | Hendrik Laux et.al. | 2409.07210 | translate | read | null |
| 2024-09-11 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition | Titouan Parcollet et.al. | 2409.07165 | translate | read | null |
| 2024-09-11 | The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction | Wen-Chin Huang et.al. | 2409.07001 | translate | read | null |
| 2024-09-10 | An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition | Yi-Cheng Wang et.al. | 2409.06468 | translate | read | null |
| 2024-09-10 | What happens to diffusion model likelihood when your model is conditional? | Mattias Cross et.al. | 2409.06364 | translate | read | null |
| 2024-09-10 | VoiceWukong: Benchmarking Deepfake Voice Detection | Ziwei Yan et.al. | 2409.06348 | translate | read | null |
| 2024-09-10 | Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches | Chang Zeng et.al. | 2409.06327 | translate | read | null |
| 2024-09-10 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking | Jihyun Lee et.al. | 2409.06263 | translate | read | null |
| 2024-09-10 | RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion | Wei Chen et.al. | 2409.06237 | translate | read | null |
| 2024-09-10 | Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings | Sakshi Deo Shukla et.al. | 2409.06222 | translate | read | null |
| 2024-09-10 | Multi-Source Music Generation with Latent Diffusion | Zhongweiyang Xu et.al. | 2409.06190 | translate | read | link |
| 2024-09-10 | VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion | Kyungguen Byun et.al. | 2409.06126 | translate | read | null |
| 2024-09-09 | Retrieval Augmented Correction of Named Entity Speech Recognition Errors | Ernest Pusateri et.al. | 2409.06062 | translate | read | null |
| 2024-09-09 | PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification | Massa Baali et.al. | 2409.05799 | translate | read | null |
| 2024-09-09 | Consensus-based Distributed Quantum Kernel Learning for Speech Recognition | Kuan-Cheng Chen et.al. | 2409.05770 | translate | read | null |
| 2024-09-09 | A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR | Giovanni Morrone et.al. | 2409.05750 | translate | read | null |
| 2024-09-09 | AS-Speech: Adaptive Style For Speech Synthesis | Zhipeng Li et.al. | 2409.05730 | translate | read | null |
| 2024-09-09 | Evaluation of real-time transcriptions using end-to-end ASR models | Carlos Arriaga et.al. | 2409.05674 | translate | read | null |
| 2024-09-09 | Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation | Nithin Rao Koluguri et.al. | 2409.05601 | translate | read | null |
| 2024-09-09 | An investigation of modularity for noise robustness in conformer-based ASR | Louise Coppieters de Gibson et.al. | 2409.05589 | translate | read | null |
| 2024-09-09 | NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge | Naoyuki Kamo et.al. | 2409.05554 | translate | read | null |
| 2024-09-09 | Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge | Hongfei Xue et.al. | 2409.05430 | translate | read | null |
| 2024-09-08 | Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection | Theophile Stourbe et.al. | 2409.05032 | translate | read | null |
| 2024-09-05 | Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization | Zexin Cai et.al. | 2409.03655 | translate | read | null |
| 2024-09-05 | DiffEVC: Any-to-Any Emotion Voice Conversion with Expressive Guidance | Hsing-Hang Chou et.al. | 2409.03636 | translate | read | null |
| 2024-09-05 | Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder | Yuying Xie et.al. | 2409.03520 | translate | read | null |
| 2024-09-04 | Probing self-attention in self-supervised speech models for cross-linguistic differences | Sai Gopinath et.al. | 2409.03115 | translate | read | null |
| 2024-09-04 | Quantification of stylistic differences in human- and ASR-produced transcripts of African American English | Annika Heuser et.al. | 2409.03059 | translate | read | null |
| 2024-09-04 | SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints | Haonan Chen et.al. | 2409.03055 | translate | read | null |
| 2024-09-04 | Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model | Tornike Karchkhadze et.al. | 2409.02845 | translate | read | null |
| 2024-09-04 | Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models | Jakob Poncelet et.al. | 2409.02565 | translate | read | null |
| 2024-09-04 | Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm | Sidonie Foulon et.al. | 2409.02477 | translate | read | null |
| 2024-09-04 | Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP | Yisi Liu et.al. | 2409.02451 | translate | read | null |
| 2024-09-04 | What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations | Kavya Manohar et.al. | 2409.02449 | translate | read | null |
| 2024-09-04 | MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision | Jiatao Chen et.al. | 2409.02421 | translate | read | link |
| 2024-09-03 | FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation | Takuhiro Kaneko et.al. | 2409.02245 | translate | read | null |
| 2024-09-03 | Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR | Xugang Lu et.al. | 2409.02239 | translate | read | null |
| 2024-09-03 | Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model | Hukai Huang et.al. | 2409.02050 | translate | read | null |
| 2024-09-03 | The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge | Shutong Niu et.al. | 2409.02041 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)