Audio Processing - 2024-09 | Paper Arxiv Daily

Audio Processing - 2024-09

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-09-30	Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding	Takafumi Moriya et.al.	2409.20313	translate	read	null
2024-09-30	Alignment-Free Training for Transducer-based Multi-Talker ASR	Takafumi Moriya et.al.	2409.20301	translate	read	null
2024-09-30	AfriHuBERT: A self-supervised speech representation model for African languages	Jesujoba O. Alabi et.al.	2409.20201	translate	read	null
2024-09-30	Melody Is All You Need For Music Generation	Shaopeng Wei et.al.	2409.20196	translate	read	link
2024-09-30	Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems	Oswald Zink et.al.	2409.19990	translate	read	null
2024-09-30	HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models	Bingshen Mu et.al.	2409.19878	translate	read	null
2024-09-29	Fine-Tuning Automatic Speech Recognition for People with Parkinson’s: An Effective Strategy for Enhancing Speech Technology Accessibility	Xiuwen Zheng et.al.	2409.19818	translate	read	null
2024-09-29	Efficient Long-Form Speech Recognition for General Speech In-Context Learning	Hao Yen et.al.	2409.19757	translate	read	null
2024-09-29	Quantitative Analysis of Audio-Visual Tasks: An Information-Theoretic Perspective	Chen Chen et.al.	2409.19575	translate	read	null
2024-09-29	CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought	Yexing Du et.al.	2409.19510	translate	read	link
2024-09-27	Speech-Mamba: Long-Context Speech Recognition with Selective State Spaces Models	Xiaoxue Gao et.al.	2409.18654	translate	read	null
2024-09-27	ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5	Jiaming Zhou et.al.	2409.18584	translate	read	null
2024-09-27	EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis	Haoyu Wang et.al.	2409.18512	translate	read	null
2024-09-27	Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking	Brian Yan et.al.	2409.18428	translate	read	null
2024-09-26	Unveiling the Role of Pretraining in Direct Speech Translation	Belen Alastruey et.al.	2409.18044	translate	read	null
2024-09-26	Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study	Keyu An et.al.	2409.17750	translate	read	null
2024-09-26	Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition	Keyu An et.al.	2409.17746	translate	read	null
2024-09-26	Deep CLAS: Deep Contextual Listen, Attend and Spell	Shifu Xiong et.al.	2409.17603	translate	read	null
2024-09-25	Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion	Giuseppe Ruggiero et.al.	2409.17387	translate	read	null
2024-09-25	Exploring synthetic data for cross-speaker style transfer in style representation based TTS	Lucas H. Ueda et.al.	2409.17364	translate	read	null
2024-09-25	How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not	Francesco Verdini et.al.	2409.17044	translate	read	null
2024-09-25	MT2KD: Towards A General-Purpose Encoder for Speech, Speaker, and Audio Events	Xiaoyu Yang et.al.	2409.17010	translate	read	null
2024-09-25	Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition	Andrés Piñeiro-Martín et.al.	2409.16954	translate	read	null
2024-09-25	Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling	Yuanchao Li et.al.	2409.16937	translate	read	link
2024-09-25	Speech Recognition Rescoring with Large Speech-Text Foundation Models	Prashanth Gurunath Shivakumar et.al.	2409.16654	translate	read	null
2024-09-24	Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices	Leonid Velikovich et.al.	2409.16469	translate	read	null
2024-09-24	FastTalker: Jointly Generating Speech and Conversational Gestures from Text	Zixin Guo et.al.	2409.16404	translate	read	null
2024-09-24	Revisiting Acoustic Features for Robust ASR	Muhammad A. Shah et.al.	2409.16399	translate	read	null
2024-09-24	Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech	Yunji Chu et.al.	2409.16203	translate	read	null
2024-09-24	ComiCap: A VLMs pipeline for dense captioning of Comic Panels	Emanuele Vivoli et.al.	2409.16159	translate	read	link
2024-09-24	Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs	Yang Yuhang et.al.	2409.16005	translate	read	null
2024-09-24	Disentangling Age and Identity with a Mutual Information Minimization Approach for Cross-Age Speaker Verification	Fengrun Zhang et.al.	2409.15974	translate	read	null
2024-09-24	Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM	Fengrun Zhang et.al.	2409.15905	translate	read	null
2024-09-24	Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization	Sotheara Leang et.al.	2409.15882	translate	read	null
2024-09-24	WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction	Shuai Wang et.al.	2409.15799	translate	read	null
2024-09-24	M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions	Shuai Wang et.al.	2409.15782	translate	read	null
2024-09-24	Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample	Zhiyong Chen et.al.	2409.15742	translate	read	null
2024-09-24	StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis	Zhiyong Chen et.al.	2409.15741	translate	read	null
2024-09-19	WeHelp: A Shared Autonomy System for Wheelchair Users	Abulikemu Abuduweili et.al.	2409.12159	translate	read	link
2024-09-18	ASR Benchmarking: Need for a More Representative Conversational Dataset	Gaurav Maheshwari et.al.	2409.12042	translate	read	link
2024-09-18	Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0	Zhiyong Wang et.al.	2409.11909	translate	read	null
2024-09-18	M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper	Jiaming Zhou et.al.	2409.11889	translate	read	null
2024-09-18	METEOR: Melody-aware Texture-controllable Symbolic Orchestral Music Generation	Dinh-Viet-Toan Le et.al.	2409.11753	translate	read	link
2024-09-19	Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations	Haopeng Geng et.al.	2409.11742	translate	read	null
2024-09-17	Discrete Unit based Masking for Improving Disentanglement in Voice Conversion	Philip H. Lee et.al.	2409.11560	translate	read	null
2024-09-17	Chain-of-Thought Prompting for Speech Translation	Ke Hu et.al.	2409.11538	translate	read	null
2024-09-17	M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses	Yufeng Yang et.al.	2409.11494	translate	read	null
2024-09-17	Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models	Jiahao Qin et.al.	2409.11263	translate	read	null
2024-09-17	WER We Stand: Benchmarking Urdu ASR Models	Samee Arif et.al.	2409.11252	translate	read	null
2024-09-17	Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text	Hongfei Xue et.al.	2409.11214	translate	read	null
2024-09-17	Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora	Francesco Nespoli et.al.	2409.11107	translate	read	null
2024-09-17	Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation	Gerard I. Gállego et.al.	2409.11003	translate	read	null
2024-09-17	Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models	Potsawee Manakul et.al.	2409.10999	translate	read	null
2024-09-17	Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data	Jing Xu et.al.	2409.10969	translate	read	null
2024-09-17	Speech Recognition for Analysis of Police Radio Communication	Tejes Srivastava et.al.	2409.10858	translate	read	null
2024-09-17	PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing	Phillip Long et.al.	2409.10831	translate	read	null
2024-09-16	Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels	Zakaria Aldeneh et.al.	2409.10791	translate	read	null
2024-09-16	An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems	Hitesh Tulsiani et.al.	2409.10515	translate	read	null
2024-09-16	Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages	Ming-Hao Hsu et.al.	2409.10429	translate	read	null
2024-09-16	Voice control interface for surgical robot assistants	Ana Davila et.al.	2409.10225	translate	read	null
2024-09-16	Augmenting Automatic Speech Recognition Models with Disfluency Detection	Robin Amann et.al.	2409.10177	translate	read	null
2024-09-16	Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization	Xiaoxue Gao et.al.	2409.10157	translate	read	null
2024-09-16	Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge	Shuiyun Liu et.al.	2409.10076	translate	read	null
2024-09-16	Speaker Contrastive Learning for Source Speaker Tracing	Qing Wang et.al.	2409.10072	translate	read	null
2024-09-16	StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion	Yinghao Aaron Li et.al.	2409.10058	translate	read	null
2024-09-16	A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models	Ryandhimas E. Zezario et.al.	2409.09914	translate	read	null
2024-09-15	Large Language Model Based Generative Error Correction: A Challenge and Baselines forSpeech Recognition, Speaker Tagging, and Emotion Recognition	Chao-Han Huck Yang et.al.	2409.09785	translate	read	null
2024-09-13	Clean Label Attacks against SLU Systems	Henry Li Xinyuan et.al.	2409.08985	translate	read	null
2024-09-13	HLTCOE JHU Submission to the Voice Privacy Challenge 2024	Henry Li Xinyuan et.al.	2409.08913	translate	read	null
2024-09-13	Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages	Yao-Fei Cheng et.al.	2409.08872	translate	read	null
2024-09-13	Exploring SSL Discrete Tokens for Multilingual ASR	Mingyu Cui et.al.	2409.08805	translate	read	null
2024-09-13	Text-To-Speech Synthesis In The Wild	Jee-weon Jung et.al.	2409.08711	translate	read	null
2024-09-13	NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training	Minglun Han et.al.	2409.08680	translate	read	null
2024-09-13	LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation	Shaojun Li et.al.	2409.08597	translate	read	null
2024-09-13	Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions	Lingwei Meng et.al.	2409.08596	translate	read	link
2024-09-13	LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling	Yubo Huang et.al.	2409.08583	translate	read	null
2024-09-13	LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study	Mahta Fetrat Qharabagh et.al.	2409.08554	translate	read	null
2024-09-12	Hierarchical Symbolic Pop Music Generation with Graph Neural Networks	Wen Qing Lim et.al.	2409.08155	translate	read	null
2024-09-12	Faster Speech-LLaMA Inference with Multi-token Prediction	Desh Raj et.al.	2409.08148	translate	read	null
2024-09-12	WhisperNER: Unified Open Named Entity and Speech Recognition	Gil Ayache et.al.	2409.08107	translate	read	null
2024-09-12	The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language	Michael Ong et.al.	2409.08103	translate	read	null
2024-09-12	Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations	Wangjin Zhou et.al.	2409.08039	translate	read	null
2024-09-12	Auto-Landmark: Acoustic Landmark Dataset and Open-Source Toolkit for Landmark Extraction	Xiangyu Zhang et.al.	2409.07969	translate	read	null
2024-09-12	Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models	Nikolai L. Kühne et.al.	2409.07936	translate	read	null
2024-09-12	Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning	Elizabeth Wilson et.al.	2409.07918	translate	read	null
2024-09-12	Bridging Paintings and Music – Exploring Emotion based Music Generation through Paintings	Tanisha Hisariya et.al.	2409.07827	translate	read	null
2024-09-12	Full-text Error Correction for Chinese Speech Recognition with Large Language Model	Zhiyuan Tang et.al.	2409.07790	translate	read	null
2024-09-11	VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos	Yan-Bo Lin et.al.	2409.07450	translate	read	null
2024-09-11	D-CAPTCHA++: A Study of Resilience of Deepfake CAPTCHA under Transferable Imperceptible Adversarial Attack	Hong-Hanh Nguyen-Le et.al.	2409.07390	translate	read	null
2024-09-11	Rethinking Mamba in Speech Processing by Self-Supervised Models	Xiangyu Zhang et.al.	2409.07273	translate	read	null
2024-09-11	ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages	Mahta Fetrat Qharabagh et.al.	2409.07259	translate	read	null
2024-09-11	Enhancing CTC-Based Visual Speech Recognition	Hendrik Laux et.al.	2409.07210	translate	read	null
2024-09-11	Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition	Titouan Parcollet et.al.	2409.07165	translate	read	null
2024-09-11	The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction	Wen-Chin Huang et.al.	2409.07001	translate	read	null
2024-09-10	An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition	Yi-Cheng Wang et.al.	2409.06468	translate	read	null
2024-09-10	What happens to diffusion model likelihood when your model is conditional?	Mattias Cross et.al.	2409.06364	translate	read	null
2024-09-10	VoiceWukong: Benchmarking Deepfake Voice Detection	Ziwei Yan et.al.	2409.06348	translate	read	null
2024-09-10	Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches	Chang Zeng et.al.	2409.06327	translate	read	null
2024-09-10	Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking	Jihyun Lee et.al.	2409.06263	translate	read	null
2024-09-10	RobustSVC: HuBERT-based Melody Extractor and Adversarial Learning for Robust Singing Voice Conversion	Wei Chen et.al.	2409.06237	translate	read	null
2024-09-10	Advancing Topic Segmentation of Broadcasted Speech with Multilingual Semantic Embeddings	Sakshi Deo Shukla et.al.	2409.06222	translate	read	null
2024-09-10	Multi-Source Music Generation with Latent Diffusion	Zhongweiyang Xu et.al.	2409.06190	translate	read	link
2024-09-10	VC-ENHANCE: Speech Restoration with Integrated Noise Suppression and Voice Conversion	Kyungguen Byun et.al.	2409.06126	translate	read	null
2024-09-09	Retrieval Augmented Correction of Named Entity Speech Recognition Errors	Ernest Pusateri et.al.	2409.06062	translate	read	null
2024-09-09	PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification	Massa Baali et.al.	2409.05799	translate	read	null
2024-09-09	Consensus-based Distributed Quantum Kernel Learning for Speech Recognition	Kuan-Cheng Chen et.al.	2409.05770	translate	read	null
2024-09-09	A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR	Giovanni Morrone et.al.	2409.05750	translate	read	null
2024-09-09	AS-Speech: Adaptive Style For Speech Synthesis	Zhipeng Li et.al.	2409.05730	translate	read	null
2024-09-09	Evaluation of real-time transcriptions using end-to-end ASR models	Carlos Arriaga et.al.	2409.05674	translate	read	null
2024-09-09	Longer is (Not Necessarily) Stronger: Punctuated Long-Sequence Training for Enhanced Speech Recognition and Translation	Nithin Rao Koluguri et.al.	2409.05601	translate	read	null
2024-09-09	An investigation of modularity for noise robustness in conformer-based ASR	Louise Coppieters de Gibson et.al.	2409.05589	translate	read	null
2024-09-09	NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge	Naoyuki Kamo et.al.	2409.05554	translate	read	null
2024-09-09	Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge	Hongfei Xue et.al.	2409.05430	translate	read	null
2024-09-08	Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection	Theophile Stourbe et.al.	2409.05032	translate	read	null
2024-09-05	Privacy versus Emotion Preservation Trade-offs in Emotion-Preserving Speaker Anonymization	Zexin Cai et.al.	2409.03655	translate	read	null
2024-09-05	DiffEVC: Any-to-Any Emotion Voice Conversion with Expressive Guidance	Hsing-Hang Chou et.al.	2409.03636	translate	read	null
2024-09-05	Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder	Yuying Xie et.al.	2409.03520	translate	read	null
2024-09-04	Probing self-attention in self-supervised speech models for cross-linguistic differences	Sai Gopinath et.al.	2409.03115	translate	read	null
2024-09-04	Quantification of stylistic differences in human- and ASR-produced transcripts of African American English	Annika Heuser et.al.	2409.03059	translate	read	null
2024-09-04	SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints	Haonan Chen et.al.	2409.03055	translate	read	null
2024-09-04	Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model	Tornike Karchkhadze et.al.	2409.02845	translate	read	null
2024-09-04	Efficient Extraction of Noise-Robust Discrete Units from Self-Supervised Speech Models	Jakob Poncelet et.al.	2409.02565	translate	read	null
2024-09-04	Parameter estimation of hidden Markov models: comparison of EM and quasi-Newton methods with a new hybrid algorithm	Sidonie Foulon et.al.	2409.02477	translate	read	null
2024-09-04	Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP	Yisi Liu et.al.	2409.02451	translate	read	null
2024-09-04	What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations	Kavya Manohar et.al.	2409.02449	translate	read	null
2024-09-04	MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision	Jiatao Chen et.al.	2409.02421	translate	read	link
2024-09-03	FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation	Takuhiro Kaneko et.al.	2409.02245	translate	read	null
2024-09-03	Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR	Xugang Lu et.al.	2409.02239	translate	read	null
2024-09-03	Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model	Hukai Huang et.al.	2409.02050	translate	read	null
2024-09-03	The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge	Shutong Niu et.al.	2409.02041	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)