Audio Processing - 2025-01

Publish Date Title Authors PDF Translate Read Code
2025-01-31 Language Bias in Self-Supervised Learning For Automatic Speech Recognition Edward Storey et.al. 2501.19321 translate read null
2025-01-30 AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment Yuqin Cao et.al. 2501.18314 translate read null
2025-01-29 Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling Theo Lepage et.al. 2501.17772 translate read null
2025-01-29 Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition Zhengdong Yang et.al. 2501.17615 translate read null
2025-01-29 VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching Ha-Yeong Choi et.al. 2501.17612 translate read null
2025-01-28 Compact Neural TTS Voices for Accessibility Kunal Jain et.al. 2501.17332 translate read null
2025-01-28 RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains Shady Nasrat et.al. 2501.16899 translate read link
2025-01-28 AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals Dongliang Zhou et.al. 2501.16780 translate read null
2025-01-28 SCDiar: a streaming diarization system based on speaker change detection and speech recognition Naijun Zheng et.al. 2501.16641 translate read null
2025-01-27 UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification Mufan Sang et.al. 2501.16542 translate read null
2025-01-27 Optimized Self-supervised Training with BEST-RQ for Speech Recognition Ilja Baumann et.al. 2501.16131 translate read null
2025-01-27 Classification Error Bound for Low Bayes Error Conditions in Machine Learning Zijian Yang et.al. 2501.15977 translate read null
2025-01-26 Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning Qian Yang et.al. 2501.15613 translate read null
2025-01-26 End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario Mohsen Ghane et.al. 2501.15466 translate read null
2025-01-26 Overview of the Amphion Toolkit (v0.2) Jiaqi Li et.al. 2501.15442 translate read link
2025-01-25 The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? Ayo Adedeji et.al. 2501.15310 translate read null
2025-01-25 Music Generation using Human-In-The-Loop Reinforcement Learning Aju Ani Justus et.al. 2501.15304 translate read null
2025-01-25 Speech Translation Refinement using Large Language Models Huaixia Dou et.al. 2501.15090 translate read link
2025-01-25 Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition Satwinder Singh et.al. 2501.14994 translate read null
2025-01-27 Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning Jisi Zhang et.al. 2501.14680 translate read null
2025-01-24 FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration Kai-Tuo Xu et.al. 2501.14350 translate read link
2025-01-24 Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models Tianrui Wang et.al. 2501.14273 translate read null
2025-01-24 Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation Wen Huang et.al. 2501.14240 translate read null
2025-01-24 LoCoML: A Framework for Real-World ML Inference Pipelines Kritin Maddireddy et.al. 2501.14165 translate read null
2025-01-23 Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction Ali Farshian Abbasi et.al. 2501.13996 translate read null
2025-01-23 Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing Hao Zhang et.al. 2501.13831 translate read null
2025-01-23 Learning-based A Posteriori Speech Presence Probability Estimation and Applications Shuai Tao et.al. 2501.13642 translate read null
2025-01-23 DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition Qijie Shao et.al. 2501.13497 translate read null
2025-01-23 Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement Jae-Sung Bae et.al. 2501.13372 translate read null
2025-01-23 OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia Xuelong Geng et.al. 2501.13306 translate read link
2025-01-22 Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions Yan Ru Pei et.al. 2501.13230 translate read link
2025-01-22 FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Moreno La Quatra et.al. 2501.12979 translate read link
2025-01-21 A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data Minh Tran et.al. 2501.12501 translate read null
2025-01-21 DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset Yupei Li et.al. 2501.12122 translate read null
2025-01-20 Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio Mateusz Barański et.al. 2501.11378 translate read null
2025-01-20 SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation Ziling Huang et.al. 2501.11274 translate read null
2025-01-19 Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets Or Haim Anidjar et.al. 2501.11065 translate read null
2025-01-18 A Benchmark of French ASR Systems Based on Error Severity Antoine Tholly et.al. 2501.10879 translate read null
2025-01-18 GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems Amin Robatian et.al. 2501.10734 translate read null
2025-01-17 Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR Karl El Hajal et.al. 2501.10256 translate read null
2025-01-17 Automatic Speech Recognition for Sanskrit with Transfer Learning Bidit Sadhukhan et.al. 2501.10024 translate read null
2025-01-17 GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions Heda Zuo et.al. 2501.09972 translate read null
2025-01-21 PIER: A Novel Metric for Evaluating What Matters in Code-Switching Enes Yavuz Ugan et.al. 2501.09512 translate read link
2025-01-16 Teaching Wav2Vec2 the Language of the Brain Tobias Fiedler et.al. 2501.09459 translate read link
2025-01-16 Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition Takaaki Hori et.al. 2501.09258 translate read null
2025-01-17 persoDA: Personalized Data Augmentation for Personalized ASR Pablo Peso Parada et.al. 2501.09113 translate read null
2025-01-15 A Non-autoregressive Model for Joint STT and TTS Vishal Sunder et.al. 2501.09104 translate read null
2025-01-13 Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations Mareike Buhl et.al. 2501.08921 translate read null
2025-01-15 XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework Sida Tian et.al. 2501.08809 translate read null
2025-01-15 Speech Synthesis along Perceptual Voice Quality Dimensions Frederik Rautenberg et.al. 2501.08791 translate read null
2025-01-15 Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification Li Zhang et.al. 2501.08691 translate read null
2025-01-15 Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom Melissa Torgbi et.al. 2501.08502 translate read null
2025-01-14 Selective Attention Merging for low resource tasks: A case study of Child ASR Natarajan Balaji Shankar et.al. 2501.08468 translate read link
2025-01-14 Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications Dimme de Groot et.al. 2501.08104 translate read null
2025-01-13 Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech Bruno Ferenc Šegedin et.al. 2501.07726 translate read null
2025-01-13 Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Jiliang Hu et.al. 2501.07329 translate read null
2025-01-13 Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model Ziyang Ma et.al. 2501.07246 translate read null
2025-01-13 AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR The Chuong Chu et.al. 2501.07102 translate read null
2025-01-11 Discrete Speech Unit Extraction via Independent Component Analysis Tomohiko Nakamura et.al. 2501.06562 translate read link
2025-01-11 A Survey on Spoken Italian Datasets and Corpora Marco Giordano et.al. 2501.06557 translate read null
2025-01-11 Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives Christiaan Jacobs et.al. 2501.06478 translate read null
2025-01-11 Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis Rui Liu et.al. 2501.06467 translate read null
2025-01-10 TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer Vladimir Bataev et.al. 2501.06320 translate read null
2025-01-10 Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI Yuya Asano et.al. 2501.06129 translate read null
2025-01-10 Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Fabian David Schmidt et.al. 2501.06117 translate read link
2025-01-10 Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Shucong Zhang et.al. 2501.06051 translate read null
2025-01-10 Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Eklavya Sarkar et.al. 2501.05987 translate read link
2025-01-10 Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron Kishor Kayyar Lakshminarayana et.al. 2501.05976 translate read null
2025-01-10 Universal-2-TF: Robust All-Neural Text Formatting for ASR Yash Khare et.al. 2501.05948 translate read null
2025-01-10 ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification Yi Ma et.al. 2501.05729 translate read link
2025-01-09 FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion Alef Iury Siqueira Ferreira et.al. 2501.05586 translate read link
2025-01-09 Probing Speaker-specific Features in Speaker Representations Aemon Yat Fei Chiu et.al. 2501.05310 translate read null
2025-01-09 DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification Qing Wang et.al. 2501.05127 translate read null
2025-01-09 JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis Jun-Hyeok Cha et.al. 2501.04904 translate read null
2025-01-08 FleSpeech: Flexibly Controllable Speech Generation with Various Prompts Hanzhao Li et.al. 2501.04644 translate read null
2025-01-09 OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis Run Luo et.al. 2501.04561 translate read null
2025-01-09 Right Label Context in End-to-End Training of Time-Synchronous ASR Models Tina Raissi et.al. 2501.04521 translate read null
2025-01-08 PolInterviews – A Dataset of German Politician Public Broadcast Interviews Lukas Birkenmaier et.al. 2501.04484 translate read null
2025-01-08 ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training Xinfa Zhu et.al. 2501.04416 translate read null
2025-01-08 Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition Huimeng Wang et.al. 2501.04379 translate read null
2025-01-08 DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions Weidong Chen et.al. 2501.04256 translate read null
2025-01-08 LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition Bowen Hao et.al. 2501.04204 translate read null
2025-01-07 Spectral-Aware Low-Rank Adaptation for Speaker Verification Zhe Li et.al. 2501.03829 translate read link
2025-01-07 NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity Owais Mujtaba Khanday et.al. 2501.03757 translate read null
2025-01-07 Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection Bang Zeng et.al. 2501.03612 translate read null
2025-01-07 Towards a Generalizable Speech Marker for Parkinson’s Disease Diagnosis Maksim Siniukov et.al. 2501.03581 translate read null
2025-01-07 Deep Learning for Pathological Speech: A Survey Shakeel A. Sheikh et.al. 2501.03536 translate read null
2025-01-02 FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles Tian-Hao Zhang et.al. 2501.03181 translate read null
2025-01-06 SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset Yiquan Zhou et.al. 2501.02953 translate read null
2025-01-07 Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Syed Abdul Gaffar Shakhadri et.al. 2501.02832 translate read null
2025-01-05 Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module Zhongjian Cui et.al. 2501.02452 translate read null
2025-01-03 Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer Vishal Sunder et.al. 2501.01936 translate read null
2025-01-03 CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation Ziqi Liang et.al. 2501.01861 translate read null
2025-01-03 MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling Simon Rouard et.al. 2501.01757 translate read null
2025-01-03 Controlling your Attributes in Voice Xuyuan Li et.al. 2501.01674 translate read null
2025-01-03 AdaptVC: High Quality Voice Conversion with Adaptive Learning Jaehun Kim et.al. 2501.01347 translate read null
2025-01-02 Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models Bin Wang et.al. 2501.01034 translate read link
2025-01-01 Incremental Dialogue Management: Survey, Discussion, and Implications for HRI Casey Kennington et.al. 2501.00953 translate read null
2025-01-01 Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation Shoutao Guo et.al. 2501.00868 translate read link
2025-01-01 Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing Gaofeng Cheng et.al. 2501.00804 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)