Audio Processing - 2025-01 | Paper Arxiv Daily

Audio Processing - 2025-01

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-01-31	Language Bias in Self-Supervised Learning For Automatic Speech Recognition	Edward Storey et.al.	2501.19321	translate	read	null
2025-01-30	AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment	Yuqin Cao et.al.	2501.18314	translate	read	null
2025-01-29	Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling	Theo Lepage et.al.	2501.17772	translate	read	null
2025-01-29	Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition	Zhengdong Yang et.al.	2501.17615	translate	read	null
2025-01-29	VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching	Ha-Yeong Choi et.al.	2501.17612	translate	read	null
2025-01-28	Compact Neural TTS Voices for Accessibility	Kunal Jain et.al.	2501.17332	translate	read	null
2025-01-28	RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains	Shady Nasrat et.al.	2501.16899	translate	read	link
2025-01-28	AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals	Dongliang Zhou et.al.	2501.16780	translate	read	null
2025-01-28	SCDiar: a streaming diarization system based on speaker change detection and speech recognition	Naijun Zheng et.al.	2501.16641	translate	read	null
2025-01-27	UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification	Mufan Sang et.al.	2501.16542	translate	read	null
2025-01-27	Optimized Self-supervised Training with BEST-RQ for Speech Recognition	Ilja Baumann et.al.	2501.16131	translate	read	null
2025-01-27	Classification Error Bound for Low Bayes Error Conditions in Machine Learning	Zijian Yang et.al.	2501.15977	translate	read	null
2025-01-26	Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning	Qian Yang et.al.	2501.15613	translate	read	null
2025-01-26	End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario	Mohsen Ghane et.al.	2501.15466	translate	read	null
2025-01-26	Overview of the Amphion Toolkit (v0.2)	Jiaqi Li et.al.	2501.15442	translate	read	link
2025-01-25	The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders?	Ayo Adedeji et.al.	2501.15310	translate	read	null
2025-01-25	Music Generation using Human-In-The-Loop Reinforcement Learning	Aju Ani Justus et.al.	2501.15304	translate	read	null
2025-01-25	Speech Translation Refinement using Large Language Models	Huaixia Dou et.al.	2501.15090	translate	read	link
2025-01-25	Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition	Satwinder Singh et.al.	2501.14994	translate	read	null
2025-01-27	Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning	Jisi Zhang et.al.	2501.14680	translate	read	null
2025-01-24	FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration	Kai-Tuo Xu et.al.	2501.14350	translate	read	link
2025-01-24	Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models	Tianrui Wang et.al.	2501.14273	translate	read	null
2025-01-24	Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation	Wen Huang et.al.	2501.14240	translate	read	null
2025-01-24	LoCoML: A Framework for Real-World ML Inference Pipelines	Kritin Maddireddy et.al.	2501.14165	translate	read	null
2025-01-23	Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction	Ali Farshian Abbasi et.al.	2501.13996	translate	read	null
2025-01-23	Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing	Hao Zhang et.al.	2501.13831	translate	read	null
2025-01-23	Learning-based A Posteriori Speech Presence Probability Estimation and Applications	Shuai Tao et.al.	2501.13642	translate	read	null
2025-01-23	DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition	Qijie Shao et.al.	2501.13497	translate	read	null
2025-01-23	Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement	Jae-Sung Bae et.al.	2501.13372	translate	read	null
2025-01-23	OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia	Xuelong Geng et.al.	2501.13306	translate	read	link
2025-01-22	Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions	Yan Ru Pei et.al.	2501.13230	translate	read	link
2025-01-22	FlanEC: Exploring Flan-T5 for Post-ASR Error Correction	Moreno La Quatra et.al.	2501.12979	translate	read	link
2025-01-21	A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data	Minh Tran et.al.	2501.12501	translate	read	null
2025-01-21	DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset	Yupei Li et.al.	2501.12122	translate	read	null
2025-01-20	Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio	Mateusz Barański et.al.	2501.11378	translate	read	null
2025-01-20	SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation	Ziling Huang et.al.	2501.11274	translate	read	null
2025-01-19	Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets	Or Haim Anidjar et.al.	2501.11065	translate	read	null
2025-01-18	A Benchmark of French ASR Systems Based on Error Severity	Antoine Tholly et.al.	2501.10879	translate	read	null
2025-01-18	GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems	Amin Robatian et.al.	2501.10734	translate	read	null
2025-01-17	Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR	Karl El Hajal et.al.	2501.10256	translate	read	null
2025-01-17	Automatic Speech Recognition for Sanskrit with Transfer Learning	Bidit Sadhukhan et.al.	2501.10024	translate	read	null
2025-01-17	GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions	Heda Zuo et.al.	2501.09972	translate	read	null
2025-01-21	PIER: A Novel Metric for Evaluating What Matters in Code-Switching	Enes Yavuz Ugan et.al.	2501.09512	translate	read	link
2025-01-16	Teaching Wav2Vec2 the Language of the Brain	Tobias Fiedler et.al.	2501.09459	translate	read	link
2025-01-16	Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition	Takaaki Hori et.al.	2501.09258	translate	read	null
2025-01-17	persoDA: Personalized Data Augmentation for Personalized ASR	Pablo Peso Parada et.al.	2501.09113	translate	read	null
2025-01-15	A Non-autoregressive Model for Joint STT and TTS	Vishal Sunder et.al.	2501.09104	translate	read	null
2025-01-13	Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations	Mareike Buhl et.al.	2501.08921	translate	read	null
2025-01-15	XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework	Sida Tian et.al.	2501.08809	translate	read	null
2025-01-15	Speech Synthesis along Perceptual Voice Quality Dimensions	Frederik Rautenberg et.al.	2501.08791	translate	read	null
2025-01-15	Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification	Li Zhang et.al.	2501.08691	translate	read	null
2025-01-15	Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom	Melissa Torgbi et.al.	2501.08502	translate	read	null
2025-01-14	Selective Attention Merging for low resource tasks: A case study of Child ASR	Natarajan Balaji Shankar et.al.	2501.08468	translate	read	link
2025-01-14	Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications	Dimme de Groot et.al.	2501.08104	translate	read	null
2025-01-13	Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech	Bruno Ferenc Šegedin et.al.	2501.07726	translate	read	null
2025-01-13	Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding	Jiliang Hu et.al.	2501.07329	translate	read	null
2025-01-13	Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model	Ziyang Ma et.al.	2501.07246	translate	read	null
2025-01-13	AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR	The Chuong Chu et.al.	2501.07102	translate	read	null
2025-01-11	Discrete Speech Unit Extraction via Independent Component Analysis	Tomohiko Nakamura et.al.	2501.06562	translate	read	link
2025-01-11	A Survey on Spoken Italian Datasets and Corpora	Marco Giordano et.al.	2501.06557	translate	read	null
2025-01-11	Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives	Christiaan Jacobs et.al.	2501.06478	translate	read	null
2025-01-11	Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis	Rui Liu et.al.	2501.06467	translate	read	null
2025-01-10	TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer	Vladimir Bataev et.al.	2501.06320	translate	read	null
2025-01-10	Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI	Yuya Asano et.al.	2501.06129	translate	read	null
2025-01-10	Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding	Fabian David Schmidt et.al.	2501.06117	translate	read	link
2025-01-10	Benchmarking Rotary Position Embeddings for Automatic Speech Recognition	Shucong Zhang et.al.	2501.06051	translate	read	null
2025-01-10	Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing	Eklavya Sarkar et.al.	2501.05987	translate	read	link
2025-01-10	Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron	Kishor Kayyar Lakshminarayana et.al.	2501.05976	translate	read	null
2025-01-10	Universal-2-TF: Robust All-Neural Text Formatting for ASR	Yash Khare et.al.	2501.05948	translate	read	null
2025-01-10	ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification	Yi Ma et.al.	2501.05729	translate	read	link
2025-01-09	FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion	Alef Iury Siqueira Ferreira et.al.	2501.05586	translate	read	link
2025-01-09	Probing Speaker-specific Features in Speaker Representations	Aemon Yat Fei Chiu et.al.	2501.05310	translate	read	null
2025-01-09	DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification	Qing Wang et.al.	2501.05127	translate	read	null
2025-01-09	JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis	Jun-Hyeok Cha et.al.	2501.04904	translate	read	null
2025-01-08	FleSpeech: Flexibly Controllable Speech Generation with Various Prompts	Hanzhao Li et.al.	2501.04644	translate	read	null
2025-01-09	OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis	Run Luo et.al.	2501.04561	translate	read	null
2025-01-09	Right Label Context in End-to-End Training of Time-Synchronous ASR Models	Tina Raissi et.al.	2501.04521	translate	read	null
2025-01-08	PolInterviews – A Dataset of German Politician Public Broadcast Interviews	Lukas Birkenmaier et.al.	2501.04484	translate	read	null
2025-01-08	ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training	Xinfa Zhu et.al.	2501.04416	translate	read	null
2025-01-08	Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition	Huimeng Wang et.al.	2501.04379	translate	read	null
2025-01-08	DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions	Weidong Chen et.al.	2501.04256	translate	read	null
2025-01-08	LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition	Bowen Hao et.al.	2501.04204	translate	read	null
2025-01-07	Spectral-Aware Low-Rank Adaptation for Speaker Verification	Zhe Li et.al.	2501.03829	translate	read	link
2025-01-07	NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity	Owais Mujtaba Khanday et.al.	2501.03757	translate	read	null
2025-01-07	Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection	Bang Zeng et.al.	2501.03612	translate	read	null
2025-01-07	Towards a Generalizable Speech Marker for Parkinson’s Disease Diagnosis	Maksim Siniukov et.al.	2501.03581	translate	read	null
2025-01-07	Deep Learning for Pathological Speech: A Survey	Shakeel A. Sheikh et.al.	2501.03536	translate	read	null
2025-01-02	FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles	Tian-Hao Zhang et.al.	2501.03181	translate	read	null
2025-01-06	SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset	Yiquan Zhou et.al.	2501.02953	translate	read	null
2025-01-07	Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models	Syed Abdul Gaffar Shakhadri et.al.	2501.02832	translate	read	null
2025-01-05	Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module	Zhongjian Cui et.al.	2501.02452	translate	read	null
2025-01-03	Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer	Vishal Sunder et.al.	2501.01936	translate	read	null
2025-01-03	CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation	Ziqi Liang et.al.	2501.01861	translate	read	null
2025-01-03	MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling	Simon Rouard et.al.	2501.01757	translate	read	null
2025-01-03	Controlling your Attributes in Voice	Xuyuan Li et.al.	2501.01674	translate	read	null
2025-01-03	AdaptVC: High Quality Voice Conversion with Adaptive Learning	Jaehun Kim et.al.	2501.01347	translate	read	null
2025-01-02	Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models	Bin Wang et.al.	2501.01034	translate	read	link
2025-01-01	Incremental Dialogue Management: Survey, Discussion, and Implications for HRI	Casey Kennington et.al.	2501.00953	translate	read	null
2025-01-01	Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation	Shoutao Guo et.al.	2501.00868	translate	read	link
2025-01-01	Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing	Gaofeng Cheng et.al.	2501.00804	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)