Audio Processing - 2025-02 | Paper Arxiv Daily

Audio Processing - 2025-02

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-02-28	InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation	Chong Zhang et.al.	2503.00084	translate	read	link
2025-02-27	LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation	Keisuke Kamahori et.al.	2502.20583	translate	read	link
2025-02-27	Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications	Marcus Yu Zhe Wee et.al.	2502.20311	translate	read	null
2025-02-27	CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR	Nian Shao et.al.	2502.20040	translate	read	link
2025-02-27	DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models	Weihao wu et.al.	2502.19924	translate	read	null
2025-02-26	Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	Ziyue Jiang et.al.	2502.18924	translate	read	null
2025-02-26	CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition	Jiaming Zhou et.al.	2502.18913	translate	read	null
2025-02-25	Exploring Gender Disparities in Automatic Speech Recognition Technology	Hend ElGhazaly et.al.	2502.18434	translate	read	null
2025-02-27	NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms	Yashan Wang et.al.	2502.18008	translate	read	null
2025-02-25	Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm	Yudong Xie et.al.	2502.17829	translate	read	null
2025-02-26	Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation	Qiuming Zhao et.al.	2502.17380	translate	read	null
2025-02-24	Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus	Golshid Shekoufandeh et.al.	2502.17284	translate	read	null
2025-02-24	Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM	Jiatong Shi et.al.	2502.16897	translate	read	null
2025-02-22	Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration	Haoxuan Wang et.al.	2502.16142	translate	read	null
2025-02-21	The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages	Jenalea Rajab et.al.	2502.15916	translate	read	null
2025-02-21	Retrieval-Augmented Speech Recognition Approach for Domain Challenges	Peng Shen et.al.	2502.15264	translate	read	null
2025-02-21	Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders	Weiqiao Shan et.al.	2502.15178	translate	read	null
2025-02-21	Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking	Khanh Le et.al.	2502.15158	translate	read	null
2025-02-20	WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models	Yifu Chen et.al.	2502.14727	translate	read	null
2025-02-20	SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition	Khanh Le et.al.	2502.14685	translate	read	null
2025-02-20	Moshi Moshi? A Model Selection Hijacking Adversarial Attack	Riccardo Petrucci et.al.	2502.14586	translate	read	null
2025-02-19	On the application of Visibility Graphs in the Spectral Domain for Speaker Recognition	Hernan Bocaccio et.al.	2502.14110	translate	read	null
2025-02-18	Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders	Seungbae Kim et.al.	2502.13983	translate	read	null
2025-02-19	Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks	Ori Shapira et.al.	2502.13645	translate	read	link
2025-02-21	VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation	Wei Zhao et.al.	2502.13508	translate	read	link
2025-02-19	Adopting Whisper for Confidence Estimation	Vaibhav Aggarwal et.al.	2502.13446	translate	read	null
2025-02-18	AV-Flow: Transforming Text to Audio-Visual Human-like Interactions	Aggelina Chatziagapi et.al.	2502.13133	translate	read	null
2025-02-18	Neuro-oscillatory models of cortical speech processing	Olesia Dogonasheva et.al.	2502.12935	translate	read	null
2025-02-18	High-Fidelity Music Vocoder using Neural Audio Codecs	Luca A. Lanzendörfer et.al.	2502.12759	translate	read	null
2025-02-18	Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge	Lian Remme et.al.	2502.12714	translate	read	null
2025-02-18	A Comprehensive Survey on Generative AI for Video-to-Music Generation	Shulei Ji et.al.	2502.12489	translate	read	null
2025-02-18	Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models	Hanin Atwany et.al.	2502.12414	translate	read	null
2025-02-18	On the Robust Approximation of ASR Metrics	Abdul Waheed et.al.	2502.12408	translate	read	null
2025-02-17	A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond	Shreya Shukla et.al.	2502.12048	translate	read	null
2025-02-17	NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing	Yifan Liang et.al.	2502.12002	translate	read	null
2025-02-17	Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration	Yan Zhang et.al.	2502.11720	translate	read	null
2025-02-17	Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models	Yingqing Guo et.al.	2502.11420	translate	read	null
2025-02-16	FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching	Hui Wang et.al.	2502.11128	translate	read	null
2025-02-16	In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback	Fyodor Morozko et.al.	2502.11126	translate	read	null
2025-02-16	DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities	Xiangyu Lu et.al.	2502.11123	translate	read	null
2025-02-14	Enhancing Age-Related Robustness in Children Speaker Verification	Vishwas M. Shetty et.al.	2502.10511	translate	read	null
2025-02-14	OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models	William Chen et.al.	2502.10373	translate	read	null
2025-02-14	VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect	Qingyuan Fei et.al.	2502.10329	translate	read	null
2025-02-14	Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries	Serkan Sulun et.al.	2502.10154	translate	read	null
2025-02-14	MTLM: an Innovative Language Model Training Paradigm for ASR	Qingliang Meng et.al.	2502.10058	translate	read	null
2025-02-14	A Preliminary Exploration with GPT-4o Voice Mode	Yu-Xiang Lin et.al.	2502.09940	translate	read	null
2025-02-14	Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge	Naoyuki Kamo et.al.	2502.09859	translate	read	null
2025-02-13	SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops	Eshaq Jamdar et.al.	2502.09553	translate	read	null
2025-02-13	Shortcut Learning Susceptibility in Vision Classifiers	Pirzada Suhail et.al.	2502.09150	translate	read	null
2025-02-13	Quantum Approaches for Dysphonia Assessment in Small Speech Datasets	Ha Tran et.al.	2502.08968	translate	read	null
2025-02-13	TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument	Kyungsu Kim et.al.	2502.08939	translate	read	link
2025-02-13	ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech	Xin Wang et.al.	2502.08857	translate	read	null
2025-02-12	Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors	Vishwanath Pratap Singh et.al.	2502.08587	translate	read	null
2025-02-11	LoRP-TTS: Low-Rank Personalized Text-To-Speech	Łukasz Bondaruk et.al.	2502.07562	translate	read	null
2025-02-12	Music for All: Exploring Multicultural Representations in Music Generation Models	Atharva Mehta et.al.	2502.07328	translate	read	link
2025-02-11	Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement	Xueyao Zhang et.al.	2502.07243	translate	read	null
2025-02-11	VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification	Pengyu Wang et.al.	2502.07205	translate	read	link
2025-02-10	A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication	Qutaiba I. Ali et.al.	2502.06969	translate	read	null
2025-02-10	Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset	Huw Cheston et.al.	2502.06364	translate	read	null
2025-02-09	Speech to Speech Translation with Translatotron: A State of the Art Review	Jules R. Kala et.al.	2502.05980	translate	read	null
2025-02-09	Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models	Jing-Xuan Zhang et.al.	2502.05766	translate	read	null
2025-02-09	Non-invasive electromyographic speech neuroprosthesis: a geometric perspective	Harshavardhana T. Gowda et.al.	2502.05762	translate	read	null
2025-02-09	BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting	Mohammad Jahid Ibna Basher et.al.	2502.05729	translate	read	null
2025-02-08	Gender Bias in Instruction-Guided Speech Synthesis Models	Chun-Yi Kuan et.al.	2502.05649	translate	read	null
2025-02-08	Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model	Jialong Zuo et.al.	2502.05471	translate	read	null
2025-02-07	Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance	Reihaneh Amooie et.al.	2502.04883	translate	read	null
2025-02-07	Lightweight Operations for Visual Speech Recognition	Iason Ioannis Panagos et.al.	2502.04834	translate	read	null
2025-02-07	Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features	Wei Chen et.al.	2502.04722	translate	read	null
2025-02-06	ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement	Keshav Bhandari et.al.	2502.04522	translate	read	link
2025-02-06	GenVC: Self-Supervised Zero-Shot Voice Conversion	Zexin Cai et.al.	2502.04519	translate	read	null
2025-02-06	FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks	Luca Della Libera et.al.	2502.04465	translate	read	link
2025-02-06	Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis	Zhen Ye et.al.	2502.04128	translate	read	link
2025-02-06	Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond	Mardhiyah Sanni et.al.	2502.03945	translate	read	null
2025-02-06	Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS	Afnan Al-Ali et.al.	2502.03895	translate	read	null
2025-02-05	Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality	Shiyi Tan et.al.	2502.03381	translate	read	null
2025-02-05	Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling	Jakob Poncelet et.al.	2502.03212	translate	read	link
2025-02-05	Metis: A Foundation Speech Generation Model with Masked Generative Pre-training	Yuancheng Wang et.al.	2502.03128	translate	read	null
2025-02-04	Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet	Shenran Wang et.al.	2502.02703	translate	read	null
2025-02-03	CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition	Martijn Bartelds et.al.	2502.01777	translate	read	null
2025-02-03	Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models	Christopher Simic et.al.	2502.01709	translate	read	null
2025-02-03	A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport	Yacouba Kaloga et.al.	2502.01588	translate	read	null
2025-02-03	mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition	Andrew Rouditchenko et.al.	2502.01547	translate	read	link
2025-02-03	Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition	Nanjun Zhou et.al.	2502.01152	translate	read	null
2025-02-03	Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis	Weiwei Lin et.al.	2502.01084	translate	read	null
2025-02-01	Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition	Anna Seo Gyeong Choi et.al.	2502.00583	translate	read	null
2025-02-01	Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions	David Gimeno-Gómez et.al.	2502.00464	translate	read	null
2025-02-01	Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language	Turi Abu et.al.	2502.00421	translate	read	link
2025-02-01	When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation	Anna Min et.al.	2502.00377	translate	read	null
2025-02-03	SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions	Dominik Wagner et.al.	2501.19377	translate	read	null
2025-02-03	DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition	Wonjun Lee et.al.	2501.19010	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)