Audio Processing - 2025-06 | Paper Arxiv Daily

Audio Processing - 2025-06

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-06-29	You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties	Paige Tuttösí et.al.	2506.23367	translate	read	null
2025-06-29	The Florence Price Art Song Dataset and Piano Accompaniment Generator	Tao-Tao He et.al.	2506.23130	translate	read	null
2025-06-29	TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure	Qi He et.al.	2506.23094	translate	read	null
2025-06-29	Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models	Cong Xie et.al.	2506.23079	translate	read	null
2025-06-28	Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions	Duygu Altinok et.al.	2506.22858	translate	read	null
2025-06-28	Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization	Duygu Altinok et.al.	2506.22846	translate	read	null
2025-06-28	A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition	Shiyao Wang et.al.	2506.22810	translate	read	null
2025-06-27	Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR	Weiqing Wang et.al.	2506.22646	translate	read	null
2025-06-27	Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition	Shunsuke Mitsumori et.al.	2506.22194	translate	read	null
2025-06-27	SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition	Muhammad Umar Farooq et.al.	2506.22143	translate	read	null
2025-06-27	Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration	Noora Sassali et.al.	2506.22116	translate	read	null
2025-06-27	Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy	Bohan Li et.al.	2506.22023	translate	read	null
2025-06-27	Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit	Kartheek Kumar Reddy Nareddy et.al.	2506.21990	translate	read	null
2025-06-26	Exploring Adapter Design Tradeoffs for Low Resource Music Generation	Atharva Mehta et.al.	2506.21298	translate	read	null
2025-06-26	A Multi-Stage Framework for Multimodal Controllable Speech Synthesis	Rui Niu et.al.	2506.20945	translate	read	null
2025-06-25	Multimodal Representation Learning and Fusion	Qihang Jin et.al.	2506.20494	translate	read	null
2025-06-25	Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR	Aleš Pražák et.al.	2506.20288	translate	read	null
2025-06-24	Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR	Martin Ratajczak et.al.	2506.19761	translate	read	null
2025-06-23	A Fourier Explanation of AI-music Artifacts	Darius Afchar et.al.	2506.19108	translate	read	null
2025-06-23	Benchmarking Music Generation Models and Metrics via Human Preference Studies	Florian Grötschla et.al.	2506.19085	translate	read	null
2025-06-23	Let Your Video Listen to Your Music!	Xinyu Zhang et.al.	2506.18881	translate	read	null
2025-06-24	MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners	Fang-Duo Tsai et.al.	2506.18729	translate	read	link
2025-06-23	Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition	Christian Huber et.al.	2506.18703	translate	read	null
2025-06-23	Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders	Nasser-Eddine Monir et.al.	2506.18691	translate	read	null
2025-06-23	End-to-End Spoken Grammatical Error Correction	Mengjie Qian et.al.	2506.18532	translate	read	null
2025-06-23	AI-Generated Song Detection via Lyrics Transcripts	Markus Frohmann et.al.	2506.18488	translate	read	null
2025-06-23	Selecting N-lowest scores for training MOS prediction models	Yuto Kondo et.al.	2506.18326	translate	read	null
2025-06-23	Large-Scale Training Data Attribution for Music Generative Models via Unlearning	Woosung Choi et.al.	2506.18312	translate	read	null
2025-06-23	Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting	Yuto Kondo et.al.	2506.18307	translate	read	null
2025-06-23	JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles	Yuto Kondo et.al.	2506.18296	translate	read	null
2025-06-20	Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025	Dominik Macháček et.al.	2506.17077	translate	read	null
2025-06-20	Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning	Giuseppe Attanasio et.al.	2506.17019	translate	read	null
2025-06-20	State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition	Aref Farhadipour et.al.	2506.16969	translate	read	null
2025-06-20	Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training	Jianyuan Feng et.al.	2506.16833	translate	read	null
2025-06-20	RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching	Hyun Joon Park et.al.	2506.16741	translate	read	link
2025-06-20	LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization	Daejin Jo et.al.	2506.16738	translate	read	null
2025-06-20	V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos	Qixin Wang et.al.	2506.16716	translate	read	null
2025-06-19	Weight Factorization and Centralization for Continual Learning in Speech Recognition	Enes Yavuz Ugan et.al.	2506.16574	translate	read	null
2025-06-19	Automatic Speech Recognition Biases in Newcastle English: an Error Analysis	Dana Serditova et.al.	2506.16558	translate	read	null
2025-06-19	InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems	Kexin Huang et.al.	2506.16381	translate	read	link
2025-06-18	Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models	Teysir Baoueb et.al.	2506.15530	translate	read	null
2025-06-18	Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper	Jaza Syed et.al.	2506.15514	translate	read	link
2025-06-18	Foundation of Affective Computing and Interaction	Changzeng Fu et.al.	2506.15497	translate	read	null
2025-06-18	An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW	Prateek Mehta et.al.	2506.15029	translate	read	null
2025-06-17	A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments	Md Jahangir Alam Khondkar et.al.	2506.15000	translate	read	link
2025-06-17	Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition	Jiamin Xie et.al.	2506.14973	translate	read	null
2025-06-17	Unifying Streaming and Non-streaming Zipformer-based ASR	Bidisha Sharma et.al.	2506.14434	translate	read	null
2025-06-17	Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification	Yiyang Zhao et.al.	2506.14226	translate	read	null
2025-06-17	Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios	Aswin Shanmugam Subramanian et.al.	2506.14204	translate	read	null
2025-06-17	AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR	Tuan Nguyen et.al.	2506.14190	translate	read	null
2025-06-17	Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models	Tuan Dat Phuong et.al.	2506.14153	translate	read	null
2025-06-16	Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems	Tuan Nguyen et.al.	2506.13596	translate	read	null
2025-06-16	From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars	Pegah Salehi et.al.	2506.13477	translate	read	null
2025-06-16	BUT System for the MLC-SLM Challenge	Alexander Polok et.al.	2506.13414	translate	read	link
2025-06-16	Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR	Yizhou Peng et.al.	2506.13396	translate	read	null
2025-06-16	NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025	Yizhou Peng et.al.	2506.13339	translate	read	null
2025-06-16	Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models	Bo Li et.al.	2506.13300	translate	read	null
2025-06-16	Personalizable Long-Context Symbolic Music Infilling with MIDI-RWKV	Christian Zhou-Zheng et.al.	2506.13001	translate	read	link
2025-06-15	SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition	Yuta Hirano et.al.	2506.12672	translate	read	null
2025-06-14	Video-Guided Text-to-Music Generation Using Public Domain Movie Collections	Haven Kim et.al.	2506.12573	translate	read	null
2025-06-14	Mitigating Non-Target Speaker Bias in Guided Speaker Embedding	Shota Horiguchi et.al.	2506.12500	translate	read	null
2025-06-13	Enabling automatic transcription of child-centered audio recordings from real-world environments	Daniil Kocharov et.al.	2506.11747	translate	read	null
2025-06-13	Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform	Xiangzhu Kong et.al.	2506.11630	translate	read	null
2025-06-13	(SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test	Stefan Bleeck et.al.	2506.11620	translate	read	null
2025-06-13	Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments	Deliang Jin et.al.	2506.11615	translate	read	null
2025-06-12	Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms	Soumen Garai et.al.	2506.11169	translate	read	null
2025-06-12	Improving Named Entity Transcription with Contextual LLM-based Revision	Viet Anh Trinh et.al.	2506.10779	translate	read	null
2025-06-12	BNMusic: Blending Environmental Noises into Personalized Music	Chi Zuo et.al.	2506.10754	translate	read	null
2025-06-12	FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition	Jongsuk Kim et.al.	2506.10747	translate	read	null
2025-06-12	Joint ASR and Speaker Role Tagging with Serialized Output Training	Anfeng Xu et.al.	2506.10349	translate	read	null
2025-06-12	RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding	Yisi Liu et.al.	2506.10289	translate	read	null
2025-06-11	Fine-Grained control over Music Generation with Activation Steering	Dipanshu Panda et.al.	2506.10225	translate	read	null
2025-06-11	UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching	Neta Glazer et.al.	2506.09874	translate	read	null
2025-06-11	Regularizing Learnable Feature Extraction for Automatic Speech Recognition	Peter Vieting et.al.	2506.09804	translate	read	null
2025-06-11	Training-Free Voice Conversion with Factorized Optimal Transport	Alexander Lobashev et.al.	2506.09709	translate	read	link
2025-06-11	You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks	Ünal Ege Gaznepoglu et.al.	2506.09521	translate	read	null
2025-06-11	OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary	Yui Sudo et.al.	2506.09448	translate	read	null
2025-06-11	CoLMbo: Speaker Language Model for Descriptive Profiling	Massa Baali et.al.	2506.09375	translate	read	null
2025-06-11	OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment	Chao-Hong Tan et.al.	2506.09349	translate	read	null
2025-06-10	SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research	Ahmed Adel Attia et.al.	2506.09206	translate	read	null
2025-06-10	FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents	Satu Hopponen et.al.	2506.08981	translate	read	null
2025-06-10	Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model	Ailin Huang et.al.	2506.08967	translate	read	null
2025-06-09	Uncovering the Functional Roles of Nonlinearity in Memory	Manuel Brenner et.al.	2506.07919	translate	read	null
2025-06-09	Unified Semi-Supervised Pipeline for Automatic Speech Recognition	Nune Tadevosyan et.al.	2506.07659	translate	read	null
2025-06-09	Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation	Rui Hu et.al.	2506.07646	translate	read	null
2025-06-09	SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement	Chenyu Yang et.al.	2506.07634	translate	read	link
2025-06-09	Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing	Jin Li et.al.	2506.07536	translate	read	null
2025-06-09	LeVo: High-Quality Song Generation with Multi-Preference Alignment	Shun Lei et.al.	2506.07520	translate	read	link
2025-06-09	Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition	Asahi Sakuma et.al.	2506.07515	translate	read	null
2025-06-09	DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction	Solee Im et.al.	2506.07510	translate	read	null
2025-06-09	Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration	Peng Huang et.al.	2506.07494	translate	read	null
2025-06-08	Speech Recognition on TV Series with Video-guided Post-Correction	Haoyuan Yang et.al.	2506.07323	translate	read	null
2025-06-06	Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems	Bo Ren et.al.	2506.06252	translate	read	null
2025-06-06	Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction	Christophe Van Gysel et.al.	2506.06117	translate	read	null
2025-06-06	CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition	Yun-Shao Tsai et.al.	2506.06071	translate	read	null
2025-06-06	Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models	Yuke Lin et.al.	2506.05796	translate	read	null
2025-06-06	Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition	Mu Yang et.al.	2506.05706	translate	read	null
2025-06-06	Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning	Yangui Fang et.al.	2506.05671	translate	read	null
2025-06-05	Improving AI-generated music with user-guided training	Vishwa Mohan Singh et.al.	2506.04852	translate	read	null
2025-06-05	LLM-based phoneme-to-grapheme for phoneme-based speech recognition	Te Ma et.al.	2506.04711	translate	read	null
2025-06-05	ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition	Thai-Binh Nguyen et.al.	2506.04635	translate	read	null
2025-06-05	LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models	Wen Ding et.al.	2506.04586	translate	read	null
2025-06-04	French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement	Thomas Joubaud et.al.	2506.04495	translate	read	null
2025-06-04	Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR	Zheng-Xin Yong et.al.	2506.04364	translate	read	null
2025-06-04	HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset	Ryan Langman et.al.	2506.04152	translate	read	null
2025-06-04	A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions	Chung-Chun Wang et.al.	2506.04077	translate	read	null
2025-06-04	Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion	Seymanur Akti et.al.	2506.04013	translate	read	null
2025-06-04	MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition	Yinfeng Xia et.al.	2506.03722	translate	read	null
2025-06-04	Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments	Reo Yoneyama et.al.	2506.03554	translate	read	null
2025-06-04	Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network	Kanami Imamura et.al.	2506.03550	translate	read	null
2025-06-03	Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation	Yongqi Wang et.al.	2506.02997	translate	read	null
2025-06-03	A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation	Verena Blaschke et.al.	2506.02894	translate	read	link
2025-06-03	CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech	Helin Wang et.al.	2506.02863	translate	read	link
2025-06-05	DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization	Geonyoung Lee et.al.	2506.02858	translate	read	null
2025-06-03	On the influence of language similarity in non-target speaker verification trials	Paul M. Reuter et.al.	2506.02777	translate	read	null
2025-06-03	Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions	Xiaoxue Gao et.al.	2506.02742	translate	read	null
2025-06-03	Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning	Ömer Tarik Özyilmaz et.al.	2506.02627	translate	read	null
2025-06-03	On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs	Kemal Altwlkany et.al.	2506.02545	translate	read	null
2025-06-03	DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds	Takuya Hasumi et.al.	2506.02499	translate	read	null
2025-06-03	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant	Yixuan Hou et.al.	2506.02457	translate	read	null
2025-06-02	MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR	Dimitrios Damianos et.al.	2505.24656	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)