Audio Processing - 2025-07 | Paper Arxiv Daily

Audio Processing - 2025-07

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-07-23	AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer	Danny D. Leybzon et.al.	2507.17718	translate	read	null
2025-07-23	Synthetic Voice Data for Automatic Speech Recognition in African Languages	Brian DeRenzi et.al.	2507.17578	translate	read	null
2025-07-23	BoSS: Beyond-Semantic Speech	Qing Wang et.al.	2507.17563	translate	read	null
2025-07-23	Clustering-based hard negative sampling for supervised contrastive speaker verification	Piotr Masztalski et.al.	2507.17540	translate	read	null
2025-07-23	Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task	Milena Davudova et.al.	2507.17326	translate	read	null
2025-07-23	On Temporal Guidance and Iterative Refinement in Audio Source Separation	Tobias Morocutti et.al.	2507.17297	translate	read	null
2025-07-23	Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge	Miaomiao Gao et.al.	2507.17288	translate	read	null
2025-07-22	SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling	Yi Guo et.al.	2507.16884	translate	read	null
2025-07-22	Step-Audio 2 Technical Report	Boyong Wu et.al.	2507.16632	translate	read	link
2025-07-22	An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications	Sujith Pulikodan et.al.	2507.16456	translate	read	null
2025-07-21	Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks	Ziqiao Yu et.al.	2507.16043	translate	read	null
2025-07-21	Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR	Zhong-Qiu Wang et.al.	2507.15229	translate	read	null
2025-07-21	EchoVoices: Preserving Generational Voices and Memories for Seniors and Children	Haiying Xu et.al.	2507.15221	translate	read	null
2025-07-21	Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems	Natalia Tomashenko et.al.	2507.15214	translate	read	null
2025-07-20	DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis	Yinghao Aaron Li et.al.	2507.14988	translate	read	link
2025-07-19	Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion	Yu Zhang et.al.	2507.14534	translate	read	link
2025-07-19	Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications	Satwik Dutta et.al.	2507.14451	translate	read	link
2025-07-18	Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic	Lilit Grigoryan et.al.	2507.13977	translate	read	null
2025-07-18	Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies	Carlos Mena et.al.	2507.13875	translate	read	null
2025-07-17	A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models	Kirill Borodin et.al.	2507.13563	translate	read	link
2025-07-17	Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder	Feng Chen et.al.	2507.13551	translate	read	null
2025-07-18	Automatically assessing oral narratives of Afrikaans and isiXhosa children	Retief Louw et.al.	2507.13205	translate	read	null
2025-07-17	SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks	Kutub Uddin et.al.	2507.13170	translate	read	null
2025-07-17	NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech	Maksim Borisov et.al.	2507.13155	translate	read	null
2025-07-17	UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets	Zhichao Sheng et.al.	2507.12951	translate	read	null
2025-07-17	Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes	Zhou Feng et.al.	2507.12932	translate	read	null
2025-07-17	AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation	Potsawee Manakul et.al.	2507.12705	translate	read	null
2025-07-17	Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine	Anastasia Kuznetsova et.al.	2507.12701	translate	read	null
2025-07-16	Improving Contextual ASR via Multi-grained Fusion with Large Language Models	Shilin Zhou et.al.	2507.12252	translate	read	null
2025-07-16	EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis	Haoxun Li et.al.	2507.12015	translate	read	null
2025-07-15	Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection	Ivan Viakhirev et.al.	2507.11777	translate	read	link
2025-07-15	FasTUSS: Faster Task-Aware Unified Source Separation	Francesco Paissan et.al.	2507.11435	translate	read	null
2025-07-15	Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models	Paul A. Bereuter et.al.	2507.11427	translate	read	null
2025-07-14	WhisperKit: On-device Real-time ASR with Billion-Scale Transformers	Atila Orhon et.al.	2507.10860	translate	read	null
2025-07-14	Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition	Mengzhe Geng et.al.	2507.10827	translate	read	null
2025-07-14	WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling	Qihui Yang et.al.	2507.10534	translate	read	null
2025-07-14	DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation	Yiru Yang et.al.	2507.10313	translate	read	null
2025-07-13	The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge	Yuke Lin et.al.	2507.09499	translate	read	null
2025-07-12	Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning	Dominika Woszczyk et.al.	2507.09310	translate	read	null
2025-07-12	Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?	Shota Horiguchi et.al.	2507.09226	translate	read	null
2025-07-15	Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition	Bingshen Mu et.al.	2507.09116	translate	read	null
2025-07-11	SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment	Shivam Mehta et.al.	2507.09070	translate	read	null
2025-07-11	The Impact of Automatic Speech Transcription on Speaker Attribution	Cristina Aggazzotti et.al.	2507.08660	translate	read	null
2025-07-11	Unlocking Speech Instruction Data Potential with Query Rewriting	Yonghua Hei et.al.	2507.08603	translate	read	null
2025-07-11	ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition	Qingliang Meng et.al.	2507.08477	translate	read	null
2025-07-11	Active Learning for Text-to-Speech Synthesis with Informative Sample Collection	Kentaro Seki et.al.	2507.08319	translate	read	null
2025-07-11	RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing	Yang Xiao et.al.	2507.08227	translate	read	null
2025-07-10	DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation	Chunxi Wang et.al.	2507.08135	translate	read	null
2025-07-10	Modèle physique variationnel pour l’estimation de réponses impulsionnelles de salles	Louis Lalay et.al.	2507.08051	translate	read	null
2025-07-10	Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models	Chen Feng et.al.	2507.07877	translate	read	null
2025-07-10	SecureSpeech: Prompt-based Speaker and Content Protection	Belinda Soh Hui Hui et.al.	2507.07799	translate	read	null
2025-07-10	Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review	Maha Tufail Agro et.al.	2507.07741	translate	read	null
2025-07-08	Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition	Dipayan Bhadra et.al.	2507.07068	translate	read	null
2025-07-09	Speech Tokenizer is Key to Consistent Representation	Wonjin Jung et.al.	2507.06802	translate	read	null
2025-07-09	Exploring State-Space-Model based Language Model in Music Generation	Wei-Jaw Lee et.al.	2507.06674	translate	read	null
2025-07-09	Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents	Zackary Rackauckas et.al.	2507.06483	translate	read	null
2025-07-08	Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis	Xintong Hu et.al.	2507.06116	translate	read	null
2025-07-08	VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis	Alexandre Symeonidis-Herzig et.al.	2507.06060	translate	read	null
2025-07-08	MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation	Fathinah Izzati et.al.	2507.05894	translate	read	null
2025-07-08	How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures	Tanvina Patel et.al.	2507.05885	translate	read	null
2025-07-08	ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark	He Wang et.al.	2507.05727	translate	read	null
2025-07-08	Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition	Zijin Gu et.al.	2507.05724	translate	read	null
2025-07-07	EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation	Fathinah Izzati et.al.	2507.04955	translate	read	null
2025-07-07	Adaptive Slimming for Scalable and Efficient Speech Enhancement	Riccardo Miccini et.al.	2507.04879	translate	read	null
2025-07-07	Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters	Mathilde Abrassart et.al.	2507.04817	translate	read	null
2025-07-07	Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis	Sho Inoue et.al.	2507.04598	translate	read	null
2025-07-06	TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet	Jaeseok Jeong et.al.	2507.04349	translate	read	null
2025-07-05	Prosody Labeling with Phoneme-BERT and Speech Foundation Models	Tomoki Koriyama et.al.	2507.03912	translate	read	null
2025-07-04	Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion	Lea Fischbach et.al.	2507.03641	translate	read	null
2025-07-04	MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI	Roser Batlle-Roca et.al.	2507.03599	translate	read	null
2025-07-08	SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge	Yuxiang Mei et.al.	2507.03343	translate	read	null
2025-07-03	DeepGesture: A conversational gesture synthesis system based on emotions and semantics	Thanh Hoang-Minh et.al.	2507.03147	translate	read	null
2025-07-03	Multi-agent Auditory Scene Analysis	Caleb Rascon et.al.	2507.02755	translate	read	null
2025-07-03	Open-Source System for Multilingual Translation and Cloned Speech Synthesis	Mateo Cámara et.al.	2507.02530	translate	read	null
2025-07-03	A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages	Sumaya Ahmed Salihs et.al.	2507.02428	translate	read	null
2025-07-03	Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability	Mark Atta Mensah et.al.	2507.02407	translate	read	null
2025-07-02	Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis	Marc-André Carbonneau et.al.	2507.02176	translate	read	null
2025-07-02	Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams	Zirui Li et.al.	2507.02115	translate	read	null
2025-07-02	Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla	Md Sazzadul Islam Ridoy et.al.	2507.01931	translate	read	null
2025-07-02	First Steps Towards Voice Anonymization for Code-Switching Speech	Sarina Meyer et.al.	2507.01765	translate	read	null
2025-07-02	PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution	Omkar Shende et.al.	2507.01695	translate	read	null
2025-07-02	Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora	Hitoshi Suda et.al.	2507.01356	translate	read	null
2025-07-02	Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation	Andrei Jelea et.al.	2507.01347	translate	read	null
2025-07-02	AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance	Vishakha Lall et.al.	2507.01274	translate	read	null
2025-07-01	MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement	Nikolai Lund Kühne et.al.	2507.00966	translate	read	null
2025-07-02	Multi-interaction TTS toward professional recording reproduction	Hiroki Kanagawa et.al.	2507.00808	translate	read	null
2025-07-01	Rectifying Magnitude Neglect in Linear Attention	Qihang Fan et.al.	2507.00698	translate	read	null
2025-07-01	Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding	Duc Cao-Dinh et.al.	2507.00669	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)