Audio Processing - 2025-10 | Paper Arxiv Daily

Audio Processing - 2025-10

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-10-31	NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion	Zongyang Du et.al.	2511.00256	translate	read	null
2025-10-31	Holographic equation of state matched with hadron gas equation as a tool for the study of the quark-gluon plasma evolution	A. V. Anufriev et.al.	2510.27541	translate	read	null
2025-10-31	Referee: Reference-aware Audiovisual Deepfake Detection	Hyemin Boo et.al.	2510.27475	translate	read	null
2025-10-31	Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation	Alireza Gharahighehi et.al.	2510.27342	translate	read	null
2025-10-31	Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication	Deok-Seon Kim et.al.	2510.27247	translate	read	null
2025-10-31	Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm	Anselm Lohmann et.al.	2510.27198	translate	read	null
2025-10-31	Expressive Range Characterization of Open Text-to-Audio Models	Jonathan Morse et.al.	2510.27102	translate	read	null
2025-10-30	Are Online Sports Fan Communities Becoming More Offensive? A Quantitative Review of Topics, Trends, and Toxicity of r/PremierLeague	Muhammad Zeeshan Mazhar et.al.	2510.27003	translate	read	null
2025-10-30	Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations	Jean-Philippe Corbeil et.al.	2510.26974	translate	read	null
2025-10-29	Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition	Amine Razig et.al.	2510.26838	translate	read	null
2025-10-29	Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling	Jiarong Du et.al.	2510.26825	translate	read	null
2025-10-28	Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features	Unzela Talpur et.al.	2510.26823	translate	read	null
2025-10-28	See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement	Jinting Wang et.al.	2510.26819	translate	read	null
2025-10-28	GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment	Jinting Wang et.al.	2510.26818	translate	read	null
2025-10-30	HMM for short independent sequences: Multiple sequence Baum-Welch application	Margarita Cabrera-Bean et.al.	2510.26532	translate	read	null
2025-10-30	UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens	Chengwei Liu et.al.	2510.26372	translate	read	link
2025-10-30	Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages	Mérilin Sousa Silva et.al.	2510.26254	translate	read	null
2025-10-29	Efficient Vocal Source Separation Through Windowed Sink Attention	Christodoulos Benetatos et.al.	2510.25745	translate	read	null
2025-10-29	Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models	Harm Lameris et.al.	2510.25577	translate	read	null
2025-10-29	Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation	Yuxiang Mao et.al.	2510.25234	translate	read	null
2025-10-27	SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution	Dharma Teja Donepudi et.al.	2510.25178	translate	read	null
2025-10-29	Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels	Keisuke Imoto et.al.	2510.25075	translate	read	null
2025-10-29	Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech	Pedro Corrêa et.al.	2510.25054	translate	read	null
2025-10-28	POWSM: A Phonetic Open Whisper-Style Speech Foundation Model	Chin-Jou Li et.al.	2510.24992	translate	read	null
2025-10-28	The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems	Stefano Natangelo et.al.	2510.24831	translate	read	null
2025-10-28	Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation	Inclusion AI et.al.	2510.24821	translate	read	link
2025-10-28	BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation	Raphaël Bagat et.al.	2510.24570	translate	read	null
2025-10-28	Levée d’ambiguïtés par grammaires locales	Eric G. C. Laporte et.al.	2510.24530	translate	read	null
2025-10-28	Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient	Rinku Sebastian et.al.	2510.24519	translate	read	null
2025-10-28	Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes	Jonas Hein et.al.	2510.24332	translate	read	null
2025-10-28	V-SAT: Video Subtitle Annotation Tool	Arpita Kundu et.al.	2510.24180	translate	read	null
2025-10-28	RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects	Md. Rezuwan Hassan et.al.	2510.24096	translate	read	null
2025-10-27	A Neural Model for Contextual Biasing Score Learning and Filtering	Wanting Huang et.al.	2510.23849	translate	read	null
2025-10-27	Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders	Nathan Paek et.al.	2510.23802	translate	read	null
2025-10-27	SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity	Hanke Xie et.al.	2510.23541	translate	read	null
2025-10-27	LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization	Máté Gedeon et.al.	2510.23320	translate	read	null
2025-10-27	Arabic Little STT: Arabic Children Speech Recognition Dataset	Mouhand Alkadri et.al.	2510.23319	translate	read	null
2025-10-27	Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?	Tawsif Tashwar Dipto et.al.	2510.23252	translate	read	null
2025-10-27	Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement	Sarabeth S. Mullins et.al.	2510.23141	translate	read	null
2025-10-27	Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition	Jing-Xuan Zhang et.al.	2510.22961	translate	read	null
2025-10-26	LRW-Persian: Lip-reading in the Wild Dataset for Persian Language	Zahra Taghizadeh et.al.	2510.22716	translate	read	null
2025-10-26	Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs	Anand et.al.	2510.22603	translate	read	link
2025-10-26	UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models	Wenming Tu et.al.	2510.22588	translate	read	link
2025-10-26	A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus	Michael Scott et.al.	2510.22495	translate	read	null
2025-10-26	The Tonogenesis Continuum in Tibetan: A Computational Investigation	Siyu Liang et.al.	2510.22485	translate	read	null
2025-10-25	M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR	Ruixiang Mao et.al.	2510.22172	translate	read	null
2025-10-25	Streaming Generation for Music Accompaniment	Yusong Wu et.al.	2510.22105	translate	read	null
2025-10-23	GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer	Jackson Loth et.al.	2510.21872	translate	read	null
2025-10-24	StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks	Jingyue Huang et.al.	2510.21685	translate	read	null
2025-10-23	ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring	Ari Frummer et.al.	2510.21014	translate	read	null
2025-10-21	Can large audio language models understand child stuttering speech? speech summarization, and source separation	Chibuzor Okocha et.al.	2510.20850	translate	read	null
2025-10-23	R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion	Junjie Zheng et.al.	2510.20677	translate	read	null
2025-10-23	Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding	Xin Zhang et.al.	2510.20504	translate	read	link
2025-10-23	Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator	Hualei Wang et.al.	2510.20210	translate	read	null
2025-10-23	SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance	Haowei Lou et.al.	2510.20113	translate	read	null
2025-10-22	Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition	Yuu Jinnai et.al.	2510.19471	translate	read	null
2025-10-22	FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems	Ziheng Deng et.al.	2510.19301	translate	read	null
2025-10-22	Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges	Cheng Huang et.al.	2510.19144	translate	read	null
2025-10-21	Steering Autoregressive Music Generation with Recursive Feature Machines	Daniel Zhao et.al.	2510.19127	translate	read	link
2025-10-21	StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction	Qianheng Xu et.al.	2510.18938	translate	read	null
2025-10-21	RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling	Mandip Goswami et.al.	2510.18917	translate	read	link
2025-10-21	MLMA: Towards Multilingual ASR With Mamba-based Architectures	Mohamed Nabih Ali et.al.	2510.18684	translate	read	null
2025-10-21	Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification	Bin Gu et.al.	2510.18533	translate	read	null
2025-10-21	A Stage-Wise Learning Strategy with Fixed Anchors for Robust Speaker Verification	Bin Gu et.al.	2510.18530	translate	read	null
2025-10-20	DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model	Massa Baali et.al.	2510.17662	translate	read	null
2025-10-19	U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation	Xusheng Yang et.al.	2510.16718	translate	read	null
2025-10-19	Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios	Shiyao Wang et.al.	2510.16700	translate	read	null
2025-10-18	Hallucination Benchmark for Speech Foundation Models	Alkis Koudounas et.al.	2510.16567	translate	read	null
2025-10-18	Interpreting the Dimensions of Speaker Embedding Space	Mark Huckvale et.al.	2510.16489	translate	read	null
2025-10-18	Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment	Fu-An Chao et.al.	2510.16387	translate	read	null
2025-10-18	MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding	Jingyue Huang et.al.	2510.16273	translate	read	null
2025-10-17	SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling	Kadri Hacioglu et.al.	2510.15851	translate	read	null
2025-10-17	SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models	Rachmad Vidya Wicaksana Putra et.al.	2510.15566	translate	read	null
2025-10-16	RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF	Qing Yang et.al.	2510.14628	translate	read	null
2025-10-16	Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics?	Qixin Deng et.al.	2510.14249	translate	read	null
2025-10-15	Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks	Supriti Sinhamahapatra et.al.	2510.13979	translate	read	null
2025-10-15	Closing the Gap Between Text and Speech Understanding in LLMs	Santiago Cuervo et.al.	2510.13632	translate	read	null
2025-10-15	UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE	Zhenyu Liu et.al.	2510.13344	translate	read	link
2025-10-15	Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses	Sungnyun Kim et.al.	2510.13281	translate	read	null
2025-10-14	Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs	Xinlu He et.al.	2510.12995	translate	read	null
2025-10-14	VCTR: A Transformer-Based Model for Non-parallel Voice Conversion	Maharnab Saikia et.al.	2510.12964	translate	read	null
2025-10-14	A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation	Mohammed Hilal Al-Kharusi et.al.	2510.12858	translate	read	null
2025-10-14	Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models	Tsung-En Lin et.al.	2510.12851	translate	read	null
2025-10-11	Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation	Md. Nayeem et.al.	2510.12827	translate	read	null
2025-10-14	Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models	Prasenjit K Mudi et.al.	2510.12666	translate	read	null
2025-10-13	BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis	Jingyuan Xing et.al.	2510.11646	translate	read	null
2025-10-13	Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker	Cheng Gong et.al.	2510.11124	translate	read	null
2025-10-13	VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents	Jiliang Hu et.al.	2510.11098	translate	read	null
2025-10-12	ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis	Mohammad Javad Ranjbar Kalahroodi et.al.	2510.10774	translate	read	null
2025-10-12	End-to-end Speech Recognition with similar length speech and text	Peng Fan et.al.	2510.10453	translate	read	null
2025-10-12	MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations	Wenxiang Guo et.al.	2510.10396	translate	read	null
2025-10-11	End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs	Nam Luu et.al.	2510.10329	translate	read	null
2025-10-11	ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis	Stephen Ni-Hahn et.al.	2510.10249	translate	read	null
2025-10-11	SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation	Zeyu Ling et.al.	2510.10069	translate	read	null
2025-10-10	Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking	Mohammad Hossein Sameti et.al.	2510.09528	translate	read	null
2025-10-10	WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations	Hui Wang et.al.	2510.09344	translate	read	null
2025-10-10	SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion	Zhao Guo et.al.	2510.09245	translate	read	null
2025-10-10	Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality – an experimental evaluation	Michele Buccoli et.al.	2510.09236	translate	read	null
2025-10-10	FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms	Atul Shree et.al.	2510.09085	translate	read	null
2025-10-10	O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion	Huu Tuong Tu et.al.	2510.09061	translate	read	link
2025-10-08	Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization	Rui Hu et.al.	2510.08618	translate	read	null
2025-10-09	MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows	Guobin Ma et.al.	2510.08392	translate	read	link
2025-10-09	DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching	Hanke Xie et.al.	2510.08373	translate	read	null
2025-10-09	Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition	Yi-Cheng Lin et.al.	2510.08047	translate	read	null
2025-10-09	IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation	Wei Wang et.al.	2510.07979	translate	read	null
2025-10-09	VoiceAgentBench: Are Voice Assistants ready for agentic tasks?	Dhruv Jain et.al.	2510.07978	translate	read	null
2025-10-09	Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor	Kuan-Yu Chen et.al.	2510.07909	translate	read	null
2025-10-08	How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu	Benjamin Akera et.al.	2510.07221	translate	read	link
2025-10-08	Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis	Zhu Li et.al.	2510.07096	translate	read	null
2025-10-08	Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation	Vaibhav Srivastav et.al.	2510.06961	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)