Audio Processing - 2025-12 | Paper Arxiv Daily

Audio Processing - 2025-12

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-12-31	Index-ASR Technical Report	Zheshu Song et.al.	2601.00890	translate	read	null
2025-12-31	Learning Speech Representations with Variational Predictive Coding	Sung-Lin Yeh et.al.	2601.00100	translate	read	null
2025-12-31	SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models	Yuan-Kuei Wu et.al.	2512.24739	translate	read	null
2025-12-29	MiMo-Audio: Audio Language Models are Few-Shot Learners	Xiaomi LLM-Core Team et.al.	2512.23808	translate	read	null
2025-12-29	PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech	Deepak Babu Piskala et.al.	2512.23686	translate	read	null
2025-12-29	AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration	Minjiang Huang et.al.	2512.23300	translate	read	null
2025-12-27	ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation	Suhua Wang et.al.	2512.22491	translate	read	null
2025-12-17	Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation	Xuanfan Ni et.al.	2512.22165	translate	read	null
2025-12-15	Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification	Jin Sob Kim et.al.	2512.22148	translate	read	null
2025-12-14	EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG	Hanbeot Park et.al.	2512.22146	translate	read	null
2025-12-26	Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning	YuXiang Kong et.al.	2512.21828	translate	read	null
2025-12-25	Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning	Most. Sharmin Sultana Samu et.al.	2512.21702	translate	read	null
2025-12-25	Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition	Lei Gao et.al.	2512.21566	translate	read	null
2025-12-23	QuarkAudio Technical Report	Chengwei Liu et.al.	2512.20151	translate	read	null
2025-12-23	VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance	Chang Sun et.al.	2512.20032	translate	read	null
2025-12-22	From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs	Alessandro Lucca et.al.	2512.19161	translate	read	null
2025-12-22	Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization	Jian You et.al.	2512.18967	translate	read	null
2025-12-21	Speaker Recognition – Wavelet Packet Based Multiresolution Feature Extraction Approach	Saurabh Bhardwaj et.al.	2512.18902	translate	read	null
2025-12-21	Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis	Pengchao Feng et.al.	2512.18699	translate	read	null
2025-12-20	Phoneme-based speech recognition driven by large language models and sampling marginalization	Te Ma et.al.	2512.18371	translate	read	null
2025-12-20	TICL+: A Case Study On Speech In-Context Learning for Children’s Speech Recognition	Haolong Zheng et.al.	2512.18263	translate	read	null
2025-12-19	SAM Audio: Segment Anything in Audio	Bowen Shi et.al.	2512.18099	translate	read	null
2025-12-19	Peeking Into The Future For Contextual Biasing	Ramaneswaran Selvakumar et.al.	2512.17657	translate	read	null
2025-12-19	When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems	Sujal Chondhekar et.al.	2512.17562	translate	read	null
2025-12-19	Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models	Ali Alsayegh et.al.	2512.17474	translate	read	null
2025-12-19	Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition	Zahra Rahmani et.al.	2512.17247	translate	read	null
2025-12-18	Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony	Darshil Chauhan et.al.	2512.16401	translate	read	null
2025-12-16	ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples	Yunfei Yang et.al.	2512.15641	translate	read	null
2025-12-16	Adapting Speech Language Model to Singing Voice Synthesis	Yiwen Zhao et.al.	2512.14657	translate	read	null
2025-12-16	MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation	Yash Vishe et.al.	2512.14629	translate	read	null
2025-12-16	GLM-TTS Technical Report	Jiayan Cui et.al.	2512.14291	translate	read	null
2025-12-16	Scalable Frameworks for Real-World Audio-Visual Speech Recognition	Sungnyun Kim et.al.	2512.14083	translate	read	null
2025-12-15	Reproducing and Dissecting Denoising Language Models for Speech Recognition	Dorian Koch et.al.	2512.13576	translate	read	null
2025-12-15	DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec	Tao Li et.al.	2512.13251	translate	read	null
2025-12-14	BUT Systems for WildSpoof Challenge: SASV in the Wild	Junyi Peng et.al.	2512.12851	translate	read	null
2025-12-14	Procedural Music Generation Systems in Games	Shangxuan Luo et.al.	2512.12834	translate	read	null
2025-12-14	Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models	Mohammad Jalili Torkamani et.al.	2512.12769	translate	read	null
2025-12-13	System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare	Maryam Mustafa et.al.	2512.12240	translate	read	null
2025-12-13	A comparative study of generative models for child voice conversion	Protima Nomo Sudro et.al.	2512.12129	translate	read	null
2025-12-12	All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR	Takafumi Moriya et.al.	2512.11543	translate	read	null
2025-12-12	PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation	Longshen Ou et.al.	2512.11348	translate	read	null
2025-12-12	The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection	Yupei Li et.al.	2512.11241	translate	read	null
2025-12-11	The TCG CREST – RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge	Nikhil Raghav et.al.	2512.11009	translate	read	null
2025-12-11	CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences	Yiyang Wang et.al.	2512.10918	translate	read	null
2025-12-11	TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage	Elroy Galbraith et.al.	2512.10741	translate	read	null
2025-12-11	MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation	Alon Ziv et.al.	2512.10264	translate	read	null
2025-12-10	Robust Speech Activity Detection in the Presence of Singing Voice	Philipp Grundhuber et.al.	2512.09713	translate	read	null
2025-12-09	LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge	Jinyoung Park et.al.	2512.09000	translate	read	null
2025-12-02	Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture	Karamvir Singh et.al.	2512.08973	translate	read	null
2025-12-09	Emovectors: assessing emotional content in jazz improvisations for creativity evaluation	Anna Jordanous et.al.	2512.08812	translate	read	null
2025-12-08	A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification	Nicolas Calbucura et.al.	2512.07571	translate	read	null
2025-12-08	Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data	Srihari Bandarupalli et.al.	2512.07277	translate	read	null
2025-12-06	Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction	Kush Revankar et.al.	2512.06485	translate	read	null
2025-12-06	Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation	Xining Song et.al.	2512.06304	translate	read	null
2025-12-01	KidSpeak: A General Multi-purpose LLM for Kids’ Speech Recognition and Screening	Rohan Sharma et.al.	2512.05994	translate	read	null
2025-12-04	YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases	Gongyu Chen et.al.	2512.04793	translate	read	null
2025-12-04	M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis	Xiaopeng Wang et.al.	2512.04720	translate	read	null
2025-12-02	Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR	Mohan Shi et.al.	2512.03301	translate	read	null
2025-12-02	DAWZY: A New Addition to AI powered “Human in the Loop” Music Co-creation	Aaron C Elkins et.al.	2512.03289	translate	read	null
2025-12-02	Bangla Hate Speech Classification with Fine-tuned Transformer Models	Yalda Keivan Jafari et.al.	2512.02845	translate	read	null
2025-12-01	Swivuriso: The South African Next Voices Multilingual Speech Dataset	Vukosi Marivatee et.al.	2512.02201	translate	read	null
2025-12-01	Story2MIDI: Emotionally Aligned Music Generation from Text	Mohammad Shokri et.al.	2512.02192	translate	read	null
2025-12-01	MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark	Yuezhang Peng et.al.	2512.01603	translate	read	null
2025-12-01	ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation	Yuezhang Peng et.al.	2512.01267	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)