Audio Processing - 2025-12

Publish Date Title Authors PDF Translate Read Code
2025-12-31 Index-ASR Technical Report Zheshu Song et.al. 2601.00890 translate read null
2025-12-31 Learning Speech Representations with Variational Predictive Coding Sung-Lin Yeh et.al. 2601.00100 translate read null
2025-12-31 SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models Yuan-Kuei Wu et.al. 2512.24739 translate read null
2025-12-29 MiMo-Audio: Audio Language Models are Few-Shot Learners Xiaomi LLM-Core Team et.al. 2512.23808 translate read null
2025-12-29 PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech Deepak Babu Piskala et.al. 2512.23686 translate read null
2025-12-29 AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration Minjiang Huang et.al. 2512.23300 translate read null
2025-12-27 ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation Suhua Wang et.al. 2512.22491 translate read null
2025-12-17 Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation Xuanfan Ni et.al. 2512.22165 translate read null
2025-12-15 Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification Jin Sob Kim et.al. 2512.22148 translate read null
2025-12-14 EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG Hanbeot Park et.al. 2512.22146 translate read null
2025-12-26 Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning YuXiang Kong et.al. 2512.21828 translate read null
2025-12-25 Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning Most. Sharmin Sultana Samu et.al. 2512.21702 translate read null
2025-12-25 Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition Lei Gao et.al. 2512.21566 translate read null
2025-12-23 QuarkAudio Technical Report Chengwei Liu et.al. 2512.20151 translate read null
2025-12-23 VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance Chang Sun et.al. 2512.20032 translate read null
2025-12-22 From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs Alessandro Lucca et.al. 2512.19161 translate read null
2025-12-22 Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization Jian You et.al. 2512.18967 translate read null
2025-12-21 Speaker Recognition – Wavelet Packet Based Multiresolution Feature Extraction Approach Saurabh Bhardwaj et.al. 2512.18902 translate read null
2025-12-21 Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis Pengchao Feng et.al. 2512.18699 translate read null
2025-12-20 Phoneme-based speech recognition driven by large language models and sampling marginalization Te Ma et.al. 2512.18371 translate read null
2025-12-20 TICL+: A Case Study On Speech In-Context Learning for Children’s Speech Recognition Haolong Zheng et.al. 2512.18263 translate read null
2025-12-19 SAM Audio: Segment Anything in Audio Bowen Shi et.al. 2512.18099 translate read null
2025-12-19 Peeking Into The Future For Contextual Biasing Ramaneswaran Selvakumar et.al. 2512.17657 translate read null
2025-12-19 When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems Sujal Chondhekar et.al. 2512.17562 translate read null
2025-12-19 Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models Ali Alsayegh et.al. 2512.17474 translate read null
2025-12-19 Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition Zahra Rahmani et.al. 2512.17247 translate read null
2025-12-18 Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony Darshil Chauhan et.al. 2512.16401 translate read null
2025-12-16 ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples Yunfei Yang et.al. 2512.15641 translate read null
2025-12-16 Adapting Speech Language Model to Singing Voice Synthesis Yiwen Zhao et.al. 2512.14657 translate read null
2025-12-16 MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation Yash Vishe et.al. 2512.14629 translate read null
2025-12-16 GLM-TTS Technical Report Jiayan Cui et.al. 2512.14291 translate read null
2025-12-16 Scalable Frameworks for Real-World Audio-Visual Speech Recognition Sungnyun Kim et.al. 2512.14083 translate read null
2025-12-15 Reproducing and Dissecting Denoising Language Models for Speech Recognition Dorian Koch et.al. 2512.13576 translate read null
2025-12-15 DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec Tao Li et.al. 2512.13251 translate read null
2025-12-14 BUT Systems for WildSpoof Challenge: SASV in the Wild Junyi Peng et.al. 2512.12851 translate read null
2025-12-14 Procedural Music Generation Systems in Games Shangxuan Luo et.al. 2512.12834 translate read null
2025-12-14 Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models Mohammad Jalili Torkamani et.al. 2512.12769 translate read null
2025-12-13 System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare Maryam Mustafa et.al. 2512.12240 translate read null
2025-12-13 A comparative study of generative models for child voice conversion Protima Nomo Sudro et.al. 2512.12129 translate read null
2025-12-12 All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR Takafumi Moriya et.al. 2512.11543 translate read null
2025-12-12 PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation Longshen Ou et.al. 2512.11348 translate read null
2025-12-12 The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection Yupei Li et.al. 2512.11241 translate read null
2025-12-11 The TCG CREST – RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge Nikhil Raghav et.al. 2512.11009 translate read null
2025-12-11 CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences Yiyang Wang et.al. 2512.10918 translate read null
2025-12-11 TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage Elroy Galbraith et.al. 2512.10741 translate read null
2025-12-11 MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation Alon Ziv et.al. 2512.10264 translate read null
2025-12-10 Robust Speech Activity Detection in the Presence of Singing Voice Philipp Grundhuber et.al. 2512.09713 translate read null
2025-12-09 LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge Jinyoung Park et.al. 2512.09000 translate read null
2025-12-02 Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture Karamvir Singh et.al. 2512.08973 translate read null
2025-12-09 Emovectors: assessing emotional content in jazz improvisations for creativity evaluation Anna Jordanous et.al. 2512.08812 translate read null
2025-12-08 A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification Nicolas Calbucura et.al. 2512.07571 translate read null
2025-12-08 Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data Srihari Bandarupalli et.al. 2512.07277 translate read null
2025-12-06 Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction Kush Revankar et.al. 2512.06485 translate read null
2025-12-06 Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation Xining Song et.al. 2512.06304 translate read null
2025-12-01 KidSpeak: A General Multi-purpose LLM for Kids’ Speech Recognition and Screening Rohan Sharma et.al. 2512.05994 translate read null
2025-12-04 YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases Gongyu Chen et.al. 2512.04793 translate read null
2025-12-04 M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis Xiaopeng Wang et.al. 2512.04720 translate read null
2025-12-02 Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR Mohan Shi et.al. 2512.03301 translate read null
2025-12-02 DAWZY: A New Addition to AI powered “Human in the Loop” Music Co-creation Aaron C Elkins et.al. 2512.03289 translate read null
2025-12-02 Bangla Hate Speech Classification with Fine-tuned Transformer Models Yalda Keivan Jafari et.al. 2512.02845 translate read null
2025-12-01 Swivuriso: The South African Next Voices Multilingual Speech Dataset Vukosi Marivatee et.al. 2512.02201 translate read null
2025-12-01 Story2MIDI: Emotionally Aligned Music Generation from Text Mohammad Shokri et.al. 2512.02192 translate read null
2025-12-01 MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark Yuezhang Peng et.al. 2512.01603 translate read null
2025-12-01 ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation Yuezhang Peng et.al. 2512.01267 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)