Audio Processing - 2025-12
Audio Processing - 2025-12
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-12-31 | Index-ASR Technical Report | Zheshu Song et.al. | 2601.00890 | translate | read | null |
| 2025-12-31 | Learning Speech Representations with Variational Predictive Coding | Sung-Lin Yeh et.al. | 2601.00100 | translate | read | null |
| 2025-12-31 | SLM-TTA: A Framework for Test-Time Adaptation of Generative Spoken Language Models | Yuan-Kuei Wu et.al. | 2512.24739 | translate | read | null |
| 2025-12-29 | MiMo-Audio: Audio Language Models are Few-Shot Learners | Xiaomi LLM-Core Team et.al. | 2512.23808 | translate | read | null |
| 2025-12-29 | PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech | Deepak Babu Piskala et.al. | 2512.23686 | translate | read | null |
| 2025-12-29 | AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration | Minjiang Huang et.al. | 2512.23300 | translate | read | null |
| 2025-12-27 | ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation | Suhua Wang et.al. | 2512.22491 | translate | read | null |
| 2025-12-17 | Marco-ASR: A Principled and Metric-Driven Framework for Fine-Tuning Large-Scale ASR Models for Domain Adaptation | Xuanfan Ni et.al. | 2512.22165 | translate | read | null |
| 2025-12-15 | Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification | Jin Sob Kim et.al. | 2512.22148 | translate | read | null |
| 2025-12-14 | EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG | Hanbeot Park et.al. | 2512.22146 | translate | read | null |
| 2025-12-26 | Contextual Biasing for LLM-Based ASR with Hotword Retrieval and Reinforcement Learning | YuXiang Kong et.al. | 2512.21828 | translate | read | null |
| 2025-12-25 | Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning | Most. Sharmin Sultana Samu et.al. | 2512.21702 | translate | read | null |
| 2025-12-25 | Broadband tunable microwave photonic radar for simultaneous detection of human respiration, heartbeat, and speech with deep learning-based speech recognition | Lei Gao et.al. | 2512.21566 | translate | read | null |
| 2025-12-23 | QuarkAudio Technical Report | Chengwei Liu et.al. | 2512.20151 | translate | read | null |
| 2025-12-23 | VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance | Chang Sun et.al. | 2512.20032 | translate | read | null |
| 2025-12-22 | From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs | Alessandro Lucca et.al. | 2512.19161 | translate | read | null |
| 2025-12-22 | Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization | Jian You et.al. | 2512.18967 | translate | read | null |
| 2025-12-21 | Speaker Recognition – Wavelet Packet Based Multiresolution Feature Extraction Approach | Saurabh Bhardwaj et.al. | 2512.18902 | translate | read | null |
| 2025-12-21 | Task Vector in TTS: Toward Emotionally Expressive Dialectal Speech Synthesis | Pengchao Feng et.al. | 2512.18699 | translate | read | null |
| 2025-12-20 | Phoneme-based speech recognition driven by large language models and sampling marginalization | Te Ma et.al. | 2512.18371 | translate | read | null |
| 2025-12-20 | TICL+: A Case Study On Speech In-Context Learning for Children’s Speech Recognition | Haolong Zheng et.al. | 2512.18263 | translate | read | null |
| 2025-12-19 | SAM Audio: Segment Anything in Audio | Bowen Shi et.al. | 2512.18099 | translate | read | null |
| 2025-12-19 | Peeking Into The Future For Contextual Biasing | Ramaneswaran Selvakumar et.al. | 2512.17657 | translate | read | null |
| 2025-12-19 | When De-noising Hurts: A Systematic Study of Speech Enhancement Effects on Modern Medical ASR Systems | Sujal Chondhekar et.al. | 2512.17562 | translate | read | null |
| 2025-12-19 | Zero-Shot Recognition of Dysarthric Speech Using Commercial Automatic Speech Recognition and Multimodal Large Language Models | Ali Alsayegh et.al. | 2512.17474 | translate | read | null |
| 2025-12-19 | Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition | Zahra Rahmani et.al. | 2512.17247 | translate | read | null |
| 2025-12-18 | Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony | Darshil Chauhan et.al. | 2512.16401 | translate | read | null |
| 2025-12-16 | ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples | Yunfei Yang et.al. | 2512.15641 | translate | read | null |
| 2025-12-16 | Adapting Speech Language Model to Singing Voice Synthesis | Yiwen Zhao et.al. | 2512.14657 | translate | read | null |
| 2025-12-16 | MuseCPBench: an Empirical Study of Music Editing Methods through Music Context Preservation | Yash Vishe et.al. | 2512.14629 | translate | read | null |
| 2025-12-16 | GLM-TTS Technical Report | Jiayan Cui et.al. | 2512.14291 | translate | read | null |
| 2025-12-16 | Scalable Frameworks for Real-World Audio-Visual Speech Recognition | Sungnyun Kim et.al. | 2512.14083 | translate | read | null |
| 2025-12-15 | Reproducing and Dissecting Denoising Language Models for Speech Recognition | Dorian Koch et.al. | 2512.13576 | translate | read | null |
| 2025-12-15 | DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec | Tao Li et.al. | 2512.13251 | translate | read | null |
| 2025-12-14 | BUT Systems for WildSpoof Challenge: SASV in the Wild | Junyi Peng et.al. | 2512.12851 | translate | read | null |
| 2025-12-14 | Procedural Music Generation Systems in Games | Shangxuan Luo et.al. | 2512.12834 | translate | read | null |
| 2025-12-14 | Adaptive Edge-Cloud Inference for Speech-to-Action Systems Using ASR and Large Language Models | Mohammad Jalili Torkamani et.al. | 2512.12769 | translate | read | null |
| 2025-12-13 | System X: A Mobile Voice-Based AI System for EMR Generation and Clinical Decision Support in Low-Resource Maternal Healthcare | Maryam Mustafa et.al. | 2512.12240 | translate | read | null |
| 2025-12-13 | A comparative study of generative models for child voice conversion | Protima Nomo Sudro et.al. | 2512.12129 | translate | read | null |
| 2025-12-12 | All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR | Takafumi Moriya et.al. | 2512.11543 | translate | read | null |
| 2025-12-12 | PhraseVAE and PhraseLDM: Latent Diffusion for Full-Song Multitrack Symbolic Music Generation | Longshen Ou et.al. | 2512.11348 | translate | read | null |
| 2025-12-12 | The Affective Bridge: Unifying Feature Representations for Speech Deepfake Detection | Yupei Li et.al. | 2512.11241 | translate | read | null |
| 2025-12-11 | The TCG CREST – RKMVERI Submission for the NCIIPC Startup India AI Grand Challenge | Nikhil Raghav et.al. | 2512.11009 | translate | read | null |
| 2025-12-11 | CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences | Yiyang Wang et.al. | 2512.10918 | translate | read | null |
| 2025-12-11 | TRIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage | Elroy Galbraith et.al. | 2512.10741 | translate | read | null |
| 2025-12-11 | MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation | Alon Ziv et.al. | 2512.10264 | translate | read | null |
| 2025-12-10 | Robust Speech Activity Detection in the Presence of Singing Voice | Philipp Grundhuber et.al. | 2512.09713 | translate | read | null |
| 2025-12-09 | LG Uplus System with Multi-Speaker IDs and Discriminator-based Sub-Judges for the WildSpoof Challenge | Jinyoung Park et.al. | 2512.09000 | translate | read | null |
| 2025-12-02 | Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture | Karamvir Singh et.al. | 2512.08973 | translate | read | null |
| 2025-12-09 | Emovectors: assessing emotional content in jazz improvisations for creativity evaluation | Anna Jordanous et.al. | 2512.08812 | translate | read | null |
| 2025-12-08 | A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification | Nicolas Calbucura et.al. | 2512.07571 | translate | read | null |
| 2025-12-08 | Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data | Srihari Bandarupalli et.al. | 2512.07277 | translate | read | null |
| 2025-12-06 | Sanvaad: A Multimodal Accessibility Framework for ISL Recognition and Voice-Based Interaction | Kush Revankar et.al. | 2512.06485 | translate | read | null |
| 2025-12-06 | Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation | Xining Song et.al. | 2512.06304 | translate | read | null |
| 2025-12-01 | KidSpeak: A General Multi-purpose LLM for Kids’ Speech Recognition and Screening | Rohan Sharma et.al. | 2512.05994 | translate | read | null |
| 2025-12-04 | YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases | Gongyu Chen et.al. | 2512.04793 | translate | read | null |
| 2025-12-04 | M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis | Xiaopeng Wang et.al. | 2512.04720 | translate | read | null |
| 2025-12-02 | Comparing Unsupervised and Supervised Semantic Speech Tokens: A Case Study of Child ASR | Mohan Shi et.al. | 2512.03301 | translate | read | null |
| 2025-12-02 | DAWZY: A New Addition to AI powered “Human in the Loop” Music Co-creation | Aaron C Elkins et.al. | 2512.03289 | translate | read | null |
| 2025-12-02 | Bangla Hate Speech Classification with Fine-tuned Transformer Models | Yalda Keivan Jafari et.al. | 2512.02845 | translate | read | null |
| 2025-12-01 | Swivuriso: The South African Next Voices Multilingual Speech Dataset | Vukosi Marivatee et.al. | 2512.02201 | translate | read | null |
| 2025-12-01 | Story2MIDI: Emotionally Aligned Music Generation from Text | Mohammad Shokri et.al. | 2512.02192 | translate | read | null |
| 2025-12-01 | MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark | Yuezhang Peng et.al. | 2512.01603 | translate | read | null |
| 2025-12-01 | ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation | Yuezhang Peng et.al. | 2512.01267 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)