Audio Processing - 2026-01

Publish Date Title Authors PDF Translate Read Code
2026-01-31 Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts Chandrashekar M S et.al. 2602.03868 translate read null
2026-01-31 ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation Junmin Gong et.al. 2602.00744 translate read null
2026-01-30 EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis Li Zhou et.al. 2601.22873 translate read null
2026-01-30 CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR Muhammad Shakeel et.al. 2601.22792 translate read null
2026-01-30 Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization Genshun Wan et.al. 2601.22779 translate read null
2026-01-29 An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems Chanwoo Park et.al. 2601.22390 translate read null
2026-01-29 TidyVoice 2026 Challenge Evaluation Plan Aref Farhadipour et.al. 2601.21960 translate read null
2026-01-29 Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts Michael Kuhlmann et.al. 2601.21886 translate read null
2026-01-29 Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER Xiuwen Zheng et.al. 2601.21347 translate read null
2026-01-29 Qwen3-ASR Technical Report Xian Shi et.al. 2601.21337 translate read null
2026-01-28 asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation Oleg Sedukhin et.al. 2601.20992 translate read null
2026-01-28 Text-only adaptation in LLM-based ASR through text denoising Sergio Burdisso et.al. 2601.20900 translate read null
2026-01-28 Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection Sergio Burdisso et.al. 2601.20898 translate read null
2026-01-28 A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models Ryan Whetten et.al. 2601.20896 translate read null
2026-01-28 SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition Manali Sharma et.al. 2601.20890 translate read null
2026-01-27 VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings Bharath Krishnamurthy et.al. 2601.20883 translate read link
2026-01-27 MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading Matteo Rossi et.al. 2601.20881 translate read null
2026-01-28 Erasing Your Voice Before It’s Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech Myungjin Lee et.al. 2601.20481 translate read null
2026-01-28 Self Voice Conversion as an Attack against Neural Audio Watermarking Yigitcan Özer et.al. 2601.20432 translate read null
2026-01-28 ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy Ya-Tse Wu et.al. 2601.20319 translate read null
2026-01-28 Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR Zilai Wang et.al. 2601.20142 translate read null
2026-01-27 T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS Haibin Wu et.al. 2601.20094 translate read null
2026-01-27 Do we really need Self-Attention for Streaming Automatic Speech Recognition? Youness Dkhissi et.al. 2601.19960 translate read null
2026-01-27 HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs Jeanne Malécot et.al. 2601.19839 translate read null
2026-01-27 Rethinking Discrete Speech Representation Tokens for Accent Generation Jinzuomu Zhong et.al. 2601.19786 translate read null
2026-01-27 Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification Zhihua Fang et.al. 2601.19709 translate read null
2026-01-27 SLM-SS: Speech Language Model for Generative Speech Separation Tianhua Li et.al. 2601.19533 translate read null
2026-01-27 Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition Isha Pandey et.al. 2601.19451 translate read null
2026-01-27 SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper Alexander Polok et.al. 2601.19194 translate read null
2026-01-26 Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries Yuchen Zhang et.al. 2601.18899 translate read null
2026-01-26 Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings Aayush M. Shrestha et.al. 2601.18694 translate read null
2026-01-26 Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity Onyedikachi Hope Amaechi-Okorie et.al. 2601.18641 translate read null
2026-01-26 UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment Wei Wang et.al. 2601.18438 translate read null
2026-01-26 Pisets: A Robust Speech Recognition System for Lectures and Interviews Ivan Bondarenko et.al. 2601.18415 translate read link
2026-01-26 Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder Zhengyang Li et.al. 2601.18396 translate read null
2026-01-26 OCR-Enhanced Multimodal ASR Can Read While Listening Junli Chen et.al. 2601.18393 translate read null
2026-01-26 Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning Steven Vander Eeckt et.al. 2601.18266 translate read null
2026-01-26 VIBEVOICE-ASR Technical Report Zhiliang Peng et.al. 2601.18184 translate read null
2026-01-26 OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion Zhichao Wang et.al. 2601.18094 translate read null
2026-01-22 TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice Aref Farhadipour et.al. 2601.16358 translate read null
2026-01-21 Test-Time Adaptation for Speech Emotion Recognition Jiaheng Dong et.al. 2601.16240 translate read null
2026-01-20 SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models Aafiya Hussain et.al. 2601.16231 translate read null
2026-01-22 Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization Maximos Kaliakatsos-Papakostas et.al. 2601.16150 translate read null
2026-01-22 Quantum Dimension Reduction of Hidden Markov Models Rishi Sundar et.al. 2601.16126 translate read null
2026-01-22 Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks Abdul Hannan et.al. 2601.16117 translate read null
2026-01-22 Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs Lalaram Arya et.al. 2601.16023 translate read null
2026-01-22 PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation Jaekwon Im et.al. 2601.15872 translate read null
2026-01-22 U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty Junjie Li et.al. 2601.15719 translate read null
2026-01-22 DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice Leying Zhang et.al. 2601.15596 translate read null
2026-01-20 Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding Jayant Havare et.al. 2601.15339 translate read null
2026-01-21 Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface Paige S. DeVries et.al. 2601.15209 translate read null
2026-01-21 Training-Efficient Text-to-Music Generation with State-Space Modeling Wei-Jaw Lee et.al. 2601.14786 translate read null
2026-01-21 Inverse-Hessian Regularization for Continual Learning in ASR Steven Vander Eeckt et.al. 2601.14751 translate read null
2026-01-21 Triage knowledge distillation for speaker verification Ju-ho Kim et.al. 2601.14699 translate read null
2026-01-21 Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch Kanami Imamura et.al. 2601.14684 translate read null
2026-01-20 Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum Mohammed Salah Al-Radhi et.al. 2601.14472 translate read null
2026-01-20 Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis Thanathai Lertpetchpun et.al. 2601.14417 translate read null
2026-01-20 DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification Youngmoon Jung et.al. 2601.13999 translate read null
2026-01-20 Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models Nikita Kuzmin et.al. 2601.13948 translate read null
2026-01-20 Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis Yushen Chen et.al. 2601.13802 translate read null
2026-01-20 S $^2$ Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion Ziqian Wang et.al. 2601.13629 translate read null
2026-01-19 The Achilles’ Heel of Angular Margins: A Chebyshev Polynomial Fix for Speaker Verification Yang Wang et.al. 2601.13198 translate read null
2026-01-19 Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition Warit Sirichotedumrong et.al. 2601.13044 translate read link
2026-01-19 Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings Seymanur Akti et.al. 2601.12966 translate read null
2026-01-19 Supervised Learning for Game Music Segmentation Shangxuan Luo et.al. 2601.12961 translate read null
2026-01-19 DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems Suyang Sun et.al. 2601.12786 translate read null
2026-01-16 F-Actor: Controllable Conversational Behaviour in Full-Duplex Models Maike Züfle et.al. 2601.11329 translate read null
2026-01-16 WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem Chengyou Wang et.al. 2601.11027 translate read null
2026-01-15 Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers Runyuan Cai et.al. 2601.10770 translate read null
2026-01-15 VoiceSculptor: Your Voice, Designed By You Jingbin Hu et.al. 2601.10629 translate read null
2026-01-15 HeartMuLa: A Family of Open Sourced Music Foundation Models Dongchao Yang et.al. 2601.10547 translate read link
2026-01-15 ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios Aniket Deroy et.al. 2601.10315 translate read null
2026-01-15 STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter Ziqi Xu et.al. 2601.10223 translate read null
2026-01-14 Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer Petros Vavaroutsos et.al. 2601.09603 translate read null
2026-01-14 Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception Zhen Wan et.al. 2601.09413 translate read null
2026-01-14 SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing Ziyang Ma et.al. 2601.09385 translate read null
2026-01-14 Research on Piano Timbre Transformation System Based on Diffusion Model Chun-Chieh Hsu et.al. 2601.09333 translate read null
2026-01-14 MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus Yexing Du et.al. 2601.09270 translate read link
2026-01-13 Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances Ziqi Ding et.al. 2601.08516 translate read null
2026-01-13 Decoding Order Matters in Autoregressive Speech Synthesis Minghui Zhao et.al. 2601.08450 translate read null
2026-01-12 ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan Xueping Zhang et.al. 2601.07303 translate read null
2026-01-12 Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects Kalvin Chang et.al. 2601.07274 translate read link
2026-01-12 The ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge Guobin Ma et.al. 2601.07237 translate read null
2026-01-11 Task Arithmetic with Support Languages for Low-Resource ASR Emma Rafkin et.al. 2601.07038 translate read null
2026-01-11 Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition Nathan Roll et.al. 2601.06972 translate read null
2026-01-11 Variational decomposition autoencoding improves disentanglement of latent representations Ioannis Ziogas et.al. 2601.06844 translate read null
2026-01-11 Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition Ayman Mansour et.al. 2601.06802 translate read null
2026-01-10 QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models Zixing Lin et.al. 2601.06573 translate read null
2026-01-10 Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning K. A. Shahriar et.al. 2601.06560 translate read null
2026-01-09 An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution Sheng-Kai Chen et.al. 2601.06235 translate read null
2026-01-09 Two-step Authentication: Multi-biometric System Using Voice and Facial Recognition Kuan Wei Chen et.al. 2601.06218 translate read null
2026-01-09 Multimodal In-context Learning for ASR of Low-resource Languages Zhaolin Li et.al. 2601.05707 translate read null
2026-01-08 LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models Ryutaro Oshima et.al. 2601.04654 translate read null
2026-01-08 WESR: Scaling and Evaluating Word-level Event-Speech Recognition Chenchen Yang et.al. 2601.04508 translate read null
2026-01-08 Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition Da-Hee Yang et.al. 2601.04459 translate read null
2026-01-07 Lightweight and perceptually-guided voice conversion for electro-laryngeal speech Benedikt Mayrhofer et.al. 2601.03892 translate read null
2026-01-07 Stuttering-Aware Automatic Speech Recognition for Indonesian Language Fadhil Muhammad et.al. 2601.03727 translate read null
2026-01-07 TellWhisper: Tell Whisper Who Speaks When Yifan Hu et.al. 2601.03712 translate read null
2026-01-07 ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis Haitao Li et.al. 2601.03632 translate read null
2026-01-07 Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias Joonwon Seo et.al. 2601.03612 translate read null
2026-01-06 Tigrinya Number Verbalization: Rules, Algorithm, and Implementation Fitsum Gaim et.al. 2601.03403 translate read null
2026-01-06 A Versatile Multimodal Agent for Multimedia Content Generation Daoan Zhang et.al. 2601.03250 translate read null
2026-01-06 XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection Kwok-Ho Ng et.al. 2601.02944 translate read null
2026-01-06 Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis Mengze Hong et.al. 2601.02914 translate read null
2026-01-06 Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration Ryan Soh-Eun Shim et.al. 2601.02906 translate read null
2026-01-06 Vclip: Face-based Speaker Generation by Face-voice Association Learning Yao Shi et.al. 2601.02753 translate read null
2026-01-06 Multi-channel multi-speaker transformer for speech recognition Guo Yifan et.al. 2601.02688 translate read null
2026-01-05 Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization Xinyu Wang et.al. 2601.02455 translate read null
2026-01-05 VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses Maryam Abbasihafshejani et.al. 2601.02444 translate read null
2026-01-05 MORE: Multi-Objective Adversarial Attacks on Speech Recognition Xiaoxue Gao et.al. 2601.01852 translate read null
2026-01-04 OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech Yong Ren et.al. 2601.01459 translate read null
2026-01-03 IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection Jiajie Zhu et.al. 2601.01239 translate read null
2026-01-02 Improving Code-Switching Speech Recognition with TTS Data Augmentation Yue Heng Yeo et.al. 2601.00935 translate read null
2026-01-02 Three factor delay learning rules for spiking neural networks Luke Vassallo et.al. 2601.00668 translate read null
2026-01-01 IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition Zhuoran Zhuang et.al. 2601.00160 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)