Audio Processing - 2026-01
Audio Processing - 2026-01
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2026-01-31 | Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts | Chandrashekar M S et.al. | 2602.03868 | translate | read | null |
| 2026-01-31 | ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation | Junmin Gong et.al. | 2602.00744 | translate | read | null |
| 2026-01-30 | EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis | Li Zhou et.al. | 2601.22873 | translate | read | null |
| 2026-01-30 | CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR | Muhammad Shakeel et.al. | 2601.22792 | translate | read | null |
| 2026-01-30 | Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization | Genshun Wan et.al. | 2601.22779 | translate | read | null |
| 2026-01-29 | An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems | Chanwoo Park et.al. | 2601.22390 | translate | read | null |
| 2026-01-29 | TidyVoice 2026 Challenge Evaluation Plan | Aref Farhadipour et.al. | 2601.21960 | translate | read | null |
| 2026-01-29 | Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts | Michael Kuhlmann et.al. | 2601.21886 | translate | read | null |
| 2026-01-29 | Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER | Xiuwen Zheng et.al. | 2601.21347 | translate | read | null |
| 2026-01-29 | Qwen3-ASR Technical Report | Xian Shi et.al. | 2601.21337 | translate | read | null |
| 2026-01-28 | asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation | Oleg Sedukhin et.al. | 2601.20992 | translate | read | null |
| 2026-01-28 | Text-only adaptation in LLM-based ASR through text denoising | Sergio Burdisso et.al. | 2601.20900 | translate | read | null |
| 2026-01-28 | Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection | Sergio Burdisso et.al. | 2601.20898 | translate | read | null |
| 2026-01-28 | A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models | Ryan Whetten et.al. | 2601.20896 | translate | read | null |
| 2026-01-28 | SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition | Manali Sharma et.al. | 2601.20890 | translate | read | null |
| 2026-01-27 | VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings | Bharath Krishnamurthy et.al. | 2601.20883 | translate | read | link |
| 2026-01-27 | MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading | Matteo Rossi et.al. | 2601.20881 | translate | read | null |
| 2026-01-28 | Erasing Your Voice Before It’s Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech | Myungjin Lee et.al. | 2601.20481 | translate | read | null |
| 2026-01-28 | Self Voice Conversion as an Attack against Neural Audio Watermarking | Yigitcan Özer et.al. | 2601.20432 | translate | read | null |
| 2026-01-28 | ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy | Ya-Tse Wu et.al. | 2601.20319 | translate | read | null |
| 2026-01-28 | Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR | Zilai Wang et.al. | 2601.20142 | translate | read | null |
| 2026-01-27 | T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS | Haibin Wu et.al. | 2601.20094 | translate | read | null |
| 2026-01-27 | Do we really need Self-Attention for Streaming Automatic Speech Recognition? | Youness Dkhissi et.al. | 2601.19960 | translate | read | null |
| 2026-01-27 | HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs | Jeanne Malécot et.al. | 2601.19839 | translate | read | null |
| 2026-01-27 | Rethinking Discrete Speech Representation Tokens for Accent Generation | Jinzuomu Zhong et.al. | 2601.19786 | translate | read | null |
| 2026-01-27 | Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification | Zhihua Fang et.al. | 2601.19709 | translate | read | null |
| 2026-01-27 | SLM-SS: Speech Language Model for Generative Speech Separation | Tianhua Li et.al. | 2601.19533 | translate | read | null |
| 2026-01-27 | Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition | Isha Pandey et.al. | 2601.19451 | translate | read | null |
| 2026-01-27 | SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper | Alexander Polok et.al. | 2601.19194 | translate | read | null |
| 2026-01-26 | Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries | Yuchen Zhang et.al. | 2601.18899 | translate | read | null |
| 2026-01-26 | Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings | Aayush M. Shrestha et.al. | 2601.18694 | translate | read | null |
| 2026-01-26 | Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity | Onyedikachi Hope Amaechi-Okorie et.al. | 2601.18641 | translate | read | null |
| 2026-01-26 | UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment | Wei Wang et.al. | 2601.18438 | translate | read | null |
| 2026-01-26 | Pisets: A Robust Speech Recognition System for Lectures and Interviews | Ivan Bondarenko et.al. | 2601.18415 | translate | read | link |
| 2026-01-26 | Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder | Zhengyang Li et.al. | 2601.18396 | translate | read | null |
| 2026-01-26 | OCR-Enhanced Multimodal ASR Can Read While Listening | Junli Chen et.al. | 2601.18393 | translate | read | null |
| 2026-01-26 | Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning | Steven Vander Eeckt et.al. | 2601.18266 | translate | read | null |
| 2026-01-26 | VIBEVOICE-ASR Technical Report | Zhiliang Peng et.al. | 2601.18184 | translate | read | null |
| 2026-01-26 | OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion | Zhichao Wang et.al. | 2601.18094 | translate | read | null |
| 2026-01-22 | TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice | Aref Farhadipour et.al. | 2601.16358 | translate | read | null |
| 2026-01-21 | Test-Time Adaptation for Speech Emotion Recognition | Jiaheng Dong et.al. | 2601.16240 | translate | read | null |
| 2026-01-20 | SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models | Aafiya Hussain et.al. | 2601.16231 | translate | read | null |
| 2026-01-22 | Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization | Maximos Kaliakatsos-Papakostas et.al. | 2601.16150 | translate | read | null |
| 2026-01-22 | Quantum Dimension Reduction of Hidden Markov Models | Rishi Sundar et.al. | 2601.16126 | translate | read | null |
| 2026-01-22 | Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks | Abdul Hannan et.al. | 2601.16117 | translate | read | null |
| 2026-01-22 | Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs | Lalaram Arya et.al. | 2601.16023 | translate | read | null |
| 2026-01-22 | PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation | Jaekwon Im et.al. | 2601.15872 | translate | read | null |
| 2026-01-22 | U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty | Junjie Li et.al. | 2601.15719 | translate | read | null |
| 2026-01-22 | DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice | Leying Zhang et.al. | 2601.15596 | translate | read | null |
| 2026-01-20 | Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding | Jayant Havare et.al. | 2601.15339 | translate | read | null |
| 2026-01-21 | Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface | Paige S. DeVries et.al. | 2601.15209 | translate | read | null |
| 2026-01-21 | Training-Efficient Text-to-Music Generation with State-Space Modeling | Wei-Jaw Lee et.al. | 2601.14786 | translate | read | null |
| 2026-01-21 | Inverse-Hessian Regularization for Continual Learning in ASR | Steven Vander Eeckt et.al. | 2601.14751 | translate | read | null |
| 2026-01-21 | Triage knowledge distillation for speaker verification | Ju-ho Kim et.al. | 2601.14699 | translate | read | null |
| 2026-01-21 | Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch | Kanami Imamura et.al. | 2601.14684 | translate | read | null |
| 2026-01-20 | Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum | Mohammed Salah Al-Radhi et.al. | 2601.14472 | translate | read | null |
| 2026-01-20 | Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis | Thanathai Lertpetchpun et.al. | 2601.14417 | translate | read | null |
| 2026-01-20 | DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification | Youngmoon Jung et.al. | 2601.13999 | translate | read | null |
| 2026-01-20 | Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models | Nikita Kuzmin et.al. | 2601.13948 | translate | read | null |
| 2026-01-20 | Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis | Yushen Chen et.al. | 2601.13802 | translate | read | null |
| 2026-01-20 | S $^2$ Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion | Ziqian Wang et.al. | 2601.13629 | translate | read | null |
| 2026-01-19 | The Achilles’ Heel of Angular Margins: A Chebyshev Polynomial Fix for Speaker Verification | Yang Wang et.al. | 2601.13198 | translate | read | null |
| 2026-01-19 | Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition | Warit Sirichotedumrong et.al. | 2601.13044 | translate | read | link |
| 2026-01-19 | Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings | Seymanur Akti et.al. | 2601.12966 | translate | read | null |
| 2026-01-19 | Supervised Learning for Game Music Segmentation | Shangxuan Luo et.al. | 2601.12961 | translate | read | null |
| 2026-01-19 | DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems | Suyang Sun et.al. | 2601.12786 | translate | read | null |
| 2026-01-16 | F-Actor: Controllable Conversational Behaviour in Full-Duplex Models | Maike Züfle et.al. | 2601.11329 | translate | read | null |
| 2026-01-16 | WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem | Chengyou Wang et.al. | 2601.11027 | translate | read | null |
| 2026-01-15 | Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers | Runyuan Cai et.al. | 2601.10770 | translate | read | null |
| 2026-01-15 | VoiceSculptor: Your Voice, Designed By You | Jingbin Hu et.al. | 2601.10629 | translate | read | null |
| 2026-01-15 | HeartMuLa: A Family of Open Sourced Music Foundation Models | Dongchao Yang et.al. | 2601.10547 | translate | read | link |
| 2026-01-15 | ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios | Aniket Deroy et.al. | 2601.10315 | translate | read | null |
| 2026-01-15 | STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter | Ziqi Xu et.al. | 2601.10223 | translate | read | null |
| 2026-01-14 | Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer | Petros Vavaroutsos et.al. | 2601.09603 | translate | read | null |
| 2026-01-14 | Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception | Zhen Wan et.al. | 2601.09413 | translate | read | null |
| 2026-01-14 | SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing | Ziyang Ma et.al. | 2601.09385 | translate | read | null |
| 2026-01-14 | Research on Piano Timbre Transformation System Based on Diffusion Model | Chun-Chieh Hsu et.al. | 2601.09333 | translate | read | null |
| 2026-01-14 | MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus | Yexing Du et.al. | 2601.09270 | translate | read | link |
| 2026-01-13 | Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances | Ziqi Ding et.al. | 2601.08516 | translate | read | null |
| 2026-01-13 | Decoding Order Matters in Autoregressive Speech Synthesis | Minghui Zhao et.al. | 2601.08450 | translate | read | null |
| 2026-01-12 | ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan | Xueping Zhang et.al. | 2601.07303 | translate | read | null |
| 2026-01-12 | Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects | Kalvin Chang et.al. | 2601.07274 | translate | read | link |
| 2026-01-12 | The ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge | Guobin Ma et.al. | 2601.07237 | translate | read | null |
| 2026-01-11 | Task Arithmetic with Support Languages for Low-Resource ASR | Emma Rafkin et.al. | 2601.07038 | translate | read | null |
| 2026-01-11 | Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition | Nathan Roll et.al. | 2601.06972 | translate | read | null |
| 2026-01-11 | Variational decomposition autoencoding improves disentanglement of latent representations | Ioannis Ziogas et.al. | 2601.06844 | translate | read | null |
| 2026-01-11 | Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition | Ayman Mansour et.al. | 2601.06802 | translate | read | null |
| 2026-01-10 | QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models | Zixing Lin et.al. | 2601.06573 | translate | read | null |
| 2026-01-10 | Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning | K. A. Shahriar et.al. | 2601.06560 | translate | read | null |
| 2026-01-09 | An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution | Sheng-Kai Chen et.al. | 2601.06235 | translate | read | null |
| 2026-01-09 | Two-step Authentication: Multi-biometric System Using Voice and Facial Recognition | Kuan Wei Chen et.al. | 2601.06218 | translate | read | null |
| 2026-01-09 | Multimodal In-context Learning for ASR of Low-resource Languages | Zhaolin Li et.al. | 2601.05707 | translate | read | null |
| 2026-01-08 | LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models | Ryutaro Oshima et.al. | 2601.04654 | translate | read | null |
| 2026-01-08 | WESR: Scaling and Evaluating Word-level Event-Speech Recognition | Chenchen Yang et.al. | 2601.04508 | translate | read | null |
| 2026-01-08 | Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition | Da-Hee Yang et.al. | 2601.04459 | translate | read | null |
| 2026-01-07 | Lightweight and perceptually-guided voice conversion for electro-laryngeal speech | Benedikt Mayrhofer et.al. | 2601.03892 | translate | read | null |
| 2026-01-07 | Stuttering-Aware Automatic Speech Recognition for Indonesian Language | Fadhil Muhammad et.al. | 2601.03727 | translate | read | null |
| 2026-01-07 | TellWhisper: Tell Whisper Who Speaks When | Yifan Hu et.al. | 2601.03712 | translate | read | null |
| 2026-01-07 | ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis | Haitao Li et.al. | 2601.03632 | translate | read | null |
| 2026-01-07 | Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias | Joonwon Seo et.al. | 2601.03612 | translate | read | null |
| 2026-01-06 | Tigrinya Number Verbalization: Rules, Algorithm, and Implementation | Fitsum Gaim et.al. | 2601.03403 | translate | read | null |
| 2026-01-06 | A Versatile Multimodal Agent for Multimedia Content Generation | Daoan Zhang et.al. | 2601.03250 | translate | read | null |
| 2026-01-06 | XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection | Kwok-Ho Ng et.al. | 2601.02944 | translate | read | null |
| 2026-01-06 | Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis | Mengze Hong et.al. | 2601.02914 | translate | read | null |
| 2026-01-06 | Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration | Ryan Soh-Eun Shim et.al. | 2601.02906 | translate | read | null |
| 2026-01-06 | Vclip: Face-based Speaker Generation by Face-voice Association Learning | Yao Shi et.al. | 2601.02753 | translate | read | null |
| 2026-01-06 | Multi-channel multi-speaker transformer for speech recognition | Guo Yifan et.al. | 2601.02688 | translate | read | null |
| 2026-01-05 | Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization | Xinyu Wang et.al. | 2601.02455 | translate | read | null |
| 2026-01-05 | VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses | Maryam Abbasihafshejani et.al. | 2601.02444 | translate | read | null |
| 2026-01-05 | MORE: Multi-Objective Adversarial Attacks on Speech Recognition | Xiaoxue Gao et.al. | 2601.01852 | translate | read | null |
| 2026-01-04 | OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech | Yong Ren et.al. | 2601.01459 | translate | read | null |
| 2026-01-03 | IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection | Jiajie Zhu et.al. | 2601.01239 | translate | read | null |
| 2026-01-02 | Improving Code-Switching Speech Recognition with TTS Data Augmentation | Yue Heng Yeo et.al. | 2601.00935 | translate | read | null |
| 2026-01-02 | Three factor delay learning rules for spiking neural networks | Luke Vassallo et.al. | 2601.00668 | translate | read | null |
| 2026-01-01 | IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition | Zhuoran Zhuang et.al. | 2601.00160 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)