Audio Processing - 2026-01 | Paper Arxiv Daily

Audio Processing - 2026-01

Publish Date	Title	Authors	PDF	Translate	Read	Code
2026-01-31	Benchmarking Automatic Speech Recognition for Indian Languages in Agricultural Contexts	Chandrashekar M S et.al.	2602.03868	translate	read	null
2026-01-31	ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation	Junmin Gong et.al.	2602.00744	translate	read	null
2026-01-30	EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis	Li Zhou et.al.	2601.22873	translate	read	null
2026-01-30	CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR	Muhammad Shakeel et.al.	2601.22792	translate	read	null
2026-01-30	Streaming Speech Recognition with Decoder-Only Large Language Models and Latency Optimization	Genshun Wan et.al.	2601.22779	translate	read	null
2026-01-29	An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems	Chanwoo Park et.al.	2601.22390	translate	read	null
2026-01-29	TidyVoice 2026 Challenge Evaluation Plan	Aref Farhadipour et.al.	2601.21960	translate	read	null
2026-01-29	Speech Quality-Based Localization of Low-Quality Speech and Text-to-Speech Synthesis Artefacts	Michael Kuhlmann et.al.	2601.21886	translate	read	null
2026-01-29	Towards Robust Dysarthric Speech Recognition: LLM-Agent Post-ASR Correction Beyond WER	Xiuwen Zheng et.al.	2601.21347	translate	read	null
2026-01-29	Qwen3-ASR Technical Report	Xian Shi et.al.	2601.21337	translate	read	null
2026-01-28	asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation	Oleg Sedukhin et.al.	2601.20992	translate	read	null
2026-01-28	Text-only adaptation in LLM-based ASR through text denoising	Sergio Burdisso et.al.	2601.20900	translate	read	null
2026-01-28	Reducing Prompt Sensitivity in LLM-based Speech Recognition Through Learnable Projection	Sergio Burdisso et.al.	2601.20898	translate	read	null
2026-01-28	A Study of Data Selection Strategies for Pre-training Self-Supervised Speech Models	Ryan Whetten et.al.	2601.20896	translate	read	null
2026-01-28	SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition	Manali Sharma et.al.	2601.20890	translate	read	null
2026-01-27	VoxMorph: Scalable Zero-shot Voice Identity Morphing via Disentangled Embeddings	Bharath Krishnamurthy et.al.	2601.20883	translate	read	link
2026-01-27	MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading	Matteo Rossi et.al.	2601.20881	translate	read	null
2026-01-28	Erasing Your Voice Before It’s Heard: Training-free Speaker Unlearning for Zero-shot Text-to-Speech	Myungjin Lee et.al.	2601.20481	translate	read	null
2026-01-28	Self Voice Conversion as an Attack against Neural Audio Watermarking	Yigitcan Özer et.al.	2601.20432	translate	read	null
2026-01-28	ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy	Ya-Tse Wu et.al.	2601.20319	translate	read	null
2026-01-28	Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR	Zilai Wang et.al.	2601.20142	translate	read	null
2026-01-27	T-Mimi: A Transformer-based Mimi Decoder for Real-Time On-Phone TTS	Haibin Wu et.al.	2601.20094	translate	read	null
2026-01-27	Do we really need Self-Attention for Streaming Automatic Speech Recognition?	Youness Dkhissi et.al.	2601.19960	translate	read	null
2026-01-27	HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs	Jeanne Malécot et.al.	2601.19839	translate	read	null
2026-01-27	Rethinking Discrete Speech Representation Tokens for Accent Generation	Jinzuomu Zhong et.al.	2601.19786	translate	read	null
2026-01-27	Hyperbolic Additive Margin Softmax with Hierarchical Information for Speaker Verification	Zhihua Fang et.al.	2601.19709	translate	read	null
2026-01-27	SLM-SS: Speech Language Model for Generative Speech Separation	Tianhua Li et.al.	2601.19533	translate	read	null
2026-01-27	Dynamic Multi-Expert Projectors with Stabilized Routing for Multilingual Speech Recognition	Isha Pandey et.al.	2601.19451	translate	read	null
2026-01-27	SE-DiCoW: Self-Enrolled Diarization-Conditioned Whisper	Alexander Polok et.al.	2601.19194	translate	read	null
2026-01-26	Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries	Yuchen Zhang et.al.	2601.18899	translate	read	null
2026-01-26	Neural Multi-Speaker Voice Cloning for Nepali in Low-Resource Settings	Aayush M. Shrestha et.al.	2601.18694	translate	read	null
2026-01-26	Unheard in the Digital Age: Rethinking AI Bias and Speech Diversity	Onyedikachi Hope Amaechi-Okorie et.al.	2601.18641	translate	read	null
2026-01-26	UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment	Wei Wang et.al.	2601.18438	translate	read	null
2026-01-26	Pisets: A Robust Speech Recognition System for Lectures and Interviews	Ivan Bondarenko et.al.	2601.18415	translate	read	link
2026-01-26	Noise-Robust AV-ASR Using Visual Features Both in the Whisper Encoder and Decoder	Zhengyang Li et.al.	2601.18396	translate	read	null
2026-01-26	OCR-Enhanced Multimodal ASR Can Read While Listening	Junli Chen et.al.	2601.18393	translate	read	null
2026-01-26	Efficient Rehearsal for Continual Learning in ASR via Singular Value Tuning	Steven Vander Eeckt et.al.	2601.18266	translate	read	null
2026-01-26	VIBEVOICE-ASR Technical Report	Zhiliang Peng et.al.	2601.18184	translate	read	null
2026-01-26	OneVoice: One Model, Triple Scenarios-Towards Unified Zero-shot Voice Conversion	Zhichao Wang et.al.	2601.18094	translate	read	null
2026-01-22	TidyVoice: A Curated Multilingual Dataset for Speaker Verification Derived from Common Voice	Aref Farhadipour et.al.	2601.16358	translate	read	null
2026-01-21	Test-Time Adaptation for Speech Emotion Recognition	Jiaheng Dong et.al.	2601.16240	translate	read	null
2026-01-20	SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models	Aafiya Hussain et.al.	2601.16231	translate	read	null
2026-01-22	Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization	Maximos Kaliakatsos-Papakostas et.al.	2601.16150	translate	read	null
2026-01-22	Quantum Dimension Reduction of Hidden Markov Models	Rishi Sundar et.al.	2601.16126	translate	read	null
2026-01-22	Distillation-based Layer Dropping (DLD) Effective End-to-end Framework for Dynamic Speech Networks	Abdul Hannan et.al.	2601.16117	translate	read	null
2026-01-22	Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs	Lalaram Arya et.al.	2601.16023	translate	read	null
2026-01-22	PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation	Jaekwon Im et.al.	2601.15872	translate	read	null
2026-01-22	U3-xi: Pushing the Boundaries of Speaker Recognition via Incorporating Uncertainty	Junjie Li et.al.	2601.15719	translate	read	null
2026-01-22	DeepASMR: LLM-Based Zero-Shot ASMR Speech Generation for Anyone of Any Voice	Leying Zhang et.al.	2601.15596	translate	read	null
2026-01-20	Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding	Jayant Havare et.al.	2601.15339	translate	read	null
2026-01-21	Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface	Paige S. DeVries et.al.	2601.15209	translate	read	null
2026-01-21	Training-Efficient Text-to-Music Generation with State-Space Modeling	Wei-Jaw Lee et.al.	2601.14786	translate	read	null
2026-01-21	Inverse-Hessian Regularization for Continual Learning in ASR	Steven Vander Eeckt et.al.	2601.14751	translate	read	null
2026-01-21	Triage knowledge distillation for speaker verification	Ju-ho Kim et.al.	2601.14699	translate	read	null
2026-01-21	Dissecting Performance Degradation in Audio Source Separation under Sampling Frequency Mismatch	Kanami Imamura et.al.	2601.14684	translate	read	null
2026-01-20	Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum	Mohammed Salah Al-Radhi et.al.	2601.14472	translate	read	null
2026-01-20	Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis	Thanathai Lertpetchpun et.al.	2601.14417	translate	read	null
2026-01-20	DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification	Youngmoon Jung et.al.	2601.13999	translate	read	null
2026-01-20	Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models	Nikita Kuzmin et.al.	2601.13948	translate	read	null
2026-01-20	Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis	Yushen Chen et.al.	2601.13802	translate	read	null
2026-01-20	S $^2$ Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion	Ziqian Wang et.al.	2601.13629	translate	read	null
2026-01-19	The Achilles’ Heel of Angular Margins: A Chebyshev Polynomial Fix for Speaker Verification	Yang Wang et.al.	2601.13198	translate	read	null
2026-01-19	Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition	Warit Sirichotedumrong et.al.	2601.13044	translate	read	link
2026-01-19	Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings	Seymanur Akti et.al.	2601.12966	translate	read	null
2026-01-19	Supervised Learning for Game Music Segmentation	Shangxuan Luo et.al.	2601.12961	translate	read	null
2026-01-19	DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems	Suyang Sun et.al.	2601.12786	translate	read	null
2026-01-16	F-Actor: Controllable Conversational Behaviour in Full-Duplex Models	Maike Züfle et.al.	2601.11329	translate	read	null
2026-01-16	WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem	Chengyou Wang et.al.	2601.11027	translate	read	null
2026-01-15	Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers	Runyuan Cai et.al.	2601.10770	translate	read	null
2026-01-15	VoiceSculptor: Your Voice, Designed By You	Jingbin Hu et.al.	2601.10629	translate	read	null
2026-01-15	HeartMuLa: A Family of Open Sourced Music Foundation Models	Dongchao Yang et.al.	2601.10547	translate	read	link
2026-01-15	ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios	Aniket Deroy et.al.	2601.10315	translate	read	null
2026-01-15	STEAMROLLER: A Multi-Agent System for Inclusive Automatic Speech Recognition for People who Stutter	Ziqi Xu et.al.	2601.10223	translate	read	null
2026-01-14	Linear Complexity Self-Supervised Learning for Music Understanding with Random Quantizer	Petros Vavaroutsos et.al.	2601.09603	translate	read	null
2026-01-14	Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception	Zhen Wan et.al.	2601.09413	translate	read	null
2026-01-14	SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing	Ziyang Ma et.al.	2601.09385	translate	read	null
2026-01-14	Research on Piano Timbre Transformation System Based on Diffusion Model	Chun-Chieh Hsu et.al.	2601.09333	translate	read	null
2026-01-14	MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus	Yexing Du et.al.	2601.09270	translate	read	link
2026-01-13	Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances	Ziqi Ding et.al.	2601.08516	translate	read	null
2026-01-13	Decoding Order Matters in Autoregressive Speech Synthesis	Minghui Zhao et.al.	2601.08450	translate	read	null
2026-01-12	ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan	Xueping Zhang et.al.	2601.07303	translate	read	null
2026-01-12	Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects	Kalvin Chang et.al.	2601.07274	translate	read	link
2026-01-12	The ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge	Guobin Ma et.al.	2601.07237	translate	read	null
2026-01-11	Task Arithmetic with Support Languages for Low-Resource ASR	Emma Rafkin et.al.	2601.07038	translate	read	null
2026-01-11	Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition	Nathan Roll et.al.	2601.06972	translate	read	null
2026-01-11	Variational decomposition autoencoding improves disentanglement of latent representations	Ioannis Ziogas et.al.	2601.06844	translate	read	null
2026-01-11	Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition	Ayman Mansour et.al.	2601.06802	translate	read	null
2026-01-10	QMAVIS: Long Video-Audio Understanding using Fusion of Large Multimodal Models	Zixing Lin et.al.	2601.06573	translate	read	null
2026-01-10	Lightweight Resolution-Aware Audio Deepfake Detection via Cross-Scale Attention and Consistency Learning	K. A. Shahriar et.al.	2601.06560	translate	read	null
2026-01-09	An Intelligent AI glasses System with Multi-Agent Architecture for Real-Time Voice Processing and Task Execution	Sheng-Kai Chen et.al.	2601.06235	translate	read	null
2026-01-09	Two-step Authentication: Multi-biometric System Using Voice and Facial Recognition	Kuan Wei Chen et.al.	2601.06218	translate	read	null
2026-01-09	Multimodal In-context Learning for ASR of Low-resource Languages	Zhaolin Li et.al.	2601.05707	translate	read	null
2026-01-08	LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models	Ryutaro Oshima et.al.	2601.04654	translate	read	null
2026-01-08	WESR: Scaling and Evaluating Word-level Event-Speech Recognition	Chenchen Yang et.al.	2601.04508	translate	read	null
2026-01-08	Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition	Da-Hee Yang et.al.	2601.04459	translate	read	null
2026-01-07	Lightweight and perceptually-guided voice conversion for electro-laryngeal speech	Benedikt Mayrhofer et.al.	2601.03892	translate	read	null
2026-01-07	Stuttering-Aware Automatic Speech Recognition for Indonesian Language	Fadhil Muhammad et.al.	2601.03727	translate	read	null
2026-01-07	TellWhisper: Tell Whisper Who Speaks When	Yifan Hu et.al.	2601.03712	translate	read	null
2026-01-07	ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis	Haitao Li et.al.	2601.03632	translate	read	null
2026-01-07	Mathematical Foundations of Polyphonic Music Generation via Structural Inductive Bias	Joonwon Seo et.al.	2601.03612	translate	read	null
2026-01-06	Tigrinya Number Verbalization: Rules, Algorithm, and Implementation	Fitsum Gaim et.al.	2601.03403	translate	read	null
2026-01-06	A Versatile Multimodal Agent for Multimedia Content Generation	Daoan Zhang et.al.	2601.03250	translate	read	null
2026-01-06	XLSR-MamBo: Scaling the Hybrid Mamba-Attention Backbone for Audio Deepfake Detection	Kwok-Ho Ng et.al.	2601.02944	translate	read	null
2026-01-06	Vulnerabilities of Audio-Based Biometric Authentication Systems Against Deepfake Speech Synthesis	Mengze Hong et.al.	2601.02914	translate	read	null
2026-01-06	Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration	Ryan Soh-Eun Shim et.al.	2601.02906	translate	read	null
2026-01-06	Vclip: Face-based Speaker Generation by Face-voice Association Learning	Yao Shi et.al.	2601.02753	translate	read	null
2026-01-06	Multi-channel multi-speaker transformer for speech recognition	Guo Yifan et.al.	2601.02688	translate	read	null
2026-01-05	Dynamic Quantization Error Propagation in Encoder-Decoder ASR Quantization	Xinyu Wang et.al.	2601.02455	translate	read	null
2026-01-05	VocalBridge: Latent Diffusion-Bridge Purification for Defeating Perturbation-Based Voiceprint Defenses	Maryam Abbasihafshejani et.al.	2601.02444	translate	read	null
2026-01-05	MORE: Multi-Objective Adversarial Attacks on Speech Recognition	Xiaoxue Gao et.al.	2601.01852	translate	read	null
2026-01-04	OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech	Yong Ren et.al.	2601.01459	translate	read	null
2026-01-03	IO-RAE: Information-Obfuscation Reversible Adversarial Example for Audio Privacy Protection	Jiajie Zhu et.al.	2601.01239	translate	read	null
2026-01-02	Improving Code-Switching Speech Recognition with TTS Data Augmentation	Yue Heng Yeo et.al.	2601.00935	translate	read	null
2026-01-02	Three factor delay learning rules for spiking neural networks	Luke Vassallo et.al.	2601.00668	translate	read	null
2026-01-01	IKFST: IOO and KOO Algorithms for Accelerated and Precise WFST-based End-to-End Automatic Speech Recognition	Zhuoran Zhuang et.al.	2601.00160	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)