Audio Processing - 2026-02
Audio Processing - 2026-02
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2026-02-28 | Polynomial Mixing for Efficient Self-supervised Speech Encoders | Eva Feillet et.al. | 2603.00683 | translate | read | null |
| 2026-02-28 | CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction | Yinghao Ma et.al. | 2603.00610 | translate | read | null |
| 2026-02-28 | Efficient Long-Sequence Diffusion Modeling for Symbolic Music Generation | Jinhan Xu et.al. | 2603.00576 | translate | read | null |
| 2026-02-28 | Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion | Sen Zhang et.al. | 2603.00563 | translate | read | null |
| 2026-02-26 | Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems | Siyuan Liu et.al. | 2602.23266 | translate | read | null |
| 2026-02-26 | Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment | Sanjid Hasan et.al. | 2602.23070 | translate | read | null |
| 2026-02-26 | A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment | Zarif Ishmam et.al. | 2602.22935 | translate | read | null |
| 2026-02-26 | Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing | An-Ci Peng et.al. | 2602.22522 | translate | read | null |
| 2026-02-25 | TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition | Cheng-Yeh Yang et.al. | 2602.22039 | translate | read | null |
| 2026-02-25 | Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization | MD. Sagor Chowdhury et.al. | 2602.21741 | translate | read | null |
| 2026-02-25 | Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration | Tangsang Chongbang et.al. | 2602.21647 | translate | read | null |
| 2026-02-25 | A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation | Chun-wei Ho et.al. | 2602.21476 | translate | read | null |
| 2026-02-24 | 823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio | Ratnajit Dhar et.al. | 2602.21183 | translate | read | null |
| 2026-02-24 | Training-Free Intelligibility-Guided Observation Addition for Noisy ASR | Haoyang Li et.al. | 2602.20967 | translate | read | null |
| 2026-02-23 | An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction | Guanting Shen et.al. | 2602.20219 | translate | read | null |
| 2026-02-23 | Can You Tell It’s AI? Human Perception of Synthetic Voices in Vishing Scenarios | Zoha Hayat Bhatti et.al. | 2602.20061 | translate | read | null |
| 2026-02-23 | Depth-Structured Music Recurrence: Budgeted Recurrent Attention for Full-Piece Symbolic Music Modeling | Yungang Yi et.al. | 2602.19816 | translate | read | null |
| 2026-02-22 | Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition | Alexandros Haliassos et.al. | 2602.19316 | translate | read | null |
| 2026-02-21 | Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation | Yonathan Ron et.al. | 2602.18966 | translate | read | null |
| 2026-02-21 | ReHear: Iterative Pseudo-Label Refinement for Semi-Supervised Speech Recognition via Audio Large Language Models | Zefang Liu et.al. | 2602.18721 | translate | read | null |
| 2026-02-18 | Fine-Pruning: A Biologically Inspired Algorithm for Personalization of Machine Learning Models | Joseph Bingham et.al. | 2602.18507 | translate | read | null |
| 2026-02-20 | MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows | Takuhiro Kaneko et.al. | 2602.18104 | translate | read | null |
| 2026-02-19 | MusicSem: A Semantically Rich Language–Audio Dataset of Natural Music Descriptions | Rebecca Salganik et.al. | 2602.17769 | translate | read | null |
| 2026-02-19 | Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment | Ivan Rinaldi et.al. | 2602.17599 | translate | read | null |
| 2026-02-19 | Voice-Driven Semantic Perception for UAV-Assisted Emergency Networks | Nuno Saavedra et.al. | 2602.17394 | translate | read | null |
| 2026-02-13 | Speech to Speech Synthesis for Voice Impersonation | Bjorn Johnson et.al. | 2602.16721 | translate | read | null |
| 2026-02-18 | Multi-Channel Replay Speech Detection using Acoustic Maps | Michael Neri et.al. | 2602.16399 | translate | read | null |
| 2026-02-18 | How to Label Resynthesized Audio: The Dual Role of Neural Audio Codecs in Audio Deepfake Detection | Yixuan Xiao et.al. | 2602.16343 | translate | read | null |
| 2026-02-17 | LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models | Ahmed Khaled Khamis et.al. | 2602.15675 | translate | read | null |
| 2026-02-17 | Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios | Yiming Yang et.al. | 2602.15519 | translate | read | null |
| 2026-02-17 | Joint Enhancement and Classification using Coupled Diffusion Models of Signals and Logits | Gilad Nurko et.al. | 2602.15405 | translate | read | null |
| 2026-02-16 | Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis | Frederik Rautenberg et.al. | 2602.14686 | translate | read | null |
| 2026-02-16 | Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer’s Disease Detection via Speech | Xiao Wei et.al. | 2602.14655 | translate | read | null |
| 2026-02-16 | CLAP-Based Automatic Word Naming Recognition in Post-Stroke Aphasia | Yacouba Kaloga et.al. | 2602.14584 | translate | read | null |
| 2026-02-15 | From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset | Jandad Jahani et.al. | 2602.14062 | translate | read | null |
| 2026-02-15 | Eureka-Audio: Triggering Audio Intelligence in Compact Language Models | Dan Zhang et.al. | 2602.13954 | translate | read | null |
| 2026-02-14 | voice2mode: Phonation Mode Classification in Singing using Self-Supervised Speech Models | Aju Ani Justus et.al. | 2602.13928 | translate | read | null |
| 2026-02-14 | ELEAT-SAGA: Early & Late Integration with Evading Alternating Training for Spoof-Robust Speaker Verification | Amro Asali et.al. | 2602.13761 | translate | read | null |
| 2026-02-13 | ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset & Benchmark | Tung X. Nguyen et.al. | 2602.12911 | translate | read | null |
| 2026-02-13 | Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting | Jing Xu et.al. | 2602.12746 | translate | read | null |
| 2026-02-13 | PISHYAR: A Socially Intelligent Smart Cane for Indoor Social Navigation and Multimodal Human-Robot Interaction for Visually Impaired People | Mahdi Haghighat Joo et.al. | 2602.12597 | translate | read | null |
| 2026-02-13 | Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR | Jaeyoung Lee et.al. | 2602.12546 | translate | read | null |
| 2026-02-12 | “Sorry, I Didn’t Catch That”: How Speech Models Miss What Matters Most | Kaitlyn Zhou et.al. | 2602.12249 | translate | read | null |
| 2026-02-12 | Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications | Manjunath Kudlur et.al. | 2602.12241 | translate | read | null |
| 2026-02-12 | On the Sensitivity of Firing Rate-Based Federated Spiking Neural Networks to Differential Privacy | Luiz Pereira et.al. | 2602.12009 | translate | read | null |
| 2026-02-12 | TC-BiMamba: Trans-Chunk bidirectionally within BiMamba for unified streaming and non-streaming ASR | Qingshun She et.al. | 2602.11546 | translate | read | null |
| 2026-02-12 | SLD-L2S: Hierarchical Subspace Latent Diffusion for High-Fidelity Lip to Speech Synthesis | Yifan Liang et.al. | 2602.11477 | translate | read | null |
| 2026-02-11 | Voxtral Realtime | Alexander H. Liu et.al. | 2602.11298 | translate | read | null |
| 2026-02-11 | Self-Supervised Learning for Speaker Recognition: A study and review | Theo Lepage et.al. | 2602.10829 | translate | read | null |
| 2026-02-05 | Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language | Isaac Wiafe et.al. | 2602.05406 | translate | read | null |
| 2026-02-03 | Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization | Sai Sindhur Malleni et.al. | 2602.04900 | translate | read | null |
| 2026-02-04 | Speaker-Aware Simulation Improves Conversational Speech Recognition | Máté Gedeon et.al. | 2602.04776 | translate | read | null |
| 2026-02-04 | HoliAntiSpoof: Audio LLM for Holistic Speech Anti-Spoofing | Xuenan Xu et.al. | 2602.04535 | translate | read | null |
| 2026-02-04 | Universal Robust Speech Adaptation for Cross-Domain Speech Recognition and Enhancement | Chien-Chun Wang et.al. | 2602.04307 | translate | read | null |
| 2026-02-04 | Frontend Token Enhancement for Token-Based Speech Recognition | Takanori Ashihara et.al. | 2602.04217 | translate | read | null |
| 2026-02-03 | Mići Princ – A Little Boy Teaching Speech Technologies the Chakavian Dialect | Nikola Ljubešić et.al. | 2602.03245 | translate | read | null |
| 2026-02-03 | Rethinking Music Captioning with Music Metadata LLMs | Irmak Bukey et.al. | 2602.03023 | translate | read | null |
| 2026-02-02 | WAXAL: A Large-Scale Multilingual African Language Speech Corpus | Abdoulaye Diack et.al. | 2602.02734 | translate | read | null |
| 2026-02-01 | VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis | Chengyuan Ma et.al. | 2602.02591 | translate | read | null |
| 2026-02-02 | DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild | Arnab Das et.al. | 2602.02286 | translate | read | null |
| 2026-02-02 | Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition | Wonjun Lee et.al. | 2602.01967 | translate | read | null |
| 2026-02-02 | LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency | Jaejun Lee et.al. | 2602.01908 | translate | read | null |
| 2026-02-02 | Joint Optimization of ASV and CM tasks: BTUEF Team’s Submission for WildSpoof Challenge | Oguzhan Kurnaz et.al. | 2602.01722 | translate | read | null |
| 2026-02-02 | BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition | Hyunsik Kim et.al. | 2602.01717 | translate | read | null |
| 2026-02-01 | Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings | Mariëtte Olijslager et.al. | 2602.01363 | translate | read | null |
| 2026-02-01 | EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech | Besher Hassan et.al. | 2602.01170 | translate | read | null |
| 2026-02-01 | HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection | Zhili Nicholas Liang et.al. | 2602.01032 | translate | read | null |
| 2026-02-01 | Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages | Yang Xiao et.al. | 2602.01008 | translate | read | null |
| 2026-02-01 | MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA | Yutong Song et.al. | 2602.00981 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)