Audio Processing - 2025-06

Publish Date Title Authors PDF Translate Read Code
2025-06-29 You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties Paige Tuttösí et.al. 2506.23367 translate read null
2025-06-29 The Florence Price Art Song Dataset and Piano Accompaniment Generator Tao-Tao He et.al. 2506.23130 translate read null
2025-06-29 TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure Qi He et.al. 2506.23094 translate read null
2025-06-29 Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models Cong Xie et.al. 2506.23079 translate read null
2025-06-28 Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions Duygu Altinok et.al. 2506.22858 translate read null
2025-06-28 Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization Duygu Altinok et.al. 2506.22846 translate read null
2025-06-28 A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition Shiyao Wang et.al. 2506.22810 translate read null
2025-06-27 Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR Weiqing Wang et.al. 2506.22646 translate read null
2025-06-27 Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition Shunsuke Mitsumori et.al. 2506.22194 translate read null
2025-06-27 SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition Muhammad Umar Farooq et.al. 2506.22143 translate read null
2025-06-27 Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration Noora Sassali et.al. 2506.22116 translate read null
2025-06-27 Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy Bohan Li et.al. 2506.22023 translate read null
2025-06-27 Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit Kartheek Kumar Reddy Nareddy et.al. 2506.21990 translate read null
2025-06-26 Exploring Adapter Design Tradeoffs for Low Resource Music Generation Atharva Mehta et.al. 2506.21298 translate read null
2025-06-26 A Multi-Stage Framework for Multimodal Controllable Speech Synthesis Rui Niu et.al. 2506.20945 translate read null
2025-06-25 Multimodal Representation Learning and Fusion Qihang Jin et.al. 2506.20494 translate read null
2025-06-25 Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR Aleš Pražák et.al. 2506.20288 translate read null
2025-06-24 Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR Martin Ratajczak et.al. 2506.19761 translate read null
2025-06-23 A Fourier Explanation of AI-music Artifacts Darius Afchar et.al. 2506.19108 translate read null
2025-06-23 Benchmarking Music Generation Models and Metrics via Human Preference Studies Florian Grötschla et.al. 2506.19085 translate read null
2025-06-23 Let Your Video Listen to Your Music! Xinyu Zhang et.al. 2506.18881 translate read null
2025-06-24 MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners Fang-Duo Tsai et.al. 2506.18729 translate read link
2025-06-23 Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition Christian Huber et.al. 2506.18703 translate read null
2025-06-23 Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders Nasser-Eddine Monir et.al. 2506.18691 translate read null
2025-06-23 End-to-End Spoken Grammatical Error Correction Mengjie Qian et.al. 2506.18532 translate read null
2025-06-23 AI-Generated Song Detection via Lyrics Transcripts Markus Frohmann et.al. 2506.18488 translate read null
2025-06-23 Selecting N-lowest scores for training MOS prediction models Yuto Kondo et.al. 2506.18326 translate read null
2025-06-23 Large-Scale Training Data Attribution for Music Generative Models via Unlearning Woosung Choi et.al. 2506.18312 translate read null
2025-06-23 Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting Yuto Kondo et.al. 2506.18307 translate read null
2025-06-23 JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles Yuto Kondo et.al. 2506.18296 translate read null
2025-06-20 Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 Dominik Macháček et.al. 2506.17077 translate read null
2025-06-20 Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning Giuseppe Attanasio et.al. 2506.17019 translate read null
2025-06-20 State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition Aref Farhadipour et.al. 2506.16969 translate read null
2025-06-20 Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training Jianyuan Feng et.al. 2506.16833 translate read null
2025-06-20 RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching Hyun Joon Park et.al. 2506.16741 translate read link
2025-06-20 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization Daejin Jo et.al. 2506.16738 translate read null
2025-06-20 V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos Qixin Wang et.al. 2506.16716 translate read null
2025-06-19 Weight Factorization and Centralization for Continual Learning in Speech Recognition Enes Yavuz Ugan et.al. 2506.16574 translate read null
2025-06-19 Automatic Speech Recognition Biases in Newcastle English: an Error Analysis Dana Serditova et.al. 2506.16558 translate read null
2025-06-19 InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems Kexin Huang et.al. 2506.16381 translate read link
2025-06-18 Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models Teysir Baoueb et.al. 2506.15530 translate read null
2025-06-18 Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper Jaza Syed et.al. 2506.15514 translate read link
2025-06-18 Foundation of Affective Computing and Interaction Changzeng Fu et.al. 2506.15497 translate read null
2025-06-18 An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW Prateek Mehta et.al. 2506.15029 translate read null
2025-06-17 A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments Md Jahangir Alam Khondkar et.al. 2506.15000 translate read link
2025-06-17 Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition Jiamin Xie et.al. 2506.14973 translate read null
2025-06-17 Unifying Streaming and Non-streaming Zipformer-based ASR Bidisha Sharma et.al. 2506.14434 translate read null
2025-06-17 Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification Yiyang Zhao et.al. 2506.14226 translate read null
2025-06-17 Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios Aswin Shanmugam Subramanian et.al. 2506.14204 translate read null
2025-06-17 AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR Tuan Nguyen et.al. 2506.14190 translate read null
2025-06-17 Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models Tuan Dat Phuong et.al. 2506.14153 translate read null
2025-06-16 Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems Tuan Nguyen et.al. 2506.13596 translate read null
2025-06-16 From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars Pegah Salehi et.al. 2506.13477 translate read null
2025-06-16 BUT System for the MLC-SLM Challenge Alexander Polok et.al. 2506.13414 translate read link
2025-06-16 Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR Yizhou Peng et.al. 2506.13396 translate read null
2025-06-16 NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 Yizhou Peng et.al. 2506.13339 translate read null
2025-06-16 Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models Bo Li et.al. 2506.13300 translate read null
2025-06-16 Personalizable Long-Context Symbolic Music Infilling with MIDI-RWKV Christian Zhou-Zheng et.al. 2506.13001 translate read link
2025-06-15 SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition Yuta Hirano et.al. 2506.12672 translate read null
2025-06-14 Video-Guided Text-to-Music Generation Using Public Domain Movie Collections Haven Kim et.al. 2506.12573 translate read null
2025-06-14 Mitigating Non-Target Speaker Bias in Guided Speaker Embedding Shota Horiguchi et.al. 2506.12500 translate read null
2025-06-13 Enabling automatic transcription of child-centered audio recordings from real-world environments Daniil Kocharov et.al. 2506.11747 translate read null
2025-06-13 Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform Xiangzhu Kong et.al. 2506.11630 translate read null
2025-06-13 (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test Stefan Bleeck et.al. 2506.11620 translate read null
2025-06-13 Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments Deliang Jin et.al. 2506.11615 translate read null
2025-06-12 Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms Soumen Garai et.al. 2506.11169 translate read null
2025-06-12 Improving Named Entity Transcription with Contextual LLM-based Revision Viet Anh Trinh et.al. 2506.10779 translate read null
2025-06-12 BNMusic: Blending Environmental Noises into Personalized Music Chi Zuo et.al. 2506.10754 translate read null
2025-06-12 FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition Jongsuk Kim et.al. 2506.10747 translate read null
2025-06-12 Joint ASR and Speaker Role Tagging with Serialized Output Training Anfeng Xu et.al. 2506.10349 translate read null
2025-06-12 RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding Yisi Liu et.al. 2506.10289 translate read null
2025-06-11 Fine-Grained control over Music Generation with Activation Steering Dipanshu Panda et.al. 2506.10225 translate read null
2025-06-11 UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching Neta Glazer et.al. 2506.09874 translate read null
2025-06-11 Regularizing Learnable Feature Extraction for Automatic Speech Recognition Peter Vieting et.al. 2506.09804 translate read null
2025-06-11 Training-Free Voice Conversion with Factorized Optimal Transport Alexander Lobashev et.al. 2506.09709 translate read link
2025-06-11 You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks Ünal Ege Gaznepoglu et.al. 2506.09521 translate read null
2025-06-11 OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary Yui Sudo et.al. 2506.09448 translate read null
2025-06-11 CoLMbo: Speaker Language Model for Descriptive Profiling Massa Baali et.al. 2506.09375 translate read null
2025-06-11 OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment Chao-Hong Tan et.al. 2506.09349 translate read null
2025-06-10 SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research Ahmed Adel Attia et.al. 2506.09206 translate read null
2025-06-10 FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents Satu Hopponen et.al. 2506.08981 translate read null
2025-06-10 Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model Ailin Huang et.al. 2506.08967 translate read null
2025-06-09 Uncovering the Functional Roles of Nonlinearity in Memory Manuel Brenner et.al. 2506.07919 translate read null
2025-06-09 Unified Semi-Supervised Pipeline for Automatic Speech Recognition Nune Tadevosyan et.al. 2506.07659 translate read null
2025-06-09 Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation Rui Hu et.al. 2506.07646 translate read null
2025-06-09 SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement Chenyu Yang et.al. 2506.07634 translate read link
2025-06-09 Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing Jin Li et.al. 2506.07536 translate read null
2025-06-09 LeVo: High-Quality Song Generation with Multi-Preference Alignment Shun Lei et.al. 2506.07520 translate read link
2025-06-09 Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition Asahi Sakuma et.al. 2506.07515 translate read null
2025-06-09 DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction Solee Im et.al. 2506.07510 translate read null
2025-06-09 Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration Peng Huang et.al. 2506.07494 translate read null
2025-06-08 Speech Recognition on TV Series with Video-guided Post-Correction Haoyuan Yang et.al. 2506.07323 translate read null
2025-06-06 Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems Bo Ren et.al. 2506.06252 translate read null
2025-06-06 Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction Christophe Van Gysel et.al. 2506.06117 translate read null
2025-06-06 CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition Yun-Shao Tsai et.al. 2506.06071 translate read null
2025-06-06 Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models Yuke Lin et.al. 2506.05796 translate read null
2025-06-06 Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition Mu Yang et.al. 2506.05706 translate read null
2025-06-06 Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning Yangui Fang et.al. 2506.05671 translate read null
2025-06-05 Improving AI-generated music with user-guided training Vishwa Mohan Singh et.al. 2506.04852 translate read null
2025-06-05 LLM-based phoneme-to-grapheme for phoneme-based speech recognition Te Ma et.al. 2506.04711 translate read null
2025-06-05 ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition Thai-Binh Nguyen et.al. 2506.04635 translate read null
2025-06-05 LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models Wen Ding et.al. 2506.04586 translate read null
2025-06-04 French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement Thomas Joubaud et.al. 2506.04495 translate read null
2025-06-04 Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR Zheng-Xin Yong et.al. 2506.04364 translate read null
2025-06-04 HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset Ryan Langman et.al. 2506.04152 translate read null
2025-06-04 A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions Chung-Chun Wang et.al. 2506.04077 translate read null
2025-06-04 Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion Seymanur Akti et.al. 2506.04013 translate read null
2025-06-04 MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition Yinfeng Xia et.al. 2506.03722 translate read null
2025-06-04 Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments Reo Yoneyama et.al. 2506.03554 translate read null
2025-06-04 Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network Kanami Imamura et.al. 2506.03550 translate read null
2025-06-03 Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation Yongqi Wang et.al. 2506.02997 translate read null
2025-06-03 A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation Verena Blaschke et.al. 2506.02894 translate read link
2025-06-03 CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech Helin Wang et.al. 2506.02863 translate read link
2025-06-05 DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization Geonyoung Lee et.al. 2506.02858 translate read null
2025-06-03 On the influence of language similarity in non-target speaker verification trials Paul M. Reuter et.al. 2506.02777 translate read null
2025-06-03 Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions Xiaoxue Gao et.al. 2506.02742 translate read null
2025-06-03 Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning Ömer Tarik Özyilmaz et.al. 2506.02627 translate read null
2025-06-03 On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs Kemal Altwlkany et.al. 2506.02545 translate read null
2025-06-03 DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds Takuya Hasumi et.al. 2506.02499 translate read null
2025-06-03 SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant Yixuan Hou et.al. 2506.02457 translate read null
2025-06-02 MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR Dimitrios Damianos et.al. 2505.24656 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)