Audio Processing - 2025-06
Audio Processing - 2025-06
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-06-29 | You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties | Paige Tuttösí et.al. | 2506.23367 | translate | read | null |
| 2025-06-29 | The Florence Price Art Song Dataset and Piano Accompaniment Generator | Tao-Tao He et.al. | 2506.23130 | translate | read | null |
| 2025-06-29 | TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure | Qi He et.al. | 2506.23094 | translate | read | null |
| 2025-06-29 | Research on Comprehensive Classroom Evaluation System Based on Multiple AI Models | Cong Xie et.al. | 2506.23079 | translate | read | null |
| 2025-06-28 | Mind the Gap: Entity-Preserved Context-Aware ASR Structured Transcriptions | Duygu Altinok et.al. | 2506.22858 | translate | read | null |
| 2025-06-28 | Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization | Duygu Altinok et.al. | 2506.22846 | translate | read | null |
| 2025-06-28 | A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition | Shiyao Wang et.al. | 2506.22810 | translate | read | null |
| 2025-06-27 | Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR | Weiqing Wang et.al. | 2506.22646 | translate | read | null |
| 2025-06-27 | Cross-lingual Data Selection Using Clip-level Acoustic Similarity for Enhancing Low-resource Automatic Speech Recognition | Shunsuke Mitsumori et.al. | 2506.22194 | translate | read | null |
| 2025-06-27 | SAGE: Spliced-Audio Generated Data for Enhancing Foundational Models in Low-Resource Arabic-English Code-Switched Speech Recognition | Muhammad Umar Farooq et.al. | 2506.22143 | translate | read | null |
| 2025-06-27 | Evaluating Pointing Gestures for Target Selection in Human-Robot Collaboration | Noora Sassali et.al. | 2506.22116 | translate | read | null |
| 2025-06-27 | Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy | Bohan Li et.al. | 2506.22023 | translate | read | null |
| 2025-06-27 | Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit | Kartheek Kumar Reddy Nareddy et.al. | 2506.21990 | translate | read | null |
| 2025-06-26 | Exploring Adapter Design Tradeoffs for Low Resource Music Generation | Atharva Mehta et.al. | 2506.21298 | translate | read | null |
| 2025-06-26 | A Multi-Stage Framework for Multimodal Controllable Speech Synthesis | Rui Niu et.al. | 2506.20945 | translate | read | null |
| 2025-06-25 | Multimodal Representation Learning and Fusion | Qihang Jin et.al. | 2506.20494 | translate | read | null |
| 2025-06-25 | Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR | Aleš Pražák et.al. | 2506.20288 | translate | read | null |
| 2025-06-24 | Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR | Martin Ratajczak et.al. | 2506.19761 | translate | read | null |
| 2025-06-23 | A Fourier Explanation of AI-music Artifacts | Darius Afchar et.al. | 2506.19108 | translate | read | null |
| 2025-06-23 | Benchmarking Music Generation Models and Metrics via Human Preference Studies | Florian Grötschla et.al. | 2506.19085 | translate | read | null |
| 2025-06-23 | Let Your Video Listen to Your Music! | Xinyu Zhang et.al. | 2506.18881 | translate | read | null |
| 2025-06-24 | MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners | Fang-Duo Tsai et.al. | 2506.18729 | translate | read | link |
| 2025-06-23 | Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition | Christian Huber et.al. | 2506.18703 | translate | read | null |
| 2025-06-23 | Evaluating Multichannel Speech Enhancement Algorithms at the Phoneme Scale Across Genders | Nasser-Eddine Monir et.al. | 2506.18691 | translate | read | null |
| 2025-06-23 | End-to-End Spoken Grammatical Error Correction | Mengjie Qian et.al. | 2506.18532 | translate | read | null |
| 2025-06-23 | AI-Generated Song Detection via Lyrics Transcripts | Markus Frohmann et.al. | 2506.18488 | translate | read | null |
| 2025-06-23 | Selecting N-lowest scores for training MOS prediction models | Yuto Kondo et.al. | 2506.18326 | translate | read | null |
| 2025-06-23 | Large-Scale Training Data Attribution for Music Generative Models via Unlearning | Woosung Choi et.al. | 2506.18312 | translate | read | null |
| 2025-06-23 | Rethinking Mean Opinion Scores in Speech Quality Assessment: Aggregation through Quantized Distribution Fitting | Yuto Kondo et.al. | 2506.18307 | translate | read | null |
| 2025-06-23 | JIS: A Speech Corpus of Japanese Idol Speakers with Various Speaking Styles | Yuto Kondo et.al. | 2506.18296 | translate | read | null |
| 2025-06-20 | Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 | Dominik Macháček et.al. | 2506.17077 | translate | read | null |
| 2025-06-20 | Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning | Giuseppe Attanasio et.al. | 2506.17019 | translate | read | null |
| 2025-06-20 | State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition | Aref Farhadipour et.al. | 2506.16969 | translate | read | null |
| 2025-06-20 | Hybrid-Sep: Language-queried audio source separation via pre-trained Model Fusion and Adversarial Diffusion Training | Jianyuan Feng et.al. | 2506.16833 | translate | read | null |
| 2025-06-20 | RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching | Hyun Joon Park et.al. | 2506.16741 | translate | read | link |
| 2025-06-20 | LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization | Daejin Jo et.al. | 2506.16738 | translate | read | null |
| 2025-06-20 | V-CASS: Vision-context-aware Expressive Speech Synthesis for Enhancing User Understanding of Videos | Qixin Wang et.al. | 2506.16716 | translate | read | null |
| 2025-06-19 | Weight Factorization and Centralization for Continual Learning in Speech Recognition | Enes Yavuz Ugan et.al. | 2506.16574 | translate | read | null |
| 2025-06-19 | Automatic Speech Recognition Biases in Newcastle English: an Error Analysis | Dana Serditova et.al. | 2506.16558 | translate | read | null |
| 2025-06-19 | InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems | Kexin Huang et.al. | 2506.16381 | translate | read | link |
| 2025-06-18 | Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models | Teysir Baoueb et.al. | 2506.15530 | translate | read | null |
| 2025-06-18 | Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper | Jaza Syed et.al. | 2506.15514 | translate | read | link |
| 2025-06-18 | Foundation of Affective Computing and Interaction | Changzeng Fu et.al. | 2506.15497 | translate | read | null |
| 2025-06-18 | An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW | Prateek Mehta et.al. | 2506.15029 | translate | read | null |
| 2025-06-17 | A Comparative Evaluation of Deep Learning Models for Speech Enhancement in Real-World Noisy Environments | Md Jahangir Alam Khondkar et.al. | 2506.15000 | translate | read | link |
| 2025-06-17 | Thinking in Directivity: Speech Large Language Model for Multi-Talker Directional Speech Recognition | Jiamin Xie et.al. | 2506.14973 | translate | read | null |
| 2025-06-17 | Unifying Streaming and Non-streaming Zipformer-based ASR | Bidisha Sharma et.al. | 2506.14434 | translate | read | null |
| 2025-06-17 | Investigation of Zero-shot Text-to-Speech Models for Enhancing Short-Utterance Speaker Verification | Yiyang Zhao et.al. | 2506.14226 | translate | read | null |
| 2025-06-17 | Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios | Aswin Shanmugam Subramanian et.al. | 2506.14204 | translate | read | null |
| 2025-06-17 | AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR | Tuan Nguyen et.al. | 2506.14190 | translate | read | null |
| 2025-06-17 | Pushing the Performance of Synthetic Speech Detection with Kolmogorov-Arnold Networks and Self-Supervised Learning Models | Tuan Dat Phuong et.al. | 2506.14153 | translate | read | null |
| 2025-06-16 | Qwen vs. Gemma Integration with Whisper: A Comparative Study in Multilingual SpeechLLM Systems | Tuan Nguyen et.al. | 2506.13596 | translate | read | null |
| 2025-06-16 | From Flat to Feeling: A Feasibility and Impact Study on Dynamic Facial Emotions in AI-Generated Avatars | Pegah Salehi et.al. | 2506.13477 | translate | read | null |
| 2025-06-16 | BUT System for the MLC-SLM Challenge | Alexander Polok et.al. | 2506.13414 | translate | read | link |
| 2025-06-16 | Bi-directional Context-Enhanced Speech Large Language Models for Multilingual Conversational ASR | Yizhou Peng et.al. | 2506.13396 | translate | read | null |
| 2025-06-16 | NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025 | Yizhou Peng et.al. | 2506.13339 | translate | read | null |
| 2025-06-16 | Seewo’s Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models | Bo Li et.al. | 2506.13300 | translate | read | null |
| 2025-06-16 | Personalizable Long-Context Symbolic Music Infilling with MIDI-RWKV | Christian Zhou-Zheng et.al. | 2506.13001 | translate | read | link |
| 2025-06-15 | SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition | Yuta Hirano et.al. | 2506.12672 | translate | read | null |
| 2025-06-14 | Video-Guided Text-to-Music Generation Using Public Domain Movie Collections | Haven Kim et.al. | 2506.12573 | translate | read | null |
| 2025-06-14 | Mitigating Non-Target Speaker Bias in Guided Speaker Embedding | Shota Horiguchi et.al. | 2506.12500 | translate | read | null |
| 2025-06-13 | Enabling automatic transcription of child-centered audio recordings from real-world environments | Daniil Kocharov et.al. | 2506.11747 | translate | read | null |
| 2025-06-13 | Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform | Xiangzhu Kong et.al. | 2506.11630 | translate | read | null |
| 2025-06-13 | (SimPhon Speech Test): A Data-Driven Method for In Silico Design and Validation of a Phonetically Balanced Speech Test | Stefan Bleeck et.al. | 2506.11620 | translate | read | null |
| 2025-06-13 | Machine Unlearning for Robust DNNs: Attribution-Guided Partitioning and Neuron Pruning in Noisy Environments | Deliang Jin et.al. | 2506.11615 | translate | read | null |
| 2025-06-12 | Advances in Small-Footprint Keyword Spotting: A Comprehensive Review of Efficient Models and Algorithms | Soumen Garai et.al. | 2506.11169 | translate | read | null |
| 2025-06-12 | Improving Named Entity Transcription with Contextual LLM-based Revision | Viet Anh Trinh et.al. | 2506.10779 | translate | read | null |
| 2025-06-12 | BNMusic: Blending Environmental Noises into Personalized Music | Chi Zuo et.al. | 2506.10754 | translate | read | null |
| 2025-06-12 | FairASR: Fair Audio Contrastive Learning for Automatic Speech Recognition | Jongsuk Kim et.al. | 2506.10747 | translate | read | null |
| 2025-06-12 | Joint ASR and Speaker Role Tagging with Serialized Output Training | Anfeng Xu et.al. | 2506.10349 | translate | read | null |
| 2025-06-12 | RT-VC: Real-Time Zero-Shot Voice Conversion with Speech Articulatory Coding | Yisi Liu et.al. | 2506.10289 | translate | read | null |
| 2025-06-11 | Fine-Grained control over Music Generation with Activation Steering | Dipanshu Panda et.al. | 2506.10225 | translate | read | null |
| 2025-06-11 | UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching | Neta Glazer et.al. | 2506.09874 | translate | read | null |
| 2025-06-11 | Regularizing Learnable Feature Extraction for Automatic Speech Recognition | Peter Vieting et.al. | 2506.09804 | translate | read | null |
| 2025-06-11 | Training-Free Voice Conversion with Factorized Optimal Transport | Alexander Lobashev et.al. | 2506.09709 | translate | read | link |
| 2025-06-11 | You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks | Ünal Ege Gaznepoglu et.al. | 2506.09521 | translate | read | null |
| 2025-06-11 | OWSM-Biasing: Contextualizing Open Whisper-Style Speech Models for Automatic Speech Recognition with Dynamic Vocabulary | Yui Sudo et.al. | 2506.09448 | translate | read | null |
| 2025-06-11 | CoLMbo: Speaker Language Model for Descriptive Profiling | Massa Baali et.al. | 2506.09375 | translate | read | null |
| 2025-06-11 | OmniDRCA: Parallel Speech-Text Foundation Model via Dual-Resolution Speech Representations and Contrastive Alignment | Chao-Hong Tan et.al. | 2506.09349 | translate | read | null |
| 2025-06-10 | SimClass: A Classroom Speech Dataset Generated via Game Engine Simulation For Automatic Speech Recognition Research | Ahmed Adel Attia et.al. | 2506.09206 | translate | read | null |
| 2025-06-10 | FROST-EMA: Finnish and Russian Oral Speech Dataset of Electromagnetic Articulography Measurements with L1, L2 and Imitated L2 Accents | Satu Hopponen et.al. | 2506.08981 | translate | read | null |
| 2025-06-10 | Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model | Ailin Huang et.al. | 2506.08967 | translate | read | null |
| 2025-06-09 | Uncovering the Functional Roles of Nonlinearity in Memory | Manuel Brenner et.al. | 2506.07919 | translate | read | null |
| 2025-06-09 | Unified Semi-Supervised Pipeline for Automatic Speech Recognition | Nune Tadevosyan et.al. | 2506.07659 | translate | read | null |
| 2025-06-09 | Transcript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation | Rui Hu et.al. | 2506.07646 | translate | read | null |
| 2025-06-09 | SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement | Chenyu Yang et.al. | 2506.07634 | translate | read | link |
| 2025-06-09 | Bayesian Learning for Domain-Invariant Speaker Verification and Anti-Spoofing | Jin Li et.al. | 2506.07536 | translate | read | null |
| 2025-06-09 | LeVo: High-Quality Song Generation with Multi-Preference Alignment | Shun Lei et.al. | 2506.07520 | translate | read | link |
| 2025-06-09 | Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition | Asahi Sakuma et.al. | 2506.07515 | translate | read | null |
| 2025-06-09 | DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction | Solee Im et.al. | 2506.07510 | translate | read | null |
| 2025-06-09 | Towards Energy-Efficient and Low-Latency Voice-Controlled Smart Homes: A Proposal for Offline Speech Recognition and IoT Integration | Peng Huang et.al. | 2506.07494 | translate | read | null |
| 2025-06-08 | Speech Recognition on TV Series with Video-guided Post-Correction | Haoyuan Yang et.al. | 2506.07323 | translate | read | null |
| 2025-06-06 | Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems | Bo Ren et.al. | 2506.06252 | translate | read | null |
| 2025-06-06 | Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction | Christophe Van Gysel et.al. | 2506.06117 | translate | read | null |
| 2025-06-06 | CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition | Yun-Shao Tsai et.al. | 2506.06071 | translate | read | null |
| 2025-06-06 | Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models | Yuke Lin et.al. | 2506.05796 | translate | read | null |
| 2025-06-06 | Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition | Mu Yang et.al. | 2506.05706 | translate | read | null |
| 2025-06-06 | Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning | Yangui Fang et.al. | 2506.05671 | translate | read | null |
| 2025-06-05 | Improving AI-generated music with user-guided training | Vishwa Mohan Singh et.al. | 2506.04852 | translate | read | null |
| 2025-06-05 | LLM-based phoneme-to-grapheme for phoneme-based speech recognition | Te Ma et.al. | 2506.04711 | translate | read | null |
| 2025-06-05 | ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition | Thai-Binh Nguyen et.al. | 2506.04635 | translate | read | null |
| 2025-06-05 | LESS: Large Language Model Enhanced Semi-Supervised Learning for Speech Foundational Models | Wen Ding et.al. | 2506.04586 | translate | read | null |
| 2025-06-04 | French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement | Thomas Joubaud et.al. | 2506.04495 | translate | read | null |
| 2025-06-04 | Effects of Speaker Count, Duration, and Accent Diversity on Zero-Shot Accent Robustness in Low-Resource ASR | Zheng-Xin Yong et.al. | 2506.04364 | translate | read | null |
| 2025-06-04 | HiFiTTS-2: A Large-Scale High Bandwidth Speech Dataset | Ryan Langman et.al. | 2506.04152 | translate | read | null |
| 2025-06-04 | A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions | Chung-Chun Wang et.al. | 2506.04077 | translate | read | null |
| 2025-06-04 | Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion | Seymanur Akti et.al. | 2506.04013 | translate | read | null |
| 2025-06-04 | MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition | Yinfeng Xia et.al. | 2506.03722 | translate | read | null |
| 2025-06-04 | Comparative Analysis of Fast and High-Fidelity Neural Vocoders for Low-Latency Streaming Synthesis in Resource-Constrained Environments | Reo Yoneyama et.al. | 2506.03554 | translate | read | null |
| 2025-06-04 | Local Equivariance Error-Based Metrics for Evaluating Sampling-Frequency-Independent Property of Neural Network | Kanami Imamura et.al. | 2506.03550 | translate | read | null |
| 2025-06-03 | Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation | Yongqi Wang et.al. | 2506.02997 | translate | read | null |
| 2025-06-03 | A Multi-Dialectal Dataset for German Dialect ASR and Dialect-to-Standard Speech Translation | Verena Blaschke et.al. | 2506.02894 | translate | read | link |
| 2025-06-03 | CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech | Helin Wang et.al. | 2506.02863 | translate | read | link |
| 2025-06-05 | DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization | Geonyoung Lee et.al. | 2506.02858 | translate | read | null |
| 2025-06-03 | On the influence of language similarity in non-target speaker verification trials | Paul M. Reuter et.al. | 2506.02777 | translate | read | null |
| 2025-06-03 | Prompt-Unseen-Emotion: Zero-shot Expressive Speech Synthesis with Prompt-LLM Contextual Knowledge for Mixed Emotions | Xiaoxue Gao et.al. | 2506.02742 | translate | read | null |
| 2025-06-03 | Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning | Ömer Tarik Özyilmaz et.al. | 2506.02627 | translate | read | null |
| 2025-06-03 | On the Language and Gender Biases in PSTN, VoIP and Neural Audio Codecs | Kemal Altwlkany et.al. | 2506.02545 | translate | read | null |
| 2025-06-03 | DnR-nonverbal: Cinematic Audio Source Separation Dataset Containing Non-Verbal Sounds | Takuya Hasumi et.al. | 2506.02499 | translate | read | null |
| 2025-06-03 | SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant | Yixuan Hou et.al. | 2506.02457 | translate | read | null |
| 2025-06-02 | MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR | Dimitrios Damianos et.al. | 2505.24656 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)