Audio Processing - 2025-07

Publish Date Title Authors PDF Translate Read Code
2025-07-23 AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer Danny D. Leybzon et.al. 2507.17718 translate read null
2025-07-23 Synthetic Voice Data for Automatic Speech Recognition in African Languages Brian DeRenzi et.al. 2507.17578 translate read null
2025-07-23 BoSS: Beyond-Semantic Speech Qing Wang et.al. 2507.17563 translate read null
2025-07-23 Clustering-based hard negative sampling for supervised contrastive speaker verification Piotr Masztalski et.al. 2507.17540 translate read null
2025-07-23 Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task Milena Davudova et.al. 2507.17326 translate read null
2025-07-23 On Temporal Guidance and Iterative Refinement in Audio Source Separation Tobias Morocutti et.al. 2507.17297 translate read null
2025-07-23 Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge Miaomiao Gao et.al. 2507.17288 translate read null
2025-07-22 SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling Yi Guo et.al. 2507.16884 translate read null
2025-07-22 Step-Audio 2 Technical Report Boyong Wu et.al. 2507.16632 translate read link
2025-07-22 An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications Sujith Pulikodan et.al. 2507.16456 translate read null
2025-07-21 Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks Ziqiao Yu et.al. 2507.16043 translate read null
2025-07-21 Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR Zhong-Qiu Wang et.al. 2507.15229 translate read null
2025-07-21 EchoVoices: Preserving Generational Voices and Memories for Seniors and Children Haiying Xu et.al. 2507.15221 translate read null
2025-07-21 Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems Natalia Tomashenko et.al. 2507.15214 translate read null
2025-07-20 DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis Yinghao Aaron Li et.al. 2507.14988 translate read link
2025-07-19 Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion Yu Zhang et.al. 2507.14534 translate read link
2025-07-19 Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications Satwik Dutta et.al. 2507.14451 translate read link
2025-07-18 Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic Lilit Grigoryan et.al. 2507.13977 translate read null
2025-07-18 Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies Carlos Mena et.al. 2507.13875 translate read null
2025-07-17 A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Kirill Borodin et.al. 2507.13563 translate read link
2025-07-17 Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder Feng Chen et.al. 2507.13551 translate read null
2025-07-18 Automatically assessing oral narratives of Afrikaans and isiXhosa children Retief Louw et.al. 2507.13205 translate read null
2025-07-17 SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks Kutub Uddin et.al. 2507.13170 translate read null
2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech Maksim Borisov et.al. 2507.13155 translate read null
2025-07-17 UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets Zhichao Sheng et.al. 2507.12951 translate read null
2025-07-17 Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes Zhou Feng et.al. 2507.12932 translate read null
2025-07-17 AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation Potsawee Manakul et.al. 2507.12705 translate read null
2025-07-17 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine Anastasia Kuznetsova et.al. 2507.12701 translate read null
2025-07-16 Improving Contextual ASR via Multi-grained Fusion with Large Language Models Shilin Zhou et.al. 2507.12252 translate read null
2025-07-16 EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis Haoxun Li et.al. 2507.12015 translate read null
2025-07-15 Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection Ivan Viakhirev et.al. 2507.11777 translate read link
2025-07-15 FasTUSS: Faster Task-Aware Unified Source Separation Francesco Paissan et.al. 2507.11435 translate read null
2025-07-15 Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models Paul A. Bereuter et.al. 2507.11427 translate read null
2025-07-14 WhisperKit: On-device Real-time ASR with Billion-Scale Transformers Atila Orhon et.al. 2507.10860 translate read null
2025-07-14 Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition Mengzhe Geng et.al. 2507.10827 translate read null
2025-07-14 WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling Qihui Yang et.al. 2507.10534 translate read null
2025-07-14 DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation Yiru Yang et.al. 2507.10313 translate read null
2025-07-13 The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge Yuke Lin et.al. 2507.09499 translate read null
2025-07-12 Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning Dominika Woszczyk et.al. 2507.09310 translate read null
2025-07-12 Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? Shota Horiguchi et.al. 2507.09226 translate read null
2025-07-15 Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition Bingshen Mu et.al. 2507.09116 translate read null
2025-07-11 SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment Shivam Mehta et.al. 2507.09070 translate read null
2025-07-11 The Impact of Automatic Speech Transcription on Speaker Attribution Cristina Aggazzotti et.al. 2507.08660 translate read null
2025-07-11 Unlocking Speech Instruction Data Potential with Query Rewriting Yonghua Hei et.al. 2507.08603 translate read null
2025-07-11 ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition Qingliang Meng et.al. 2507.08477 translate read null
2025-07-11 Active Learning for Text-to-Speech Synthesis with Informative Sample Collection Kentaro Seki et.al. 2507.08319 translate read null
2025-07-11 RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing Yang Xiao et.al. 2507.08227 translate read null
2025-07-10 DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation Chunxi Wang et.al. 2507.08135 translate read null
2025-07-10 Modèle physique variationnel pour l’estimation de réponses impulsionnelles de salles Louis Lalay et.al. 2507.08051 translate read null
2025-07-10 Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models Chen Feng et.al. 2507.07877 translate read null
2025-07-10 SecureSpeech: Prompt-based Speaker and Content Protection Belinda Soh Hui Hui et.al. 2507.07799 translate read null
2025-07-10 Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Maha Tufail Agro et.al. 2507.07741 translate read null
2025-07-08 Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition Dipayan Bhadra et.al. 2507.07068 translate read null
2025-07-09 Speech Tokenizer is Key to Consistent Representation Wonjin Jung et.al. 2507.06802 translate read null
2025-07-09 Exploring State-Space-Model based Language Model in Music Generation Wei-Jaw Lee et.al. 2507.06674 translate read null
2025-07-09 Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents Zackary Rackauckas et.al. 2507.06483 translate read null
2025-07-08 Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis Xintong Hu et.al. 2507.06116 translate read null
2025-07-08 VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis Alexandre Symeonidis-Herzig et.al. 2507.06060 translate read null
2025-07-08 MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation Fathinah Izzati et.al. 2507.05894 translate read null
2025-07-08 How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures Tanvina Patel et.al. 2507.05885 translate read null
2025-07-08 ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark He Wang et.al. 2507.05727 translate read null
2025-07-08 Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition Zijin Gu et.al. 2507.05724 translate read null
2025-07-07 EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation Fathinah Izzati et.al. 2507.04955 translate read null
2025-07-07 Adaptive Slimming for Scalable and Efficient Speech Enhancement Riccardo Miccini et.al. 2507.04879 translate read null
2025-07-07 Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters Mathilde Abrassart et.al. 2507.04817 translate read null
2025-07-07 Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis Sho Inoue et.al. 2507.04598 translate read null
2025-07-06 TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet Jaeseok Jeong et.al. 2507.04349 translate read null
2025-07-05 Prosody Labeling with Phoneme-BERT and Speech Foundation Models Tomoki Koriyama et.al. 2507.03912 translate read null
2025-07-04 Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion Lea Fischbach et.al. 2507.03641 translate read null
2025-07-04 MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI Roser Batlle-Roca et.al. 2507.03599 translate read null
2025-07-08 SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge Yuxiang Mei et.al. 2507.03343 translate read null
2025-07-03 DeepGesture: A conversational gesture synthesis system based on emotions and semantics Thanh Hoang-Minh et.al. 2507.03147 translate read null
2025-07-03 Multi-agent Auditory Scene Analysis Caleb Rascon et.al. 2507.02755 translate read null
2025-07-03 Open-Source System for Multilingual Translation and Cloned Speech Synthesis Mateo Cámara et.al. 2507.02530 translate read null
2025-07-03 A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages Sumaya Ahmed Salihs et.al. 2507.02428 translate read null
2025-07-03 Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability Mark Atta Mensah et.al. 2507.02407 translate read null
2025-07-02 Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis Marc-André Carbonneau et.al. 2507.02176 translate read null
2025-07-02 Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams Zirui Li et.al. 2507.02115 translate read null
2025-07-02 Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla Md Sazzadul Islam Ridoy et.al. 2507.01931 translate read null
2025-07-02 First Steps Towards Voice Anonymization for Code-Switching Speech Sarina Meyer et.al. 2507.01765 translate read null
2025-07-02 PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution Omkar Shende et.al. 2507.01695 translate read null
2025-07-02 Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora Hitoshi Suda et.al. 2507.01356 translate read null
2025-07-02 Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation Andrei Jelea et.al. 2507.01347 translate read null
2025-07-02 AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance Vishakha Lall et.al. 2507.01274 translate read null
2025-07-01 MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement Nikolai Lund Kühne et.al. 2507.00966 translate read null
2025-07-02 Multi-interaction TTS toward professional recording reproduction Hiroki Kanagawa et.al. 2507.00808 translate read null
2025-07-01 Rectifying Magnitude Neglect in Linear Attention Qihang Fan et.al. 2507.00698 translate read null
2025-07-01 Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding Duc Cao-Dinh et.al. 2507.00669 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)