Audio Processing - 2025-07
Audio Processing - 2025-07
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-07-23 | AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer | Danny D. Leybzon et.al. | 2507.17718 | translate | read | null |
| 2025-07-23 | Synthetic Voice Data for Automatic Speech Recognition in African Languages | Brian DeRenzi et.al. | 2507.17578 | translate | read | null |
| 2025-07-23 | BoSS: Beyond-Semantic Speech | Qing Wang et.al. | 2507.17563 | translate | read | null |
| 2025-07-23 | Clustering-based hard negative sampling for supervised contrastive speaker verification | Piotr Masztalski et.al. | 2507.17540 | translate | read | null |
| 2025-07-23 | Application of Whisper in Clinical Practice: the Post-Stroke Speech Assessment during a Naming Task | Milena Davudova et.al. | 2507.17326 | translate | read | null |
| 2025-07-23 | On Temporal Guidance and Iterative Refinement in Audio Source Separation | Tobias Morocutti et.al. | 2507.17297 | translate | read | null |
| 2025-07-23 | Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge | Miaomiao Gao et.al. | 2507.17288 | translate | read | null |
| 2025-07-22 | SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling | Yi Guo et.al. | 2507.16884 | translate | read | null |
| 2025-07-22 | Step-Audio 2 Technical Report | Boyong Wu et.al. | 2507.16632 | translate | read | link |
| 2025-07-22 | An approach to measuring the performance of Automatic Speech Recognition (ASR) models in the context of Large Language Model (LLM) powered applications | Sujith Pulikodan et.al. | 2507.16456 | translate | read | null |
| 2025-07-21 | Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks | Ziqiao Yu et.al. | 2507.16043 | translate | read | null |
| 2025-07-21 | Mixture to Beamformed Mixture: Leveraging Beamformed Mixture as Weak-Supervision for Speech Enhancement and Noise-Robust ASR | Zhong-Qiu Wang et.al. | 2507.15229 | translate | read | null |
| 2025-07-21 | EchoVoices: Preserving Generational Voices and Memories for Seniors and Children | Haiying Xu et.al. | 2507.15221 | translate | read | null |
| 2025-07-21 | Exploiting Context-dependent Duration Features for Voice Anonymization Attack Systems | Natalia Tomashenko et.al. | 2507.15214 | translate | read | null |
| 2025-07-20 | DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis | Yinghao Aaron Li et.al. | 2507.14988 | translate | read | link |
| 2025-07-19 | Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion | Yu Zhang et.al. | 2507.14534 | translate | read | link |
| 2025-07-19 | Adapting Whisper for Lightweight and Efficient Automatic Speech Recognition of Children for On-device Edge Applications | Satwik Dutta et.al. | 2507.14451 | translate | read | link |
| 2025-07-18 | Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic | Lilit Grigoryan et.al. | 2507.13977 | translate | read | null |
| 2025-07-18 | Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies | Carlos Mena et.al. | 2507.13875 | translate | read | null |
| 2025-07-17 | A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models | Kirill Borodin et.al. | 2507.13563 | translate | read | link |
| 2025-07-17 | Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder | Feng Chen et.al. | 2507.13551 | translate | read | null |
| 2025-07-18 | Automatically assessing oral narratives of Afrikaans and isiXhosa children | Retief Louw et.al. | 2507.13205 | translate | read | null |
| 2025-07-17 | SHIELD: A Secure and Highly Enhanced Integrated Learning for Robust Deepfake Detection against Adversarial Attacks | Kutub Uddin et.al. | 2507.13170 | translate | read | null |
| 2025-07-17 | NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech | Maksim Borisov et.al. | 2507.13155 | translate | read | null |
| 2025-07-17 | UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets | Zhichao Sheng et.al. | 2507.12951 | translate | read | null |
| 2025-07-17 | Enkidu: Universal Frequential Perturbation for Real-Time Audio Privacy Protection against Voice Deepfakes | Zhou Feng et.al. | 2507.12932 | translate | read | null |
| 2025-07-17 | AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation | Potsawee Manakul et.al. | 2507.12705 | translate | read | null |
| 2025-07-17 | Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine | Anastasia Kuznetsova et.al. | 2507.12701 | translate | read | null |
| 2025-07-16 | Improving Contextual ASR via Multi-grained Fusion with Large Language Models | Shilin Zhou et.al. | 2507.12252 | translate | read | null |
| 2025-07-16 | EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis | Haoxun Li et.al. | 2507.12015 | translate | read | null |
| 2025-07-15 | Towards Scalable AASIST: Refining Graph Attention for Speech Deepfake Detection | Ivan Viakhirev et.al. | 2507.11777 | translate | read | link |
| 2025-07-15 | FasTUSS: Faster Task-Aware Unified Source Separation | Francesco Paissan et.al. | 2507.11435 | translate | read | null |
| 2025-07-15 | Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models | Paul A. Bereuter et.al. | 2507.11427 | translate | read | null |
| 2025-07-14 | WhisperKit: On-device Real-time ASR with Billion-Scale Transformers | Atila Orhon et.al. | 2507.10860 | translate | read | null |
| 2025-07-14 | Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition | Mengzhe Geng et.al. | 2507.10827 | translate | read | null |
| 2025-07-14 | WildFX: A DAW-Powered Pipeline for In-the-Wild Audio FX Graph Modeling | Qihui Yang et.al. | 2507.10534 | translate | read | null |
| 2025-07-14 | DQLoRA: A Lightweight Domain-Aware Denoising ASR via Adapter-guided Distillation | Yiru Yang et.al. | 2507.10313 | translate | read | null |
| 2025-07-13 | The DKU System for Multi-Speaker Automatic Speech Recognition in MLC-SLM Challenge | Yuke Lin et.al. | 2507.09499 | translate | read | null |
| 2025-07-12 | Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning | Dominika Woszczyk et.al. | 2507.09310 | translate | read | null |
| 2025-07-12 | Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? | Shota Horiguchi et.al. | 2507.09226 | translate | read | null |
| 2025-07-15 | Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition | Bingshen Mu et.al. | 2507.09116 | translate | read | null |
| 2025-07-11 | SemAlignVC: Enhancing zero-shot timbre conversion using semantic alignment | Shivam Mehta et.al. | 2507.09070 | translate | read | null |
| 2025-07-11 | The Impact of Automatic Speech Transcription on Speaker Attribution | Cristina Aggazzotti et.al. | 2507.08660 | translate | read | null |
| 2025-07-11 | Unlocking Speech Instruction Data Potential with Query Rewriting | Yonghua Hei et.al. | 2507.08603 | translate | read | null |
| 2025-07-11 | ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition | Qingliang Meng et.al. | 2507.08477 | translate | read | null |
| 2025-07-11 | Active Learning for Text-to-Speech Synthesis with Informative Sample Collection | Kentaro Seki et.al. | 2507.08319 | translate | read | null |
| 2025-07-11 | RawTFNet: A Lightweight CNN Architecture for Speech Anti-spoofing | Yang Xiao et.al. | 2507.08227 | translate | read | null |
| 2025-07-10 | DARAS: Dynamic Audio-Room Acoustic Synthesis for Blind Room Impulse Response Estimation | Chunxi Wang et.al. | 2507.08135 | translate | read | null |
| 2025-07-10 | Modèle physique variationnel pour l’estimation de réponses impulsionnelles de salles | Louis Lalay et.al. | 2507.08051 | translate | read | null |
| 2025-07-10 | Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models | Chen Feng et.al. | 2507.07877 | translate | read | null |
| 2025-07-10 | SecureSpeech: Prompt-based Speaker and Content Protection | Belinda Soh Hui Hui et.al. | 2507.07799 | translate | read | null |
| 2025-07-10 | Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review | Maha Tufail Agro et.al. | 2507.07741 | translate | read | null |
| 2025-07-08 | Deep Feed-Forward Neural Network for Bangla Isolated Speech Recognition | Dipayan Bhadra et.al. | 2507.07068 | translate | read | null |
| 2025-07-09 | Speech Tokenizer is Key to Consistent Representation | Wonjin Jung et.al. | 2507.06802 | translate | read | null |
| 2025-07-09 | Exploring State-Space-Model based Language Model in Music Generation | Wei-Jaw Lee et.al. | 2507.06674 | translate | read | null |
| 2025-07-09 | Learning Japanese with Jouzu: Interaction Outcomes with Stylized Dialogue Fictional Agents | Zackary Rackauckas et.al. | 2507.06483 | translate | read | null |
| 2025-07-08 | Speech Quality Assessment Model Based on Mixture of Experts: System-Level Performance Enhancement and Utterance-Level Challenge Analysis | Xintong Hu et.al. | 2507.06116 | translate | read | null |
| 2025-07-08 | VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis | Alexandre Symeonidis-Herzig et.al. | 2507.06060 | translate | read | null |
| 2025-07-08 | MusiScene: Leveraging MU-LLaMA for Scene Imagination and Enhanced Video Background Music Generation | Fathinah Izzati et.al. | 2507.05894 | translate | read | null |
| 2025-07-08 | How to Evaluate Automatic Speech Recognition: Comparing Different Performance and Bias Measures | Tanvina Patel et.al. | 2507.05885 | translate | read | null |
| 2025-07-08 | ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark | He Wang et.al. | 2507.05727 | translate | read | null |
| 2025-07-08 | Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition | Zijin Gu et.al. | 2507.05724 | translate | read | null |
| 2025-07-07 | EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation | Fathinah Izzati et.al. | 2507.04955 | translate | read | null |
| 2025-07-07 | Adaptive Slimming for Scalable and Efficient Speech Enhancement | Riccardo Miccini et.al. | 2507.04879 | translate | read | null |
| 2025-07-07 | Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters | Mathilde Abrassart et.al. | 2507.04817 | translate | read | null |
| 2025-07-07 | Multi-Step Prediction and Control of Hierarchical Emotion Distribution in Text-to-Speech Synthesis | Sho Inoue et.al. | 2507.04598 | translate | read | null |
| 2025-07-06 | TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet | Jaeseok Jeong et.al. | 2507.04349 | translate | read | null |
| 2025-07-05 | Prosody Labeling with Phoneme-BERT and Speech Foundation Models | Tomoki Koriyama et.al. | 2507.03912 | translate | read | null |
| 2025-07-04 | Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion | Lea Fischbach et.al. | 2507.03641 | translate | read | null |
| 2025-07-04 | MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI | Roser Batlle-Roca et.al. | 2507.03599 | translate | read | null |
| 2025-07-08 | SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge | Yuxiang Mei et.al. | 2507.03343 | translate | read | null |
| 2025-07-03 | DeepGesture: A conversational gesture synthesis system based on emotions and semantics | Thanh Hoang-Minh et.al. | 2507.03147 | translate | read | null |
| 2025-07-03 | Multi-agent Auditory Scene Analysis | Caleb Rascon et.al. | 2507.02755 | translate | read | null |
| 2025-07-03 | Open-Source System for Multilingual Translation and Cloned Speech Synthesis | Mateo Cámara et.al. | 2507.02530 | translate | read | null |
| 2025-07-03 | A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages | Sumaya Ahmed Salihs et.al. | 2507.02428 | translate | read | null |
| 2025-07-03 | Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability | Mark Atta Mensah et.al. | 2507.02407 | translate | read | null |
| 2025-07-02 | Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis | Marc-André Carbonneau et.al. | 2507.02176 | translate | read | null |
| 2025-07-02 | Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams | Zirui Li et.al. | 2507.02115 | translate | read | null |
| 2025-07-02 | Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla | Md Sazzadul Islam Ridoy et.al. | 2507.01931 | translate | read | null |
| 2025-07-02 | First Steps Towards Voice Anonymization for Code-Switching Speech | Sarina Meyer et.al. | 2507.01765 | translate | read | null |
| 2025-07-02 | PERTINENCE: Input-based Opportunistic Neural Network Dynamic Execution | Omkar Shende et.al. | 2507.01695 | translate | read | null |
| 2025-07-02 | Voice Conversion for Likability Control via Automated Rating of Speech Synthesis Corpora | Hitoshi Suda et.al. | 2507.01356 | translate | read | null |
| 2025-07-02 | Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation | Andrei Jelea et.al. | 2507.01347 | translate | read | null |
| 2025-07-02 | AI Meets Maritime Training: Precision Analytics for Enhanced Safety and Performance | Vishakha Lall et.al. | 2507.01274 | translate | read | null |
| 2025-07-01 | MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement | Nikolai Lund Kühne et.al. | 2507.00966 | translate | read | null |
| 2025-07-02 | Multi-interaction TTS toward professional recording reproduction | Hiroki Kanagawa et.al. | 2507.00808 | translate | read | null |
| 2025-07-01 | Rectifying Magnitude Neglect in Linear Attention | Qihang Fan et.al. | 2507.00698 | translate | read | null |
| 2025-07-01 | Audio-3DVG: Unified Audio - Point Cloud Fusion for 3D Visual Grounding | Duc Cao-Dinh et.al. | 2507.00669 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)