Audio Processing - 2025-10

Publish Date Title Authors PDF Translate Read Code
2025-10-31 NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion Zongyang Du et.al. 2511.00256 translate read null
2025-10-31 Holographic equation of state matched with hadron gas equation as a tool for the study of the quark-gluon plasma evolution A. V. Anufriev et.al. 2510.27541 translate read null
2025-10-31 Referee: Reference-aware Audiovisual Deepfake Detection Hyemin Boo et.al. 2510.27475 translate read null
2025-10-31 Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation Alireza Gharahighehi et.al. 2510.27342 translate read null
2025-10-31 Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication Deok-Seon Kim et.al. 2510.27247 translate read null
2025-10-31 Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm Anselm Lohmann et.al. 2510.27198 translate read null
2025-10-31 Expressive Range Characterization of Open Text-to-Audio Models Jonathan Morse et.al. 2510.27102 translate read null
2025-10-30 Are Online Sports Fan Communities Becoming More Offensive? A Quantitative Review of Topics, Trends, and Toxicity of r/PremierLeague Muhammad Zeeshan Mazhar et.al. 2510.27003 translate read null
2025-10-30 Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations Jean-Philippe Corbeil et.al. 2510.26974 translate read null
2025-10-29 Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition Amine Razig et.al. 2510.26838 translate read null
2025-10-29 Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling Jiarong Du et.al. 2510.26825 translate read null
2025-10-28 Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features Unzela Talpur et.al. 2510.26823 translate read null
2025-10-28 See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement Jinting Wang et.al. 2510.26819 translate read null
2025-10-28 GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment Jinting Wang et.al. 2510.26818 translate read null
2025-10-30 HMM for short independent sequences: Multiple sequence Baum-Welch application Margarita Cabrera-Bean et.al. 2510.26532 translate read null
2025-10-30 UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens Chengwei Liu et.al. 2510.26372 translate read link
2025-10-30 Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages Mérilin Sousa Silva et.al. 2510.26254 translate read null
2025-10-29 Efficient Vocal Source Separation Through Windowed Sink Attention Christodoulos Benetatos et.al. 2510.25745 translate read null
2025-10-29 Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models Harm Lameris et.al. 2510.25577 translate read null
2025-10-29 Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation Yuxiang Mao et.al. 2510.25234 translate read null
2025-10-27 SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution Dharma Teja Donepudi et.al. 2510.25178 translate read null
2025-10-29 Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels Keisuke Imoto et.al. 2510.25075 translate read null
2025-10-29 Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech Pedro Corrêa et.al. 2510.25054 translate read null
2025-10-28 POWSM: A Phonetic Open Whisper-Style Speech Foundation Model Chin-Jou Li et.al. 2510.24992 translate read null
2025-10-28 The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems Stefano Natangelo et.al. 2510.24831 translate read null
2025-10-28 Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation Inclusion AI et.al. 2510.24821 translate read link
2025-10-28 BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation Raphaël Bagat et.al. 2510.24570 translate read null
2025-10-28 Levée d’ambiguïtés par grammaires locales Eric G. C. Laporte et.al. 2510.24530 translate read null
2025-10-28 Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient Rinku Sebastian et.al. 2510.24519 translate read null
2025-10-28 Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes Jonas Hein et.al. 2510.24332 translate read null
2025-10-28 V-SAT: Video Subtitle Annotation Tool Arpita Kundu et.al. 2510.24180 translate read null
2025-10-28 RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects Md. Rezuwan Hassan et.al. 2510.24096 translate read null
2025-10-27 A Neural Model for Contextual Biasing Score Learning and Filtering Wanting Huang et.al. 2510.23849 translate read null
2025-10-27 Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders Nathan Paek et.al. 2510.23802 translate read null
2025-10-27 SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity Hanke Xie et.al. 2510.23541 translate read null
2025-10-27 LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization Máté Gedeon et.al. 2510.23320 translate read null
2025-10-27 Arabic Little STT: Arabic Children Speech Recognition Dataset Mouhand Alkadri et.al. 2510.23319 translate read null
2025-10-27 Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? Tawsif Tashwar Dipto et.al. 2510.23252 translate read null
2025-10-27 Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement Sarabeth S. Mullins et.al. 2510.23141 translate read null
2025-10-27 Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition Jing-Xuan Zhang et.al. 2510.22961 translate read null
2025-10-26 LRW-Persian: Lip-reading in the Wild Dataset for Persian Language Zahra Taghizadeh et.al. 2510.22716 translate read null
2025-10-26 Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs Anand et.al. 2510.22603 translate read link
2025-10-26 UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models Wenming Tu et.al. 2510.22588 translate read link
2025-10-26 A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus Michael Scott et.al. 2510.22495 translate read null
2025-10-26 The Tonogenesis Continuum in Tibetan: A Computational Investigation Siyu Liang et.al. 2510.22485 translate read null
2025-10-25 M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR Ruixiang Mao et.al. 2510.22172 translate read null
2025-10-25 Streaming Generation for Music Accompaniment Yusong Wu et.al. 2510.22105 translate read null
2025-10-23 GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer Jackson Loth et.al. 2510.21872 translate read null
2025-10-24 StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks Jingyue Huang et.al. 2510.21685 translate read null
2025-10-23 ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring Ari Frummer et.al. 2510.21014 translate read null
2025-10-21 Can large audio language models understand child stuttering speech? speech summarization, and source separation Chibuzor Okocha et.al. 2510.20850 translate read null
2025-10-23 R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion Junjie Zheng et.al. 2510.20677 translate read null
2025-10-23 Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding Xin Zhang et.al. 2510.20504 translate read link
2025-10-23 Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator Hualei Wang et.al. 2510.20210 translate read null
2025-10-23 SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance Haowei Lou et.al. 2510.20113 translate read null
2025-10-22 Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition Yuu Jinnai et.al. 2510.19471 translate read null
2025-10-22 FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems Ziheng Deng et.al. 2510.19301 translate read null
2025-10-22 Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges Cheng Huang et.al. 2510.19144 translate read null
2025-10-21 Steering Autoregressive Music Generation with Recursive Feature Machines Daniel Zhao et.al. 2510.19127 translate read link
2025-10-21 StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction Qianheng Xu et.al. 2510.18938 translate read null
2025-10-21 RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling Mandip Goswami et.al. 2510.18917 translate read link
2025-10-21 MLMA: Towards Multilingual ASR With Mamba-based Architectures Mohamed Nabih Ali et.al. 2510.18684 translate read null
2025-10-21 Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification Bin Gu et.al. 2510.18533 translate read null
2025-10-21 A Stage-Wise Learning Strategy with Fixed Anchors for Robust Speaker Verification Bin Gu et.al. 2510.18530 translate read null
2025-10-20 DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model Massa Baali et.al. 2510.17662 translate read null
2025-10-19 U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation Xusheng Yang et.al. 2510.16718 translate read null
2025-10-19 Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios Shiyao Wang et.al. 2510.16700 translate read null
2025-10-18 Hallucination Benchmark for Speech Foundation Models Alkis Koudounas et.al. 2510.16567 translate read null
2025-10-18 Interpreting the Dimensions of Speaker Embedding Space Mark Huckvale et.al. 2510.16489 translate read null
2025-10-18 Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment Fu-An Chao et.al. 2510.16387 translate read null
2025-10-18 MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding Jingyue Huang et.al. 2510.16273 translate read null
2025-10-17 SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling Kadri Hacioglu et.al. 2510.15851 translate read null
2025-10-17 SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models Rachmad Vidya Wicaksana Putra et.al. 2510.15566 translate read null
2025-10-16 RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF Qing Yang et.al. 2510.14628 translate read null
2025-10-16 Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? Qixin Deng et.al. 2510.14249 translate read null
2025-10-15 Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks Supriti Sinhamahapatra et.al. 2510.13979 translate read null
2025-10-15 Closing the Gap Between Text and Speech Understanding in LLMs Santiago Cuervo et.al. 2510.13632 translate read null
2025-10-15 UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Zhenyu Liu et.al. 2510.13344 translate read link
2025-10-15 Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses Sungnyun Kim et.al. 2510.13281 translate read null
2025-10-14 Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs Xinlu He et.al. 2510.12995 translate read null
2025-10-14 VCTR: A Transformer-Based Model for Non-parallel Voice Conversion Maharnab Saikia et.al. 2510.12964 translate read null
2025-10-14 A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation Mohammed Hilal Al-Kharusi et.al. 2510.12858 translate read null
2025-10-14 Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models Tsung-En Lin et.al. 2510.12851 translate read null
2025-10-11 Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation Md. Nayeem et.al. 2510.12827 translate read null
2025-10-14 Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models Prasenjit K Mudi et.al. 2510.12666 translate read null
2025-10-13 BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis Jingyuan Xing et.al. 2510.11646 translate read null
2025-10-13 Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker Cheng Gong et.al. 2510.11124 translate read null
2025-10-13 VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents Jiliang Hu et.al. 2510.11098 translate read null
2025-10-12 ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis Mohammad Javad Ranjbar Kalahroodi et.al. 2510.10774 translate read null
2025-10-12 End-to-end Speech Recognition with similar length speech and text Peng Fan et.al. 2510.10453 translate read null
2025-10-12 MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations Wenxiang Guo et.al. 2510.10396 translate read null
2025-10-11 End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs Nam Luu et.al. 2510.10329 translate read null
2025-10-11 ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis Stephen Ni-Hahn et.al. 2510.10249 translate read null
2025-10-11 SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation Zeyu Ling et.al. 2510.10069 translate read null
2025-10-10 Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking Mohammad Hossein Sameti et.al. 2510.09528 translate read null
2025-10-10 WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations Hui Wang et.al. 2510.09344 translate read null
2025-10-10 SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion Zhao Guo et.al. 2510.09245 translate read null
2025-10-10 Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality – an experimental evaluation Michele Buccoli et.al. 2510.09236 translate read null
2025-10-10 FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms Atul Shree et.al. 2510.09085 translate read null
2025-10-10 O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion Huu Tuong Tu et.al. 2510.09061 translate read link
2025-10-08 Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization Rui Hu et.al. 2510.08618 translate read null
2025-10-09 MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows Guobin Ma et.al. 2510.08392 translate read link
2025-10-09 DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching Hanke Xie et.al. 2510.08373 translate read null
2025-10-09 Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition Yi-Cheng Lin et.al. 2510.08047 translate read null
2025-10-09 IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation Wei Wang et.al. 2510.07979 translate read null
2025-10-09 VoiceAgentBench: Are Voice Assistants ready for agentic tasks? Dhruv Jain et.al. 2510.07978 translate read null
2025-10-09 Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor Kuan-Yu Chen et.al. 2510.07909 translate read null
2025-10-08 How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu Benjamin Akera et.al. 2510.07221 translate read link
2025-10-08 Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis Zhu Li et.al. 2510.07096 translate read null
2025-10-08 Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation Vaibhav Srivastav et.al. 2510.06961 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)