Audio Processing - 2025-10
Audio Processing - 2025-10
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-10-31 | NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion | Zongyang Du et.al. | 2511.00256 | translate | read | null |
| 2025-10-31 | Holographic equation of state matched with hadron gas equation as a tool for the study of the quark-gluon plasma evolution | A. V. Anufriev et.al. | 2510.27541 | translate | read | null |
| 2025-10-31 | Referee: Reference-aware Audiovisual Deepfake Detection | Hyemin Boo et.al. | 2510.27475 | translate | read | null |
| 2025-10-31 | Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation | Alireza Gharahighehi et.al. | 2510.27342 | translate | read | null |
| 2025-10-31 | Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication | Deok-Seon Kim et.al. | 2510.27247 | translate | read | null |
| 2025-10-31 | Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm | Anselm Lohmann et.al. | 2510.27198 | translate | read | null |
| 2025-10-31 | Expressive Range Characterization of Open Text-to-Audio Models | Jonathan Morse et.al. | 2510.27102 | translate | read | null |
| 2025-10-30 | Are Online Sports Fan Communities Becoming More Offensive? A Quantitative Review of Topics, Trends, and Toxicity of r/PremierLeague | Muhammad Zeeshan Mazhar et.al. | 2510.27003 | translate | read | null |
| 2025-10-30 | Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations | Jean-Philippe Corbeil et.al. | 2510.26974 | translate | read | null |
| 2025-10-29 | Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition | Amine Razig et.al. | 2510.26838 | translate | read | null |
| 2025-10-29 | Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling | Jiarong Du et.al. | 2510.26825 | translate | read | null |
| 2025-10-28 | Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features | Unzela Talpur et.al. | 2510.26823 | translate | read | null |
| 2025-10-28 | See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement | Jinting Wang et.al. | 2510.26819 | translate | read | null |
| 2025-10-28 | GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment | Jinting Wang et.al. | 2510.26818 | translate | read | null |
| 2025-10-30 | HMM for short independent sequences: Multiple sequence Baum-Welch application | Margarita Cabrera-Bean et.al. | 2510.26532 | translate | read | null |
| 2025-10-30 | UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens | Chengwei Liu et.al. | 2510.26372 | translate | read | link |
| 2025-10-30 | Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages | Mérilin Sousa Silva et.al. | 2510.26254 | translate | read | null |
| 2025-10-29 | Efficient Vocal Source Separation Through Windowed Sink Attention | Christodoulos Benetatos et.al. | 2510.25745 | translate | read | null |
| 2025-10-29 | Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models | Harm Lameris et.al. | 2510.25577 | translate | read | null |
| 2025-10-29 | Learning Disentangled Speech- and Expression-Driven Blendshapes for 3D Talking Face Animation | Yuxiang Mao et.al. | 2510.25234 | translate | read | null |
| 2025-10-27 | SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution | Dharma Teja Donepudi et.al. | 2510.25178 | translate | read | null |
| 2025-10-29 | Joint Analysis of Acoustic Scenes and Sound Events Based on Semi-Supervised Training of Sound Events With Partial Labels | Keisuke Imoto et.al. | 2510.25075 | translate | read | null |
| 2025-10-29 | Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech | Pedro Corrêa et.al. | 2510.25054 | translate | read | null |
| 2025-10-28 | POWSM: A Phonetic Open Whisper-Style Speech Foundation Model | Chin-Jou Li et.al. | 2510.24992 | translate | read | null |
| 2025-10-28 | The Narrative Continuity Test: A Conceptual Framework for Evaluating Identity Persistence in AI Systems | Stefano Natangelo et.al. | 2510.24831 | translate | read | null |
| 2025-10-28 | Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation | Inclusion AI et.al. | 2510.24821 | translate | read | link |
| 2025-10-28 | BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation | Raphaël Bagat et.al. | 2510.24570 | translate | read | null |
| 2025-10-28 | Levée d’ambiguïtés par grammaires locales | Eric G. C. Laporte et.al. | 2510.24530 | translate | read | null |
| 2025-10-28 | Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient | Rinku Sebastian et.al. | 2510.24519 | translate | read | null |
| 2025-10-28 | Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes | Jonas Hein et.al. | 2510.24332 | translate | read | null |
| 2025-10-28 | V-SAT: Video Subtitle Annotation Tool | Arpita Kundu et.al. | 2510.24180 | translate | read | null |
| 2025-10-28 | RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects | Md. Rezuwan Hassan et.al. | 2510.24096 | translate | read | null |
| 2025-10-27 | A Neural Model for Contextual Biasing Score Learning and Filtering | Wanting Huang et.al. | 2510.23849 | translate | read | null |
| 2025-10-27 | Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders | Nathan Paek et.al. | 2510.23802 | translate | read | null |
| 2025-10-27 | SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity | Hanke Xie et.al. | 2510.23541 | translate | read | null |
| 2025-10-27 | LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization | Máté Gedeon et.al. | 2510.23320 | translate | read | null |
| 2025-10-27 | Arabic Little STT: Arabic Children Speech Recognition Dataset | Mouhand Alkadri et.al. | 2510.23319 | translate | read | null |
| 2025-10-27 | Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? | Tawsif Tashwar Dipto et.al. | 2510.23252 | translate | read | null |
| 2025-10-27 | Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement | Sarabeth S. Mullins et.al. | 2510.23141 | translate | read | null |
| 2025-10-27 | Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition | Jing-Xuan Zhang et.al. | 2510.22961 | translate | read | null |
| 2025-10-26 | LRW-Persian: Lip-reading in the Wild Dataset for Persian Language | Zahra Taghizadeh et.al. | 2510.22716 | translate | read | null |
| 2025-10-26 | Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMs | Anand et.al. | 2510.22603 | translate | read | link |
| 2025-10-26 | UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models | Wenming Tu et.al. | 2510.22588 | translate | read | link |
| 2025-10-26 | A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus | Michael Scott et.al. | 2510.22495 | translate | read | null |
| 2025-10-26 | The Tonogenesis Continuum in Tibetan: A Computational Investigation | Siyu Liang et.al. | 2510.22485 | translate | read | null |
| 2025-10-25 | M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR | Ruixiang Mao et.al. | 2510.22172 | translate | read | null |
| 2025-10-25 | Streaming Generation for Music Accompaniment | Yusong Wu et.al. | 2510.22105 | translate | read | null |
| 2025-10-23 | GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer | Jackson Loth et.al. | 2510.21872 | translate | read | null |
| 2025-10-24 | StylePitcher: Generating Style-Following and Expressive Pitch Curves for Versatile Singing Tasks | Jingyue Huang et.al. | 2510.21685 | translate | read | null |
| 2025-10-23 | ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring | Ari Frummer et.al. | 2510.21014 | translate | read | null |
| 2025-10-21 | Can large audio language models understand child stuttering speech? speech summarization, and source separation | Chibuzor Okocha et.al. | 2510.20850 | translate | read | null |
| 2025-10-23 | R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion | Junjie Zheng et.al. | 2510.20677 | translate | read | null |
| 2025-10-23 | Speaking Clearly: A Simplified Whisper-Based Codec for Low-Bitrate Speech Coding | Xin Zhang et.al. | 2510.20504 | translate | read | link |
| 2025-10-23 | Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator | Hualei Wang et.al. | 2510.20210 | translate | read | null |
| 2025-10-23 | SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance | Haowei Lou et.al. | 2510.20113 | translate | read | null |
| 2025-10-22 | Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition | Yuu Jinnai et.al. | 2510.19471 | translate | read | null |
| 2025-10-22 | FLASH Viterbi: Fast and Adaptive Viterbi Decoding for Modern Data Systems | Ziheng Deng et.al. | 2510.19301 | translate | read | null |
| 2025-10-22 | Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges | Cheng Huang et.al. | 2510.19144 | translate | read | null |
| 2025-10-21 | Steering Autoregressive Music Generation with Recursive Feature Machines | Daniel Zhao et.al. | 2510.19127 | translate | read | link |
| 2025-10-21 | StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction | Qianheng Xu et.al. | 2510.18938 | translate | read | null |
| 2025-10-21 | RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling | Mandip Goswami et.al. | 2510.18917 | translate | read | link |
| 2025-10-21 | MLMA: Towards Multilingual ASR With Mamba-based Architectures | Mohamed Nabih Ali et.al. | 2510.18684 | translate | read | null |
| 2025-10-21 | Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification | Bin Gu et.al. | 2510.18533 | translate | read | null |
| 2025-10-21 | A Stage-Wise Learning Strategy with Fixed Anchors for Robust Speaker Verification | Bin Gu et.al. | 2510.18530 | translate | read | null |
| 2025-10-20 | DELULU: Discriminative Embedding Learning Using Latent Units for Speaker-Aware Self-Supervised Speech Foundational Model | Massa Baali et.al. | 2510.17662 | translate | read | null |
| 2025-10-19 | U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation | Xusheng Yang et.al. | 2510.16718 | translate | read | null |
| 2025-10-19 | Zero- and One-Shot Data Augmentation for Sentence-Level Dysarthric Speech Recognition in Constrained Scenarios | Shiyao Wang et.al. | 2510.16700 | translate | read | null |
| 2025-10-18 | Hallucination Benchmark for Speech Foundation Models | Alkis Koudounas et.al. | 2510.16567 | translate | read | null |
| 2025-10-18 | Interpreting the Dimensions of Speaker Embedding Space | Mark Huckvale et.al. | 2510.16489 | translate | read | null |
| 2025-10-18 | Probing the Hidden Talent of ASR Foundation Models for L2 English Oral Assessment | Fu-An Chao et.al. | 2510.16387 | translate | read | null |
| 2025-10-18 | MuseTok: Symbolic Music Tokenization for Generation and Semantic Understanding | Jingyue Huang et.al. | 2510.16273 | translate | read | null |
| 2025-10-17 | SpeechLLMs for Large-scale Contextualized Zero-shot Slot Filling | Kadri Hacioglu et.al. | 2510.15851 | translate | read | null |
| 2025-10-17 | SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models | Rachmad Vidya Wicaksana Putra et.al. | 2510.15566 | translate | read | null |
| 2025-10-16 | RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF | Qing Yang et.al. | 2510.14628 | translate | read | null |
| 2025-10-16 | Do Joint Language-Audio Embeddings Encode Perceptual Timbre Semantics? | Qixin Deng et.al. | 2510.14249 | translate | read | null |
| 2025-10-15 | Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks | Supriti Sinhamahapatra et.al. | 2510.13979 | translate | read | null |
| 2025-10-15 | Closing the Gap Between Text and Speech Understanding in LLMs | Santiago Cuervo et.al. | 2510.13632 | translate | read | null |
| 2025-10-15 | UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE | Zhenyu Liu et.al. | 2510.13344 | translate | read | link |
| 2025-10-15 | Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses | Sungnyun Kim et.al. | 2510.13281 | translate | read | null |
| 2025-10-14 | Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs | Xinlu He et.al. | 2510.12995 | translate | read | null |
| 2025-10-14 | VCTR: A Transformer-Based Model for Non-parallel Voice Conversion | Maharnab Saikia et.al. | 2510.12964 | translate | read | null |
| 2025-10-14 | A Critical Review of the Need for Knowledge-Centric Evaluation of Quranic Recitation | Mohammed Hilal Al-Kharusi et.al. | 2510.12858 | translate | read | null |
| 2025-10-14 | Adaptive vector steering: A training-free, layer-wise intervention for hallucination mitigation in large audio and multimodal models | Tsung-En Lin et.al. | 2510.12851 | translate | read | null |
| 2025-10-11 | Automatic Speech Recognition in the Modern Era: Architectures, Training, and Evaluation | Md. Nayeem et.al. | 2510.12827 | translate | read | null |
| 2025-10-14 | Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models | Prasenjit K Mudi et.al. | 2510.12666 | translate | read | null |
| 2025-10-13 | BridgeCode: A Dual Speech Representation Paradigm for Autoregressive Zero-Shot Text-to-Speech Synthesis | Jingyuan Xing et.al. | 2510.11646 | translate | read | null |
| 2025-10-13 | Perturbation Self-Supervised Representations for Cross-Lingual Emotion TTS: Stage-Wise Modeling of Emotion and Speaker | Cheng Gong et.al. | 2510.11124 | translate | read | null |
| 2025-10-13 | VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents | Jiliang Hu et.al. | 2510.11098 | translate | read | null |
| 2025-10-12 | ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis | Mohammad Javad Ranjbar Kalahroodi et.al. | 2510.10774 | translate | read | null |
| 2025-10-12 | End-to-end Speech Recognition with similar length speech and text | Peng Fan et.al. | 2510.10453 | translate | read | null |
| 2025-10-12 | MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations | Wenxiang Guo et.al. | 2510.10396 | translate | read | null |
| 2025-10-11 | End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs | Nam Luu et.al. | 2510.10329 | translate | read | null |
| 2025-10-11 | ProGress: Structured Music Generation via Graph Diffusion and Hierarchical Music Analysis | Stephen Ni-Hahn et.al. | 2510.10249 | translate | read | null |
| 2025-10-11 | SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation | Zeyu Ling et.al. | 2510.10069 | translate | read | null |
| 2025-10-10 | Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking | Mohammad Hossein Sameti et.al. | 2510.09528 | translate | read | null |
| 2025-10-10 | WildElder: A Chinese Elderly Speech Dataset from the Wild with Fine-Grained Manual Annotations | Hui Wang et.al. | 2510.09344 | translate | read | null |
| 2025-10-10 | SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion | Zhao Guo et.al. | 2510.09245 | translate | read | null |
| 2025-10-10 | Effects of automotive microphone frequency response characteristics and noise conditions on speech and ASR quality – an experimental evaluation | Michele Buccoli et.al. | 2510.09236 | translate | read | null |
| 2025-10-10 | FLToP CTC: Frame-Level Token Pruning via Relative Threshold for Efficient and Memory-Saving Decoding on Diverse Platforms | Atul Shree et.al. | 2510.09085 | translate | read | null |
| 2025-10-10 | O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion | Huu Tuong Tu et.al. | 2510.09061 | translate | read | link |
| 2025-10-08 | Look before Transcription: End-to-End SlideASR with Visually-Anchored Policy Optimization | Rui Hu et.al. | 2510.08618 | translate | read | null |
| 2025-10-09 | MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows | Guobin Ma et.al. | 2510.08392 | translate | read | link |
| 2025-10-09 | DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching | Hanke Xie et.al. | 2510.08373 | translate | read | null |
| 2025-10-09 | Pseudo2Real: Task Arithmetic for Pseudo-Label Correction in Automatic Speech Recognition | Yi-Cheng Lin et.al. | 2510.08047 | translate | read | null |
| 2025-10-09 | IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation | Wei Wang et.al. | 2510.07979 | translate | read | null |
| 2025-10-09 | VoiceAgentBench: Are Voice Assistants ready for agentic tasks? | Dhruv Jain et.al. | 2510.07978 | translate | read | null |
| 2025-10-09 | Bloodroot: When Watermarking Turns Poisonous For Stealthy Backdoor | Kuan-Yu Chen et.al. | 2510.07909 | translate | read | null |
| 2025-10-08 | How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu | Benjamin Akera et.al. | 2510.07221 | translate | read | link |
| 2025-10-08 | Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis | Zhu Li et.al. | 2510.07096 | translate | read | null |
| 2025-10-08 | Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation | Vaibhav Srivastav et.al. | 2510.06961 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)