Audio Processing - 2025-02

Publish Date Title Authors PDF Translate Read Code
2025-02-28 InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation Chong Zhang et.al. 2503.00084 translate read link
2025-02-27 LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation Keisuke Kamahori et.al. 2502.20583 translate read link
2025-02-27 Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications Marcus Yu Zhe Wee et.al. 2502.20311 translate read null
2025-02-27 CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR Nian Shao et.al. 2502.20040 translate read link
2025-02-27 DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models Weihao wu et.al. 2502.19924 translate read null
2025-02-26 Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis Ziyue Jiang et.al. 2502.18924 translate read null
2025-02-26 CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition Jiaming Zhou et.al. 2502.18913 translate read null
2025-02-25 Exploring Gender Disparities in Automatic Speech Recognition Technology Hend ElGhazaly et.al. 2502.18434 translate read null
2025-02-27 NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms Yashan Wang et.al. 2502.18008 translate read null
2025-02-25 Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm Yudong Xie et.al. 2502.17829 translate read null
2025-02-26 Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation Qiuming Zhao et.al. 2502.17380 translate read null
2025-02-24 Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus Golshid Shekoufandeh et.al. 2502.17284 translate read null
2025-02-24 Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM Jiatong Shi et.al. 2502.16897 translate read null
2025-02-22 Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration Haoxuan Wang et.al. 2502.16142 translate read null
2025-02-21 The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages Jenalea Rajab et.al. 2502.15916 translate read null
2025-02-21 Retrieval-Augmented Speech Recognition Approach for Domain Challenges Peng Shen et.al. 2502.15264 translate read null
2025-02-21 Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders Weiqiao Shan et.al. 2502.15178 translate read null
2025-02-21 Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking Khanh Le et.al. 2502.15158 translate read null
2025-02-20 WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models Yifu Chen et.al. 2502.14727 translate read null
2025-02-20 SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition Khanh Le et.al. 2502.14685 translate read null
2025-02-20 Moshi Moshi? A Model Selection Hijacking Adversarial Attack Riccardo Petrucci et.al. 2502.14586 translate read null
2025-02-19 On the application of Visibility Graphs in the Spectral Domain for Speaker Recognition Hernan Bocaccio et.al. 2502.14110 translate read null
2025-02-18 Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders Seungbae Kim et.al. 2502.13983 translate read null
2025-02-19 Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks Ori Shapira et.al. 2502.13645 translate read link
2025-02-21 VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation Wei Zhao et.al. 2502.13508 translate read link
2025-02-19 Adopting Whisper for Confidence Estimation Vaibhav Aggarwal et.al. 2502.13446 translate read null
2025-02-18 AV-Flow: Transforming Text to Audio-Visual Human-like Interactions Aggelina Chatziagapi et.al. 2502.13133 translate read null
2025-02-18 Neuro-oscillatory models of cortical speech processing Olesia Dogonasheva et.al. 2502.12935 translate read null
2025-02-18 High-Fidelity Music Vocoder using Neural Audio Codecs Luca A. Lanzendörfer et.al. 2502.12759 translate read null
2025-02-18 Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge Lian Remme et.al. 2502.12714 translate read null
2025-02-18 A Comprehensive Survey on Generative AI for Video-to-Music Generation Shulei Ji et.al. 2502.12489 translate read null
2025-02-18 Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models Hanin Atwany et.al. 2502.12414 translate read null
2025-02-18 On the Robust Approximation of ASR Metrics Abdul Waheed et.al. 2502.12408 translate read null
2025-02-17 A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond Shreya Shukla et.al. 2502.12048 translate read null
2025-02-17 NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing Yifan Liang et.al. 2502.12002 translate read null
2025-02-17 Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration Yan Zhang et.al. 2502.11720 translate read null
2025-02-17 Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models Yingqing Guo et.al. 2502.11420 translate read null
2025-02-16 FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching Hui Wang et.al. 2502.11128 translate read null
2025-02-16 In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback Fyodor Morozko et.al. 2502.11126 translate read null
2025-02-16 DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities Xiangyu Lu et.al. 2502.11123 translate read null
2025-02-14 Enhancing Age-Related Robustness in Children Speaker Verification Vishwas M. Shetty et.al. 2502.10511 translate read null
2025-02-14 OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models William Chen et.al. 2502.10373 translate read null
2025-02-14 VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect Qingyuan Fei et.al. 2502.10329 translate read null
2025-02-14 Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries Serkan Sulun et.al. 2502.10154 translate read null
2025-02-14 MTLM: an Innovative Language Model Training Paradigm for ASR Qingliang Meng et.al. 2502.10058 translate read null
2025-02-14 A Preliminary Exploration with GPT-4o Voice Mode Yu-Xiang Lin et.al. 2502.09940 translate read null
2025-02-14 Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge Naoyuki Kamo et.al. 2502.09859 translate read null
2025-02-13 SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops Eshaq Jamdar et.al. 2502.09553 translate read null
2025-02-13 Shortcut Learning Susceptibility in Vision Classifiers Pirzada Suhail et.al. 2502.09150 translate read null
2025-02-13 Quantum Approaches for Dysphonia Assessment in Small Speech Datasets Ha Tran et.al. 2502.08968 translate read null
2025-02-13 TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument Kyungsu Kim et.al. 2502.08939 translate read link
2025-02-13 ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech Xin Wang et.al. 2502.08857 translate read null
2025-02-12 Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors Vishwanath Pratap Singh et.al. 2502.08587 translate read null
2025-02-11 LoRP-TTS: Low-Rank Personalized Text-To-Speech Łukasz Bondaruk et.al. 2502.07562 translate read null
2025-02-12 Music for All: Exploring Multicultural Representations in Music Generation Models Atharva Mehta et.al. 2502.07328 translate read link
2025-02-11 Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement Xueyao Zhang et.al. 2502.07243 translate read null
2025-02-11 VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification Pengyu Wang et.al. 2502.07205 translate read link
2025-02-10 A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication Qutaiba I. Ali et.al. 2502.06969 translate read null
2025-02-10 Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset Huw Cheston et.al. 2502.06364 translate read null
2025-02-09 Speech to Speech Translation with Translatotron: A State of the Art Review Jules R. Kala et.al. 2502.05980 translate read null
2025-02-09 Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models Jing-Xuan Zhang et.al. 2502.05766 translate read null
2025-02-09 Non-invasive electromyographic speech neuroprosthesis: a geometric perspective Harshavardhana T. Gowda et.al. 2502.05762 translate read null
2025-02-09 BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting Mohammad Jahid Ibna Basher et.al. 2502.05729 translate read null
2025-02-08 Gender Bias in Instruction-Guided Speech Synthesis Models Chun-Yi Kuan et.al. 2502.05649 translate read null
2025-02-08 Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model Jialong Zuo et.al. 2502.05471 translate read null
2025-02-07 Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance Reihaneh Amooie et.al. 2502.04883 translate read null
2025-02-07 Lightweight Operations for Visual Speech Recognition Iason Ioannis Panagos et.al. 2502.04834 translate read null
2025-02-07 Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features Wei Chen et.al. 2502.04722 translate read null
2025-02-06 ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement Keshav Bhandari et.al. 2502.04522 translate read link
2025-02-06 GenVC: Self-Supervised Zero-Shot Voice Conversion Zexin Cai et.al. 2502.04519 translate read null
2025-02-06 FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks Luca Della Libera et.al. 2502.04465 translate read link
2025-02-06 Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Zhen Ye et.al. 2502.04128 translate read link
2025-02-06 Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond Mardhiyah Sanni et.al. 2502.03945 translate read null
2025-02-06 Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS Afnan Al-Ali et.al. 2502.03895 translate read null
2025-02-05 Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality Shiyi Tan et.al. 2502.03381 translate read null
2025-02-05 Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling Jakob Poncelet et.al. 2502.03212 translate read link
2025-02-05 Metis: A Foundation Speech Generation Model with Masked Generative Pre-training Yuancheng Wang et.al. 2502.03128 translate read null
2025-02-04 Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet Shenran Wang et.al. 2502.02703 translate read null
2025-02-03 CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition Martijn Bartelds et.al. 2502.01777 translate read null
2025-02-03 Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models Christopher Simic et.al. 2502.01709 translate read null
2025-02-03 A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport Yacouba Kaloga et.al. 2502.01588 translate read null
2025-02-03 mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition Andrew Rouditchenko et.al. 2502.01547 translate read link
2025-02-03 Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition Nanjun Zhou et.al. 2502.01152 translate read null
2025-02-03 Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis Weiwei Lin et.al. 2502.01084 translate read null
2025-02-01 Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition Anna Seo Gyeong Choi et.al. 2502.00583 translate read null
2025-02-01 Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions David Gimeno-Gómez et.al. 2502.00464 translate read null
2025-02-01 Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language Turi Abu et.al. 2502.00421 translate read link
2025-02-01 When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation Anna Min et.al. 2502.00377 translate read null
2025-02-03 SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions Dominik Wagner et.al. 2501.19377 translate read null
2025-02-03 DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition Wonjun Lee et.al. 2501.19010 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)