Audio Processing - 2025-02
Audio Processing - 2025-02
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-02-28 | InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation | Chong Zhang et.al. | 2503.00084 | translate | read | link |
| 2025-02-27 | LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation | Keisuke Kamahori et.al. | 2502.20583 | translate | read | link |
| 2025-02-27 | Adapting Automatic Speech Recognition for Accented Air Traffic Control Communications | Marcus Yu Zhe Wee et.al. | 2502.20311 | translate | read | null |
| 2025-02-27 | CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR | Nian Shao et.al. | 2502.20040 | translate | read | link |
| 2025-02-27 | DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models | Weihao wu et.al. | 2502.19924 | translate | read | null |
| 2025-02-26 | Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis | Ziyue Jiang et.al. | 2502.18924 | translate | read | null |
| 2025-02-26 | CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition | Jiaming Zhou et.al. | 2502.18913 | translate | read | null |
| 2025-02-25 | Exploring Gender Disparities in Automatic Speech Recognition Technology | Hend ElGhazaly et.al. | 2502.18434 | translate | read | null |
| 2025-02-27 | NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms | Yashan Wang et.al. | 2502.18008 | translate | read | null |
| 2025-02-25 | Silent Speech Sentence Recognition with Six-Axis Accelerometers using Conformer and CTC Algorithm | Yudong Xie et.al. | 2502.17829 | translate | read | null |
| 2025-02-26 | Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation | Qiuming Zhao et.al. | 2502.17380 | translate | read | null |
| 2025-02-24 | Improving the Inclusivity of Dutch Speech Recognition by Fine-tuning Whisper on the JASMIN-CGN Corpus | Golshid Shekoufandeh et.al. | 2502.17284 | translate | read | null |
| 2025-02-24 | Balancing Speech Understanding and Generation Using Continual Pre-training for Codec-based Speech LLM | Jiatong Shi et.al. | 2502.16897 | translate | read | null |
| 2025-02-22 | Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration | Haoxuan Wang et.al. | 2502.16142 | translate | read | null |
| 2025-02-21 | The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages | Jenalea Rajab et.al. | 2502.15916 | translate | read | null |
| 2025-02-21 | Retrieval-Augmented Speech Recognition Approach for Domain Challenges | Peng Shen et.al. | 2502.15264 | translate | read | null |
| 2025-02-21 | Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders | Weiqiao Shan et.al. | 2502.15178 | translate | read | null |
| 2025-02-21 | Improving Streaming Speech Recognition With Time-Shifted Contextual Attention And Dynamic Right Context Masking | Khanh Le et.al. | 2502.15158 | translate | read | null |
| 2025-02-20 | WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models | Yifu Chen et.al. | 2502.14727 | translate | read | null |
| 2025-02-20 | SegAug: CTC-Aligned Segmented Augmentation For Robust RNN-Transducer Based Speech Recognition | Khanh Le et.al. | 2502.14685 | translate | read | null |
| 2025-02-20 | Moshi Moshi? A Model Selection Hijacking Adversarial Attack | Riccardo Petrucci et.al. | 2502.14586 | translate | read | null |
| 2025-02-19 | On the application of Visibility Graphs in the Spectral Domain for Speaker Recognition | Hernan Bocaccio et.al. | 2502.14110 | translate | read | null |
| 2025-02-18 | Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders | Seungbae Kim et.al. | 2502.13983 | translate | read | null |
| 2025-02-19 | Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks | Ori Shapira et.al. | 2502.13645 | translate | read | link |
| 2025-02-21 | VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation | Wei Zhao et.al. | 2502.13508 | translate | read | link |
| 2025-02-19 | Adopting Whisper for Confidence Estimation | Vaibhav Aggarwal et.al. | 2502.13446 | translate | read | null |
| 2025-02-18 | AV-Flow: Transforming Text to Audio-Visual Human-like Interactions | Aggelina Chatziagapi et.al. | 2502.13133 | translate | read | null |
| 2025-02-18 | Neuro-oscillatory models of cortical speech processing | Olesia Dogonasheva et.al. | 2502.12935 | translate | read | null |
| 2025-02-18 | High-Fidelity Music Vocoder using Neural Audio Codecs | Luca A. Lanzendörfer et.al. | 2502.12759 | translate | read | null |
| 2025-02-18 | Playing with Voices: Tabletop Role-Playing Game Recordings as a Diarization Challenge | Lian Remme et.al. | 2502.12714 | translate | read | null |
| 2025-02-18 | A Comprehensive Survey on Generative AI for Video-to-Music Generation | Shulei Ji et.al. | 2502.12489 | translate | read | null |
| 2025-02-18 | Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models | Hanin Atwany et.al. | 2502.12414 | translate | read | null |
| 2025-02-18 | On the Robust Approximation of ASR Metrics | Abdul Waheed et.al. | 2502.12408 | translate | read | null |
| 2025-02-17 | A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond | Shreya Shukla et.al. | 2502.12048 | translate | read | null |
| 2025-02-17 | NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing | Yifan Liang et.al. | 2502.12002 | translate | read | null |
| 2025-02-17 | Can you pass that tool?: Implications of Indirect Speech in Physical Human-Robot Collaboration | Yan Zhang et.al. | 2502.11720 | translate | read | null |
| 2025-02-17 | Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models | Yingqing Guo et.al. | 2502.11420 | translate | read | null |
| 2025-02-16 | FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching | Hui Wang et.al. | 2502.11128 | translate | read | null |
| 2025-02-16 | In Situ Optimization of an Optoelectronic Reservoir Computer with Digital Delayed Feedback | Fyodor Morozko et.al. | 2502.11126 | translate | read | null |
| 2025-02-16 | DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities | Xiangyu Lu et.al. | 2502.11123 | translate | read | null |
| 2025-02-14 | Enhancing Age-Related Robustness in Children Speaker Verification | Vishwas M. Shetty et.al. | 2502.10511 | translate | read | null |
| 2025-02-14 | OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models | William Chen et.al. | 2502.10373 | translate | read | null |
| 2025-02-14 | VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect | Qingyuan Fei et.al. | 2502.10329 | translate | read | null |
| 2025-02-14 | Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries | Serkan Sulun et.al. | 2502.10154 | translate | read | null |
| 2025-02-14 | MTLM: an Innovative Language Model Training Paradigm for ASR | Qingliang Meng et.al. | 2502.10058 | translate | read | null |
| 2025-02-14 | A Preliminary Exploration with GPT-4o Voice Mode | Yu-Xiang Lin et.al. | 2502.09940 | translate | read | null |
| 2025-02-14 | Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge | Naoyuki Kamo et.al. | 2502.09859 | translate | read | null |
| 2025-02-13 | SyntheticPop: Attacking Speaker Verification Systems With Synthetic VoicePops | Eshaq Jamdar et.al. | 2502.09553 | translate | read | null |
| 2025-02-13 | Shortcut Learning Susceptibility in Vision Classifiers | Pirzada Suhail et.al. | 2502.09150 | translate | read | null |
| 2025-02-13 | Quantum Approaches for Dysphonia Assessment in Small Speech Datasets | Ha Tran et.al. | 2502.08968 | translate | read | null |
| 2025-02-13 | TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument | Kyungsu Kim et.al. | 2502.08939 | translate | read | link |
| 2025-02-13 | ASVspoof 5: Design, Collection and Validation of Resources for Spoofing, Deepfake, and Adversarial Attack Detection Using Crowdsourced Speech | Xin Wang et.al. | 2502.08857 | translate | read | null |
| 2025-02-12 | Causal Analysis of ASR Errors for Children: Quantifying the Impact of Physiological, Cognitive, and Extrinsic Factors | Vishwanath Pratap Singh et.al. | 2502.08587 | translate | read | null |
| 2025-02-11 | LoRP-TTS: Low-Rank Personalized Text-To-Speech | Łukasz Bondaruk et.al. | 2502.07562 | translate | read | null |
| 2025-02-12 | Music for All: Exploring Multicultural Representations in Music Generation Models | Atharva Mehta et.al. | 2502.07328 | translate | read | link |
| 2025-02-11 | Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement | Xueyao Zhang et.al. | 2502.07243 | translate | read | null |
| 2025-02-11 | VINP: Variational Bayesian Inference with Neural Speech Prior for Joint ASR-Effective Speech Dereverberation and Blind RIR Identification | Pengyu Wang et.al. | 2502.07205 | translate | read | link |
| 2025-02-10 | A Comparative Study of ASR Implementations in Resource-Constrained Wireless Sensor Networks for Real-Time Voice Communication | Qutaiba I. Ali et.al. | 2502.06969 | translate | read | null |
| 2025-02-10 | Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset | Huw Cheston et.al. | 2502.06364 | translate | read | null |
| 2025-02-09 | Speech to Speech Translation with Translatotron: A State of the Art Review | Jules R. Kala et.al. | 2502.05980 | translate | read | null |
| 2025-02-09 | Audio-Visual Representation Learning via Knowledge Distillation from Speech Foundation Models | Jing-Xuan Zhang et.al. | 2502.05766 | translate | read | null |
| 2025-02-09 | Non-invasive electromyographic speech neuroprosthesis: a geometric perspective | Harshavardhana T. Gowda et.al. | 2502.05762 | translate | read | null |
| 2025-02-09 | BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting | Mohammad Jahid Ibna Basher et.al. | 2502.05729 | translate | read | null |
| 2025-02-08 | Gender Bias in Instruction-Guided Speech Synthesis Models | Chun-Yi Kuan et.al. | 2502.05649 | translate | read | null |
| 2025-02-08 | Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model | Jialong Zuo et.al. | 2502.05471 | translate | read | null |
| 2025-02-07 | Evaluating Standard and Dialectal Frisian ASR: Multilingual Fine-tuning and Language Identification for Improved Low-resource Performance | Reihaneh Amooie et.al. | 2502.04883 | translate | read | null |
| 2025-02-07 | Lightweight Operations for Visual Speech Recognition | Iason Ioannis Panagos et.al. | 2502.04834 | translate | read | null |
| 2025-02-07 | Singing Voice Conversion with Accompaniment Using Self-Supervised Representation-Based Melody Features | Wei Chen et.al. | 2502.04722 | translate | read | null |
| 2025-02-06 | ImprovNet: Generating Controllable Musical Improvisations with Iterative Corruption Refinement | Keshav Bhandari et.al. | 2502.04522 | translate | read | link |
| 2025-02-06 | GenVC: Self-Supervised Zero-Shot Voice Conversion | Zexin Cai et.al. | 2502.04519 | translate | read | null |
| 2025-02-06 | FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks | Luca Della Libera et.al. | 2502.04465 | translate | read | link |
| 2025-02-06 | Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis | Zhen Ye et.al. | 2502.04128 | translate | read | link |
| 2025-02-06 | Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond | Mardhiyah Sanni et.al. | 2502.03945 | translate | read | null |
| 2025-02-06 | Rule-Based Modeling of Low-Dimensional Data with PCA and Binary Particle Swarm Optimization (BPSO) in ANFIS | Afnan Al-Ali et.al. | 2502.03895 | translate | read | null |
| 2025-02-05 | Integrating automatic speech recognition into remote healthcare interpreting: A pilot study of its impact on interpreting quality | Shiyi Tan et.al. | 2502.03381 | translate | read | null |
| 2025-02-05 | Leveraging Broadcast Media Subtitle Transcripts for Automatic Speech Recognition and Subtitling | Jakob Poncelet et.al. | 2502.03212 | translate | read | link |
| 2025-02-05 | Metis: A Foundation Speech Generation Model with Masked Generative Pre-training | Yuancheng Wang et.al. | 2502.03128 | translate | read | null |
| 2025-02-04 | Developing multilingual speech synthesis system for Ojibwe, Mi’kmaq, and Maliseet | Shenran Wang et.al. | 2502.02703 | translate | read | null |
| 2025-02-03 | CTC-DRO: Robust Optimization for Reducing Language Disparities in Speech Recognition | Martijn Bartelds et.al. | 2502.01777 | translate | read | null |
| 2025-02-03 | Adapter-Based Multi-Agent AVSR Extension for Pre-Trained ASR Models | Christopher Simic et.al. | 2502.01709 | translate | read | null |
| 2025-02-03 | A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport | Yacouba Kaloga et.al. | 2502.01588 | translate | read | null |
| 2025-02-03 | mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition | Andrew Rouditchenko et.al. | 2502.01547 | translate | read | link |
| 2025-02-03 | Gradient Norm-based Fine-Tuning for Backdoor Defense in Automatic Speech Recognition | Nanjun Zhou et.al. | 2502.01152 | translate | read | null |
| 2025-02-03 | Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis | Weiwei Lin et.al. | 2502.01084 | translate | read | null |
| 2025-02-01 | Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition | Anna Seo Gyeong Choi et.al. | 2502.00583 | translate | read | null |
| 2025-02-01 | Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions | David Gimeno-Gómez et.al. | 2502.00464 | translate | read | null |
| 2025-02-01 | Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language | Turi Abu et.al. | 2502.00421 | translate | read | link |
| 2025-02-01 | When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation | Anna Min et.al. | 2502.00377 | translate | read | null |
| 2025-02-03 | SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions | Dominik Wagner et.al. | 2501.19377 | translate | read | null |
| 2025-02-03 | DyPCL: Dynamic Phoneme-level Contrastive Learning for Dysarthric Speech Recognition | Wonjun Lee et.al. | 2501.19010 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)