Audio Processing - 2025-03
Audio Processing - 2025-03
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-03-31 | Can Diffusion Models Disentangle? A Theoretical Perspective | Liming Wang et.al. | 2504.00220 | translate | read | null |
| 2025-03-31 | SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation | Ngoc Dung Huynh et.al. | 2503.24164 | translate | read | null |
| 2025-03-31 | SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development | Minghan Wang et.al. | 2503.23848 | translate | read | link |
| 2025-03-30 | The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR | Injy Hamed et.al. | 2503.23576 | translate | read | null |
| 2025-03-30 | Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages | Xabier de Zuazo et.al. | 2503.23542 | translate | read | link |
| 2025-03-30 | Scaling Auditory Cognition via Test-Time Compute in Audio Language Models | Ting Dang et.al. | 2503.23395 | translate | read | null |
| 2025-03-29 | SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System | Hyeongju Kim et.al. | 2503.23108 | translate | read | null |
| 2025-03-28 | Enhancing Dance-to-Music Generation via Negative Conditioning Latent Diffusion Model | Changchang Sun et.al. | 2503.22138 | translate | read | null |
| 2025-03-27 | VALLR: Visual ASR Language Model for Lip Reading | Marshall Thomas et.al. | 2503.21408 | translate | read | null |
| 2025-03-27 | A 71.2- $μ$ W Speech Recognition Accelerator with Recurrent Spiking Neural Network | Chih-Chyau Yang et.al. | 2503.21337 | translate | read | null |
| 2025-03-27 | Vision-to-Music Generation: A Survey | Zhaokai Wang et.al. | 2503.21254 | translate | read | link |
| 2025-03-26 | Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit | Aniket Abhishek Soni et.al. | 2503.21025 | translate | read | null |
| 2025-03-26 | Text-Driven Voice Conversion via Latent State-Space Modeling | Wen Li et.al. | 2503.20999 | translate | read | null |
| 2025-03-26 | FinAudio: A Benchmark for Audio Large Language Models in Financial Applications | Yupeng Cao et.al. | 2503.20990 | translate | read | null |
| 2025-03-26 | Dolphin: A Large-Scale Automatic Speech Recognition Model for Eastern Languages | Yangyang Meng et.al. | 2503.20212 | translate | read | link |
| 2025-03-25 | Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy | Athiya Deviyani et.al. | 2503.19828 | translate | read | null |
| 2025-03-25 | Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation | Max W. Y. Lam et.al. | 2503.19611 | translate | read | null |
| 2025-03-25 | Boosting the Transferability of Audio Adversarial Examples with Acoustic Representation Optimization | Weifei Jin et.al. | 2503.19591 | translate | read | null |
| 2025-03-25 | Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment | Ghazanfar Ali et.al. | 2503.19334 | translate | read | null |
| 2025-03-22 | A Survey on Structured State Space Sequence (S4) Models | Shriyank Somvanshi et.al. | 2503.18970 | translate | read | link |
| 2025-03-24 | Towards Responsible AI Music: an Investigation of Trustworthy Features for Creative Systems | Jacopo de Berardinis et.al. | 2503.18814 | translate | read | null |
| 2025-03-24 | Whispering in Amharic: Fine-tuning Whisper for Low-resource Language | Dawit Ketema Gete et.al. | 2503.18485 | translate | read | null |
| 2025-03-23 | Elevating Robust Multi-Talker ASR by Decoupling Speaker Separation and Speech Recognition | Yufeng Yang et.al. | 2503.17886 | translate | read | null |
| 2025-03-22 | LZMidi: Compression-Based Symbolic Music Generation | Connor Ding et.al. | 2503.17654 | translate | read | null |
| 2025-03-21 | Your voice is your voice: Supporting Self-expression through Speech Generation and LLMs in Augmented and Alternative Communication | Yiwen Xu et.al. | 2503.17479 | translate | read | null |
| 2025-03-21 | From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech | Ji-Hoon Kim et.al. | 2503.16956 | translate | read | null |
| 2025-03-20 | CAARMA: Class Augmentation with Adversarial Mixup Regularization | Massa Baali et.al. | 2503.16718 | translate | read | null |
| 2025-03-20 | WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching | Tianze Luo et.al. | 2503.16689 | translate | read | null |
| 2025-03-20 | SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors | Yang Chen et.al. | 2503.16578 | translate | read | null |
| 2025-03-19 | A Comprehensive Survey on Architectural Advances in Deep CNNs: Challenges, Applications, and Emerging Research Directions | Saddam Hussain Khan et.al. | 2503.16546 | translate | read | null |
| 2025-03-19 | Evaluating ASR Confidence Scores for Automated Error Detection in User-Assisted Correction Interfaces | Korbinian Kuhn et.al. | 2503.15124 | translate | read | null |
| 2025-03-19 | Communication Access Real-Time Translation Through Collaborative Correction of Automatic Speech Recognition | Korbinian Kuhn et.al. | 2503.15120 | translate | read | null |
| 2025-03-19 | MoonCast: High-Quality Zero-Shot Podcast Generation | Zeqian Ju et.al. | 2503.14345 | translate | read | link |
| 2025-03-18 | InnerSelf: Designing Self-Deepfaked Voice for Emotional Well-being | Guang Dai et.al. | 2503.14257 | translate | read | null |
| 2025-03-17 | Halving transcription time: A fast, user-friendly and GDPR-compliant workflow to create AI-assisted transcripts for content analysis | Jakob Sponholz et.al. | 2503.13031 | translate | read | null |
| 2025-03-14 | MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens | Jeong Hun Yeo et.al. | 2503.11315 | translate | read | link |
| 2025-03-13 | AudioX: Diffusion Transformer for Anything-to-Audio Generation | Zeyue Tian et.al. | 2503.10522 | translate | read | link |
| 2025-03-13 | Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings | Jakaria Islam Emon et.al. | 2503.10446 | translate | read | link |
| 2025-03-14 | Proceedings of the ISCA/ITG Workshop on Diversity in Large Speech and Language Models | Sebastian Möller et.al. | 2503.10298 | translate | read | null |
| 2025-03-12 | ValSub: Subsampling Validation Data to Mitigate Forgetting during ASR Personalization | Haaris Mehmood et.al. | 2503.09906 | translate | read | null |
| 2025-03-12 | Quantization for OpenAI’s Whisper Models: A Comparative Analysis | Allison Andreyev et.al. | 2503.09905 | translate | read | link |
| 2025-03-12 | Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment | Xiaowei Bi et.al. | 2503.09081 | translate | read | null |
| 2025-03-11 | An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR | Sewade Ogun et.al. | 2503.08954 | translate | read | null |
| 2025-03-11 | YuE: Scaling Open Foundation Models for Long-Form Music Generation | Ruibin Yuan et.al. | 2503.08638 | translate | read | link |
| 2025-03-11 | Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos | Soumya Shamarao Jahagirdar et.al. | 2503.08335 | translate | read | null |
| 2025-03-11 | FilmComposer: LLM-Driven Music Production for Silent Film Clips | Zhifeng Xie et.al. | 2503.08147 | translate | read | link |
| 2025-03-11 | Boundary Regression for Leitmotif Detection in Music Audio | Sihun Lee et.al. | 2503.07977 | translate | read | null |
| 2025-03-10 | Building English ASR model with regional language support | Purvi Agrawal et.al. | 2503.07522 | translate | read | null |
| 2025-03-10 | Impact of Microphone Array Mismatches to Learning-based Replay Speech Detection | Michael Neri et.al. | 2503.07357 | translate | read | null |
| 2025-03-10 | Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling | Michael McGuire et.al. | 2503.06924 | translate | read | null |
| 2025-03-09 | Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs | Umberto Cappellazzo et.al. | 2503.06362 | translate | read | null |
| 2025-03-08 | Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations | Jeong Hun Yeo et.al. | 2503.06273 | translate | read | link |
| 2025-03-08 | A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment | Koji Inoue et.al. | 2503.06241 | translate | read | null |
| 2025-03-07 | DiVISe: Direct Visual-Input Speech Synthesis Preserving Speaker Characteristics And Intelligibility | Yifan Liu et.al. | 2503.05223 | translate | read | null |
| 2025-03-06 | From Voice to Safety: Language AI Powered Pilot-ATC Communication Understanding for Airport Surface Movement Collision Risk Assessment | Yutian Pang et.al. | 2503.04974 | translate | read | null |
| 2025-03-04 | Normalization through Fine-tuning: Understanding Wav2vec 2.0 Embeddings for Phonetic Analysis | Yiming Wang et.al. | 2503.04814 | translate | read | null |
| 2025-03-06 | LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM | Sambal Shikhar et.al. | 2503.04724 | translate | read | link |
| 2025-03-06 | Self-Supervised Models for Phoneme Recognition: Applications in Children’s Speech for Reading Learning | Lucas Block Medin et.al. | 2503.04710 | translate | read | null |
| 2025-03-05 | Good practices for evaluation of synthesized speech | Erica Cooper et.al. | 2503.03250 | translate | read | null |
| 2025-03-03 | Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis | Samuel S. Sohn et.al. | 2503.02907 | translate | read | null |
| 2025-03-04 | Go Beyond Your Means: Unlearning with Per-Sample Gradient Orthogonalization | Aviv Shamsian et.al. | 2503.02312 | translate | read | null |
| 2025-03-05 | Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization | Leonid Berlyand et.al. | 2503.01922 | translate | read | null |
| 2025-03-03 | Augmenting Online Meetings with Context-Aware Real-time Music Generation | Haruki Suzawa et.al. | 2503.01354 | translate | read | null |
| 2025-03-03 | Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology | Birger Moell et.al. | 2503.01266 | translate | read | null |
| 2025-03-03 | DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion | Ziqian Ning et.al. | 2503.01183 | translate | read | link |
| 2025-03-02 | Unveiling Biases while Embracing Sustainability: Assessing the Dual Challenges of Automatic Speech Recognition Systems | Ajinkya Kulkarni et.al. | 2503.00907 | translate | read | null |
| 2025-03-02 | UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation | Alexander H. Liu et.al. | 2503.00733 | translate | read | null |
| 2025-03-01 | PodAgent: A Comprehensive Framework for Podcast Generation | Yujia Xiao et.al. | 2503.00455 | translate | read | link |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)