Audio Processing - 2024-05

Publish Date Title Authors PDF Translate Read Code
2024-05-31 Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction Jean-Marc Valin et.al. 2405.21069 translate read null
2024-05-30 DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation Zachary Novack et.al. 2405.20289 translate read null
2024-05-30 Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation Adam Sorrenti et.al. 2405.20059 translate read link
2024-05-30 Explainable Attribute-Based Speaker Verification Xiaoliang Wu et.al. 2405.19796 translate read null
2024-05-31 Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities Vicky Zayats et.al. 2405.18669 translate read null
2024-05-28 Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR Shivesh Jadon et.al. 2405.18537 translate read null
2024-05-28 Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation Anjanava Biswas et.al. 2405.18346 translate read null
2024-05-28 NUTS, NARS, and Speech D. van der Sluis et.al. 2405.17874 translate read null
2024-05-28 TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation Chenyang Le et.al. 2405.17809 translate read null
2024-05-27 Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients Mohamed Nabih Ali et.al. 2405.17376 translate read link
2024-05-27 “Pass the butter”: A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT Haohua Que et.al. 2405.17250 translate read null
2024-05-27 RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis Haoxiang Shi et.al. 2405.17028 translate read null
2024-05-27 A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition Zilu Guo et.al. 2405.16952 translate read null
2024-05-24 Quality-aware Masked Diffusion Transformer for Enhanced Music Generation Chang Li et.al. 2405.15863 translate read null
2024-05-27 HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System Zhisheng Zhang et.al. 2405.15655 translate read null
2024-05-24 Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition Zijin Gu et.al. 2405.15216 translate read null
2024-05-23 Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Suyoung Kim et.al. 2405.15097 translate read null
2024-05-23 Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis Hui Li et.al. 2405.15093 translate read null
2024-05-23 Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models Jingyi Chen et.al. 2405.14632 translate read null
2024-05-23 Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition Chan-Jan Hsu et.al. 2405.14259 translate read link
2024-05-23 Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models Yuchen Hu et.al. 2405.14161 translate read null
2024-05-23 A Survey on Vision-Language-Action Models for Embodied AI Yueen Ma et.al. 2405.14093 translate read link
2024-05-22 ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos Maria Luísa Lima et.al. 2405.13903 translate read null
2024-05-22 Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation Muhammad Shakeel et.al. 2405.13514 translate read null
2024-05-22 A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Yue Li et.al. 2405.13477 translate read null
2024-05-22 You don’t understand me!: Comparing ASR results for L1 and L2 speakers of Swedish Ronald Cumbal et.al. 2405.13379 translate read null
2024-05-22 Contextualized Automatic Speech Recognition with Dynamic Vocabulary Yui Sudo et.al. 2405.13344 translate read null
2024-05-21 FairLENS: Assessing Fairness in Law Enforcement Speech Recognition Yicheng Wang et.al. 2405.13166 translate read null
2024-05-21 Could a Computer Architect Understand our Brain? Valentin Puente-Varona et.al. 2405.12815 translate read null
2024-05-21 SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors Nicolas Jonason et.al. 2405.12666 translate read null
2024-05-21 Mamba in Speech: Towards an Alternative to Self-Attention Xiangyu Zhang et.al. 2405.12609 translate read link
2024-05-20 Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification Nian Li et.al. 2405.12031 translate read null
2024-05-20 Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining Neena Aloysius et.al. 2405.12018 translate read null
2024-05-20 Diff-BGM: A Diffusion Model for Video Background Music Generation Sizhe Li et.al. 2405.11913 translate read null
2024-05-20 SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model Siavash Shams et.al. 2405.11831 translate read link
2024-05-17 Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System Vimal Manohar et.al. 2405.11078 translate read null
2024-05-17 Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix Jixun Yao et.al. 2405.10786 translate read null
2024-05-16 Speaker Verification in Agent-Generated Conversations Yizhe Yang et.al. 2405.10150 translate read null
2024-05-16 Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Yuchen Hu et.al. 2405.10025 translate read null
2024-05-16 Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models Ziyu Wang et.al. 2405.09901 translate read link
2024-05-16 Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model Siyang Wang et.al. 2405.09768 translate read null
2024-05-15 No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation Qiaoqiao Ren et.al. 2405.09708 translate read link
2024-05-15 Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer Weifei Jin et.al. 2405.09470 translate read null
2024-05-15 Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis Sho Inoue et.al. 2405.09171 translate read null
2024-05-15 Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization Jenthe Thienpondt et.al. 2405.09142 translate read null
2024-05-14 Investigating the ‘Autoencoder Behavior’ in Speech Self-Supervised Models: a focus on HuBERT’s Pretraining Valentin Vielzeuf et.al. 2405.08402 translate read null
2024-05-14 SpeechVerse: A Large-scale Generalizable Audio Language Model Nilaksh Das et.al. 2405.08295 translate read null
2024-05-13 Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases Pengfei Zhang et.al. 2405.07442 translate read null
2024-05-12 SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset Sushant Gautam et.al. 2405.07354 translate read link
2024-05-11 Towards an Accessible and Rapidly Trainable Rhythm Sequencer Using a Generative Stacked Autoencoder Alex Wastnidge et.al. 2405.07034 translate read null
2024-05-11 A framework of text-dependent speaker verification for chinese numerical string corpus Litong Zheng et.al. 2405.07029 translate read null
2024-05-10 DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation Jie Xu et.al. 2405.06368 translate read null
2024-05-10 Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Dena Mujtaba et.al. 2405.06150 translate read null
2024-05-09 Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Vyas Raina et.al. 2405.06134 translate read link
2024-05-09 The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Jingguang Tian et.al. 2405.05498 translate read null
2024-05-07 Open Implementation and Study of BEST-RQ for Speech Processing Ryan Whetten et.al. 2405.04296 translate read link
2024-05-07 Speaker Characterization by means of Attention Pooling Federico Costa et.al. 2405.04096 translate read null
2024-05-06 Whispy: Adapting STT Whisper Models to Real-Time Environments Antonio Bevilacqua et.al. 2405.03484 translate read null
2024-05-06 MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition Bingshen Mu et.al. 2405.03152 translate read null
2024-05-06 Determined Multichannel Blind Source Separation with Clustered Source Model Jianyu Wang et.al. 2405.03118 translate read null
2024-05-11 Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences Kyudan Jung et.al. 2405.02995 translate read null
2024-05-07 Mozart’s Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models Tianze Xu et.al. 2405.02801 translate read link
2024-05-04 Mixat: A Data Set of Bilingual Emirati-English Speech Maryam Al Ali et.al. 2405.02578 translate read link
2024-05-06 Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models Alessandro Pianese et.al. 2405.02179 translate read null
2024-05-06 Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets Xuelong Geng et.al. 2405.02132 translate read null
2024-05-02 Converting Anyone’s Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model Zongyang Du et.al. 2405.01730 translate read null
2024-05-01 Efficient Sample-Specific Encoder Perturbations Yassir Fathullah et.al. 2405.01601 translate read null
2024-05-02 Low-resource speech recognition and dialect identification of Irish in a multi-task framework Liam Lonergan et.al. 2405.01293 translate read null
2024-05-02 Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features Francisco Teixeira et.al. 2405.01207 translate read null
2024-05-02 Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment Aditya Chakravarty et.al. 2405.01004 translate read link
2024-05-02 Efficient Compression of Multitask Multilingual Speech Models Thomas Palmeira Ferraz et.al. 2405.00966 translate read null
2024-05-02 MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion Pengcheng Li et.al. 2405.00930 translate read null
2024-05-01 Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation Yimin Deng et.al. 2405.00603 translate read null
2024-05-01 Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition Dongyuan Li et.al. 2405.00307 translate read link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)