Audio Processing - 2024-05
Audio Processing - 2024-05
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-05-31 | Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction | Jean-Marc Valin et.al. | 2405.21069 | translate | read | null |
| 2024-05-30 | DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation | Zachary Novack et.al. | 2405.20289 | translate | read | null |
| 2024-05-30 | Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation | Adam Sorrenti et.al. | 2405.20059 | translate | read | link |
| 2024-05-30 | Explainable Attribute-Based Speaker Verification | Xiaoliang Wu et.al. | 2405.19796 | translate | read | null |
| 2024-05-31 | Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities | Vicky Zayats et.al. | 2405.18669 | translate | read | null |
| 2024-05-28 | Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR | Shivesh Jadon et.al. | 2405.18537 | translate | read | null |
| 2024-05-28 | Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation | Anjanava Biswas et.al. | 2405.18346 | translate | read | null |
| 2024-05-28 | NUTS, NARS, and Speech | D. van der Sluis et.al. | 2405.17874 | translate | read | null |
| 2024-05-28 | TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation | Chenyang Le et.al. | 2405.17809 | translate | read | null |
| 2024-05-27 | Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients | Mohamed Nabih Ali et.al. | 2405.17376 | translate | read | link |
| 2024-05-27 | “Pass the butter”: A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT | Haohua Que et.al. | 2405.17250 | translate | read | null |
| 2024-05-27 | RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis | Haoxiang Shi et.al. | 2405.17028 | translate | read | null |
| 2024-05-27 | A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition | Zilu Guo et.al. | 2405.16952 | translate | read | null |
| 2024-05-24 | Quality-aware Masked Diffusion Transformer for Enhanced Music Generation | Chang Li et.al. | 2405.15863 | translate | read | null |
| 2024-05-27 | HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System | Zhisheng Zhang et.al. | 2405.15655 | translate | read | null |
| 2024-05-24 | Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition | Zijin Gu et.al. | 2405.15216 | translate | read | null |
| 2024-05-23 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding | Suyoung Kim et.al. | 2405.15097 | translate | read | null |
| 2024-05-23 | Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis | Hui Li et.al. | 2405.15093 | translate | read | null |
| 2024-05-23 | Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models | Jingyi Chen et.al. | 2405.14632 | translate | read | null |
| 2024-05-23 | Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition | Chan-Jan Hsu et.al. | 2405.14259 | translate | read | link |
| 2024-05-23 | Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models | Yuchen Hu et.al. | 2405.14161 | translate | read | null |
| 2024-05-23 | A Survey on Vision-Language-Action Models for Embodied AI | Yueen Ma et.al. | 2405.14093 | translate | read | link |
| 2024-05-22 | ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos | Maria Luísa Lima et.al. | 2405.13903 | translate | read | null |
| 2024-05-22 | Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation | Muhammad Shakeel et.al. | 2405.13514 | translate | read | null |
| 2024-05-22 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction | Yue Li et.al. | 2405.13477 | translate | read | null |
| 2024-05-22 | You don’t understand me!: Comparing ASR results for L1 and L2 speakers of Swedish | Ronald Cumbal et.al. | 2405.13379 | translate | read | null |
| 2024-05-22 | Contextualized Automatic Speech Recognition with Dynamic Vocabulary | Yui Sudo et.al. | 2405.13344 | translate | read | null |
| 2024-05-21 | FairLENS: Assessing Fairness in Law Enforcement Speech Recognition | Yicheng Wang et.al. | 2405.13166 | translate | read | null |
| 2024-05-21 | Could a Computer Architect Understand our Brain? | Valentin Puente-Varona et.al. | 2405.12815 | translate | read | null |
| 2024-05-21 | SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors | Nicolas Jonason et.al. | 2405.12666 | translate | read | null |
| 2024-05-21 | Mamba in Speech: Towards an Alternative to Self-Attention | Xiangyu Zhang et.al. | 2405.12609 | translate | read | link |
| 2024-05-20 | Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification | Nian Li et.al. | 2405.12031 | translate | read | null |
| 2024-05-20 | Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining | Neena Aloysius et.al. | 2405.12018 | translate | read | null |
| 2024-05-20 | Diff-BGM: A Diffusion Model for Video Background Music Generation | Sizhe Li et.al. | 2405.11913 | translate | read | null |
| 2024-05-20 | SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model | Siavash Shams et.al. | 2405.11831 | translate | read | link |
| 2024-05-17 | Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System | Vimal Manohar et.al. | 2405.11078 | translate | read | null |
| 2024-05-17 | Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix | Jixun Yao et.al. | 2405.10786 | translate | read | null |
| 2024-05-16 | Speaker Verification in Agent-Generated Conversations | Yizhe Yang et.al. | 2405.10150 | translate | read | null |
| 2024-05-16 | Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models | Yuchen Hu et.al. | 2405.10025 | translate | read | null |
| 2024-05-16 | Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models | Ziyu Wang et.al. | 2405.09901 | translate | read | link |
| 2024-05-16 | Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model | Siyang Wang et.al. | 2405.09768 | translate | read | null |
| 2024-05-15 | No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation | Qiaoqiao Ren et.al. | 2405.09708 | translate | read | link |
| 2024-05-15 | Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer | Weifei Jin et.al. | 2405.09470 | translate | read | null |
| 2024-05-15 | Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis | Sho Inoue et.al. | 2405.09171 | translate | read | null |
| 2024-05-15 | Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization | Jenthe Thienpondt et.al. | 2405.09142 | translate | read | null |
| 2024-05-14 | Investigating the ‘Autoencoder Behavior’ in Speech Self-Supervised Models: a focus on HuBERT’s Pretraining | Valentin Vielzeuf et.al. | 2405.08402 | translate | read | null |
| 2024-05-14 | SpeechVerse: A Large-scale Generalizable Audio Language Model | Nilaksh Das et.al. | 2405.08295 | translate | read | null |
| 2024-05-13 | Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases | Pengfei Zhang et.al. | 2405.07442 | translate | read | null |
| 2024-05-12 | SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset | Sushant Gautam et.al. | 2405.07354 | translate | read | link |
| 2024-05-11 | Towards an Accessible and Rapidly Trainable Rhythm Sequencer Using a Generative Stacked Autoencoder | Alex Wastnidge et.al. | 2405.07034 | translate | read | null |
| 2024-05-11 | A framework of text-dependent speaker verification for chinese numerical string corpus | Litong Zheng et.al. | 2405.07029 | translate | read | null |
| 2024-05-10 | DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation | Jie Xu et.al. | 2405.06368 | translate | read | null |
| 2024-05-10 | Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech | Dena Mujtaba et.al. | 2405.06150 | translate | read | null |
| 2024-05-09 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models | Vyas Raina et.al. | 2405.06134 | translate | read | link |
| 2024-05-09 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge | Jingguang Tian et.al. | 2405.05498 | translate | read | null |
| 2024-05-07 | Open Implementation and Study of BEST-RQ for Speech Processing | Ryan Whetten et.al. | 2405.04296 | translate | read | link |
| 2024-05-07 | Speaker Characterization by means of Attention Pooling | Federico Costa et.al. | 2405.04096 | translate | read | null |
| 2024-05-06 | Whispy: Adapting STT Whisper Models to Real-Time Environments | Antonio Bevilacqua et.al. | 2405.03484 | translate | read | null |
| 2024-05-06 | MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition | Bingshen Mu et.al. | 2405.03152 | translate | read | null |
| 2024-05-06 | Determined Multichannel Blind Source Separation with Clustered Source Model | Jianyu Wang et.al. | 2405.03118 | translate | read | null |
| 2024-05-11 | Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences | Kyudan Jung et.al. | 2405.02995 | translate | read | null |
| 2024-05-07 | Mozart’s Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models | Tianze Xu et.al. | 2405.02801 | translate | read | link |
| 2024-05-04 | Mixat: A Data Set of Bilingual Emirati-English Speech | Maryam Al Ali et.al. | 2405.02578 | translate | read | link |
| 2024-05-06 | Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models | Alessandro Pianese et.al. | 2405.02179 | translate | read | null |
| 2024-05-06 | Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets | Xuelong Geng et.al. | 2405.02132 | translate | read | null |
| 2024-05-02 | Converting Anyone’s Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model | Zongyang Du et.al. | 2405.01730 | translate | read | null |
| 2024-05-01 | Efficient Sample-Specific Encoder Perturbations | Yassir Fathullah et.al. | 2405.01601 | translate | read | null |
| 2024-05-02 | Low-resource speech recognition and dialect identification of Irish in a multi-task framework | Liam Lonergan et.al. | 2405.01293 | translate | read | null |
| 2024-05-02 | Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features | Francisco Teixeira et.al. | 2405.01207 | translate | read | null |
| 2024-05-02 | Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment | Aditya Chakravarty et.al. | 2405.01004 | translate | read | link |
| 2024-05-02 | Efficient Compression of Multitask Multilingual Speech Models | Thomas Palmeira Ferraz et.al. | 2405.00966 | translate | read | null |
| 2024-05-02 | MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion | Pengcheng Li et.al. | 2405.00930 | translate | read | null |
| 2024-05-01 | Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation | Yimin Deng et.al. | 2405.00603 | translate | read | null |
| 2024-05-01 | Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition | Dongyuan Li et.al. | 2405.00307 | translate | read | link |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)