Audio Processing - 2024-05 | Paper Arxiv Daily

Audio Processing - 2024-05

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-05-31	Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction	Jean-Marc Valin et.al.	2405.21069	translate	read	null
2024-05-30	DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation	Zachary Novack et.al.	2405.20289	translate	read	null
2024-05-30	Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation	Adam Sorrenti et.al.	2405.20059	translate	read	link
2024-05-30	Explainable Attribute-Based Speaker Verification	Xiaoliang Wu et.al.	2405.19796	translate	read	null
2024-05-31	Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities	Vicky Zayats et.al.	2405.18669	translate	read	null
2024-05-28	Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR	Shivesh Jadon et.al.	2405.18537	translate	read	null
2024-05-28	Intelligent Clinical Documentation: Harnessing Generative AI for Patient-Centric Clinical Note Generation	Anjanava Biswas et.al.	2405.18346	translate	read	null
2024-05-28	NUTS, NARS, and Speech	D. van der Sluis et.al.	2405.17874	translate	read	null
2024-05-28	TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation	Chenyang Le et.al.	2405.17809	translate	read	null
2024-05-27	Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients	Mohamed Nabih Ali et.al.	2405.17376	translate	read	link
2024-05-27	“Pass the butter”: A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT	Haohua Que et.al.	2405.17250	translate	read	null
2024-05-27	RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis	Haoxiang Shi et.al.	2405.17028	translate	read	null
2024-05-27	A Variance-Preserving Interpolation Approach for Diffusion Models with Applications to Single Channel Speech Enhancement and Recognition	Zilu Guo et.al.	2405.16952	translate	read	null
2024-05-24	Quality-aware Masked Diffusion Transformer for Enhanced Music Generation	Chang Li et.al.	2405.15863	translate	read	null
2024-05-27	HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System	Zhisheng Zhang et.al.	2405.15655	translate	read	null
2024-05-24	Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition	Zijin Gu et.al.	2405.15216	translate	read	null
2024-05-23	Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding	Suyoung Kim et.al.	2405.15097	translate	read	null
2024-05-23	Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis	Hui Li et.al.	2405.15093	translate	read	null
2024-05-23	Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models	Jingyi Chen et.al.	2405.14632	translate	read	null
2024-05-23	Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition	Chan-Jan Hsu et.al.	2405.14259	translate	read	link
2024-05-23	Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models	Yuchen Hu et.al.	2405.14161	translate	read	null
2024-05-23	A Survey on Vision-Language-Action Models for Embodied AI	Yueen Ma et.al.	2405.14093	translate	read	link
2024-05-22	ST-Gait++: Leveraging spatio-temporal convolutions for gait-based emotion recognition on videos	Maria Luísa Lima et.al.	2405.13903	translate	read	null
2024-05-22	Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation	Muhammad Shakeel et.al.	2405.13514	translate	read	null
2024-05-22	A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction	Yue Li et.al.	2405.13477	translate	read	null
2024-05-22	You don’t understand me!: Comparing ASR results for L1 and L2 speakers of Swedish	Ronald Cumbal et.al.	2405.13379	translate	read	null
2024-05-22	Contextualized Automatic Speech Recognition with Dynamic Vocabulary	Yui Sudo et.al.	2405.13344	translate	read	null
2024-05-21	FairLENS: Assessing Fairness in Law Enforcement Speech Recognition	Yicheng Wang et.al.	2405.13166	translate	read	null
2024-05-21	Could a Computer Architect Understand our Brain?	Valentin Puente-Varona et.al.	2405.12815	translate	read	null
2024-05-21	SYMPLEX: Controllable Symbolic Music Generation using Simplex Diffusion with Vocabulary Priors	Nicolas Jonason et.al.	2405.12666	translate	read	null
2024-05-21	Mamba in Speech: Towards an Alternative to Self-Attention	Xiangyu Zhang et.al.	2405.12609	translate	read	link
2024-05-20	Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification	Nian Li et.al.	2405.12031	translate	read	null
2024-05-20	Continuous Sign Language Recognition with Adapted Conformer via Unsupervised Pretraining	Neena Aloysius et.al.	2405.12018	translate	read	null
2024-05-20	Diff-BGM: A Diffusion Model for Video Background Music Generation	Sizhe Li et.al.	2405.11913	translate	read	null
2024-05-20	SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model	Siavash Shams et.al.	2405.11831	translate	read	link
2024-05-17	Acoustic modeling for Overlapping Speech Recognition: JHU Chime-5 Challenge System	Vimal Manohar et.al.	2405.11078	translate	read	null
2024-05-17	Distinctive and Natural Speaker Anonymization via Singular Value Transformation-assisted Matrix	Jixun Yao et.al.	2405.10786	translate	read	null
2024-05-16	Speaker Verification in Agent-Generated Conversations	Yizhe Yang et.al.	2405.10150	translate	read	null
2024-05-16	Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models	Yuchen Hu et.al.	2405.10025	translate	read	null
2024-05-16	Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models	Ziyu Wang et.al.	2405.09901	translate	read	link
2024-05-16	Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model	Siyang Wang et.al.	2405.09768	translate	read	null
2024-05-15	No More Mumbles: Enhancing Robot Intelligibility through Speech Adaptation	Qiaoqiao Ren et.al.	2405.09708	translate	read	link
2024-05-15	Towards Evaluating the Robustness of Automatic Speech Recognition Systems via Audio Style Transfer	Weifei Jin et.al.	2405.09470	translate	read	null
2024-05-15	Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis	Sho Inoue et.al.	2405.09171	translate	read	null
2024-05-15	Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization	Jenthe Thienpondt et.al.	2405.09142	translate	read	null
2024-05-14	Investigating the ‘Autoencoder Behavior’ in Speech Self-Supervised Models: a focus on HuBERT’s Pretraining	Valentin Vielzeuf et.al.	2405.08402	translate	read	null
2024-05-14	SpeechVerse: A Large-scale Generalizable Audio Language Model	Nilaksh Das et.al.	2405.08295	translate	read	null
2024-05-13	Rene: A Pre-trained Multi-modal Architecture for Auscultation of Respiratory Diseases	Pengfei Zhang et.al.	2405.07442	translate	read	null
2024-05-12	SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset	Sushant Gautam et.al.	2405.07354	translate	read	link
2024-05-11	Towards an Accessible and Rapidly Trainable Rhythm Sequencer Using a Generative Stacked Autoencoder	Alex Wastnidge et.al.	2405.07034	translate	read	null
2024-05-11	A framework of text-dependent speaker verification for chinese numerical string corpus	Litong Zheng et.al.	2405.07029	translate	read	null
2024-05-10	DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation	Jie Xu et.al.	2405.06368	translate	read	null
2024-05-10	Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech	Dena Mujtaba et.al.	2405.06150	translate	read	null
2024-05-09	Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models	Vyas Raina et.al.	2405.06134	translate	read	link
2024-05-09	The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge	Jingguang Tian et.al.	2405.05498	translate	read	null
2024-05-07	Open Implementation and Study of BEST-RQ for Speech Processing	Ryan Whetten et.al.	2405.04296	translate	read	link
2024-05-07	Speaker Characterization by means of Attention Pooling	Federico Costa et.al.	2405.04096	translate	read	null
2024-05-06	Whispy: Adapting STT Whisper Models to Real-Time Environments	Antonio Bevilacqua et.al.	2405.03484	translate	read	null
2024-05-06	MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition	Bingshen Mu et.al.	2405.03152	translate	read	null
2024-05-06	Determined Multichannel Blind Source Separation with Clustered Source Model	Jianyu Wang et.al.	2405.03118	translate	read	null
2024-05-11	Analysis about Theoretical Foundations for Method to Enhancing ASR Performance using OCR Word Frequency Differences	Kyudan Jung et.al.	2405.02995	translate	read	null
2024-05-07	Mozart’s Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models	Tianze Xu et.al.	2405.02801	translate	read	link
2024-05-04	Mixat: A Data Set of Bilingual Emirati-English Speech	Maryam Al Ali et.al.	2405.02578	translate	read	link
2024-05-06	Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models	Alessandro Pianese et.al.	2405.02179	translate	read	null
2024-05-06	Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets	Xuelong Geng et.al.	2405.02132	translate	read	null
2024-05-02	Converting Anyone’s Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model	Zongyang Du et.al.	2405.01730	translate	read	null
2024-05-01	Efficient Sample-Specific Encoder Perturbations	Yassir Fathullah et.al.	2405.01601	translate	read	null
2024-05-02	Low-resource speech recognition and dialect identification of Irish in a multi-task framework	Liam Lonergan et.al.	2405.01293	translate	read	null
2024-05-02	Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features	Francisco Teixeira et.al.	2405.01207	translate	read	null
2024-05-02	Deep Learning Models in Speech Recognition: Measuring GPU Energy Consumption, Impact of Noise and Model Quantization for Edge Deployment	Aditya Chakravarty et.al.	2405.01004	translate	read	link
2024-05-02	Efficient Compression of Multitask Multilingual Speech Models	Thomas Palmeira Ferraz et.al.	2405.00966	translate	read	null
2024-05-02	MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion	Pengcheng Li et.al.	2405.00930	translate	read	null
2024-05-01	Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation	Yimin Deng et.al.	2405.00603	translate	read	null
2024-05-01	Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition	Dongyuan Li et.al.	2405.00307	translate	read	link

(<a href=../Audio_Processing.md>back to Audio Processing</a>)