Audio Processing - 2024-11 | Paper Arxiv Daily

Audio Processing - 2024-11

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-11-30	From Audio Deepfake Detection to AI-Generated Music Detection – A Pathway and Overview	Yupei Li et.al.	2412.00571	translate	read	null
2024-11-30	Sample adaptive data augmentation with progressive scheduling	Hongxuan Lu et.al.	2412.00415	translate	read	null
2024-11-30	Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models	Nadeen Fathallah et.al.	2412.00342	translate	read	null
2024-11-30	MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI	Jongmin Jung et.al.	2412.00325	translate	read	null
2024-11-30	Improving speaker verification robustness with synthetic emotional utterances	Nikhil Kumar Koditala et.al.	2412.00319	translate	read	null
2024-11-29	Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities	Haorui He et.al.	2411.19770	translate	read	null
2024-11-29	Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency	Akshaya Rajesh et.al.	2411.19611	translate	read	null
2024-11-28	ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words	Hazem Darwish et.al.	2411.18888	translate	read	null
2024-11-27	EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement	Suzanne Oliver et.al.	2411.18587	translate	read	null
2024-11-27	AMPS: ASR with Multimodal Paraphrase Supervision	Amruta Parulekar et.al.	2411.18368	translate	read	null
2024-11-27	Continual Learning in Machine Speech Chain Using Gradient Episodic Memory	Geoffrey Tyndall et.al.	2411.18320	translate	read	null
2024-11-27	Aligning Pre-trained Models for Spoken Language Translation	Šimon Sedláček et.al.	2411.18294	translate	read	null
2024-11-27	Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks	Junyi Yang et.al.	2411.18271	translate	read	null
2024-11-27	How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario	Shih-Heng Wang et.al.	2411.18217	translate	read	null
2024-11-27	Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting	Emma Reyner-Fuentes et.al.	2411.18177	translate	read	null
2024-11-27	MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models	Thai-Binh Nguyen et.al.	2411.18152	translate	read	null
2024-11-27	SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation	Wenyi Yu et.al.	2411.18138	translate	read	null
2024-11-27	Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition	Shih-heng Wang et.al.	2411.18107	translate	read	null
2024-11-26	Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis	Akshita Gupta et.al.	2411.17690	translate	read	null
2024-11-26	Scaling Speech-Text Pre-training with Synthetic Interleaved Data	Aohan Zeng et.al.	2411.17607	translate	read	null
2024-11-26	Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition	Hyeonseung Lee et.al.	2411.17537	translate	read	null
2024-11-26	Comparative Analysis of ASR Methods for Speech Deepfake Detection	Davide Salvi et.al.	2411.17349	translate	read	null
2024-11-26	k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning	Yifan Yang et.al.	2411.17100	translate	read	null
2024-11-25	Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN	Elona Shatri et.al.	2411.16405	translate	read	null
2024-11-25	The SVASR System for Text-dependent Speaker Verification (TdSV) AAIC Challenge 2024	Mohammadreza Molavi et.al.	2411.16276	translate	read	null
2024-11-25	SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations	Youngjun Sim et.al.	2411.16147	translate	read	null
2024-11-24	A Training-Free Approach for Music Style Transfer with Latent Diffusion Models	Sooyoung Kim et.al.	2411.15913	translate	read	null
2024-11-22	Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering	Mostafa Varzaneh et.al.	2411.15372	translate	read	null
2024-11-22	Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network	Irfan Nafiz Shahan et.al.	2411.15082	translate	read	link
2024-11-22	VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space	Armani Rodriguez et.al.	2411.14642	translate	read	null
2024-11-21	Generative AI for Music and Audio	Hao-Wen Dong et.al.	2411.14627	translate	read	null
2024-11-20	From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language	Muhammad Sharif et.al.	2411.14493	translate	read	null
2024-11-21	Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge	Ruiyang Qin et.al.	2411.13766	translate	read	null
2024-11-18	A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children	Lamia Berriche et.al.	2411.13592	translate	read	null
2024-11-20	CAFE A Novel Code switching Dataset for Algerian Dialect French and English	Houssam Eddine-Othman Lachemat et.al.	2411.13424	translate	read	null
2024-11-20	I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception	Jiawei Zhang et.al.	2411.13314	translate	read	null
2024-11-20	Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM	Jiawei Yu et.al.	2411.13159	translate	read	null
2024-11-21	Improving Controllability and Editability for Pretrained Text-to-Music Generation Models	Yixiao Zhang et.al.	2411.12641	translate	read	null
2024-11-19	Whisper Finetuning on Nepali Language	Sanjay Rijal et.al.	2411.12587	translate	read	null
2024-11-18	An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems	Jingyu Li et.al.	2411.11353	translate	read	null
2024-11-18	Study of the Performance of CEEMDAN in Underdetermined Speech Separation	Rawad Melhem et.al.	2411.11312	translate	read	null
2024-11-18	SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features	Yu-Fei Shi et.al.	2411.11232	translate	read	null
2024-11-17	Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation	Jisang Park et.al.	2411.10927	translate	read	null
2024-11-16	BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization	Md. Nazmus Sadat Samin et.al.	2411.10879	translate	read	link
2024-11-16	Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024	Seyed Ali Farokh et.al.	2411.10828	translate	read	null
2024-11-15	SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers	Joseph Liu et.al.	2411.10510	translate	read	link
2024-11-15	Interactive Cycle Model – The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses	Libo Wang et.al.	2411.10362	translate	read	null
2024-11-15	Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems	Pedro Palacios et.al.	2411.10285	translate	read	null
2024-11-15	DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization	Christos Koutlis et.al.	2411.10193	translate	read	link
2024-11-15	XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection	Yang Xiao et.al.	2411.10027	translate	read	link
2024-11-15	Zero-shot Voice Conversion with Diffusion Transformers	Songting Liu et.al.	2411.09943	translate	read	null
2024-11-14	Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data	Rik Raes et.al.	2411.09431	translate	read	null
2024-11-14	Transferable Adversarial Attacks against ASR	Xiaoxue Gao et.al.	2411.09220	translate	read	null
2024-11-14	Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation	Kuiyuan Zhang et.al.	2411.09167	translate	read	null
2024-11-13	Language Models for Music Medicine Generation	Emmanouil Nikolakakis et.al.	2411.09080	translate	read	null
2024-11-14	Evaluating Synthetic Command Attacks on Smart Voice Assistants	Zhengxian He et.al.	2411.08316	translate	read	null
2024-11-13	PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation	Yungang Yi et.al.	2411.08307	translate	read	null
2024-11-11	Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition	Yoshiki Masuyama et.al.	2411.06968	translate	read	link
2024-11-11	DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions	Shu-Tong Niu et.al.	2411.06667	translate	read	null
2024-11-10	Debatts: Zero-Shot Debating Text-to-Speech Synthesis	Yiqiao Huang et.al.	2411.06540	translate	read	null
2024-11-10	CTC-Assisted LLM-Based Contextual ASR	Guanrou Yang et.al.	2411.06437	translate	read	link
2024-11-07	Dialectal Coverage And Generalization in Arabic Speech Recognition	Amirbek Djanibekov et.al.	2411.05872	translate	read	null
2024-11-07	Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models	Chuqiao Song et.al.	2411.04862	translate	read	null
2024-11-07	Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages	Leena G Pillai et.al.	2411.04573	translate	read	null
2024-11-06	Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks	Felipe Marra et.al.	2411.03948	translate	read	null
2024-11-04	Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs	Alexandros Haliassos et.al.	2411.02256	translate	read	link
2024-11-04	Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data	Sofiane Azzouz et.al.	2411.02037	translate	read	null
2024-11-04	CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching	Yu Pan et.al.	2411.02026	translate	read	null
2024-11-04	MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence	Fuming You et.al.	2411.01805	translate	read	null
2024-11-03	SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation	Dennis Fucci et.al.	2411.01710	translate	read	null
2024-11-02	Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection	Han Yin et.al.	2411.01174	translate	read	link
2024-11-02	Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis	Shijia Liao et.al.	2411.01156	translate	read	link
2024-11-01	Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO	Macarious Hui et.al.	2411.00980	translate	read	null
2024-11-04	Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval	Nikolaos Flemotomos et.al.	2411.00664	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)