Audio Processing - 2024-07 | Paper Arxiv Daily

Audio Processing - 2024-07

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-07-31	Combining audio control and style transfer using latent diffusion	Nils Demerlé et.al.	2408.00196	translate	read	null
2024-07-31	The Llama 3 Herd of Models	Abhimanyu Dubey et.al.	2407.21783	translate	read	null
2024-07-31	Between the AI and Me: Analysing Listeners’ Perspectives on AI- and Human-Composed Progressive Metal Music	Pedro Sarmento et.al.	2407.21615	translate	read	null
2024-07-31	On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition	Nick Rossenbach et.al.	2407.21476	translate	read	null
2024-07-31	Towards interfacing large language models with ASR systems using confidence measures and prompting	Maryam Naderi et.al.	2407.21414	translate	read	null
2024-07-30	Self-Supervised Models in Automatic Whispered Speech Recognition	Aref Farhadipour et.al.	2407.21211	translate	read	null
2024-07-28	ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks	Nakamasa Inoue et.al.	2407.21066	translate	read	null
2024-07-30	Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation	Jingyue Huang et.al.	2407.20955	translate	read	link
2024-07-29	Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation	Junda Wu et.al.	2407.20445	translate	read	null
2024-07-29	Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings	Seungyeon Rhyu et.al.	2407.19900	translate	read	null
2024-07-26	Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition	Hukai Huang et.al.	2407.18581	translate	read	null
2024-07-29	Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks	Mahmoud Salhab et.al.	2407.18571	translate	read	null
2024-07-26	Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models	Neil Shah et.al.	2407.18541	translate	read	null
2024-07-26	VoxSim: A perceptual voice similarity dataset	Junseok Ahn et.al.	2407.18505	translate	read	null
2024-07-26	Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation	Shiyao Wang et.al.	2407.18461	translate	read	link
2024-07-25	On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures	Nick Rossenbach et.al.	2407.17997	translate	read	null
2024-07-25	Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization	Ruijie Tao et.al.	2407.17902	translate	read	link
2024-07-25	Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions	Jiwon Suh et.al.	2407.17874	translate	read	null
2024-07-25	Scaling A Simple Approach to Zero-Shot Speech Recognition	Jinming Zhao et.al.	2407.17852	translate	read	link
2024-07-24	Coupling Speech Encoders with Downstream Text Models	Ciprian Chelba et.al.	2407.17605	translate	read	null
2024-07-24	A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives	Jan Lehečka et.al.	2407.17160	translate	read	null
2024-07-24	Long-Term, Store-Front Robotics: Interactive Music for Robotic Arm, Caxixi and Frame Drums	Richard Savery et.al.	2407.16956	translate	read	null
2024-07-23	Quantifying the Role of Textual Predictability in Automatic Speech Recognition	Sean Robertson et.al.	2407.16537	translate	read	null
2024-07-23	The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization	Samuele Cornell et.al.	2407.16447	translate	read	null
2024-07-23	Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction	Rithik Sachdev et.al.	2407.16370	translate	read	link
2024-07-22	dMel: Speech Tokenization made Simple	He Bai et.al.	2407.15835	translate	read	null
2024-07-22	Robustness of Speech Separation Models for Similar-pitch Speakers	Bunlong Lay et.al.	2407.15749	translate	read	null
2024-07-22	SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios	Hazim Bukhari et.al.	2407.15300	translate	read	null
2024-07-21	Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning	Shuai Wang et.al.	2407.15188	translate	read	null
2024-07-21	MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation	Yun-Han Lan et.al.	2407.15060	translate	read	null
2024-07-20	Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity	Tianhua Qi et.al.	2407.14800	translate	read	null
2024-07-21	Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization	Orson Mengara et.al.	2407.14573	translate	read	null
2024-07-19	Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio	Roser Batlle-Roca et.al.	2407.14364	translate	read	link
2024-07-19	Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings	Praveen Srinivasa Varadhan et.al.	2407.14056	translate	read	link
2024-07-19	GE2E-AC: Generalized End-to-End Loss Training for Accent Classification	Chihiro Watanabe et.al.	2407.14021	translate	read	null
2024-07-19	MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis	Qian Yang et.al.	2407.14006	translate	read	null
2024-07-19	Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance	Changye Li et.al.	2407.13982	translate	read	link
2024-07-18	Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models	Weiqin Li et.al.	2407.13509	translate	read	null
2024-07-18	Reducing Barriers to the Use of Marginalised Music Genres in AI	Nick Bryan-Kinns et.al.	2407.13439	translate	read	null
2024-07-18	Robust ASR Error Correction with Conservative Data Filtering	Takuma Udagawa et.al.	2407.13300	translate	read	null
2024-07-18	Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training	Lukuan Dong et.al.	2407.13292	translate	read	null
2024-07-18	How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines	Ailin Liu et.al.	2407.13266	translate	read	null
2024-07-18	A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR	Jian You et.al.	2407.13142	translate	read	null
2024-07-17	Audio Conditioning for Music Generation via Discrete Bottleneck Features	Simon Rouard et.al.	2407.12563	translate	read	null
2024-07-17	Morphosyntactic Analysis for CHILDES	Houjun Liu et.al.	2407.12389	translate	read	null
2024-07-17	Adaptive Cascading Network for Continual Test-Time Adaptation	Kien X. Nguyen et.al.	2407.12240	translate	read	null
2024-07-16	Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models	Minh Nguyen et.al.	2407.12094	translate	read	link
2024-07-17	Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors	Julien Hauret et.al.	2407.11828	translate	read	link
2024-07-16	Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality	Tina Raissi et.al.	2407.11641	translate	read	null
2024-07-16	The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation	Michele Panariello et.al.	2407.11516	translate	read	null
2024-07-16	VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark	Yuke Lin et.al.	2407.11510	translate	read	null
2024-07-16	Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models	Matthew Perez et.al.	2407.11345	translate	read	null
2024-07-15	Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data	Liang-Hsuan Tseng et.al.	2407.10603	translate	read	null
2024-07-15	BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features	Jing Luo et.al.	2407.10462	translate	read	link
2024-07-14	The Interpretation Gap in Text-to-Music Generation Models	Yongyi Zang et.al.	2407.10328	translate	read	null
2024-07-14	Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation	Ruizhe Huang et.al.	2407.10303	translate	read	null
2024-07-14	CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR	Wenbo Zhao et.al.	2407.10255	translate	read	null
2024-07-14	Textless Dependency Parsing by Labeled Sequence Prediction	Shunsuke Kando et.al.	2407.10118	translate	read	link
2024-07-14	Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification	Li Zhang et.al.	2407.10048	translate	read	null
2024-07-13	Text-Based Detection of On-Hold Scripts in Contact Center Calls	Dmitrii Galimzianov et.al.	2407.09849	translate	read	link
2024-07-13	Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System	Lingwei Meng et.al.	2407.09817	translate	read	null
2024-07-13	A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations	Xiangzhu Kong et.al.	2407.09807	translate	read	null
2024-07-12	Music Proofreading with RefinPaint: Where and How to Modify Compositions given Context	Pedro Ramoneda et.al.	2407.09099	translate	read	link
2024-07-12	Optimization of DNN-based speaker verification model through efficient quantization technique	Yeona Hong et.al.	2407.08991	translate	read	null
2024-07-10	Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks	Lucca Emmanuel Pineli Simões et.al.	2407.08658	translate	read	null
2024-07-11	Tamil Language Computing: the Present and the Future	Kengatharaiyer Sarveswaran et.al.	2407.08618	translate	read	null
2024-07-11	Autoregressive Speech Synthesis without Vector Quantization	Lingwei Meng et.al.	2407.08551	translate	read	null
2024-07-11	Toward accessible comics for blind and low vision readers	Christophe Rigaud et.al.	2407.08248	translate	read	null
2024-07-10	Phonetic Richness for Improved Automatic Speaker Verification	Nicholas Klein et.al.	2407.08017	translate	read	null
2024-07-10	Source Tracing of Audio Deepfake Systems	Nicholas Klein et.al.	2407.08016	translate	read	null
2024-07-11	SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis	Zihao Wang et.al.	2407.07728	translate	read	link
2024-07-10	HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing	Arnon Turetzky et.al.	2407.07566	translate	read	null
2024-07-09	Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support	Karn N. Watcharasupat et.al.	2407.07275	translate	read	null
2024-07-09	Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology	Robin Netzorg et.al.	2407.07235	translate	read	null
2024-07-09	Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models	Yi-Cheng Lin et.al.	2407.06957	translate	read	link
2024-07-09	Tailored Design of Audio-Visual Speech Recognition Models using Branchformers	David Gimeno-Gómez et.al.	2407.06606	translate	read	link
2024-07-08	Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation	Mengzhe Geng et.al.	2407.06310	translate	read	null
2024-07-08	Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection	Zhenchun Lei et.al.	2407.05605	translate	read	null
2024-07-07	Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation	Jin Woo Lee et.al.	2407.05516	translate	read	null
2024-07-07	Fine-Grained and Interpretable Neural Speech Editing	Max Morrison et.al.	2407.05471	translate	read	null
2024-07-09	CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens	Zhihao Du et.al.	2407.05407	translate	read	null
2024-07-06	A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining	Feiyang Xiao et.al.	2407.04936	translate	read	null
2024-07-05	MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study	Rajat Bhattacharjya et.al.	2407.04849	translate	read	null
2024-07-05	Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition	Ye Bai et.al.	2407.04675	translate	read	null
2024-07-05	Multitaper mel-spectrograms for keyword spotting	Douglas Baptista de Souza et.al.	2407.04662	translate	read	null
2024-07-05	Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units	Bolaji Yusuf et.al.	2407.04652	translate	read	link
2024-07-05	Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models	Bolaji Yusuf et.al.	2407.04641	translate	read	null
2024-07-05	Written Term Detection Improves Spoken Term Detection	Bolaji Yusuf et.al.	2407.04601	translate	read	link
2024-07-05	FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder	Rubing Shen et.al.	2407.04575	translate	read	null
2024-07-05	Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect	Salima Mdhaffar et.al.	2407.04533	translate	read	null
2024-07-05	Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models	Vyas Raina et.al.	2407.04482	translate	read	null
2024-07-05	XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models	Shashi Kumar et.al.	2407.04439	translate	read	null
2024-07-05	Romanization Encoding For Multilingual ASR	Wen Ding et.al.	2407.04368	translate	read	null
2024-07-03	GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification	Hui Yan et.al.	2407.03135	translate	read	null
2024-07-03	Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition	Jinming Chen et.al.	2407.03026	translate	read	null
2024-07-03	Probing the Feasibility of Multilingual Speaker Anonymization	Sarina Meyer et.al.	2407.02937	translate	read	link
2024-07-02	Towards the Next Frontier in Speech Representation Learning Using Disentanglement	Varun Krishna et.al.	2407.02543	translate	read	null
2024-07-02	Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization	Yuchen Hu et.al.	2407.02243	translate	read	null
2024-07-02	The USTC-NERCSLIP Systems for The ICMC-ASR Challenge	Minghui Wu et.al.	2407.02052	translate	read	null
2024-07-02	Accompanied Singing Voice Synthesis with Fully Text-controlled Melody	Ruiqi Li et.al.	2407.02049	translate	read	null
2024-07-02	Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models	Zhiyuan Tang et.al.	2407.01909	translate	read	link
2024-07-01	Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting	Scott H. Hawley et.al.	2407.01499	translate	read	null
2024-07-01	Lightweight Zero-shot Text-to-Speech with Mixture of Adapters	Kenichi Fujita et.al.	2407.01291	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)