Audio Processing - 2024-07

Publish Date Title Authors PDF Translate Read Code
2024-07-31 Combining audio control and style transfer using latent diffusion Nils Demerlé et.al. 2408.00196 translate read null
2024-07-31 The Llama 3 Herd of Models Abhimanyu Dubey et.al. 2407.21783 translate read null
2024-07-31 Between the AI and Me: Analysing Listeners’ Perspectives on AI- and Human-Composed Progressive Metal Music Pedro Sarmento et.al. 2407.21615 translate read null
2024-07-31 On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Nick Rossenbach et.al. 2407.21476 translate read null
2024-07-31 Towards interfacing large language models with ASR systems using confidence measures and prompting Maryam Naderi et.al. 2407.21414 translate read null
2024-07-30 Self-Supervised Models in Automatic Whispered Speech Recognition Aref Farhadipour et.al. 2407.21211 translate read null
2024-07-28 ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks Nakamasa Inoue et.al. 2407.21066 translate read null
2024-07-30 Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation Jingyue Huang et.al. 2407.20955 translate read link
2024-07-29 Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation Junda Wu et.al. 2407.20445 translate read null
2024-07-29 Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings Seungyeon Rhyu et.al. 2407.19900 translate read null
2024-07-26 Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition Hukai Huang et.al. 2407.18581 translate read null
2024-07-29 Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks Mahmoud Salhab et.al. 2407.18571 translate read null
2024-07-26 Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models Neil Shah et.al. 2407.18541 translate read null
2024-07-26 VoxSim: A perceptual voice similarity dataset Junseok Ahn et.al. 2407.18505 translate read null
2024-07-26 Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation Shiyao Wang et.al. 2407.18461 translate read link
2024-07-25 On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Nick Rossenbach et.al. 2407.17997 translate read null
2024-07-25 Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization Ruijie Tao et.al. 2407.17902 translate read link
2024-07-25 Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Jiwon Suh et.al. 2407.17874 translate read null
2024-07-25 Scaling A Simple Approach to Zero-Shot Speech Recognition Jinming Zhao et.al. 2407.17852 translate read link
2024-07-24 Coupling Speech Encoders with Downstream Text Models Ciprian Chelba et.al. 2407.17605 translate read null
2024-07-24 A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives Jan Lehečka et.al. 2407.17160 translate read null
2024-07-24 Long-Term, Store-Front Robotics: Interactive Music for Robotic Arm, Caxixi and Frame Drums Richard Savery et.al. 2407.16956 translate read null
2024-07-23 Quantifying the Role of Textual Predictability in Automatic Speech Recognition Sean Robertson et.al. 2407.16537 translate read null
2024-07-23 The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization Samuele Cornell et.al. 2407.16447 translate read null
2024-07-23 Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Rithik Sachdev et.al. 2407.16370 translate read link
2024-07-22 dMel: Speech Tokenization made Simple He Bai et.al. 2407.15835 translate read null
2024-07-22 Robustness of Speech Separation Models for Similar-pitch Speakers Bunlong Lay et.al. 2407.15749 translate read null
2024-07-22 SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Hazim Bukhari et.al. 2407.15300 translate read null
2024-07-21 Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning Shuai Wang et.al. 2407.15188 translate read null
2024-07-21 MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation Yun-Han Lan et.al. 2407.15060 translate read null
2024-07-20 Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity Tianhua Qi et.al. 2407.14800 translate read null
2024-07-21 Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization Orson Mengara et.al. 2407.14573 translate read null
2024-07-19 Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio Roser Batlle-Roca et.al. 2407.14364 translate read link
2024-07-19 Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings Praveen Srinivasa Varadhan et.al. 2407.14056 translate read link
2024-07-19 GE2E-AC: Generalized End-to-End Loss Training for Accent Classification Chihiro Watanabe et.al. 2407.14021 translate read null
2024-07-19 MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis Qian Yang et.al. 2407.14006 translate read null
2024-07-19 Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance Changye Li et.al. 2407.13982 translate read link
2024-07-18 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models Weiqin Li et.al. 2407.13509 translate read null
2024-07-18 Reducing Barriers to the Use of Marginalised Music Genres in AI Nick Bryan-Kinns et.al. 2407.13439 translate read null
2024-07-18 Robust ASR Error Correction with Conservative Data Filtering Takuma Udagawa et.al. 2407.13300 translate read null
2024-07-18 Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training Lukuan Dong et.al. 2407.13292 translate read null
2024-07-18 How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines Ailin Liu et.al. 2407.13266 translate read null
2024-07-18 A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR Jian You et.al. 2407.13142 translate read null
2024-07-17 Audio Conditioning for Music Generation via Discrete Bottleneck Features Simon Rouard et.al. 2407.12563 translate read null
2024-07-17 Morphosyntactic Analysis for CHILDES Houjun Liu et.al. 2407.12389 translate read null
2024-07-17 Adaptive Cascading Network for Continual Test-Time Adaptation Kien X. Nguyen et.al. 2407.12240 translate read null
2024-07-16 Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models Minh Nguyen et.al. 2407.12094 translate read link
2024-07-17 Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors Julien Hauret et.al. 2407.11828 translate read link
2024-07-16 Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality Tina Raissi et.al. 2407.11641 translate read null
2024-07-16 The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation Michele Panariello et.al. 2407.11516 translate read null
2024-07-16 VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark Yuke Lin et.al. 2407.11510 translate read null
2024-07-16 Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Matthew Perez et.al. 2407.11345 translate read null
2024-07-15 Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data Liang-Hsuan Tseng et.al. 2407.10603 translate read null
2024-07-15 BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features Jing Luo et.al. 2407.10462 translate read link
2024-07-14 The Interpretation Gap in Text-to-Music Generation Models Yongyi Zang et.al. 2407.10328 translate read null
2024-07-14 Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation Ruizhe Huang et.al. 2407.10303 translate read null
2024-07-14 CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR Wenbo Zhao et.al. 2407.10255 translate read null
2024-07-14 Textless Dependency Parsing by Labeled Sequence Prediction Shunsuke Kando et.al. 2407.10118 translate read link
2024-07-14 Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification Li Zhang et.al. 2407.10048 translate read null
2024-07-13 Text-Based Detection of On-Hold Scripts in Contact Center Calls Dmitrii Galimzianov et.al. 2407.09849 translate read link
2024-07-13 Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System Lingwei Meng et.al. 2407.09817 translate read null
2024-07-13 A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations Xiangzhu Kong et.al. 2407.09807 translate read null
2024-07-12 Music Proofreading with RefinPaint: Where and How to Modify Compositions given Context Pedro Ramoneda et.al. 2407.09099 translate read link
2024-07-12 Optimization of DNN-based speaker verification model through efficient quantization technique Yeona Hong et.al. 2407.08991 translate read null
2024-07-10 Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks Lucca Emmanuel Pineli Simões et.al. 2407.08658 translate read null
2024-07-11 Tamil Language Computing: the Present and the Future Kengatharaiyer Sarveswaran et.al. 2407.08618 translate read null
2024-07-11 Autoregressive Speech Synthesis without Vector Quantization Lingwei Meng et.al. 2407.08551 translate read null
2024-07-11 Toward accessible comics for blind and low vision readers Christophe Rigaud et.al. 2407.08248 translate read null
2024-07-10 Phonetic Richness for Improved Automatic Speaker Verification Nicholas Klein et.al. 2407.08017 translate read null
2024-07-10 Source Tracing of Audio Deepfake Systems Nicholas Klein et.al. 2407.08016 translate read null
2024-07-11 SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis Zihao Wang et.al. 2407.07728 translate read link
2024-07-10 HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing Arnon Turetzky et.al. 2407.07566 translate read null
2024-07-09 Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support Karn N. Watcharasupat et.al. 2407.07275 translate read null
2024-07-09 Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology Robin Netzorg et.al. 2407.07235 translate read null
2024-07-09 Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models Yi-Cheng Lin et.al. 2407.06957 translate read link
2024-07-09 Tailored Design of Audio-Visual Speech Recognition Models using Branchformers David Gimeno-Gómez et.al. 2407.06606 translate read link
2024-07-08 Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation Mengzhe Geng et.al. 2407.06310 translate read null
2024-07-08 Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection Zhenchun Lei et.al. 2407.05605 translate read null
2024-07-07 Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation Jin Woo Lee et.al. 2407.05516 translate read null
2024-07-07 Fine-Grained and Interpretable Neural Speech Editing Max Morrison et.al. 2407.05471 translate read null
2024-07-09 CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens Zhihao Du et.al. 2407.05407 translate read null
2024-07-06 A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining Feiyang Xiao et.al. 2407.04936 translate read null
2024-07-05 MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study Rajat Bhattacharjya et.al. 2407.04849 translate read null
2024-07-05 Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition Ye Bai et.al. 2407.04675 translate read null
2024-07-05 Multitaper mel-spectrograms for keyword spotting Douglas Baptista de Souza et.al. 2407.04662 translate read null
2024-07-05 Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units Bolaji Yusuf et.al. 2407.04652 translate read link
2024-07-05 Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models Bolaji Yusuf et.al. 2407.04641 translate read null
2024-07-05 Written Term Detection Improves Spoken Term Detection Bolaji Yusuf et.al. 2407.04601 translate read link
2024-07-05 FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder Rubing Shen et.al. 2407.04575 translate read null
2024-07-05 Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Salima Mdhaffar et.al. 2407.04533 translate read null
2024-07-05 Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Vyas Raina et.al. 2407.04482 translate read null
2024-07-05 XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models Shashi Kumar et.al. 2407.04439 translate read null
2024-07-05 Romanization Encoding For Multilingual ASR Wen Ding et.al. 2407.04368 translate read null
2024-07-03 GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification Hui Yan et.al. 2407.03135 translate read null
2024-07-03 Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Jinming Chen et.al. 2407.03026 translate read null
2024-07-03 Probing the Feasibility of Multilingual Speaker Anonymization Sarina Meyer et.al. 2407.02937 translate read link
2024-07-02 Towards the Next Frontier in Speech Representation Learning Using Disentanglement Varun Krishna et.al. 2407.02543 translate read null
2024-07-02 Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization Yuchen Hu et.al. 2407.02243 translate read null
2024-07-02 The USTC-NERCSLIP Systems for The ICMC-ASR Challenge Minghui Wu et.al. 2407.02052 translate read null
2024-07-02 Accompanied Singing Voice Synthesis with Fully Text-controlled Melody Ruiqi Li et.al. 2407.02049 translate read null
2024-07-02 Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Zhiyuan Tang et.al. 2407.01909 translate read link
2024-07-01 Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting Scott H. Hawley et.al. 2407.01499 translate read null
2024-07-01 Lightweight Zero-shot Text-to-Speech with Mixture of Adapters Kenichi Fujita et.al. 2407.01291 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)