Audio Processing - 2024-07
Audio Processing - 2024-07
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-07-31 | Combining audio control and style transfer using latent diffusion | Nils Demerlé et.al. | 2408.00196 | translate | read | null |
| 2024-07-31 | The Llama 3 Herd of Models | Abhimanyu Dubey et.al. | 2407.21783 | translate | read | null |
| 2024-07-31 | Between the AI and Me: Analysing Listeners’ Perspectives on AI- and Human-Composed Progressive Metal Music | Pedro Sarmento et.al. | 2407.21615 | translate | read | null |
| 2024-07-31 | On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition | Nick Rossenbach et.al. | 2407.21476 | translate | read | null |
| 2024-07-31 | Towards interfacing large language models with ASR systems using confidence measures and prompting | Maryam Naderi et.al. | 2407.21414 | translate | read | null |
| 2024-07-30 | Self-Supervised Models in Automatic Whispered Speech Recognition | Aref Farhadipour et.al. | 2407.21211 | translate | read | null |
| 2024-07-28 | ELP-Adapters: Parameter Efficient Adapter Tuning for Various Speech Processing Tasks | Nakamasa Inoue et.al. | 2407.21066 | translate | read | null |
| 2024-07-30 | Emotion-driven Piano Music Generation via Two-stage Disentanglement and Functional Representation | Jingyue Huang et.al. | 2407.20955 | translate | read | link |
| 2024-07-29 | Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation | Junda Wu et.al. | 2407.20445 | translate | read | null |
| 2024-07-29 | Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings | Seungyeon Rhyu et.al. | 2407.19900 | translate | read | null |
| 2024-07-26 | Dynamic Language Group-Based MoE: Enhancing Efficiency and Flexibility for Code-Switching Speech Recognition | Hukai Huang et.al. | 2407.18581 | translate | read | null |
| 2024-07-29 | Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks | Mahmoud Salhab et.al. | 2407.18571 | translate | read | null |
| 2024-07-26 | Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models | Neil Shah et.al. | 2407.18541 | translate | read | null |
| 2024-07-26 | VoxSim: A perceptual voice similarity dataset | Junseok Ahn et.al. | 2407.18505 | translate | read | null |
| 2024-07-26 | Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation | Shiyao Wang et.al. | 2407.18461 | translate | read | link |
| 2024-07-25 | On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures | Nick Rossenbach et.al. | 2407.17997 | translate | read | null |
| 2024-07-25 | Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization | Ruijie Tao et.al. | 2407.17902 | translate | read | link |
| 2024-07-25 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions | Jiwon Suh et.al. | 2407.17874 | translate | read | null |
| 2024-07-25 | Scaling A Simple Approach to Zero-Shot Speech Recognition | Jinming Zhao et.al. | 2407.17852 | translate | read | link |
| 2024-07-24 | Coupling Speech Encoders with Downstream Text Models | Ciprian Chelba et.al. | 2407.17605 | translate | read | null |
| 2024-07-24 | A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives | Jan Lehečka et.al. | 2407.17160 | translate | read | null |
| 2024-07-24 | Long-Term, Store-Front Robotics: Interactive Music for Robotic Arm, Caxixi and Frame Drums | Richard Savery et.al. | 2407.16956 | translate | read | null |
| 2024-07-23 | Quantifying the Role of Textual Predictability in Automatic Speech Recognition | Sean Robertson et.al. | 2407.16537 | translate | read | null |
| 2024-07-23 | The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization | Samuele Cornell et.al. | 2407.16447 | translate | read | null |
| 2024-07-23 | Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction | Rithik Sachdev et.al. | 2407.16370 | translate | read | link |
| 2024-07-22 | dMel: Speech Tokenization made Simple | He Bai et.al. | 2407.15835 | translate | read | null |
| 2024-07-22 | Robustness of Speech Separation Models for Similar-pitch Speakers | Bunlong Lay et.al. | 2407.15749 | translate | read | null |
| 2024-07-22 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios | Hazim Bukhari et.al. | 2407.15300 | translate | read | null |
| 2024-07-21 | Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning | Shuai Wang et.al. | 2407.15188 | translate | read | null |
| 2024-07-21 | MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation | Yun-Han Lan et.al. | 2407.15060 | translate | read | null |
| 2024-07-20 | Towards Realistic Emotional Voice Conversion using Controllable Emotional Intensity | Tianhua Qi et.al. | 2407.14800 | translate | read | null |
| 2024-07-21 | Trading Devil Final: Backdoor attack via Stock market and Bayesian Optimization | Orson Mengara et.al. | 2407.14573 | translate | read | null |
| 2024-07-19 | Towards Assessing Data Replication in Music Generation with Music Similarity Metrics on Raw Audio | Roser Batlle-Roca et.al. | 2407.14364 | translate | read | link |
| 2024-07-19 | Rasa: Building Expressive Speech Synthesis Systems for Indian Languages in Low-resource Settings | Praveen Srinivasa Varadhan et.al. | 2407.14056 | translate | read | link |
| 2024-07-19 | GE2E-AC: Generalized End-to-End Loss Training for Accent Classification | Chihiro Watanabe et.al. | 2407.14021 | translate | read | null |
| 2024-07-19 | MSceneSpeech: A Multi-Scene Speech Dataset For Expressive Speech Synthesis | Qian Yang et.al. | 2407.14006 | translate | read | null |
| 2024-07-19 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance | Changye Li et.al. | 2407.13982 | translate | read | link |
| 2024-07-18 | Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models | Weiqin Li et.al. | 2407.13509 | translate | read | null |
| 2024-07-18 | Reducing Barriers to the Use of Marginalised Music Genres in AI | Nick Bryan-Kinns et.al. | 2407.13439 | translate | read | null |
| 2024-07-18 | Robust ASR Error Correction with Conservative Data Filtering | Takuma Udagawa et.al. | 2407.13300 | translate | read | null |
| 2024-07-18 | Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training | Lukuan Dong et.al. | 2407.13292 | translate | read | null |
| 2024-07-18 | How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines | Ailin Liu et.al. | 2407.13266 | translate | read | null |
| 2024-07-18 | A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR | Jian You et.al. | 2407.13142 | translate | read | null |
| 2024-07-17 | Audio Conditioning for Music Generation via Discrete Bottleneck Features | Simon Rouard et.al. | 2407.12563 | translate | read | null |
| 2024-07-17 | Morphosyntactic Analysis for CHILDES | Houjun Liu et.al. | 2407.12389 | translate | read | null |
| 2024-07-17 | Adaptive Cascading Network for Continual Test-Time Adaptation | Kien X. Nguyen et.al. | 2407.12240 | translate | read | null |
| 2024-07-16 | Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models | Minh Nguyen et.al. | 2407.12094 | translate | read | link |
| 2024-07-17 | Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors | Julien Hauret et.al. | 2407.11828 | translate | read | link |
| 2024-07-16 | Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality | Tina Raissi et.al. | 2407.11641 | translate | read | null |
| 2024-07-16 | The VoicePrivacy 2022 Challenge: Progress and Perspectives in Voice Anonymisation | Michele Panariello et.al. | 2407.11516 | translate | read | null |
| 2024-07-16 | VoxBlink2: A 100K+ Speaker Recognition Corpus and the Open-Set Speaker-Identification Benchmark | Yuke Lin et.al. | 2407.11510 | translate | read | null |
| 2024-07-16 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models | Matthew Perez et.al. | 2407.11345 | translate | read | null |
| 2024-07-15 | Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data | Liang-Hsuan Tseng et.al. | 2407.10603 | translate | read | null |
| 2024-07-15 | BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features | Jing Luo et.al. | 2407.10462 | translate | read | link |
| 2024-07-14 | The Interpretation Gap in Text-to-Music Generation Models | Yongyi Zang et.al. | 2407.10328 | translate | read | null |
| 2024-07-14 | Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation | Ruizhe Huang et.al. | 2407.10303 | translate | read | null |
| 2024-07-14 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer based Streaming ASR | Wenbo Zhao et.al. | 2407.10255 | translate | read | null |
| 2024-07-14 | Textless Dependency Parsing by Labeled Sequence Prediction | Shunsuke Kando et.al. | 2407.10118 | translate | read | link |
| 2024-07-14 | Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification | Li Zhang et.al. | 2407.10048 | translate | read | null |
| 2024-07-13 | Text-Based Detection of On-Hold Scripts in Contact Center Calls | Dmitrii Galimzianov et.al. | 2407.09849 | translate | read | link |
| 2024-07-13 | Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System | Lingwei Meng et.al. | 2407.09817 | translate | read | null |
| 2024-07-13 | A Streaming Multi-Channel End-to-End Speech Recognition System with Realistic Evaluations | Xiangzhu Kong et.al. | 2407.09807 | translate | read | null |
| 2024-07-12 | Music Proofreading with RefinPaint: Where and How to Modify Compositions given Context | Pedro Ramoneda et.al. | 2407.09099 | translate | read | link |
| 2024-07-12 | Optimization of DNN-based speaker verification model through efficient quantization technique | Yeona Hong et.al. | 2407.08991 | translate | read | null |
| 2024-07-10 | Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks | Lucca Emmanuel Pineli Simões et.al. | 2407.08658 | translate | read | null |
| 2024-07-11 | Tamil Language Computing: the Present and the Future | Kengatharaiyer Sarveswaran et.al. | 2407.08618 | translate | read | null |
| 2024-07-11 | Autoregressive Speech Synthesis without Vector Quantization | Lingwei Meng et.al. | 2407.08551 | translate | read | null |
| 2024-07-11 | Toward accessible comics for blind and low vision readers | Christophe Rigaud et.al. | 2407.08248 | translate | read | null |
| 2024-07-10 | Phonetic Richness for Improved Automatic Speaker Verification | Nicholas Klein et.al. | 2407.08017 | translate | read | null |
| 2024-07-10 | Source Tracing of Audio Deepfake Systems | Nicholas Klein et.al. | 2407.08016 | translate | read | null |
| 2024-07-11 | SaMoye: Zero-shot Singing Voice Conversion Based on Feature Disentanglement and Synthesis | Zihao Wang et.al. | 2407.07728 | translate | read | link |
| 2024-07-10 | HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing | Arnon Turetzky et.al. | 2407.07566 | translate | read | null |
| 2024-07-09 | Remastering Divide and Remaster: A Cinematic Audio Source Separation Dataset with Multilingual Support | Karn N. Watcharasupat et.al. | 2407.07275 | translate | read | null |
| 2024-07-09 | Speech After Gender: A Trans-Feminine Perspective on Next Steps for Speech Science and Technology | Robin Netzorg et.al. | 2407.07235 | translate | read | null |
| 2024-07-09 | Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models | Yi-Cheng Lin et.al. | 2407.06957 | translate | read | link |
| 2024-07-09 | Tailored Design of Audio-Visual Speech Recognition Models using Branchformers | David Gimeno-Gómez et.al. | 2407.06606 | translate | read | link |
| 2024-07-08 | Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation | Mengzhe Geng et.al. | 2407.06310 | translate | read | null |
| 2024-07-08 | Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection | Zhenchun Lei et.al. | 2407.05605 | translate | read | null |
| 2024-07-07 | Differentiable Modal Synthesis for Physical Modeling of Planar String Sound and Motion Simulation | Jin Woo Lee et.al. | 2407.05516 | translate | read | null |
| 2024-07-07 | Fine-Grained and Interpretable Neural Speech Editing | Max Morrison et.al. | 2407.05471 | translate | read | null |
| 2024-07-09 | CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens | Zhihao Du et.al. | 2407.05407 | translate | read | null |
| 2024-07-06 | A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining | Feiyang Xiao et.al. | 2407.04936 | translate | read | null |
| 2024-07-05 | MUSIC-lite: Efficient MUSIC using Approximate Computing: An OFDM Radar Case Study | Rajat Bhattacharjya et.al. | 2407.04849 | translate | read | null |
| 2024-07-05 | Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition | Ye Bai et.al. | 2407.04675 | translate | read | null |
| 2024-07-05 | Multitaper mel-spectrograms for keyword spotting | Douglas Baptista de Souza et.al. | 2407.04662 | translate | read | null |
| 2024-07-05 | Pretraining End-to-End Keyword Search with Automatically Discovered Acoustic Units | Bolaji Yusuf et.al. | 2407.04652 | translate | read | link |
| 2024-07-05 | Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models | Bolaji Yusuf et.al. | 2407.04641 | translate | read | null |
| 2024-07-05 | Written Term Detection Improves Spoken Term Detection | Bolaji Yusuf et.al. | 2407.04601 | translate | read | link |
| 2024-07-05 | FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder | Rubing Shen et.al. | 2407.04575 | translate | read | null |
| 2024-07-05 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect | Salima Mdhaffar et.al. | 2407.04533 | translate | read | null |
| 2024-07-05 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models | Vyas Raina et.al. | 2407.04482 | translate | read | null |
| 2024-07-05 | XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models | Shashi Kumar et.al. | 2407.04439 | translate | read | null |
| 2024-07-05 | Romanization Encoding For Multilingual ASR | Wen Ding et.al. | 2407.04368 | translate | read | null |
| 2024-07-03 | GMM-ResNext: Combining Generative and Discriminative Models for Speaker Verification | Hui Yan et.al. | 2407.03135 | translate | read | null |
| 2024-07-03 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition | Jinming Chen et.al. | 2407.03026 | translate | read | null |
| 2024-07-03 | Probing the Feasibility of Multilingual Speaker Anonymization | Sarina Meyer et.al. | 2407.02937 | translate | read | link |
| 2024-07-02 | Towards the Next Frontier in Speech Representation Learning Using Disentanglement | Varun Krishna et.al. | 2407.02543 | translate | read | null |
| 2024-07-02 | Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization | Yuchen Hu et.al. | 2407.02243 | translate | read | null |
| 2024-07-02 | The USTC-NERCSLIP Systems for The ICMC-ASR Challenge | Minghui Wu et.al. | 2407.02052 | translate | read | null |
| 2024-07-02 | Accompanied Singing Voice Synthesis with Fully Text-controlled Melody | Ruiqi Li et.al. | 2407.02049 | translate | read | null |
| 2024-07-02 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models | Zhiyuan Tang et.al. | 2407.01909 | translate | read | link |
| 2024-07-01 | Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting | Scott H. Hawley et.al. | 2407.01499 | translate | read | null |
| 2024-07-01 | Lightweight Zero-shot Text-to-Speech with Mixture of Adapters | Kenichi Fujita et.al. | 2407.01291 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)