Audio Processing - 2024-04
Audio Processing - 2024-04
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-04-30 | Who is Authentic Speaker | Qiang Huang et.al. | 2405.00248 | translate | read | null |
| 2024-04-30 | ConFides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration | Sunwoo Ha et.al. | 2405.00223 | translate | read | null |
| 2024-04-30 | Expressivity and Speech Synthesis | Andreas Triantafyllopoulos et.al. | 2404.19363 | translate | read | null |
| 2024-04-30 | Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation | Eyal Liron Dolev et.al. | 2404.19310 | translate | read | null |
| 2024-04-30 | EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization | Jianzong Wang et.al. | 2404.19214 | translate | read | null |
| 2024-04-30 | EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning | Ziqi Liang et.al. | 2404.19212 | translate | read | null |
| 2024-04-29 | Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification | Artem Abzaliev et.al. | 2404.18739 | translate | read | null |
| 2024-04-29 | MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis | Xiang Li et.al. | 2404.18398 | translate | read | link |
| 2024-04-30 | ComposerX: Multi-Agent Symbolic Music Composition with LLMs | Qixin Deng et.al. | 2404.18081 | translate | read | link |
| 2024-04-27 | A Comparison of Differential Performance Metrics for the Evaluation of Automatic Speaker Verification Fairness | Oubaida Chouchane et.al. | 2404.17810 | translate | read | null |
| 2024-04-26 | An RFP dataset for Real, Fake, and Partially fake audio detection | Abdulazeez AlAli et.al. | 2404.17721 | translate | read | null |
| 2024-04-26 | A Semi-Automatic Approach to Create Large Gender- and Age-Balanced Speaker Corpora: Usefulness of Speaker Diarization & Identification | Rémi Uro et.al. | 2404.17552 | translate | read | null |
| 2024-04-26 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? | Ruben Janssens et.al. | 2404.17394 | translate | read | null |
| 2024-04-26 | Device Feature based on Graph Fourier Transformation with Logarithmic Processing For Detection of Replay Speech Attacks | Mingrui He et.al. | 2404.17280 | translate | read | null |
| 2024-04-29 | COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations | Ruben Ciranni et.al. | 2404.16969 | translate | read | null |
| 2024-04-26 | Automatic Speech Recognition System-Independent Word Error Rate Estimation | Chanho Park et.al. | 2404.16743 | translate | read | null |
| 2024-04-25 | Developing Acoustic Models for Automatic Speech Recognition in Swedish | Giampiero Salvi et.al. | 2404.16547 | translate | read | null |
| 2024-04-25 | U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF | Xingchen Song et.al. | 2404.16407 | translate | read | null |
| 2024-04-24 | Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges | Badri Narayana Patro et.al. | 2404.16112 | translate | read | link |
| 2024-04-24 | Efficient Multi-Model Fusion with Adversarial Complementary Representation Learning | Zuheng Kang et.al. | 2404.15704 | translate | read | null |
| 2024-04-24 | HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts | Xinlei Niu et.al. | 2404.15637 | translate | read | null |
| 2024-04-23 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information | Chihiro Taguchi et.al. | 2404.15501 | translate | read | link |
| 2024-04-23 | Additive Margin in Contrastive Self-Supervised Frameworks to Learn Discriminative Speaker Representations | Theo Lepage et.al. | 2404.14913 | translate | read | null |
| 2024-04-23 | Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance | Tsubasa Ochiai et.al. | 2404.14860 | translate | read | null |
| 2024-04-25 | FlashSpeech: Efficient Zero-Shot Speech Synthesis | Zhen Ye et.al. | 2404.14700 | translate | read | null |
| 2024-04-22 | Assessment of Sign Language-Based versus Touch-Based Input for Deaf Users Interacting with Intelligent Personal Assistants | Nina Tran et.al. | 2404.14605 | translate | read | null |
| 2024-04-22 | Exploring neural oscillations during speech perception via surrogate gradient spiking neural networks | Alexandre Bittar et.al. | 2404.14024 | translate | read | null |
| 2024-04-23 | Retrieval-Augmented Audio Deepfake Detection | Zuheng Kang et.al. | 2404.13892 | translate | read | null |
| 2024-04-23 | Parameter Efficient Fine Tuning: A Comprehensive Analysis Across Applications | Charith Chandra Sai Balne et.al. | 2404.13506 | translate | read | null |
| 2024-04-20 | Text-dependent Speaker Verification (TdSV) Challenge 2024: Challenge Evaluation Plan | Zeinali Hossein et.al. | 2404.13428 | translate | read | null |
| 2024-04-20 | Semantically Corrected Amharic Automatic Speech Recognition | Samuael Adnew et.al. | 2404.13362 | translate | read | link |
| 2024-04-20 | Music Consistency Models | Zhengcong Fei et.al. | 2404.13358 | translate | read | null |
| 2024-04-20 | Track Role Prediction of Single-Instrumental Sequences | Changheon Han et.al. | 2404.13286 | translate | read | null |
| 2024-04-19 | Learn2Talk: 3D Talking Face Learns from 2D Talking Face | Yixiang Zhuang et.al. | 2404.12888 | translate | read | null |
| 2024-04-19 | Efficient infusion of self-supervised representations in Automatic Speech Recognition | Darshan Prabhu et.al. | 2404.12628 | translate | read | null |
| 2024-04-18 | TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches | Rong Wang et.al. | 2404.12077 | translate | read | null |
| 2024-04-18 | Large Language Models: From Notes to Musical Form | Lilac Atassi et.al. | 2404.11976 | translate | read | null |
| 2024-04-17 | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation | Ye Bai et.al. | 2404.11275 | translate | read | null |
| 2024-04-16 | Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training | Pavel Denisov et.al. | 2404.10922 | translate | read | link |
| 2024-04-16 | Long-form music generation with latent diffusion | Zach Evans et.al. | 2404.10301 | translate | read | null |
| 2024-04-16 | Anatomy of Industrial Scale Multilingual ASR | Francis McCann Ramirez et.al. | 2404.09841 | translate | read | null |
| 2024-04-15 | Resilience of Large Language Models for Noisy Instructions | Bin Wang et.al. | 2404.09754 | translate | read | null |
| 2024-04-16 | Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment | Zhiqing Hong et.al. | 2404.09313 | translate | read | null |
| 2024-04-12 | Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task | Hassan Ali et.al. | 2404.08424 | translate | read | null |
| 2024-04-12 | ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa’ikhana | Monica Romero et.al. | 2404.08368 | translate | read | null |
| 2024-04-10 | An inclusive review on deep learning techniques and their scope in handwriting recognition | Sukhdeep Singh et.al. | 2404.08011 | translate | read | null |
| 2024-04-12 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution | Tien-Hong Lo et.al. | 2404.07575 | translate | read | null |
| 2024-04-12 | Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping | Kevin Zhang et.al. | 2404.07341 | translate | read | null |
| 2024-04-12 | Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness | Xincan Feng et.al. | 2404.06714 | translate | read | link |
| 2024-04-10 | MuPT: A Generative Symbolic Music Pretrained Transformer | Xingwei Qu et.al. | 2404.06393 | translate | read | null |
| 2024-04-10 | The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge | Yiwei Guo et.al. | 2404.06079 | translate | read | null |
| 2024-04-06 | A Novel Bi-LSTM And Transformer Architecture For Generating Tabla Music | Roopa Mayya et.al. | 2404.05765 | translate | read | null |
| 2024-04-08 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain | Khai Le-Duc et.al. | 2404.05659 | translate | read | link |
| 2024-04-07 | Gull: A Generative Multifunctional Audio Codec | Yi Luo et.al. | 2404.04947 | translate | read | null |
| 2024-04-07 | Safeguarding Voice Privacy: Harnessing Near-Ultrasonic Interference To Protect Against Unauthorized Audio Recording | Forrest McKee et.al. | 2404.04769 | translate | read | null |
| 2024-04-06 | HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks | Yingting Li et.al. | 2404.04645 | translate | read | link |
| 2024-04-05 | The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos | Igor Cardoso et.al. | 2404.04420 | translate | read | null |
| 2024-04-04 | Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition | Hainan Xu et.al. | 2404.04295 | translate | read | null |
| 2024-04-05 | Open vocabulary keyword spotting through transfer learning from speech synthesis | Kesavaraj V et.al. | 2404.03914 | translate | read | null |
| 2024-04-06 | RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis | Detai Xin et.al. | 2404.03204 | translate | read | null |
| 2024-04-03 | Mai Ho’omāuna i ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian | Kaavya Chaparala et.al. | 2404.03073 | translate | read | null |
| 2024-04-03 | PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders | Yu Pan et.al. | 2404.02702 | translate | read | null |
| 2024-04-03 | Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation | Yejin Jeon et.al. | 2404.02592 | translate | read | null |
| 2024-04-03 | CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models | Zaid Sheikh et.al. | 2404.02408 | translate | read | link |
| 2024-04-02 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition | Alexandros Haliassos et.al. | 2404.02098 | translate | read | link |
| 2024-04-02 | Noise Masking Attacks and Defenses for Pretrained Speech Models | Matthew Jagielski et.al. | 2404.02052 | translate | read | null |
| 2024-04-02 | Kallaama: A Transcribed Speech Dataset about Agriculture in the Three Most Widely Spoken Languages in Senegal | Elodie Gauthier et.al. | 2404.01991 | translate | read | link |
| 2024-04-05 | Zero-Shot Multi-Lingual Speaker Verification in Clinical Trials | Ali Akram et.al. | 2404.01981 | translate | read | null |
| 2024-04-02 | Transfer Learning from Whisper for Microscopic Intelligibility Prediction | Paul Best et.al. | 2404.01737 | translate | read | null |
| 2024-04-01 | KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis | Adal Abilbekov et.al. | 2404.01033 | translate | read | null |
| 2024-04-01 | Voice Conversion Augmentation for Speaker Recognition on Defective Datasets | Ruijie Tao et.al. | 2404.00863 | translate | read | null |
| 2024-04-01 | Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling | Injune Hwang et.al. | 2404.00856 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)