Audio Processing - 2024-06
Audio Processing - 2024-06
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-06-30 | An Attribute Interpolation Method in Speech Synthesis by Model Merging | Masato Murata et.al. | 2407.00766 | translate | read | null |
| 2024-06-30 | Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations | Salah Zaiem et.al. | 2407.00756 | translate | read | null |
| 2024-06-30 | FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis | Yinlin Guo et.al. | 2407.00753 | translate | read | null |
| 2024-06-29 | When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration | Philipp Allgeuer et.al. | 2407.00518 | translate | read | null |
| 2024-06-28 | SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR | Qiuming Zhao et.al. | 2406.19706 | translate | read | null |
| 2024-06-28 | Less is More: Accurate Speech Recognition & Translation without Web-Scale Data | Krishna C. Puvvada et.al. | 2406.19674 | translate | read | null |
| 2024-06-27 | Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects | Orevaoghene Ahia et.al. | 2406.19564 | translate | read | null |
| 2024-06-27 | Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment | Rotem Rousso et.al. | 2406.19363 | translate | read | null |
| 2024-06-27 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems | Zheng Fang et.al. | 2406.19311 | translate | read | null |
| 2024-06-27 | Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models | Borodin Kirill Nikolayevich et.al. | 2406.19243 | translate | read | null |
| 2024-06-27 | DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability | Hyun Joon Park et.al. | 2406.19135 | translate | read | link |
| 2024-06-27 | Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over | Atsunori Ogawa et.al. | 2406.18972 | translate | read | null |
| 2024-06-27 | Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network | Yehoshua Dissen et.al. | 2406.18928 | translate | read | null |
| 2024-06-27 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study | Peikun Chen et.al. | 2406.18862 | translate | read | null |
| 2024-06-26 | A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems | Karn N. Watcharasupat et.al. | 2406.18747 | translate | read | link |
| 2024-06-26 | Dynamic Data Pruning for Automatic Speech Recognition | Qiao Xiao et.al. | 2406.18373 | translate | read | null |
| 2024-06-26 | MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research | Song Li et.al. | 2406.18301 | translate | read | null |
| 2024-06-26 | Automatic Speech Recognition for Hindi | Anish Saha et.al. | 2406.18135 | translate | read | null |
| 2024-06-26 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs | Ahmed Heakl et.al. | 2406.18120 | translate | read | link |
| 2024-06-26 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR | Shuaishuai Ye et.al. | 2406.18021 | translate | read | null |
| 2024-06-25 | Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment | Paarth Neekhara et.al. | 2406.17957 | translate | read | null |
| 2024-06-25 | Sequential Editing for Lifelong Training of Speech Recognition Models | Devang Kulshreshtha et.al. | 2406.17935 | translate | read | null |
| 2024-06-25 | FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data | Dancheng Liu et.al. | 2406.17926 | translate | read | link |
| 2024-06-25 | Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals | Kentaro Seki et.al. | 2406.17722 | translate | read | null |
| 2024-06-25 | Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model | Jiawen Huang et.al. | 2406.17618 | translate | read | link |
| 2024-06-25 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization | Adriana Fernandez-Lopez et.al. | 2406.17614 | translate | read | null |
| 2024-06-25 | High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model | Joun Yeop Lee et.al. | 2406.17310 | translate | read | null |
| 2024-06-25 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR | Van Tung Pham et.al. | 2406.17272 | translate | read | null |
| 2024-06-25 | Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation | Yingting Li et.al. | 2406.17257 | translate | read | null |
| 2024-06-24 | Investigating Confidence Estimation Measures for Speaker Diarization | Anurag Chowdhury et.al. | 2406.17124 | translate | read | null |
| 2024-06-24 | Exploring the Capability of Mamba in Speech Applications | Koichi Miyazaki et.al. | 2406.16808 | translate | read | null |
| 2024-06-24 | Blending LLMs into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 | Sai Koneru et.al. | 2406.16777 | translate | read | null |
| 2024-06-25 | Towards Zero-Shot Text-To-Speech for Arabic Dialects | Khai Duy Doan et.al. | 2406.16751 | translate | read | null |
| 2024-06-24 | One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection | Hyun Myung Kim et.al. | 2406.16716 | translate | read | null |
| 2024-06-24 | RefXVC: Cross-Lingual Voice Conversion with Enhanced Reference Leveraging | Mingyang Zhang et.al. | 2406.16326 | translate | read | null |
| 2024-06-24 | DreamVoice: Text-Guided Voice Conversion | Jiarui Hai et.al. | 2406.16314 | translate | read | null |
| 2024-06-23 | Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss | Muhammad Shakeel et.al. | 2406.16120 | translate | read | null |
| 2024-06-23 | Decoder-only Architecture for Streaming End-to-end Speech Recognition | Emiru Tsunoo et.al. | 2406.16107 | translate | read | null |
| 2024-06-22 | Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment | Heejin Do et.al. | 2406.15723 | translate | read | null |
| 2024-06-21 | PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics | Amir Nassereldine et.al. | 2406.15668 | translate | read | null |
| 2024-06-21 | Perception of Phonological Assimilation by Neural Speech Recognition Models | Charlotte Pouw et.al. | 2406.15265 | translate | read | null |
| 2024-06-21 | InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions | Yu Nakagome et.al. | 2406.14890 | translate | read | null |
| 2024-06-20 | An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks | Varsha Suresh et.al. | 2406.14747 | translate | read | null |
| 2024-06-21 | DASB – Discrete Audio and Speech Benchmark | Pooneh Mousavi et.al. | 2406.14294 | translate | read | null |
| 2024-06-20 | Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries | Anna Wróblewska et.al. | 2406.14266 | translate | read | null |
| 2024-06-19 | Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control | Alexander Blatt et.al. | 2406.13842 | translate | read | null |
| 2024-06-19 | ManWav: The First Manchu ASR Model | Jean Seo et.al. | 2406.13502 | translate | read | null |
| 2024-06-19 | Children’s Speech Recognition through Discrete Token Enhancement | Vrunda N. Sukhadia et.al. | 2406.13431 | translate | read | null |
| 2024-06-19 | CEC: A Noisy Label Detection Method for Speaker Recognition | Yao Shen et.al. | 2406.13268 | translate | read | null |
| 2024-06-18 | Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech | Cheol Jun Cho et.al. | 2406.12998 | translate | read | null |
| 2024-06-18 | Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition | Kuan-Chen Wang et.al. | 2406.12699 | translate | read | null |
| 2024-06-18 | Transcribe, Align and Segment: Creating speech datasets for low-resource languages | Taras Sereda et.al. | 2406.12674 | translate | read | null |
| 2024-06-18 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech | Adrien Pupier et.al. | 2406.12621 | translate | read | null |
| 2024-06-18 | Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting | Yosuke Kashiwagi et.al. | 2406.12611 | translate | read | null |
| 2024-06-18 | Unsupervised Online Continual Learning for Automatic Speech Recognition | Steven Vander Eeckt et.al. | 2406.12503 | translate | read | null |
| 2024-06-18 | Performant ASR Models for Medical Entities in Accented Speech | Tejumade Afonja et.al. | 2406.12387 | translate | read | null |
| 2024-06-18 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model | Hayato Futami et.al. | 2406.12317 | translate | read | null |
| 2024-06-18 | JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning | Boyu Chen et.al. | 2406.12292 | translate | read | null |
| 2024-06-18 | SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization | Young Jin Ahn et.al. | 2406.12233 | translate | read | null |
| 2024-06-18 | A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis | Guoqiang Hu et.al. | 2406.12164 | translate | read | null |
| 2024-06-17 | 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis | Sewade Ogun et.al. | 2406.11727 | translate | read | null |
| 2024-06-17 | GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement | Yifan Yang et.al. | 2406.11546 | translate | read | link |
| 2024-06-17 | Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9 | Do Hyun Lee et.al. | 2406.11248 | translate | read | null |
| 2024-06-17 | Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision | Yafeng Chen et.al. | 2406.11169 | translate | read | null |
| 2024-06-16 | Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech | Guan-Ting Lin et.al. | 2406.11064 | translate | read | null |
| 2024-06-16 | NAST: Noise Aware Speech Tokenization for Speech Language Models | Shoval Messica et.al. | 2406.11037 | translate | read | link |
| 2024-06-16 | Large Language Models for Dysfluency Detection in Stuttered Speech | Dominik Wagner et.al. | 2406.11025 | translate | read | null |
| 2024-06-16 | Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models | Dominik Wagner et.al. | 2406.11022 | translate | read | null |
| 2024-06-16 | Optimized Speculative Sampling for GPU Hardware Accelerators | Dominik Wagner et.al. | 2406.11016 | translate | read | null |
| 2024-06-16 | CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving | Bhavani Shankar et.al. | 2406.10993 | translate | read | null |
| 2024-06-14 | Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation | Dena Mujtaba et.al. | 2406.10177 | translate | read | null |
| 2024-06-14 | On the Evaluation of Speech Foundation Models for Spoken Language Understanding | Siddhant Arora et.al. | 2406.10083 | translate | read | null |
| 2024-06-14 | Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation | Andrew Rouditchenko et.al. | 2406.10082 | translate | read | link |
| 2024-06-14 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection | Haoyu Wang et.al. | 2406.10052 | translate | read | link |
| 2024-06-14 | ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR | Vishwanath Pratap Singh et.al. | 2406.09999 | translate | read | null |
| 2024-06-14 | An efficient text augmentation approach for contextualized Mandarin speech recognition | Naijun Zheng et.al. | 2406.09950 | translate | read | null |
| 2024-06-14 | Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition | Yicong Jiang et.al. | 2406.09873 | translate | read | null |
| 2024-06-14 | MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model | Jiatong Shi et.al. | 2406.09869 | translate | read | null |
| 2024-06-14 | Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy | Linhan Ma et.al. | 2406.09844 | translate | read | null |
| 2024-06-14 | Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation | Kaien Mo et.al. | 2406.09821 | translate | read | null |
| 2024-06-13 | Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech | Martina Valente et.al. | 2406.09290 | translate | read | null |
| 2024-06-13 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t | Chihiro Taguchi et.al. | 2406.09202 | translate | read | null |
| 2024-06-13 | LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks | Amit Meghanani et.al. | 2406.09153 | translate | read | null |
| 2024-06-13 | ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis | Dehua Tao et.al. | 2406.08989 | translate | read | null |
| 2024-06-13 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition | William Ravenscroft et.al. | 2406.08914 | translate | read | null |
| 2024-06-13 | AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers | Emil Biju et.al. | 2406.08904 | translate | read | null |
| 2024-06-13 | A Single-Step Non-Autoregressive Automatic Speech Recognition Architecture with High Accuracy and Inference Speed | Ziyang Zhuang et.al. | 2406.08835 | translate | read | null |
| 2024-06-13 | Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems | Zhengyang Chen et.al. | 2406.08812 | translate | read | null |
| 2024-06-12 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets | Jiatong Shi et.al. | 2406.08641 | translate | read | null |
| 2024-06-12 | Emotion Manipulation Through Music – A Deep Learning Interactive Visual Approach | Adel N. Abdalla et.al. | 2406.08623 | translate | read | null |
| 2024-06-12 | SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models | Chun Yin et.al. | 2406.08445 | translate | read | null |
| 2024-06-12 | TokSing: Singing Voice Synthesis based on Discrete Tokens | Yuning Wu et.al. | 2406.08416 | translate | read | null |
| 2024-06-12 | Neural Blind Source Separation and Diarization for Distant Speech Recognition | Yoshiaki Bando et.al. | 2406.08396 | translate | read | null |
| 2024-06-12 | Towards Unsupervised Speech Recognition Without Pronunciation Models | Junrui Ni et.al. | 2406.08380 | translate | read | null |
| 2024-06-12 | Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques | Yuanchao Li et.al. | 2406.08353 | translate | read | link |
| 2024-06-12 | Refining Self-Supervised Learnt Speech Representation using Brain Activations | Hengyu Li et.al. | 2406.08266 | translate | read | null |
| 2024-06-12 | Transformer-based Model for ASR N-Best Rescoring and Rewriting | Iwen E. Kang et.al. | 2406.08207 | translate | read | null |
| 2024-06-12 | FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter | Yuanjun Lv et.al. | 2406.08196 | translate | read | link |
| 2024-06-12 | Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data | Yuma Shirahata et.al. | 2406.08111 | translate | read | null |
| 2024-06-12 | Can Large Language Models Understand Spatial Audio? | Changli Tang et.al. | 2406.07914 | translate | read | null |
| 2024-06-11 | Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? | Qingkai Fang et.al. | 2406.07289 | translate | read | null |
| 2024-06-11 | Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment | Takuto Igarashi et.al. | 2406.07280 | translate | read | null |
| 2024-06-11 | AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection | Rong Gong et.al. | 2406.07256 | translate | read | null |
| 2024-06-11 | SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark | Yuki Saito et.al. | 2406.07254 | translate | read | null |
| 2024-06-11 | CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems | Haibin Wu et.al. | 2406.07237 | translate | read | null |
| 2024-06-11 | MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms | Seung-bin Kim et.al. | 2406.07103 | translate | read | link |
| 2024-06-11 | Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter | Andrei Andrusenko et.al. | 2406.07096 | translate | read | null |
| 2024-06-11 | Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech | Mateusz Czyżnikiewicz et.al. | 2406.07090 | translate | read | null |
| 2024-06-11 | Reading Miscue Detection in Primary School through Automatic Speech Recognition | Lingyun Gao et.al. | 2406.07060 | translate | read | null |
| 2024-06-10 | Synthetic Query Generation using Large Language Models for Virtual Assistants | Sonal Sannigrahi et.al. | 2406.06729 | translate | read | null |
| 2024-06-10 | Meta Learning Text-to-Speech Synthesis in over 7000 Languages | Florian Lux et.al. | 2406.06403 | translate | read | link |
| 2024-06-10 | A Parameter-efficient Language Extension Framework for Multilingual ASR | Wei Liu et.al. | 2406.06329 | translate | read | null |
| 2024-06-10 | Quantifying the effect of speech pathology on automatic and human speaker verification | Bence Mark Halpern et.al. | 2406.06208 | translate | read | null |
| 2024-06-10 | JenGAN: Stacked Shifted Filters in GAN-Based Speech Synthesis | Hyunjae Cho et.al. | 2406.06111 | translate | read | null |
| 2024-06-10 | Prompting Large Language Models with Audio for General-Purpose Speech Summarization | Wonjune Kang et.al. | 2406.05968 | translate | read | link |
| 2024-06-09 | Conserving Human Creativity with Evolutionary Generative Algorithms: A Case Study in Music Generation | Justin Kilb et.al. | 2406.05873 | translate | read | null |
| 2024-06-09 | Source -Free Domain Adaptation for Speaker Verification in Data-Scarce Languages and Noisy Channels | Shlomo Salo Elia et.al. | 2406.05863 | translate | read | null |
| 2024-06-09 | Do Prompts Really Prompt? Exploring the Prompt Understanding Capability of Whisper | Chih-Kai Yang et.al. | 2406.05806 | translate | read | null |
| 2024-06-09 | Optimizing Multi-Stuttered Speech Classification: Leveraging Whisper’s Encoder for Efficient Parameter Reduction in Automated Assessment | Huma Ameer et.al. | 2406.05784 | translate | read | null |
| 2024-06-09 | SPA-SVC: Self-supervised Pitch Augmentation for Singing Voice Conversion | Bingsong Bai et.al. | 2406.05692 | translate | read | null |
| 2024-06-07 | The Database and Benchmark for Source Speaker Verification Against Voice Conversion | Ze Li et.al. | 2406.04951 | translate | read | null |
| 2024-06-07 | LLM-based speaker diarization correction: A generalizable approach | Georgios Efstathiadis et.al. | 2406.04927 | translate | read | link |
| 2024-06-07 | Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR | Shaojun Li et.al. | 2406.04791 | translate | read | null |
| 2024-06-07 | Pitch-Aware RNN-T for Mandarin Chinese Mispronunciation Detection and Diagnosis | Xintong Wang et.al. | 2406.04595 | translate | read | null |
| 2024-06-07 | Neural Codec-based Adversarial Sample Detection for Speaker Verification | Xuanjun Chen et.al. | 2406.04582 | translate | read | null |
| 2024-06-06 | Flexible Multichannel Speech Enhancement for Noise-Robust Frontend | Ante Jukić et.al. | 2406.04552 | translate | read | null |
| 2024-06-06 | Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation | Keqi Deng et.al. | 2406.04541 | translate | read | null |
| 2024-06-06 | To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation | Abdul Waheed et.al. | 2406.04512 | translate | read | link |
| 2024-06-06 | Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline | Ali N. Salman et.al. | 2406.04494 | translate | read | null |
| 2024-06-06 | Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis | Théodor Lemerle et.al. | 2406.04467 | translate | read | link |
| 2024-06-06 | VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling | Zeyue Tian et.al. | 2406.04321 | translate | read | link |
| 2024-06-06 | Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement | Wangyou Zhang et.al. | 2406.04269 | translate | read | null |
| 2024-06-06 | Hypernetworks for Personalizing ASR to Atypical Speech | Max Mueller-Eberstein et.al. | 2406.04240 | translate | read | null |
| 2024-06-06 | Helsinki Speech Challenge 2024 | Martin Ludvigsen et.al. | 2406.04123 | translate | read | null |
| 2024-06-06 | BLSP-Emo: Towards Empathetic Large Speech-Language Models | Chen Wang et.al. | 2406.03872 | translate | read | link |
| 2024-06-06 | Improving Zero-Shot Chinese-English Code-Switching ASR with kNN-CTC and Gated Monolingual Datastores | Jiaming Zhou et.al. | 2406.03814 | translate | read | null |
| 2024-06-06 | Speed of Light Exact Greedy Decoding for RNN-T Speech Recognition Models on GPU | Daniel Galvez et.al. | 2406.03791 | translate | read | null |
| 2024-06-06 | Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining | Jinlong Xue et.al. | 2406.03714 | translate | read | null |
| 2024-06-06 | Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model | Jinlong Xue et.al. | 2406.03706 | translate | read | null |
| 2024-06-05 | Style Mixture of Experts for Expressive Text-To-Speech Synthesis | Ahad Jawaid et.al. | 2406.03637 | translate | read | null |
| 2024-06-05 | Enhancing CTC-based speech recognition with diverse modeling units | Shiyi Han et.al. | 2406.03274 | translate | read | null |
| 2024-06-05 | Error-preserving Automatic Speech Recognition of Young English Learners’ Language | Janick Michot et.al. | 2406.03235 | translate | read | link |
| 2024-06-05 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning | Shaolei Zhang et.al. | 2406.03049 | translate | read | link |
| 2024-06-05 | 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders | Yui Sudo et.al. | 2406.02950 | translate | read | null |
| 2024-06-05 | SYN2REAL: Leveraging Task Arithmetic for Mitigating Synthetic-Real Discrepancies in ASR Domain Adaptation | Hsuan Su et.al. | 2406.02925 | translate | read | null |
| 2024-06-05 | Text Injection for Neural Contextual Biasing | Zhong Meng et.al. | 2406.02921 | translate | read | null |
| 2024-06-04 | Keyword-Guided Adaptation of Automatic Speech Recognition | Aviv Shamsian et.al. | 2406.02649 | translate | read | null |
| 2024-06-04 | Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion | Ruiqi Li et.al. | 2406.02429 | translate | read | null |
| 2024-06-04 | An Independence-promoting Loss for Music Generation with Language Models | Jean-Marie Lemercier et.al. | 2406.02315 | translate | read | null |
| 2024-06-04 | Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models | Victor Miara et.al. | 2406.02285 | translate | read | link |
| 2024-06-04 | ERes2NetV2: Boosting Short-Duration Speaker Verification Performance with Computational Efficiency | Yafeng Chen et.al. | 2406.02167 | translate | read | null |
| 2024-06-04 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision | Saierdaer Yusuyin et.al. | 2406.02166 | translate | read | link |
| 2024-06-04 | Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis | Kun Zhou et.al. | 2406.02009 | translate | read | null |
| 2024-06-04 | Efficiently Train ASR Models that Memorize Less and Perform Better with Per-core Clipping | Lun Wang et.al. | 2406.02004 | translate | read | null |
| 2024-06-03 | TinySV: Speaker Verification in TinyML with On-device Learning | Massimo Pavan et.al. | 2406.01655 | translate | read | null |
| 2024-06-03 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach | Ara Yeroyan et.al. | 2406.01446 | translate | read | null |
| 2024-06-03 | Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization | Firas Khader et.al. | 2406.01314 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)