Audio Processing - 2024-03
Audio Processing - 2024-03
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-03-31 | Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation | Rohan Chaudhury et.al. | 2404.01339 | translate | read | link |
| 2024-03-31 | CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models | Xiang Li et.al. | 2404.00569 | translate | read | link |
| 2024-03-29 | ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models | Thibaut Thonet et.al. | 2403.20262 | translate | read | null |
| 2024-03-29 | 3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization | Yafeng Chen et.al. | 2403.19971 | translate | read | link |
| 2024-03-28 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition | Yash Jain et.al. | 2403.19822 | translate | read | null |
| 2024-03-28 | Asymmetric and trial-dependent modeling: the contribution of LIA to SdSV Challenge Task 2 | Pierre-Michel Bousquet et.al. | 2403.19634 | translate | read | null |
| 2024-03-28 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition | Siyuan Shen et.al. | 2403.19224 | translate | read | link |
| 2024-03-28 | LV-CTC: Non-autoregressive ASR with CTC and latent variable models | Yuya Fujita et.al. | 2403.19207 | translate | read | null |
| 2024-03-27 | PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations | Ehsan Latif et.al. | 2403.18721 | translate | read | null |
| 2024-03-27 | ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus | Injy Hamed et.al. | 2403.18182 | translate | read | null |
| 2024-03-28 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition | Yi-Cheng Wang et.al. | 2403.17645 | translate | read | null |
| 2024-03-26 | Extracting Biomedical Entities from Noisy Audio Transcripts | Nima Ebadi et.al. | 2403.17363 | translate | read | null |
| 2024-03-25 | Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT | Rohit Raju et.al. | 2403.16655 | translate | read | null |
| 2024-03-25 | Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator | Takuhiro Kaneko et.al. | 2403.16464 | translate | read | null |
| 2024-03-22 | Privacy-Preserving End-to-End Spoken Language Understanding | Yinggui Wang et.al. | 2403.15510 | translate | read | null |
| 2024-03-26 | A Multimodal Approach to Device-Directed Speech Detection with Large Language Models | Dominik Wagner et.al. | 2403.14438 | translate | read | null |
| 2024-03-21 | XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception | HyoJung Han et.al. | 2403.14402 | translate | read | null |
| 2024-03-21 | M $^3$ AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset | Zhe Chen et.al. | 2403.14168 | translate | read | null |
| 2024-03-21 | The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data | Alice Baird et.al. | 2403.14048 | translate | read | null |
| 2024-03-20 | Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot | Antonio Bono et.al. | 2403.13960 | translate | read | null |
| 2024-03-20 | BanglaNum – A Public Dataset for Bengali Digit Recognition from Speech | Mir Sayeed Mohammad et.al. | 2403.13465 | translate | read | null |
| 2024-03-20 | Advanced Long-Content Speech Recognition With Factorized Neural Transducer | Xun Gong et.al. | 2403.13423 | translate | read | null |
| 2024-03-20 | KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario | Huali Zhou et.al. | 2403.13356 | translate | read | null |
| 2024-03-20 | Building speech corpus with diverse voice characteristics for its prompt-based representation | Aya Watanabe et.al. | 2403.13353 | translate | read | null |
| 2024-03-20 | Polaris: A Safety-focused LLM Constellation Architecture for Healthcare | Subhabrata Mukherjee et.al. | 2403.13313 | translate | read | null |
| 2024-03-19 | FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer | Dongyeong Hwang et.al. | 2403.12821 | translate | read | link |
| 2024-03-19 | Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation | Yuto Ishikawa et.al. | 2403.12477 | translate | read | null |
| 2024-03-19 | An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis | Yifan Peng et.al. | 2403.12402 | translate | read | null |
| 2024-03-18 | Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models | Linus Nwankwo et.al. | 2403.12273 | translate | read | null |
| 2024-03-18 | Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models | Emilian Postolache et.al. | 2403.11706 | translate | read | link |
| 2024-03-18 | QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation | Zhizhen Zhou et.al. | 2403.11626 | translate | read | null |
| 2024-03-18 | AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition | SooHwan Eom et.al. | 2403.11578 | translate | read | null |
| 2024-03-16 | Energy-Based Models with Applications to Speech and Language Processing | Zhijian Ou et.al. | 2403.10961 | translate | read | null |
| 2024-03-16 | Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR | Savitha Murthy et.al. | 2403.10937 | translate | read | null |
| 2024-03-15 | MusicHiFi: Fast High-Fidelity Stereo Vocoding | Ge Zhu et.al. | 2403.10493 | translate | read | null |
| 2024-03-15 | Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks | Peter Leer et.al. | 2403.10420 | translate | read | null |
| 2024-03-14 | SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages | René Groh et.al. | 2403.09753 | translate | read | link |
| 2024-03-14 | More than words: Advancements and challenges in speech recognition for singing | Anna Kruspe et.al. | 2403.09298 | translate | read | null |
| 2024-03-13 | Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition | Wenjing Zhu et.al. | 2403.08258 | translate | read | null |
| 2024-03-13 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation | Jiayu Du et.al. | 2403.08196 | translate | read | link |
| 2024-03-13 | Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children | Taekyung Ahn et.al. | 2403.08187 | translate | read | null |
| 2024-03-13 | EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech | Ziqi Liang et.al. | 2403.08164 | translate | read | null |
| 2024-03-12 | Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language | Yash Sharma et.al. | 2403.08011 | translate | read | null |
| 2024-03-12 | Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation | Keshav Bhandari et.al. | 2403.07995 | translate | read | null |
| 2024-03-11 | The evaluation of a code-switched Sepedi-English automatic speech recognition system | Amanda Phaladi et.al. | 2403.07947 | translate | read | null |
| 2024-03-12 | Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets | Jan Pešán et.al. | 2403.07767 | translate | read | null |
| 2024-03-11 | Real-Time Multimodal Cognitive Assistant for Emergency Medical Services | Keshara Weerasinghe et.al. | 2403.06734 | translate | read | null |
| 2024-03-11 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR | Yufeng Yang et.al. | 2403.06387 | translate | read | null |
| 2024-03-10 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations | Amit Meghanani et.al. | 2403.06260 | translate | read | null |
| 2024-03-09 | HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling | Chunhui Wang et.al. | 2403.05989 | translate | read | null |
| 2024-03-09 | Aligning Speech to Languages to Enhance Code-switching Speech Recognition | Hexin Liu et.al. | 2403.05887 | translate | read | null |
| 2024-03-07 | Classist Tools: Social Class Correlates with Performance in NLP | Amanda Cercas Curry et.al. | 2403.04445 | translate | read | null |
| 2024-03-07 | A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain | Qusai Abo Obaidah et.al. | 2403.04280 | translate | read | null |
| 2024-03-07 | A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition | Yusheng Dai et.al. | 2403.04245 | translate | read | link |
| 2024-03-06 | RADIA – Radio Advertisement Detection with Intelligent Analytics | Jorge Álvarez et.al. | 2403.03538 | translate | read | null |
| 2024-03-06 | Non-verbal information in spontaneous speech – towards a new framework of analysis | Tirza Biron et.al. | 2403.03522 | translate | read | null |
| 2024-03-05 | NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models | Zeqian Ju et.al. | 2403.03100 | translate | read | null |
| 2024-03-05 | AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models | Kazuki Kawamura et.al. | 2403.02938 | translate | read | null |
| 2024-03-05 | Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction | Yue Li et.al. | 2403.02918 | translate | read | null |
| 2024-03-04 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings | Joonas Kalda et.al. | 2403.02288 | translate | read | null |
| 2024-03-04 | What has LeBenchmark Learnt about French Syntax? | Zdravko Dugonjić et.al. | 2403.02173 | translate | read | null |
| 2024-03-04 | SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR | Zhiyun Fan et.al. | 2403.02010 | translate | read | null |
| 2024-03-04 | Language and Speech Technology for Central Kurdish Varieties | Sina Ahmadi et.al. | 2403.01983 | translate | read | link |
| 2024-03-03 | PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion | Tianhua Qi et.al. | 2403.01494 | translate | read | null |
| 2024-03-03 | A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement | Ravi Shankar et.al. | 2403.01369 | translate | read | null |
| 2024-03-03 | a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification | Hye-jin Shim et.al. | 2403.01355 | translate | read | link |
| 2024-03-02 | Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey | Hamza Kheddar et.al. | 2403.01255 | translate | read | null |
| 2024-03-02 | Towards Accurate Lip-to-Speech Synthesis in-the-Wild | Sindhu Hegde et.al. | 2403.01087 | translate | read | null |
| 2024-03-01 | VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis | Weiwei Lin et.al. | 2403.00529 | translate | read | null |
| 2024-03-01 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview | Heyang Liu et.al. | 2403.00370 | translate | read | null |
| 2024-03-01 | Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification | Mufan Sang et.al. | 2403.00293 | translate | read | null |
| 2024-03-01 | Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART | Aniket Tathe et.al. | 2403.00212 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)