Audio Processing - 2024-03 | Paper Arxiv Daily

Audio Processing - 2024-03

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-03-31	Humane Speech Synthesis through Zero-Shot Emotion and Disfluency Generation	Rohan Chaudhury et.al.	2404.01339	translate	read	link
2024-03-31	CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models	Xiang Li et.al.	2404.00569	translate	read	link
2024-03-29	ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models	Thibaut Thonet et.al.	2403.20262	translate	read	null
2024-03-29	3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization	Yafeng Chen et.al.	2403.19971	translate	read	link
2024-03-28	Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition	Yash Jain et.al.	2403.19822	translate	read	null
2024-03-28	Asymmetric and trial-dependent modeling: the contribution of LIA to SdSV Challenge Task 2	Pierre-Michel Bousquet et.al.	2403.19634	translate	read	null
2024-03-28	Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition	Siyuan Shen et.al.	2403.19224	translate	read	link
2024-03-28	LV-CTC: Non-autoregressive ASR with CTC and latent variable models	Yuya Fujita et.al.	2403.19207	translate	read	null
2024-03-27	PhysicsAssistant: An LLM-Powered Interactive Learning Robot for Physics Lab Investigations	Ehsan Latif et.al.	2403.18721	translate	read	null
2024-03-27	ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus	Injy Hamed et.al.	2403.18182	translate	read	null
2024-03-28	DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition	Yi-Cheng Wang et.al.	2403.17645	translate	read	null
2024-03-26	Extracting Biomedical Entities from Noisy Audio Transcripts	Nima Ebadi et.al.	2403.17363	translate	read	null
2024-03-25	Grammatical vs Spelling Error Correction: An Investigation into the Responsiveness of Transformer-based Language Models using BART and MarianMT	Rohit Raju et.al.	2403.16655	translate	read	null
2024-03-25	Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator	Takuhiro Kaneko et.al.	2403.16464	translate	read	null
2024-03-22	Privacy-Preserving End-to-End Spoken Language Understanding	Yinggui Wang et.al.	2403.15510	translate	read	null
2024-03-26	A Multimodal Approach to Device-Directed Speech Detection with Large Language Models	Dominik Wagner et.al.	2403.14438	translate	read	null
2024-03-21	XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han et.al.	2403.14402	translate	read	null
2024-03-21	M $^3$ AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset	Zhe Chen et.al.	2403.14168	translate	read	null
2024-03-21	The NeurIPS 2023 Machine Learning for Audio Workshop: Affective Audio Benchmarks and Novel Data	Alice Baird et.al.	2403.14048	translate	read	null
2024-03-20	Open Access NAO (OAN): a ROS2-based software framework for HRI applications with the NAO robot	Antonio Bono et.al.	2403.13960	translate	read	null
2024-03-20	BanglaNum – A Public Dataset for Bengali Digit Recognition from Speech	Mir Sayeed Mohammad et.al.	2403.13465	translate	read	null
2024-03-20	Advanced Long-Content Speech Recognition With Factorized Neural Transducer	Xun Gong et.al.	2403.13423	translate	read	null
2024-03-20	KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario	Huali Zhou et.al.	2403.13356	translate	read	null
2024-03-20	Building speech corpus with diverse voice characteristics for its prompt-based representation	Aya Watanabe et.al.	2403.13353	translate	read	null
2024-03-20	Polaris: A Safety-focused LLM Constellation Architecture for Healthcare	Subhabrata Mukherjee et.al.	2403.13313	translate	read	null
2024-03-19	FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer	Dongyeong Hwang et.al.	2403.12821	translate	read	link
2024-03-19	Real-time Speech Extraction Using Spatially Regularized Independent Low-rank Matrix Analysis and Rank-constrained Spatial Covariance Matrix Estimation	Yuto Ishikawa et.al.	2403.12477	translate	read	null
2024-03-19	An Empirical Study of Speech Language Models for Prompt-Conditioned Speech Synthesis	Yifan Peng et.al.	2403.12402	translate	read	null
2024-03-18	Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models	Linus Nwankwo et.al.	2403.12273	translate	read	null
2024-03-18	Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models	Emilian Postolache et.al.	2403.11706	translate	read	link
2024-03-18	QEAN: Quaternion-Enhanced Attention Network for Visual Dance Generation	Zhizhen Zhou et.al.	2403.11626	translate	read	null
2024-03-18	AdaMER-CTC: Connectionist Temporal Classification with Adaptive Maximum Entropy Regularization for Automatic Speech Recognition	SooHwan Eom et.al.	2403.11578	translate	read	null
2024-03-16	Energy-Based Models with Applications to Speech and Language Processing	Zhijian Ou et.al.	2403.10961	translate	read	null
2024-03-16	Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR	Savitha Murthy et.al.	2403.10937	translate	read	null
2024-03-15	MusicHiFi: Fast High-Fidelity Stereo Vocoding	Ge Zhu et.al.	2403.10493	translate	read	null
2024-03-15	Neural Networks Hear You Loud And Clear: Hearing Loss Compensation Using Deep Neural Networks	Peter Leer et.al.	2403.10420	translate	read	null
2024-03-14	SpokeN-100: A Cross-Lingual Benchmarking Dataset for The Classification of Spoken Numbers in Different Languages	René Groh et.al.	2403.09753	translate	read	link
2024-03-14	More than words: Advancements and challenges in speech recognition for singing	Anna Kruspe et.al.	2403.09298	translate	read	null
2024-03-13	Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition	Wenjing Zhu et.al.	2403.08258	translate	read	null
2024-03-13	SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation	Jiayu Du et.al.	2403.08196	translate	read	link
2024-03-13	Automatic Speech Recognition (ASR) for the Diagnosis of pronunciation of Speech Sound Disorders in Korean children	Taekyung Ahn et.al.	2403.08187	translate	read	null
2024-03-13	EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech	Ziqi Liang et.al.	2403.08164	translate	read	null
2024-03-12	Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language	Yash Sharma et.al.	2403.08011	translate	read	null
2024-03-12	Motifs, Phrases, and Beyond: The Modelling of Structure in Symbolic Music Generation	Keshav Bhandari et.al.	2403.07995	translate	read	null
2024-03-11	The evaluation of a code-switched Sepedi-English automatic speech recognition system	Amanda Phaladi et.al.	2403.07947	translate	read	null
2024-03-12	Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets	Jan Pešán et.al.	2403.07767	translate	read	null
2024-03-11	Real-Time Multimodal Cognitive Assistant for Emergency Medical Services	Keshara Weerasinghe et.al.	2403.06734	translate	read	null
2024-03-11	Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR	Yufeng Yang et.al.	2403.06387	translate	read	null
2024-03-10	SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations	Amit Meghanani et.al.	2403.06260	translate	read	null
2024-03-09	HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling	Chunhui Wang et.al.	2403.05989	translate	read	null
2024-03-09	Aligning Speech to Languages to Enhance Code-switching Speech Recognition	Hexin Liu et.al.	2403.05887	translate	read	null
2024-03-07	Classist Tools: Social Class Correlates with Performance in NLP	Amanda Cercas Curry et.al.	2403.04445	translate	read	null
2024-03-07	A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain	Qusai Abo Obaidah et.al.	2403.04280	translate	read	null
2024-03-07	A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition	Yusheng Dai et.al.	2403.04245	translate	read	link
2024-03-06	RADIA – Radio Advertisement Detection with Intelligent Analytics	Jorge Álvarez et.al.	2403.03538	translate	read	null
2024-03-06	Non-verbal information in spontaneous speech – towards a new framework of analysis	Tirza Biron et.al.	2403.03522	translate	read	null
2024-03-05	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	Zeqian Ju et.al.	2403.03100	translate	read	null
2024-03-05	AIx Speed: Playback Speed Optimization Using Listening Comprehension of Speech Recognition Models	Kazuki Kawamura et.al.	2403.02938	translate	read	null
2024-03-05	Single-Channel Robot Ego-Speech Filtering during Human-Robot Interaction	Yue Li et.al.	2403.02918	translate	read	null
2024-03-04	PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings	Joonas Kalda et.al.	2403.02288	translate	read	null
2024-03-04	What has LeBenchmark Learnt about French Syntax?	Zdravko Dugonjić et.al.	2403.02173	translate	read	null
2024-03-04	SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR	Zhiyun Fan et.al.	2403.02010	translate	read	null
2024-03-04	Language and Speech Technology for Central Kurdish Varieties	Sina Ahmadi et.al.	2403.01983	translate	read	link
2024-03-03	PAVITS: Exploring Prosody-aware VITS for End-to-End Emotional Voice Conversion	Tianhua Qi et.al.	2403.01494	translate	read	null
2024-03-03	A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement	Ravi Shankar et.al.	2403.01369	translate	read	null
2024-03-03	a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification	Hye-jin Shim et.al.	2403.01355	translate	read	link
2024-03-02	Automatic Speech Recognition using Advanced Deep Learning Approaches: A survey	Hamza Kheddar et.al.	2403.01255	translate	read	null
2024-03-02	Towards Accurate Lip-to-Speech Synthesis in-the-Wild	Sindhu Hegde et.al.	2403.01087	translate	read	null
2024-03-01	VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis	Weiwei Lin et.al.	2403.00529	translate	read	null
2024-03-01	Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview	Heyang Liu et.al.	2403.00370	translate	read	null
2024-03-01	Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification	Mufan Sang et.al.	2403.00293	translate	read	null
2024-03-01	Transcription and translation of videos using fine-tuned XLSR Wav2Vec2 on custom dataset and mBART	Aniket Tathe et.al.	2403.00212	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)