Audio Processing - 2024-08 | Paper Arxiv Daily

Audio Processing - 2024-08

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-08-30	Advancing Multi-talker ASR Performance with Large Language Models	Mohan Shi et.al.	2408.17431	translate	read	null
2024-08-30	AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge	Kirill Borodin et.al.	2408.17352	translate	read	null
2024-08-30	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	Zhen Ye et.al.	2408.17175	translate	read	link
2024-08-30	Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings	Shota Horiguchi et.al.	2408.17142	translate	read	null
2024-08-30	Generative Modeling Perspective for Control and Reasoning in Robotics	Takuma Yoneda et.al.	2408.17041	translate	read	null
2024-08-30	Utilizing Speaker Profiles for Impersonation Audio Detection	Hao Gu et.al.	2408.17009	translate	read	null
2024-08-30	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	Zhifei Xie et.al.	2408.16725	translate	read	link
2024-08-29	CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions	Laurin Wagner et.al.	2408.16589	translate	read	link
2024-08-29	Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing	Qianhui Liu et.al.	2408.16564	translate	read	null
2024-08-29	RAVE for Speech: Efficient Voice Conversion at High Sampling Rates	Anders R. Bargum et.al.	2408.16546	translate	read	null
2024-08-29	Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis	Zehai Tu et.al.	2408.16373	translate	read	null
2024-08-29	Measuring the Accuracy of Automatic Speech Recognition Solutions	Korbinian Kuhn et.al.	2408.16287	translate	read	link
2024-08-29	Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation	Lun Wang et.al.	2408.16204	translate	read	null
2024-08-29	Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction	Yuka Ko et.al.	2408.16180	translate	read	null
2024-08-28	Spoofing-Robust Speaker Verification Using Parallel Embedding Fusion: BTU Speech Group’s Approach for ASVspoof5 Challenge	Oğuzhan Kurnaz et.al.	2408.15877	translate	read	null
2024-08-28	VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Yixuan Zhou et.al.	2408.15676	translate	read	link
2024-08-28	Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications	Korbinian Kuhn et.al.	2408.15616	translate	read	link
2024-08-28	Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models	Yiyang Zhao et.al.	2408.15585	translate	read	null
2024-08-28	EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models	Wenhan Yao et.al.	2408.15508	translate	read	null
2024-08-27	Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement	Longshen Ou et.al.	2408.15176	translate	read	null
2024-08-27	Speech Recognition Transformers: Topological-lingualism Perspective	Shruti Singh et.al.	2408.14991	translate	read	null
2024-08-27	Literary and Colloquial Dialect Identification for Tamil using Acoustic Features	M. Nanmalar et.al.	2408.14887	translate	read	null
2024-08-27	The VoxCeleb Speaker Recognition Challenge: A Retrospective	Jaesung Huh et.al.	2408.14886	translate	read	null
2024-08-27	MaskCycleGAN-based Whisper to Normal Speech Conversion	K. Rohith Gupta et.al.	2408.14797	translate	read	null
2024-08-26	MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues	Kuluhan Binici et.al.	2408.14418	translate	read	null
2024-08-26	Self-supervised Speech Representations Still Struggle with African American Vernacular English	Kalvin Chang et.al.	2408.14262	translate	read	link
2024-08-26	Automatic recognition and detection of aphasic natural speech	Mara Barberis et.al.	2408.14082	translate	read	null
2024-08-26	Research Advances and New Paradigms for Biology-inspired Spiking Neural Networks	Tianyu Zheng et.al.	2408.13996	translate	read	null
2024-08-26	Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard	Wonjune Kang et.al.	2408.13970	translate	read	null
2024-08-25	Literary and Colloquial Tamil Dialect Identification	M. Nanmalar et.al.	2408.13739	translate	read	null
2024-08-24	Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification	Aditya Dawn et.al.	2408.13644	translate	read	null
2024-08-24	As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research	Wiebke Hutiri et.al.	2408.13614	translate	read	null
2024-08-24	SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description	Zeyu Jin et.al.	2408.13608	translate	read	link
2024-08-23	Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples	Zhenyu Wang et.al.	2408.13341	translate	read	null
2024-08-23	Which Prosodic Features Matter Most for Pragmatics?	Nigel G. Ward et.al.	2408.13240	translate	read	null
2024-08-23	NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks	He Huang et.al.	2408.13106	translate	read	null
2024-08-23	Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models	Adnan Haider et.al.	2408.13008	translate	read	null
2024-08-22	Towards measuring fairness in speech recognition: Fair-Speech dataset	Irina-Elena Veliche et.al.	2408.12734	translate	read	null
2024-08-22	WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech	Hirotaka Hiraki et.al.	2408.12500	translate	read	null
2024-08-22	Positional Description for Numerical Normalization	Deepanshu Gupta et.al.	2408.12430	translate	read	null
2024-08-22	LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation	Shihao Chen et.al.	2408.12354	translate	read	null
2024-08-22	Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features	Shaoxiang Dang et.al.	2408.12279	translate	read	null
2024-08-21	The State of Commercial Automatic French Legal Speech Recognition Systems and their Impact on Court Reporters et al	Nicolad Garneau et.al.	2408.11940	translate	read	null
2024-08-21	Approaching Deep Learning through the Spectral Dynamics of Weights	David Yunis et.al.	2408.11804	translate	read	link
2024-08-22	A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification	Xujiang Xing et.al.	2408.11562	translate	read	null
2024-08-21	Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech	Anastasia Avdeeva et.al.	2408.11528	translate	read	null
2024-08-21	Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers	Prashant Serai et.al.	2408.11258	translate	read	null
2024-08-20	BUT Systems and Analyses for the ASVspoof 5 Challenge	Johan Rohdin et.al.	2408.11152	translate	read	null
2024-08-20	AI-Based IVR	Gassyrbek Kosherbay et.al.	2408.10549	translate	read	null
2024-08-20	XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition	Xucheng Wan et.al.	2408.10524	translate	read	null
2024-08-19	ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge	Juan M. Martín-Doñas et.al.	2408.10361	translate	read	null
2024-08-19	Hear Your Face: Face-based voice conversion with F0 estimation	Jaejun Lee et.al.	2408.09802	translate	read	null
2024-08-19	Unsupervised Composable Representations for Audio	Giovanni Bindi et.al.	2408.09792	translate	read	null
2024-08-19	Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts	Jiaqing Liu et.al.	2408.09688	translate	read	null
2024-08-18	A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition	Yangze Li et.al.	2408.09491	translate	read	null
2024-08-17	Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model	Massimiliano Todisco et.al.	2408.09300	translate	read	null
2024-08-17	Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition	Samuele Cornell et.al.	2408.09215	translate	read	null
2024-08-14	Supervised and Unsupervised Alignments for Spoofing Behavioral Biometrics	Thomas Thebaud et.al.	2408.08918	translate	read	null
2024-08-16	ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale	Xin Wang et.al.	2408.08739	translate	read	null
2024-08-15	Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words	Kento Nozawa et.al.	2408.08027	translate	read	null
2024-08-14	SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition	Mohamed Osman et.al.	2408.07851	translate	read	link
2024-08-14	WavLM model ensemble for audio deepfake detection	David Combei et.al.	2408.07414	translate	read	null
2024-08-14	DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement	Tao Sun et.al.	2408.07388	translate	read	null
2024-08-13	Play Me Something Icy: Practical Challenges, Explainability and the Semantic Gap in Generative AI Music	Jesse Allison et.al.	2408.07224	translate	read	null
2024-08-13	VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders	Yubing Cao et.al.	2408.06906	translate	read	null
2024-08-13	SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis	Osamu Take et.al.	2408.06858	translate	read	link
2024-08-13	PRESENT: Zero-Shot Text-to-Prosody Control	Perry Lam et.al.	2408.06827	translate	read	link
2024-08-13	Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation	Matthias Bartolo et.al.	2408.06804	translate	read	link
2024-08-12	Cross-Lingual Conversational Speech Summarization with Large Language Models	Max Nelson et.al.	2408.06484	translate	read	null
2024-08-12	Audio Enhancement for Computer Audition – An Iterative Training Paradigm Using Sample Importance	Manuel Milling et.al.	2408.06264	translate	read	null
2024-08-12	Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning	Wonjun Lee et.al.	2408.06043	translate	read	null
2024-08-12	Controlling Surprisal in Music Generation via Information Content Curve Matching	Mathias Rose Bjare et.al.	2408.06022	translate	read	link
2024-08-11	LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition	Eunseop Yoon et.al.	2408.05769	translate	read	null
2024-08-11	VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing	Chunyu Qiang et.al.	2408.05758	translate	read	null
2024-08-10	Improving Whisper’s Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text	Jinpeng Li et.al.	2408.05554	translate	read	null
2024-08-09	MooER: LLM-based Speech Recognition and Translation Models from Moore Threads	Junhao Xu et.al.	2408.05101	translate	read	null
2024-08-09	TEAdapter: Supply abundant guidance for controllable text-to-music generation	Jialing Zou et.al.	2408.04865	translate	read	null
2024-08-08	MulliVC: Multi-lingual Voice Conversion With Cycle Consistency	Jiawei Huang et.al.	2408.04708	translate	read	null
2024-08-08	NeuralMultiling: A Novel Neural Architecture Search for Smartphone based Multilingual Speaker Verification	Aravinda Reddy PN et.al.	2408.04362	translate	read	null
2024-08-08	HydraFormer: One Encoder For All Subsampling Rates	Yaoxun Xu et.al.	2408.04325	translate	read	link
2024-08-08	Preserving spoken content in voice anonymisation with character-level vocoder conditioning	Michele Panariello et.al.	2408.04306	translate	read	null
2024-08-08	wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech	Khai Le-Duc et.al.	2408.04174	translate	read	null
2024-08-07	Speaker Adaptation for Quantised End-to-End ASR Models	Qiuming Zhao et.al.	2408.03979	translate	read	null
2024-08-06	Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training	Hawraz A. Ahmad et.al.	2408.03887	translate	read	null
2024-08-07	Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation	Karn N. Watcharasupat et.al.	2408.03588	translate	read	null
2024-08-06	ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval	Ruixiang Zhao et.al.	2408.02978	translate	read	null
2024-08-06	Self-Supervised Learning for Multi-Channel Neural Transducer	Atsushi Kojima et.al.	2408.02945	translate	read	null
2024-08-05	Automatic Voice Identification after Speech Resynthesis using PPG	Thibault Gaudier et.al.	2408.02712	translate	read	null
2024-08-05	Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition	Jaeyoung Kim et.al.	2408.02582	translate	read	null
2024-08-05	The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024	He Wang et.al.	2408.02369	translate	read	null
2024-08-05	StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion	Zhichao Wang et.al.	2408.02178	translate	read	null
2024-08-04	Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model	Shipei Liu et.al.	2408.01950	translate	read	null
2024-08-03	ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features	Peng Cheng et.al.	2408.01808	translate	read	null
2024-08-03	Generating High-quality Symbolic Music Using Fine-grained Discriminators	Zhedong Zhang et.al.	2408.01696	translate	read	null
2024-08-02	EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody	Coen Schoof et.al.	2408.01178	translate	read	null
2024-08-01	Expressive MIDI-format Piano Performance Generation	Jingwei Liu et.al.	2408.00900	translate	read	null
2024-08-01	SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data	Yichen Lu et.al.	2408.00624	translate	read	null
2024-08-01	Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation	Xinhan Di et.al.	2408.00284	translate	read	null
2024-08-01	Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation	Kohei Matsuura et.al.	2408.00205	translate	read	null
2024-08-01	Generative Expressive Conversational Speech Synthesis	Rui Liu et.al.	2407.21491	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)