Audio Processing - 2024-10 | Paper Arxiv Daily

Audio Processing - 2024-10

Publish Date	Title	Authors	PDF	Translate	Read	Code
2024-10-31	IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision	Maxwell Meyer et.al.	2411.00252	translate	read	null
2024-10-31	Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?	Ioannis Tsiamas et.al.	2410.24019	translate	read	null
2024-10-31	Task-Aware Unified Source Separation	Kohei Saijo et.al.	2410.23987	translate	read	null
2024-10-30	Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis	Théodor Lemerle et.al.	2410.23320	translate	read	link
2024-10-30	Augmenting Polish Automatic Speech Recognition System With Synthetic Data	Łukasz Bondaruk et.al.	2410.22903	translate	read	null
2024-10-30	Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising	Yoto Fujita et.al.	2410.22805	translate	read	null
2024-10-29	Emotion-Guided Image to Music Generation	Souraja Kundu et.al.	2410.22299	translate	read	null
2024-10-29	Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding	Bohan Li et.al.	2410.21951	translate	read	null
2024-10-29	Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription	Can Cui et.al.	2410.21849	translate	read	null
2024-10-28	Asynchronous Tool Usage for Real-Time Agents	Antonio A. Ginart et.al.	2410.21620	translate	read	null
2024-10-28	Enhancing TTS Stability in Hebrew using Discrete Semantic Units	Ella Zeldes et.al.	2410.21502	translate	read	null
2024-10-28	Mitigating Unauthorized Speech Synthesis for Voice Protection	Zhisheng Zhang et.al.	2410.20742	translate	read	link
2024-10-27	Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors	Sadia Nowrin et.al.	2410.20564	translate	read	null
2024-10-27	Symbotunes: unified hub for symbolic music generative models	Paweł Skierś et.al.	2410.20515	translate	read	link
2024-10-27	MusicFlow: Cascaded Flow Matching for Text Guided Music Generation	K R Prajwal et.al.	2410.20478	translate	read	null
2024-10-27	Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation	Maohao Shen et.al.	2410.20336	translate	read	null
2024-10-27	Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs	Enshi Zhang et.al.	2410.20334	translate	read	null
2024-10-26	emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography	Viswanath Sivakumar et.al.	2410.20081	translate	read	link
2024-10-24	Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis	Suparna De et.al.	2410.19199	translate	read	null
2024-10-25	A Survey on Speech Large Language Models	Jing Peng et.al.	2410.18908	translate	read	null
2024-10-24	We Augmented Whisper With kNN and You Won’t Believe What Came Next	Maya K. Nachesa et.al.	2410.18850	translate	read	null
2024-10-24	STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin et.al.	2410.18607	translate	read	null
2024-10-24	Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts	ChaeHun Park et.al.	2410.18444	translate	read	null
2024-10-24	Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model	Vishakha Lall et.al.	2410.18363	translate	read	null
2024-10-23	Music102: An $D_{12}$ -equivariant transformer for chord progression accompaniment	Weiliang Luo et.al.	2410.18151	translate	read	link
2024-10-23	ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams	Srija Anand et.al.	2410.17901	translate	read	null
2024-10-23	OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	Qinglin Zhang et.al.	2410.17799	translate	read	link
2024-10-23	Exploring Tokenization Methods for Multitrack Sheet Music Generation	Yashan Wang et.al.	2410.17584	translate	read	null
2024-10-23	VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning	Yifan Peng et.al.	2410.17485	translate	read	null
2024-10-22	mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar	Suryoday Basak et.al.	2410.17457	translate	read	null
2024-10-22	Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models	Alexander Polok et.al.	2410.17437	translate	read	null
2024-10-22	VoiceBench: Benchmarking LLM-Based Voice Assistants	Yiming Chen et.al.	2410.17196	translate	read	link
2024-10-22	Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification	Wen Huang et.al.	2410.17033	translate	read	null
2024-10-22	Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap	Guanrou Yang et.al.	2410.16726	translate	read	null
2024-10-22	DENOASR: Debiasing ASRs through Selective Denoising	Anand Kumar Rai et.al.	2410.16712	translate	read	null
2024-10-21	AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition	Zehua Liu et.al.	2410.16438	translate	read	link
2024-10-21	Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification	Wan Lin et.al.	2410.16428	translate	read	null
2024-10-21	Continuous Speech Synthesis using per-token Latent Diffusion	Arnon Turetzky et.al.	2410.16048	translate	read	null
2024-10-21	LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec	Yiwei Guo et.al.	2410.15764	translate	read	null
2024-10-21	Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation	Victor Junqiu Wei et.al.	2410.15620	translate	read	null
2024-10-21	Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding	Yeonjoon Jung et.al.	2410.15609	translate	read	null
2024-10-21	Moonshine: Speech Recognition for Live Transcription and Voice Commands	Nat Jeffries et.al.	2410.15608	translate	read	link
2024-10-20	Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example	Suhita Ghosh et.al.	2410.15500	translate	read	link
2024-10-20	Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses	Suhita Ghosh et.al.	2410.15499	translate	read	null
2024-10-20	Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant	Alan Dao et.al.	2410.15316	translate	read	link
2024-10-19	Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention	Yuzhe Weng et.al.	2410.15029	translate	read	link
2024-10-18	AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup	Carlos Carvalho et.al.	2410.14910	translate	read	null
2024-10-18	A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages	Sujitha Sathiyamoorthy et.al.	2410.14197	translate	read	null
2024-10-17	Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding	Tan Dat Nguyen et.al.	2410.13839	translate	read	null
2024-10-17	Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR	Abhishek Gupta et.al.	2410.13445	translate	read	null
2024-10-17	MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit	Yutian Wang et.al.	2410.13419	translate	read	null
2024-10-17	DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech	Jan Melechovsky et.al.	2410.13342	translate	read	null
2024-10-17	Computational Approaches to Arabic-English Code-Switching	Caroline Sabty et.al.	2410.13318	translate	read	null
2024-10-17	DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis	Yu Gu et.al.	2410.13288	translate	read	null
2024-10-17	Roadmap towards Superhuman Speech Understanding using Large Language Models	Fan Bu et.al.	2410.13268	translate	read	null
2024-10-17	Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation	Sreyan Ghosh et.al.	2410.13198	translate	read	null
2024-10-17	EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning	Ashish Seth et.al.	2410.13179	translate	read	link
2024-10-17	Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities	Xiangping Chen et.al.	2410.13110	translate	read	null
2024-10-16	Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR	Christoph Minixhofer et.al.	2410.12279	translate	read	null
2024-10-16	Guided Speaker Embedding	Shota Horiguchi et.al.	2410.12182	translate	read	null
2024-10-15	A Framework for Adapting Human-Robot Interaction to Diverse User Groups	Theresa Pekarek Rosin et.al.	2410.11377	translate	read	null
2024-10-15	Investigation of Speaker Representation for Target-Speaker Speech Processing	Takanori Ashihara et.al.	2410.11243	translate	read	null
2024-10-14	DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization	Yingahao Aaron Li et.al.	2410.11097	translate	read	null
2024-10-14	Character-aware audio-visual subtitling in context	Jaesung Huh et.al.	2410.11068	translate	read	null
2024-10-14	Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers	Gabriel Souza et.al.	2410.10515	translate	read	null
2024-10-14	Everyday Speech in the Indian Subcontinent	Utkarsh Pathak et.al.	2410.10508	translate	read	null
2024-10-14	In-Materia Speech Recognition	Mohamadreza Zolfagharinejad et.al.	2410.10434	translate	read	null
2024-10-13	State of NLP in Kenya: A Survey	Cynthia Jayne Amol et.al.	2410.09948	translate	read	null
2024-10-13	M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models	Megha Sharma et.al.	2410.09928	translate	read	link
2024-10-12	SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs	Wenxi Chen et.al.	2410.09503	translate	read	link
2024-10-12	Automatic Speech Recognition with BERT and CTC Transformers: A Review	Noussaiba Djeffal et.al.	2410.09456	translate	read	null
2024-10-11	UniGlyph: A Seven-Segment Script for Universal Language Representation	G. V. Bency Sherin et.al.	2410.08974	translate	read	null
2024-10-14	Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities	Aulia Adila et.al.	2410.08828	translate	read	null
2024-10-11	Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation	Yishan Lv et.al.	2410.08626	translate	read	null
2024-10-11	Symbolic Music Generation with Fine-grained Interactive Textural Guidance	Tingyu Zhu et.al.	2410.08435	translate	read	null
2024-10-10	SoundScape: A Human-AI Co-Creation System Making Your Memories Heard	Chongjun Zhong et.al.	2410.08136	translate	read	null
2024-10-10	Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models	Adriana Fernandez-Lopez et.al.	2410.07771	translate	read	null
2024-10-09	The First VoicePrivacy Attacker Challenge Evaluation Plan	Natalia Tomashenko et.al.	2410.07428	translate	read	link
2024-10-09	Advocating Character Error Rate for Multilingual ASR Evaluation	Thennal D K et.al.	2410.07400	translate	read	null
2024-10-09	Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch	Teodora Răgman et.al.	2410.06787	translate	read	null
2024-10-09	Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS	Onkar Kishor Susladkar et.al.	2410.06608	translate	read	null
2024-10-08	Diversity-Rewarded CFG Distillation	Geoffrey Cideron et.al.	2410.06084	translate	read	null
2024-10-08	The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge	Ya Jiang et.al.	2410.05986	translate	read	null
2024-10-08	Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching	Leonardo B. de M. M. Marques et.al.	2410.05620	translate	read	link
2024-10-07	Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments	Sagarika Alavilli et.al.	2410.05423	translate	read	null
2024-10-07	Presto! Distilling Steps and Layers for Accelerating Music Generation	Zachary Novack et.al.	2410.05167	translate	read	null
2024-10-07	Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer	Siyuan Hou et.al.	2410.05151	translate	read	null
2024-10-07	Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars	Navid Ashrafi et.al.	2410.05131	translate	read	null
2024-10-07	CR-CTC: Consistency regularization on CTC for improved speech recognition	Zengwei Yao et.al.	2410.05101	translate	read	null
2024-10-07	Improving Speaker Representations Using Contrastive Losses on Multi-scale Features	Satvik Dixit et.al.	2410.05037	translate	read	null
2024-10-06	Punctuation Prediction for Polish Texts using Transformers	Jakub Pokrywka et.al.	2410.04621	translate	read	null
2024-10-06	Casablanca: Data and Models for Multidialectal Arabic Speech Recognition	Bashar Talafha et.al.	2410.04527	translate	read	null
2024-10-06	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	Yuto Nishimura et.al.	2410.04380	translate	read	null
2024-10-06	SONAR: A Synthetic AI-Audio Detection Framework~and Benchmark	Xiang Li et.al.	2410.04324	translate	read	link
2024-10-05	Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer	Tomoki Honda et.al.	2410.04159	translate	read	link
2024-10-04	Generative Semantic Communication for Text-to-Speech Synthesis	Jiahao Zheng et.al.	2410.03459	translate	read	null
2024-10-04	Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges	Nguyen Van Dinh et.al.	2410.03458	translate	read	null
2024-10-04	Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques	Olga Iakovenko et.al.	2410.03412	translate	read	null
2024-10-04	MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak et.al.	2410.03192	translate	read	null
2024-10-03	Disentangling Textual and Acoustic Features of Neural Speech Representations	Hosein Mohebbi et.al.	2410.03037	translate	read	null
2024-10-03	Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR	Hainan Xu et.al.	2410.02597	translate	read	null
2024-10-04	Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition	Olga Iakovenko et.al.	2410.02560	translate	read	null
2024-10-03	Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems	Olga Iakovenko et.al.	2410.02538	translate	read	null
2024-10-03	State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data	Sara Barahona et.al.	2410.02364	translate	read	null
2024-10-03	A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker’s Shadowings	Haopeng Geng et.al.	2410.02239	translate	read	null
2024-10-02	Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset	Weihan Xu et.al.	2410.02084	translate	read	null
2024-10-02	Spoken Grammar Assessment Using LLM	Sunil Kumar Kopparapu et.al.	2410.01579	translate	read	null
2024-10-02	Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling	Yuguang Yang et.al.	2410.01350	translate	read	null
2024-10-01	MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages	Marco Gaido et.al.	2410.01036	translate	read	link
2024-10-01	Automatic Speech Recognition for the Ika Language	Uchenna Nzenwata et.al.	2410.00940	translate	read	null
2024-10-01	Do Music Generation Models Encode Music Theory?	Megan Wei et.al.	2410.00872	translate	read	null
2024-10-01	VHASR: A Multimodal Speech Recognition System With Vision Hotwords	Jiliang Hu et.al.	2410.00822	translate	read	link
2024-10-01	Improving curriculum learning for target speaker extraction with synthetic speakers	Yun Liu et.al.	2410.00811	translate	read	null
2024-10-01	End-to-End Speech Recognition with Pre-trained Masked Language Model	Yosuke Higuchi et.al.	2410.00528	translate	read	null
2024-10-02	Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces	Lilac Atassi et.al.	2410.00344	translate	read	null
2024-10-01	EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control	Haozhe Chen et.al.	2410.00316	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)