Audio Processing - 2024-10

Publish Date Title Authors PDF Translate Read Code
2024-10-31 IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision Maxwell Meyer et.al. 2411.00252 translate read null
2024-10-31 Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? Ioannis Tsiamas et.al. 2410.24019 translate read null
2024-10-31 Task-Aware Unified Source Separation Kohei Saijo et.al. 2410.23987 translate read null
2024-10-30 Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis Théodor Lemerle et.al. 2410.23320 translate read link
2024-10-30 Augmenting Polish Automatic Speech Recognition System With Synthetic Data Łukasz Bondaruk et.al. 2410.22903 translate read null
2024-10-30 Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising Yoto Fujita et.al. 2410.22805 translate read null
2024-10-29 Emotion-Guided Image to Music Generation Souraja Kundu et.al. 2410.22299 translate read null
2024-10-29 Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding Bohan Li et.al. 2410.21951 translate read null
2024-10-29 Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription Can Cui et.al. 2410.21849 translate read null
2024-10-28 Asynchronous Tool Usage for Real-Time Agents Antonio A. Ginart et.al. 2410.21620 translate read null
2024-10-28 Enhancing TTS Stability in Hebrew using Discrete Semantic Units Ella Zeldes et.al. 2410.21502 translate read null
2024-10-28 Mitigating Unauthorized Speech Synthesis for Voice Protection Zhisheng Zhang et.al. 2410.20742 translate read link
2024-10-27 Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors Sadia Nowrin et.al. 2410.20564 translate read null
2024-10-27 Symbotunes: unified hub for symbolic music generative models Paweł Skierś et.al. 2410.20515 translate read link
2024-10-27 MusicFlow: Cascaded Flow Matching for Text Guided Music Generation K R Prajwal et.al. 2410.20478 translate read null
2024-10-27 Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation Maohao Shen et.al. 2410.20336 translate read null
2024-10-27 Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs Enshi Zhang et.al. 2410.20334 translate read null
2024-10-26 emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography Viswanath Sivakumar et.al. 2410.20081 translate read link
2024-10-24 Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis Suparna De et.al. 2410.19199 translate read null
2024-10-25 A Survey on Speech Large Language Models Jing Peng et.al. 2410.18908 translate read null
2024-10-24 We Augmented Whisper With kNN and You Won’t Believe What Came Next Maya K. Nachesa et.al. 2410.18850 translate read null
2024-10-24 STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin et.al. 2410.18607 translate read null
2024-10-24 Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts ChaeHun Park et.al. 2410.18444 translate read null
2024-10-24 Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model Vishakha Lall et.al. 2410.18363 translate read null
2024-10-23 Music102: An $D_{12}$ -equivariant transformer for chord progression accompaniment Weiliang Luo et.al. 2410.18151 translate read link
2024-10-23 ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams Srija Anand et.al. 2410.17901 translate read null
2024-10-23 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation Qinglin Zhang et.al. 2410.17799 translate read link
2024-10-23 Exploring Tokenization Methods for Multitrack Sheet Music Generation Yashan Wang et.al. 2410.17584 translate read null
2024-10-23 VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning Yifan Peng et.al. 2410.17485 translate read null
2024-10-22 mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Suryoday Basak et.al. 2410.17457 translate read null
2024-10-22 Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models Alexander Polok et.al. 2410.17437 translate read null
2024-10-22 VoiceBench: Benchmarking LLM-Based Voice Assistants Yiming Chen et.al. 2410.17196 translate read link
2024-10-22 Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification Wen Huang et.al. 2410.17033 translate read null
2024-10-22 Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap Guanrou Yang et.al. 2410.16726 translate read null
2024-10-22 DENOASR: Debiasing ASRs through Selective Denoising Anand Kumar Rai et.al. 2410.16712 translate read null
2024-10-21 AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition Zehua Liu et.al. 2410.16438 translate read link
2024-10-21 Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification Wan Lin et.al. 2410.16428 translate read null
2024-10-21 Continuous Speech Synthesis using per-token Latent Diffusion Arnon Turetzky et.al. 2410.16048 translate read null
2024-10-21 LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec Yiwei Guo et.al. 2410.15764 translate read null
2024-10-21 Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation Victor Junqiu Wei et.al. 2410.15620 translate read null
2024-10-21 Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding Yeonjoon Jung et.al. 2410.15609 translate read null
2024-10-21 Moonshine: Speech Recognition for Live Transcription and Voice Commands Nat Jeffries et.al. 2410.15608 translate read link
2024-10-20 Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example Suhita Ghosh et.al. 2410.15500 translate read link
2024-10-20 Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses Suhita Ghosh et.al. 2410.15499 translate read null
2024-10-20 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant Alan Dao et.al. 2410.15316 translate read link
2024-10-19 Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention Yuzhe Weng et.al. 2410.15029 translate read link
2024-10-18 AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup Carlos Carvalho et.al. 2410.14910 translate read null
2024-10-18 A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages Sujitha Sathiyamoorthy et.al. 2410.14197 translate read null
2024-10-17 Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding Tan Dat Nguyen et.al. 2410.13839 translate read null
2024-10-17 Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR Abhishek Gupta et.al. 2410.13445 translate read null
2024-10-17 MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit Yutian Wang et.al. 2410.13419 translate read null
2024-10-17 DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech Jan Melechovsky et.al. 2410.13342 translate read null
2024-10-17 Computational Approaches to Arabic-English Code-Switching Caroline Sabty et.al. 2410.13318 translate read null
2024-10-17 DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis Yu Gu et.al. 2410.13288 translate read null
2024-10-17 Roadmap towards Superhuman Speech Understanding using Large Language Models Fan Bu et.al. 2410.13268 translate read null
2024-10-17 Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation Sreyan Ghosh et.al. 2410.13198 translate read null
2024-10-17 EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning Ashish Seth et.al. 2410.13179 translate read link
2024-10-17 Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities Xiangping Chen et.al. 2410.13110 translate read null
2024-10-16 Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR Christoph Minixhofer et.al. 2410.12279 translate read null
2024-10-16 Guided Speaker Embedding Shota Horiguchi et.al. 2410.12182 translate read null
2024-10-15 A Framework for Adapting Human-Robot Interaction to Diverse User Groups Theresa Pekarek Rosin et.al. 2410.11377 translate read null
2024-10-15 Investigation of Speaker Representation for Target-Speaker Speech Processing Takanori Ashihara et.al. 2410.11243 translate read null
2024-10-14 DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization Yingahao Aaron Li et.al. 2410.11097 translate read null
2024-10-14 Character-aware audio-visual subtitling in context Jaesung Huh et.al. 2410.11068 translate read null
2024-10-14 Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers Gabriel Souza et.al. 2410.10515 translate read null
2024-10-14 Everyday Speech in the Indian Subcontinent Utkarsh Pathak et.al. 2410.10508 translate read null
2024-10-14 In-Materia Speech Recognition Mohamadreza Zolfagharinejad et.al. 2410.10434 translate read null
2024-10-13 State of NLP in Kenya: A Survey Cynthia Jayne Amol et.al. 2410.09948 translate read null
2024-10-13 M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models Megha Sharma et.al. 2410.09928 translate read link
2024-10-12 SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs Wenxi Chen et.al. 2410.09503 translate read link
2024-10-12 Automatic Speech Recognition with BERT and CTC Transformers: A Review Noussaiba Djeffal et.al. 2410.09456 translate read null
2024-10-11 UniGlyph: A Seven-Segment Script for Universal Language Representation G. V. Bency Sherin et.al. 2410.08974 translate read null
2024-10-14 Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities Aulia Adila et.al. 2410.08828 translate read null
2024-10-11 Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation Yishan Lv et.al. 2410.08626 translate read null
2024-10-11 Symbolic Music Generation with Fine-grained Interactive Textural Guidance Tingyu Zhu et.al. 2410.08435 translate read null
2024-10-10 SoundScape: A Human-AI Co-Creation System Making Your Memories Heard Chongjun Zhong et.al. 2410.08136 translate read null
2024-10-10 Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models Adriana Fernandez-Lopez et.al. 2410.07771 translate read null
2024-10-09 The First VoicePrivacy Attacker Challenge Evaluation Plan Natalia Tomashenko et.al. 2410.07428 translate read link
2024-10-09 Advocating Character Error Rate for Multilingual ASR Evaluation Thennal D K et.al. 2410.07400 translate read null
2024-10-09 Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch Teodora Răgman et.al. 2410.06787 translate read null
2024-10-09 Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS Onkar Kishor Susladkar et.al. 2410.06608 translate read null
2024-10-08 Diversity-Rewarded CFG Distillation Geoffrey Cideron et.al. 2410.06084 translate read null
2024-10-08 The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge Ya Jiang et.al. 2410.05986 translate read null
2024-10-08 Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching Leonardo B. de M. M. Marques et.al. 2410.05620 translate read link
2024-10-07 Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments Sagarika Alavilli et.al. 2410.05423 translate read null
2024-10-07 Presto! Distilling Steps and Layers for Accelerating Music Generation Zachary Novack et.al. 2410.05167 translate read null
2024-10-07 Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer Siyuan Hou et.al. 2410.05151 translate read null
2024-10-07 Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars Navid Ashrafi et.al. 2410.05131 translate read null
2024-10-07 CR-CTC: Consistency regularization on CTC for improved speech recognition Zengwei Yao et.al. 2410.05101 translate read null
2024-10-07 Improving Speaker Representations Using Contrastive Losses on Multi-scale Features Satvik Dixit et.al. 2410.05037 translate read null
2024-10-06 Punctuation Prediction for Polish Texts using Transformers Jakub Pokrywka et.al. 2410.04621 translate read null
2024-10-06 Casablanca: Data and Models for Multidialectal Arabic Speech Recognition Bashar Talafha et.al. 2410.04527 translate read null
2024-10-06 HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis Yuto Nishimura et.al. 2410.04380 translate read null
2024-10-06 SONAR: A Synthetic AI-Audio Detection Framework~and Benchmark Xiang Li et.al. 2410.04324 translate read link
2024-10-05 Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer Tomoki Honda et.al. 2410.04159 translate read link
2024-10-04 Generative Semantic Communication for Text-to-Speech Synthesis Jiahao Zheng et.al. 2410.03459 translate read null
2024-10-04 Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges Nguyen Van Dinh et.al. 2410.03458 translate read null
2024-10-04 Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques Olga Iakovenko et.al. 2410.03412 translate read null
2024-10-04 MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak et.al. 2410.03192 translate read null
2024-10-03 Disentangling Textual and Acoustic Features of Neural Speech Representations Hosein Mohebbi et.al. 2410.03037 translate read null
2024-10-03 Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR Hainan Xu et.al. 2410.02597 translate read null
2024-10-04 Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition Olga Iakovenko et.al. 2410.02560 translate read null
2024-10-03 Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Olga Iakovenko et.al. 2410.02538 translate read null
2024-10-03 State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data Sara Barahona et.al. 2410.02364 translate read null
2024-10-03 A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker’s Shadowings Haopeng Geng et.al. 2410.02239 translate read null
2024-10-02 Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset Weihan Xu et.al. 2410.02084 translate read null
2024-10-02 Spoken Grammar Assessment Using LLM Sunil Kumar Kopparapu et.al. 2410.01579 translate read null
2024-10-02 Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling Yuguang Yang et.al. 2410.01350 translate read null
2024-10-01 MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages Marco Gaido et.al. 2410.01036 translate read link
2024-10-01 Automatic Speech Recognition for the Ika Language Uchenna Nzenwata et.al. 2410.00940 translate read null
2024-10-01 Do Music Generation Models Encode Music Theory? Megan Wei et.al. 2410.00872 translate read null
2024-10-01 VHASR: A Multimodal Speech Recognition System With Vision Hotwords Jiliang Hu et.al. 2410.00822 translate read link
2024-10-01 Improving curriculum learning for target speaker extraction with synthetic speakers Yun Liu et.al. 2410.00811 translate read null
2024-10-01 End-to-End Speech Recognition with Pre-trained Masked Language Model Yosuke Higuchi et.al. 2410.00528 translate read null
2024-10-02 Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces Lilac Atassi et.al. 2410.00344 translate read null
2024-10-01 EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control Haozhe Chen et.al. 2410.00316 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)