Audio Processing - 2024-10
Audio Processing - 2024-10
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-10-31 | IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision | Maxwell Meyer et.al. | 2411.00252 | translate | read | null |
| 2024-10-31 | Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody? | Ioannis Tsiamas et.al. | 2410.24019 | translate | read | null |
| 2024-10-31 | Task-Aware Unified Source Separation | Kohei Saijo et.al. | 2410.23987 | translate | read | null |
| 2024-10-30 | Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis | Théodor Lemerle et.al. | 2410.23320 | translate | read | link |
| 2024-10-30 | Augmenting Polish Automatic Speech Recognition System With Synthetic Data | Łukasz Bondaruk et.al. | 2410.22903 | translate | read | null |
| 2024-10-30 | Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising | Yoto Fujita et.al. | 2410.22805 | translate | read | null |
| 2024-10-29 | Emotion-Guided Image to Music Generation | Souraja Kundu et.al. | 2410.22299 | translate | read | null |
| 2024-10-29 | Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding | Bohan Li et.al. | 2410.21951 | translate | read | null |
| 2024-10-29 | Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription | Can Cui et.al. | 2410.21849 | translate | read | null |
| 2024-10-28 | Asynchronous Tool Usage for Real-Time Agents | Antonio A. Ginart et.al. | 2410.21620 | translate | read | null |
| 2024-10-28 | Enhancing TTS Stability in Hebrew using Discrete Semantic Units | Ella Zeldes et.al. | 2410.21502 | translate | read | null |
| 2024-10-28 | Mitigating Unauthorized Speech Synthesis for Voice Protection | Zhisheng Zhang et.al. | 2410.20742 | translate | read | link |
| 2024-10-27 | Using Confidence Scores to Improve Eyes-free Detection of Speech Recognition Errors | Sadia Nowrin et.al. | 2410.20564 | translate | read | null |
| 2024-10-27 | Symbotunes: unified hub for symbolic music generative models | Paweł Skierś et.al. | 2410.20515 | translate | read | link |
| 2024-10-27 | MusicFlow: Cascaded Flow Matching for Text Guided Music Generation | K R Prajwal et.al. | 2410.20478 | translate | read | null |
| 2024-10-27 | Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation | Maohao Shen et.al. | 2410.20336 | translate | read | null |
| 2024-10-27 | Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs | Enshi Zhang et.al. | 2410.20334 | translate | read | null |
| 2024-10-26 | emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography | Viswanath Sivakumar et.al. | 2410.20081 | translate | read | link |
| 2024-10-24 | Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis | Suparna De et.al. | 2410.19199 | translate | read | null |
| 2024-10-25 | A Survey on Speech Large Language Models | Jing Peng et.al. | 2410.18908 | translate | read | null |
| 2024-10-24 | We Augmented Whisper With kNN and You Won’t Believe What Came Next | Maya K. Nachesa et.al. | 2410.18850 | translate | read | null |
| 2024-10-24 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model | Hawau Olamide Toyin et.al. | 2410.18607 | translate | read | null |
| 2024-10-24 | Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts | ChaeHun Park et.al. | 2410.18444 | translate | read | null |
| 2024-10-24 | Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model | Vishakha Lall et.al. | 2410.18363 | translate | read | null |
| 2024-10-23 | Music102: An $D_{12}$ -equivariant transformer for chord progression accompaniment | Weiliang Luo et.al. | 2410.18151 | translate | read | link |
| 2024-10-23 | ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams | Srija Anand et.al. | 2410.17901 | translate | read | null |
| 2024-10-23 | OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation | Qinglin Zhang et.al. | 2410.17799 | translate | read | link |
| 2024-10-23 | Exploring Tokenization Methods for Multitrack Sheet Music Generation | Yashan Wang et.al. | 2410.17584 | translate | read | null |
| 2024-10-23 | VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning | Yifan Peng et.al. | 2410.17485 | translate | read | null |
| 2024-10-22 | mmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar | Suryoday Basak et.al. | 2410.17457 | translate | read | null |
| 2024-10-22 | Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models | Alexander Polok et.al. | 2410.17437 | translate | read | null |
| 2024-10-22 | VoiceBench: Benchmarking LLM-Based Voice Assistants | Yiming Chen et.al. | 2410.17196 | translate | read | link |
| 2024-10-22 | Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification | Wen Huang et.al. | 2410.17033 | translate | read | null |
| 2024-10-22 | Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap | Guanrou Yang et.al. | 2410.16726 | translate | read | null |
| 2024-10-22 | DENOASR: Debiasing ASRs through Selective Denoising | Anand Kumar Rai et.al. | 2410.16712 | translate | read | null |
| 2024-10-21 | AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition | Zehua Liu et.al. | 2410.16438 | translate | read | link |
| 2024-10-21 | Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification | Wan Lin et.al. | 2410.16428 | translate | read | null |
| 2024-10-21 | Continuous Speech Synthesis using per-token Latent Diffusion | Arnon Turetzky et.al. | 2410.16048 | translate | read | null |
| 2024-10-21 | LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec | Yiwei Guo et.al. | 2410.15764 | translate | read | null |
| 2024-10-21 | Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation | Victor Junqiu Wei et.al. | 2410.15620 | translate | read | null |
| 2024-10-21 | Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding | Yeonjoon Jung et.al. | 2410.15609 | translate | read | null |
| 2024-10-21 | Moonshine: Speech Recognition for Live Transcription and Voice Commands | Nat Jeffries et.al. | 2410.15608 | translate | read | link |
| 2024-10-20 | Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example | Suhita Ghosh et.al. | 2410.15500 | translate | read | link |
| 2024-10-20 | Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses | Suhita Ghosh et.al. | 2410.15499 | translate | read | null |
| 2024-10-20 | Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant | Alan Dao et.al. | 2410.15316 | translate | read | link |
| 2024-10-19 | Enhancing Multimodal Sentiment Analysis for Missing Modality through Self-Distillation and Unified Modality Cross-Attention | Yuzhe Weng et.al. | 2410.15029 | translate | read | link |
| 2024-10-18 | AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup | Carlos Carvalho et.al. | 2410.14910 | translate | read | null |
| 2024-10-18 | A Unified Framework for Collecting Text-to-Speech Synthesis Datasets for 22 Indian Languages | Sujitha Sathiyamoorthy et.al. | 2410.14197 | translate | read | null |
| 2024-10-17 | Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding | Tan Dat Nguyen et.al. | 2410.13839 | translate | read | null |
| 2024-10-17 | Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR | Abhishek Gupta et.al. | 2410.13445 | translate | read | null |
| 2024-10-17 | MeloTrans: A Text to Symbolic Music Generation Model Following Human Composition Habit | Yutian Wang et.al. | 2410.13419 | translate | read | null |
| 2024-10-17 | DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech | Jan Melechovsky et.al. | 2410.13342 | translate | read | null |
| 2024-10-17 | Computational Approaches to Arabic-English Code-Switching | Caroline Sabty et.al. | 2410.13318 | translate | read | null |
| 2024-10-17 | DurIAN-E 2: Duration Informed Attention Network with Adaptive Variational Autoencoder and Adversarial Learning for Expressive Text-to-Speech Synthesis | Yu Gu et.al. | 2410.13288 | translate | read | null |
| 2024-10-17 | Roadmap towards Superhuman Speech Understanding using Large Language Models | Fan Bu et.al. | 2410.13268 | translate | read | null |
| 2024-10-17 | Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation | Sreyan Ghosh et.al. | 2410.13198 | translate | read | null |
| 2024-10-17 | EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning | Ashish Seth et.al. | 2410.13179 | translate | read | link |
| 2024-10-17 | Deep Learning-based Software Engineering: Progress, Challenges, and Opportunities | Xiangping Chen et.al. | 2410.13110 | translate | read | null |
| 2024-10-16 | Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR | Christoph Minixhofer et.al. | 2410.12279 | translate | read | null |
| 2024-10-16 | Guided Speaker Embedding | Shota Horiguchi et.al. | 2410.12182 | translate | read | null |
| 2024-10-15 | A Framework for Adapting Human-Robot Interaction to Diverse User Groups | Theresa Pekarek Rosin et.al. | 2410.11377 | translate | read | null |
| 2024-10-15 | Investigation of Speaker Representation for Target-Speaker Speech Processing | Takanori Ashihara et.al. | 2410.11243 | translate | read | null |
| 2024-10-14 | DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization | Yingahao Aaron Li et.al. | 2410.11097 | translate | read | null |
| 2024-10-14 | Character-aware audio-visual subtitling in context | Jaesung Huh et.al. | 2410.11068 | translate | read | null |
| 2024-10-14 | Do we need more complex representations for structure? A comparison of note duration representation for Music Transformers | Gabriel Souza et.al. | 2410.10515 | translate | read | null |
| 2024-10-14 | Everyday Speech in the Indian Subcontinent | Utkarsh Pathak et.al. | 2410.10508 | translate | read | null |
| 2024-10-14 | In-Materia Speech Recognition | Mohamadreza Zolfagharinejad et.al. | 2410.10434 | translate | read | null |
| 2024-10-13 | State of NLP in Kenya: A Survey | Cynthia Jayne Amol et.al. | 2410.09948 | translate | read | null |
| 2024-10-13 | M2M-Gen: A Multimodal Framework for Automated Background Music Generation in Japanese Manga Using Large Language Models | Megha Sharma et.al. | 2410.09928 | translate | read | link |
| 2024-10-12 | SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs | Wenxi Chen et.al. | 2410.09503 | translate | read | link |
| 2024-10-12 | Automatic Speech Recognition with BERT and CTC Transformers: A Review | Noussaiba Djeffal et.al. | 2410.09456 | translate | read | null |
| 2024-10-11 | UniGlyph: A Seven-Segment Script for Universal Language Representation | G. V. Bency Sherin et.al. | 2410.08974 | translate | read | null |
| 2024-10-14 | Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities | Aulia Adila et.al. | 2410.08828 | translate | read | null |
| 2024-10-11 | Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation | Yishan Lv et.al. | 2410.08626 | translate | read | null |
| 2024-10-11 | Symbolic Music Generation with Fine-grained Interactive Textural Guidance | Tingyu Zhu et.al. | 2410.08435 | translate | read | null |
| 2024-10-10 | SoundScape: A Human-AI Co-Creation System Making Your Memories Heard | Chongjun Zhong et.al. | 2410.08136 | translate | read | null |
| 2024-10-10 | Full-Rank No More: Low-Rank Weight Training for Modern Speech Recognition Models | Adriana Fernandez-Lopez et.al. | 2410.07771 | translate | read | null |
| 2024-10-09 | The First VoicePrivacy Attacker Challenge Evaluation Plan | Natalia Tomashenko et.al. | 2410.07428 | translate | read | link |
| 2024-10-09 | Advocating Character Error Rate for Multilingual ASR Evaluation | Thennal D K et.al. | 2410.07400 | translate | read | null |
| 2024-10-09 | Efficient training strategies for natural sounding speech synthesis and speaker adaptation based on FastPitch | Teodora Răgman et.al. | 2410.06787 | translate | read | null |
| 2024-10-09 | Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS | Onkar Kishor Susladkar et.al. | 2410.06608 | translate | read | null |
| 2024-10-08 | Diversity-Rewarded CFG Distillation | Geoffrey Cideron et.al. | 2410.06084 | translate | read | null |
| 2024-10-08 | The USTC-NERCSLIP Systems for the CHiME-8 MMCSG Challenge | Ya Jiang et.al. | 2410.05986 | translate | read | null |
| 2024-10-08 | Improving Data Augmentation-based Cross-Speaker Style Transfer for TTS with Singing Voice, Style Filtering, and F0 Matching | Leonardo B. de M. M. Marques et.al. | 2410.05620 | translate | read | link |
| 2024-10-07 | Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments | Sagarika Alavilli et.al. | 2410.05423 | translate | read | null |
| 2024-10-07 | Presto! Distilling Steps and Layers for Accelerating Music Generation | Zachary Novack et.al. | 2410.05167 | translate | read | null |
| 2024-10-07 | Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer | Siyuan Hou et.al. | 2410.05151 | translate | read | null |
| 2024-10-07 | Enhancing Job Interview Preparation Through Immersive Experiences Using Photorealistic, AI-powered Metahuman Avatars | Navid Ashrafi et.al. | 2410.05131 | translate | read | null |
| 2024-10-07 | CR-CTC: Consistency regularization on CTC for improved speech recognition | Zengwei Yao et.al. | 2410.05101 | translate | read | null |
| 2024-10-07 | Improving Speaker Representations Using Contrastive Losses on Multi-scale Features | Satvik Dixit et.al. | 2410.05037 | translate | read | null |
| 2024-10-06 | Punctuation Prediction for Polish Texts using Transformers | Jakub Pokrywka et.al. | 2410.04621 | translate | read | null |
| 2024-10-06 | Casablanca: Data and Models for Multidialectal Arabic Speech Recognition | Bashar Talafha et.al. | 2410.04527 | translate | read | null |
| 2024-10-06 | HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis | Yuto Nishimura et.al. | 2410.04380 | translate | read | null |
| 2024-10-06 | SONAR: A Synthetic AI-Audio Detection Framework~and Benchmark | Xiang Li et.al. | 2410.04324 | translate | read | link |
| 2024-10-05 | Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer | Tomoki Honda et.al. | 2410.04159 | translate | read | link |
| 2024-10-04 | Generative Semantic Communication for Text-to-Speech Synthesis | Jiahao Zheng et.al. | 2410.03459 | translate | read | null |
| 2024-10-04 | Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges | Nguyen Van Dinh et.al. | 2410.03458 | translate | read | null |
| 2024-10-04 | Team MTS @ AutoMin 2021: An Overview of Existing Summarization Approaches and Comparison to Unsupervised Summarization Techniques | Olga Iakovenko et.al. | 2410.03412 | translate | read | null |
| 2024-10-04 | MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech | Taejun Bak et.al. | 2410.03192 | translate | read | null |
| 2024-10-03 | Disentangling Textual and Acoustic Features of Neural Speech Representations | Hosein Mohebbi et.al. | 2410.03037 | translate | read | null |
| 2024-10-03 | Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR | Hainan Xu et.al. | 2410.02597 | translate | read | null |
| 2024-10-04 | Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition | Olga Iakovenko et.al. | 2410.02560 | translate | read | null |
| 2024-10-03 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems | Olga Iakovenko et.al. | 2410.02538 | translate | read | null |
| 2024-10-03 | State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data | Sara Barahona et.al. | 2410.02364 | translate | read | null |
| 2024-10-03 | A Pilot Study of Applying Sequence-to-Sequence Voice Conversion to Evaluate the Intelligibility of L2 Speech Using a Native Speaker’s Shadowings | Haopeng Geng et.al. | 2410.02239 | translate | read | null |
| 2024-10-02 | Generating Symbolic Music from Natural Language Prompts using an LLM-Enhanced Dataset | Weihan Xu et.al. | 2410.02084 | translate | read | null |
| 2024-10-02 | Spoken Grammar Assessment Using LLM | Sunil Kumar Kopparapu et.al. | 2410.01579 | translate | read | null |
| 2024-10-02 | Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling | Yuguang Yang et.al. | 2410.01350 | translate | read | null |
| 2024-10-01 | MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages | Marco Gaido et.al. | 2410.01036 | translate | read | link |
| 2024-10-01 | Automatic Speech Recognition for the Ika Language | Uchenna Nzenwata et.al. | 2410.00940 | translate | read | null |
| 2024-10-01 | Do Music Generation Models Encode Music Theory? | Megan Wei et.al. | 2410.00872 | translate | read | null |
| 2024-10-01 | VHASR: A Multimodal Speech Recognition System With Vision Hotwords | Jiliang Hu et.al. | 2410.00822 | translate | read | link |
| 2024-10-01 | Improving curriculum learning for target speaker extraction with synthetic speakers | Yun Liu et.al. | 2410.00811 | translate | read | null |
| 2024-10-01 | End-to-End Speech Recognition with Pre-trained Masked Language Model | Yosuke Higuchi et.al. | 2410.00528 | translate | read | null |
| 2024-10-02 | Integrating Text-to-Music Models with Language Models: Composing Long Structured Music Pieces | Lilac Atassi et.al. | 2410.00344 | translate | read | null |
| 2024-10-01 | EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control | Haozhe Chen et.al. | 2410.00316 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)