Audio Processing - 2024-11
Audio Processing - 2024-11
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-11-30 | From Audio Deepfake Detection to AI-Generated Music Detection – A Pathway and Overview | Yupei Li et.al. | 2412.00571 | translate | read | null |
| 2024-11-30 | Sample adaptive data augmentation with progressive scheduling | Hongxuan Lu et.al. | 2412.00415 | translate | read | null |
| 2024-11-30 | Empowering the Deaf and Hard of Hearing Community: Enhancing Video Captions Using Large Language Models | Nadeen Fathallah et.al. | 2412.00342 | translate | read | null |
| 2024-11-30 | MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI | Jongmin Jung et.al. | 2412.00325 | translate | read | null |
| 2024-11-30 | Improving speaker verification robustness with synthetic emotional utterances | Nikhil Kumar Koditala et.al. | 2412.00319 | translate | read | null |
| 2024-11-29 | Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities | Haorui He et.al. | 2411.19770 | translate | read | null |
| 2024-11-29 | Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency | Akshaya Rajesh et.al. | 2411.19611 | translate | read | null |
| 2024-11-28 | ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words | Hazem Darwish et.al. | 2411.18888 | translate | read | null |
| 2024-11-27 | EEG-Based Analysis of Brain Responses in Multi-Modal Human-Robot Interaction: Modulating Engagement | Suzanne Oliver et.al. | 2411.18587 | translate | read | null |
| 2024-11-27 | AMPS: ASR with Multimodal Paraphrase Supervision | Amruta Parulekar et.al. | 2411.18368 | translate | read | null |
| 2024-11-27 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory | Geoffrey Tyndall et.al. | 2411.18320 | translate | read | null |
| 2024-11-27 | Aligning Pre-trained Models for Spoken Language Translation | Šimon Sedláček et.al. | 2411.18294 | translate | read | null |
| 2024-11-27 | Efficient Nonlinear Function Approximation in Analog Resistive Crossbars for Recurrent Neural Networks | Junyi Yang et.al. | 2411.18271 | translate | read | null |
| 2024-11-27 | How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario | Shih-Heng Wang et.al. | 2411.18217 | translate | read | null |
| 2024-11-27 | Machine Unlearning reveals that the Gender-based Violence Victim Condition can be detected from Speech in a Speaker-Agnostic Setting | Emma Reyner-Fuentes et.al. | 2411.18177 | translate | read | null |
| 2024-11-27 | MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models | Thai-Binh Nguyen et.al. | 2411.18152 | translate | read | null |
| 2024-11-27 | SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation | Wenyi Yu et.al. | 2411.18138 | translate | read | null |
| 2024-11-27 | Fusion of Discrete Representations and Self-Augmented Representations for Multilingual Automatic Speech Recognition | Shih-heng Wang et.al. | 2411.18107 | translate | read | null |
| 2024-11-26 | Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis | Akshita Gupta et.al. | 2411.17690 | translate | read | null |
| 2024-11-26 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data | Aohan Zeng et.al. | 2411.17607 | translate | read | null |
| 2024-11-26 | Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition | Hyeonseung Lee et.al. | 2411.17537 | translate | read | null |
| 2024-11-26 | Comparative Analysis of ASR Methods for Speech Deepfake Detection | Davide Salvi et.al. | 2411.17349 | translate | read | null |
| 2024-11-26 | k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning | Yifan Yang et.al. | 2411.17100 | translate | read | null |
| 2024-11-25 | Synthesising Handwritten Music with GANs: A Comprehensive Evaluation of CycleWGAN, ProGAN, and DCGAN | Elona Shatri et.al. | 2411.16405 | translate | read | null |
| 2024-11-25 | The SVASR System for Text-dependent Speaker Verification (TdSV) AAIC Challenge 2024 | Mohammadreza Molavi et.al. | 2411.16276 | translate | read | null |
| 2024-11-25 | SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations | Youngjun Sim et.al. | 2411.16147 | translate | read | null |
| 2024-11-24 | A Training-Free Approach for Music Style Transfer with Latent Diffusion Models | Sooyoung Kim et.al. | 2411.15913 | translate | read | null |
| 2024-11-22 | Transforming NLU with Babylon: A Case Study in Development of Real-time, Edge-Efficient, Multi-Intent Translation System for Automated Drive-Thru Ordering | Mostafa Varzaneh et.al. | 2411.15372 | translate | read | null |
| 2024-11-22 | Towards Speaker Identification with Minimal Dataset and Constrained Resources using 1D-Convolution Neural Network | Irfan Nafiz Shahan et.al. | 2411.15082 | translate | read | link |
| 2024-11-22 | VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space | Armani Rodriguez et.al. | 2411.14642 | translate | read | null |
| 2024-11-21 | Generative AI for Music and Audio | Hao-Wen Dong et.al. | 2411.14627 | translate | read | null |
| 2024-11-20 | From Statistical Methods to Pre-Trained Models; A Survey on Automatic Speech Recognition for Resource Scarce Urdu Language | Muhammad Sharif et.al. | 2411.14493 | translate | read | null |
| 2024-11-21 | Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge | Ruiyang Qin et.al. | 2411.13766 | translate | read | null |
| 2024-11-18 | A Novel Speech Analysis and Correction Tool for Arabic-Speaking Children | Lamia Berriche et.al. | 2411.13592 | translate | read | null |
| 2024-11-20 | CAFE A Novel Code switching Dataset for Algerian Dialect French and English | Houssam Eddine-Othman Lachemat et.al. | 2411.13424 | translate | read | null |
| 2024-11-20 | I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception | Jiawei Zhang et.al. | 2411.13314 | translate | read | null |
| 2024-11-20 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM | Jiawei Yu et.al. | 2411.13159 | translate | read | null |
| 2024-11-21 | Improving Controllability and Editability for Pretrained Text-to-Music Generation Models | Yixiao Zhang et.al. | 2411.12641 | translate | read | null |
| 2024-11-19 | Whisper Finetuning on Nepali Language | Sanjay Rijal et.al. | 2411.12587 | translate | read | null |
| 2024-11-18 | An Investigation of Reprogramming for Cross-Language Adaptation in Speaker Verification Systems | Jingyu Li et.al. | 2411.11353 | translate | read | null |
| 2024-11-18 | Study of the Performance of CEEMDAN in Underdetermined Speech Separation | Rawad Melhem et.al. | 2411.11312 | translate | read | null |
| 2024-11-18 | SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features | Yu-Fei Shi et.al. | 2411.11232 | translate | read | null |
| 2024-11-17 | Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation | Jisang Park et.al. | 2411.10927 | translate | read | null |
| 2024-11-16 | BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization | Md. Nazmus Sadat Samin et.al. | 2411.10879 | translate | read | link |
| 2024-11-16 | Bilingual Text-dependent Speaker Verification with Pre-trained Models for TdSV Challenge 2024 | Seyed Ali Farokh et.al. | 2411.10828 | translate | read | null |
| 2024-11-15 | SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers | Joseph Liu et.al. | 2411.10510 | translate | read | link |
| 2024-11-15 | Interactive Cycle Model – The Linkage Combination among Automatic Speech Recognition, Large Language Models and Smart Glasses | Libo Wang et.al. | 2411.10362 | translate | read | null |
| 2024-11-15 | Systolic Arrays and Structured Pruning Co-design for Efficient Transformers in Edge Systems | Pedro Palacios et.al. | 2411.10285 | translate | read | null |
| 2024-11-15 | DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization | Christos Koutlis et.al. | 2411.10193 | translate | read | link |
| 2024-11-15 | XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection | Yang Xiao et.al. | 2411.10027 | translate | read | link |
| 2024-11-15 | Zero-shot Voice Conversion with Diffusion Transformers | Songting Liu et.al. | 2411.09943 | translate | read | null |
| 2024-11-14 | Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data | Rik Raes et.al. | 2411.09431 | translate | read | null |
| 2024-11-14 | Transferable Adversarial Attacks against ASR | Xiaoxue Gao et.al. | 2411.09220 | translate | read | null |
| 2024-11-14 | Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation | Kuiyuan Zhang et.al. | 2411.09167 | translate | read | null |
| 2024-11-13 | Language Models for Music Medicine Generation | Emmanouil Nikolakakis et.al. | 2411.09080 | translate | read | null |
| 2024-11-14 | Evaluating Synthetic Command Attacks on Smart Voice Assistants | Zhengxian He et.al. | 2411.08316 | translate | read | null |
| 2024-11-13 | PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation | Yungang Yi et.al. | 2411.08307 | translate | read | null |
| 2024-11-11 | Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition | Yoshiki Masuyama et.al. | 2411.06968 | translate | read | link |
| 2024-11-11 | DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions | Shu-Tong Niu et.al. | 2411.06667 | translate | read | null |
| 2024-11-10 | Debatts: Zero-Shot Debating Text-to-Speech Synthesis | Yiqiao Huang et.al. | 2411.06540 | translate | read | null |
| 2024-11-10 | CTC-Assisted LLM-Based Contextual ASR | Guanrou Yang et.al. | 2411.06437 | translate | read | link |
| 2024-11-07 | Dialectal Coverage And Generalization in Arabic Speech Recognition | Amirbek Djanibekov et.al. | 2411.05872 | translate | read | null |
| 2024-11-07 | Sentiment Analysis of Spanish Political Party Tweets Using Pre-trained Language Models | Chuqiao Song et.al. | 2411.04862 | translate | read | null |
| 2024-11-07 | Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages | Leena G Pillai et.al. | 2411.04573 | translate | read | null |
| 2024-11-06 | Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks | Felipe Marra et.al. | 2411.03948 | translate | read | null |
| 2024-11-04 | Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs | Alexandros Haliassos et.al. | 2411.02256 | translate | read | link |
| 2024-11-04 | Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data | Sofiane Azzouz et.al. | 2411.02037 | translate | read | null |
| 2024-11-04 | CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching | Yu Pan et.al. | 2411.02026 | translate | read | null |
| 2024-11-04 | MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence | Fuming You et.al. | 2411.01805 | translate | read | null |
| 2024-11-03 | SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation | Dennis Fucci et.al. | 2411.01710 | translate | read | null |
| 2024-11-02 | Leveraging LLM and Text-Queried Separation for Noise-Robust Sound Event Detection | Han Yin et.al. | 2411.01174 | translate | read | link |
| 2024-11-02 | Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis | Shijia Liao et.al. | 2411.01156 | translate | read | link |
| 2024-11-01 | Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO | Macarious Hui et.al. | 2411.00980 | translate | read | null |
| 2024-11-04 | Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval | Nikolaos Flemotomos et.al. | 2411.00664 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)