Audio Processing - 2024-08
Audio Processing - 2024-08
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2024-08-30 | Advancing Multi-talker ASR Performance with Large Language Models | Mohan Shi et.al. | 2408.17431 | translate | read | null |
| 2024-08-30 | AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge | Kirill Borodin et.al. | 2408.17352 | translate | read | null |
| 2024-08-30 | Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model | Zhen Ye et.al. | 2408.17175 | translate | read | link |
| 2024-08-30 | Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings | Shota Horiguchi et.al. | 2408.17142 | translate | read | null |
| 2024-08-30 | Generative Modeling Perspective for Control and Reasoning in Robotics | Takuma Yoneda et.al. | 2408.17041 | translate | read | null |
| 2024-08-30 | Utilizing Speaker Profiles for Impersonation Audio Detection | Hao Gu et.al. | 2408.17009 | translate | read | null |
| 2024-08-30 | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming | Zhifei Xie et.al. | 2408.16725 | translate | read | link |
| 2024-08-29 | CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions | Laurin Wagner et.al. | 2408.16589 | translate | read | link |
| 2024-08-29 | Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing | Qianhui Liu et.al. | 2408.16564 | translate | read | null |
| 2024-08-29 | RAVE for Speech: Efficient Voice Conversion at High Sampling Rates | Anders R. Bargum et.al. | 2408.16546 | translate | read | null |
| 2024-08-29 | Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis | Zehai Tu et.al. | 2408.16373 | translate | read | null |
| 2024-08-29 | Measuring the Accuracy of Automatic Speech Recognition Solutions | Korbinian Kuhn et.al. | 2408.16287 | translate | read | link |
| 2024-08-29 | Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation | Lun Wang et.al. | 2408.16204 | translate | read | null |
| 2024-08-29 | Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction | Yuka Ko et.al. | 2408.16180 | translate | read | null |
| 2024-08-28 | Spoofing-Robust Speaker Verification Using Parallel Embedding Fusion: BTU Speech Group’s Approach for ASVspoof5 Challenge | Oğuzhan Kurnaz et.al. | 2408.15877 | translate | read | null |
| 2024-08-28 | VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling | Yixuan Zhou et.al. | 2408.15676 | translate | read | link |
| 2024-08-28 | Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications | Korbinian Kuhn et.al. | 2408.15616 | translate | read | link |
| 2024-08-28 | Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models | Yiyang Zhao et.al. | 2408.15585 | translate | read | null |
| 2024-08-28 | EmoAttack: Utilizing Emotional Voice Conversion for Speech Backdoor Attacks on Deep Speech Classification Models | Wenhan Yao et.al. | 2408.15508 | translate | read | null |
| 2024-08-27 | Unlocking Potential in Pre-Trained Music Language Models for Versatile Multi-Track Music Arrangement | Longshen Ou et.al. | 2408.15176 | translate | read | null |
| 2024-08-27 | Speech Recognition Transformers: Topological-lingualism Perspective | Shruti Singh et.al. | 2408.14991 | translate | read | null |
| 2024-08-27 | Literary and Colloquial Dialect Identification for Tamil using Acoustic Features | M. Nanmalar et.al. | 2408.14887 | translate | read | null |
| 2024-08-27 | The VoxCeleb Speaker Recognition Challenge: A Retrospective | Jaesung Huh et.al. | 2408.14886 | translate | read | null |
| 2024-08-27 | MaskCycleGAN-based Whisper to Normal Speech Conversion | K. Rohith Gupta et.al. | 2408.14797 | translate | read | null |
| 2024-08-26 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues | Kuluhan Binici et.al. | 2408.14418 | translate | read | null |
| 2024-08-26 | Self-supervised Speech Representations Still Struggle with African American Vernacular English | Kalvin Chang et.al. | 2408.14262 | translate | read | link |
| 2024-08-26 | Automatic recognition and detection of aphasic natural speech | Mara Barberis et.al. | 2408.14082 | translate | read | null |
| 2024-08-26 | Research Advances and New Paradigms for Biology-inspired Spiking Neural Networks | Tianyu Zheng et.al. | 2408.13996 | translate | read | null |
| 2024-08-26 | Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard | Wonjune Kang et.al. | 2408.13970 | translate | read | null |
| 2024-08-25 | Literary and Colloquial Tamil Dialect Identification | M. Nanmalar et.al. | 2408.13739 | translate | read | null |
| 2024-08-24 | Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification | Aditya Dawn et.al. | 2408.13644 | translate | read | null |
| 2024-08-24 | As Biased as You Measure: Methodological Pitfalls of Bias Evaluations in Speaker Verification Research | Wiebke Hutiri et.al. | 2408.13614 | translate | read | null |
| 2024-08-24 | SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description | Zeyu Jin et.al. | 2408.13608 | translate | read | link |
| 2024-08-23 | Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples | Zhenyu Wang et.al. | 2408.13341 | translate | read | null |
| 2024-08-23 | Which Prosodic Features Matter Most for Pragmatics? | Nigel G. Ward et.al. | 2408.13240 | translate | read | null |
| 2024-08-23 | NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks | He Huang et.al. | 2408.13106 | translate | read | null |
| 2024-08-23 | Focused Discriminative Training For Streaming CTC-Trained Automatic Speech Recognition Models | Adnan Haider et.al. | 2408.13008 | translate | read | null |
| 2024-08-22 | Towards measuring fairness in speech recognition: Fair-Speech dataset | Irina-Elena Veliche et.al. | 2408.12734 | translate | read | null |
| 2024-08-22 | WhisperMask: A Noise Suppressive Mask-Type Microphone for Whisper Speech | Hirotaka Hiraki et.al. | 2408.12500 | translate | read | null |
| 2024-08-22 | Positional Description for Numerical Normalization | Deepanshu Gupta et.al. | 2408.12430 | translate | read | null |
| 2024-08-22 | LCM-SVC: Latent Diffusion Model Based Singing Voice Conversion with Inference Acceleration via Latent Consistency Distillation | Shihao Chen et.al. | 2408.12354 | translate | read | null |
| 2024-08-22 | Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features | Shaoxiang Dang et.al. | 2408.12279 | translate | read | null |
| 2024-08-21 | The State of Commercial Automatic French Legal Speech Recognition Systems and their Impact on Court Reporters et al | Nicolad Garneau et.al. | 2408.11940 | translate | read | null |
| 2024-08-21 | Approaching Deep Learning through the Spectral Dynamics of Weights | David Yunis et.al. | 2408.11804 | translate | read | link |
| 2024-08-22 | A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification | Xujiang Xing et.al. | 2408.11562 | translate | read | null |
| 2024-08-21 | Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech | Anastasia Avdeeva et.al. | 2408.11528 | translate | read | null |
| 2024-08-21 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers | Prashant Serai et.al. | 2408.11258 | translate | read | null |
| 2024-08-20 | BUT Systems and Analyses for the ASVspoof 5 Challenge | Johan Rohdin et.al. | 2408.11152 | translate | read | null |
| 2024-08-20 | AI-Based IVR | Gassyrbek Kosherbay et.al. | 2408.10549 | translate | read | null |
| 2024-08-20 | XCB: an effective contextual biasing approach to bias cross-lingual phrases in speech recognition | Xucheng Wan et.al. | 2408.10524 | translate | read | null |
| 2024-08-19 | ASASVIcomtech: The Vicomtech-UGR Speech Deepfake Detection and SASV Systems for the ASVspoof5 Challenge | Juan M. Martín-Doñas et.al. | 2408.10361 | translate | read | null |
| 2024-08-19 | Hear Your Face: Face-based voice conversion with F0 estimation | Jaejun Lee et.al. | 2408.09802 | translate | read | null |
| 2024-08-19 | Unsupervised Composable Representations for Audio | Giovanni Bindi et.al. | 2408.09792 | translate | read | null |
| 2024-08-19 | Recording for Eyes, Not Echoing to Ears: Contextualized Spoken-to-Written Conversion of ASR Transcripts | Jiaqing Liu et.al. | 2408.09688 | translate | read | null |
| 2024-08-18 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition | Yangze Li et.al. | 2408.09491 | translate | read | null |
| 2024-08-17 | Malacopula: adversarial automatic speaker verification attacks using a neural-based generalised Hammerstein model | Massimiliano Todisco et.al. | 2408.09300 | translate | read | null |
| 2024-08-17 | Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition | Samuele Cornell et.al. | 2408.09215 | translate | read | null |
| 2024-08-14 | Supervised and Unsupervised Alignments for Spoofing Behavioral Biometrics | Thomas Thebaud et.al. | 2408.08918 | translate | read | null |
| 2024-08-16 | ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale | Xin Wang et.al. | 2408.08739 | translate | read | null |
| 2024-08-15 | Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words | Kento Nozawa et.al. | 2408.08027 | translate | read | null |
| 2024-08-14 | SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition | Mohamed Osman et.al. | 2408.07851 | translate | read | link |
| 2024-08-14 | WavLM model ensemble for audio deepfake detection | David Combei et.al. | 2408.07414 | translate | read | null |
| 2024-08-14 | DPSNN: Spiking Neural Network for Low-Latency Streaming Speech Enhancement | Tao Sun et.al. | 2408.07388 | translate | read | null |
| 2024-08-13 | Play Me Something Icy: Practical Challenges, Explainability and the Semantic Gap in Generative AI Music | Jesse Allison et.al. | 2408.07224 | translate | read | null |
| 2024-08-13 | VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders | Yubing Cao et.al. | 2408.06906 | translate | read | null |
| 2024-08-13 | SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis | Osamu Take et.al. | 2408.06858 | translate | read | link |
| 2024-08-13 | PRESENT: Zero-Shot Text-to-Prosody Control | Perry Lam et.al. | 2408.06827 | translate | read | link |
| 2024-08-13 | Deep Learning for Speaker Identification: Architectural Insights from AB-1 Corpus Analysis and Performance Evaluation | Matthias Bartolo et.al. | 2408.06804 | translate | read | link |
| 2024-08-12 | Cross-Lingual Conversational Speech Summarization with Large Language Models | Max Nelson et.al. | 2408.06484 | translate | read | null |
| 2024-08-12 | Audio Enhancement for Computer Audition – An Iterative Training Paradigm Using Sample Importance | Manuel Milling et.al. | 2408.06264 | translate | read | null |
| 2024-08-12 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning | Wonjun Lee et.al. | 2408.06043 | translate | read | null |
| 2024-08-12 | Controlling Surprisal in Music Generation via Information Content Curve Matching | Mathias Rose Bjare et.al. | 2408.06022 | translate | read | link |
| 2024-08-11 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition | Eunseop Yoon et.al. | 2408.05769 | translate | read | null |
| 2024-08-11 | VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing | Chunyu Qiang et.al. | 2408.05758 | translate | read | null |
| 2024-08-10 | Improving Whisper’s Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text | Jinpeng Li et.al. | 2408.05554 | translate | read | null |
| 2024-08-09 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads | Junhao Xu et.al. | 2408.05101 | translate | read | null |
| 2024-08-09 | TEAdapter: Supply abundant guidance for controllable text-to-music generation | Jialing Zou et.al. | 2408.04865 | translate | read | null |
| 2024-08-08 | MulliVC: Multi-lingual Voice Conversion With Cycle Consistency | Jiawei Huang et.al. | 2408.04708 | translate | read | null |
| 2024-08-08 | NeuralMultiling: A Novel Neural Architecture Search for Smartphone based Multilingual Speaker Verification | Aravinda Reddy PN et.al. | 2408.04362 | translate | read | null |
| 2024-08-08 | HydraFormer: One Encoder For All Subsampling Rates | Yaoxun Xu et.al. | 2408.04325 | translate | read | link |
| 2024-08-08 | Preserving spoken content in voice anonymisation with character-level vocoder conditioning | Michele Panariello et.al. | 2408.04306 | translate | read | null |
| 2024-08-08 | wav2graph: A Framework for Supervised Learning Knowledge Graph from Speech | Khai Le-Duc et.al. | 2408.04174 | translate | read | null |
| 2024-08-07 | Speaker Adaptation for Quantised End-to-End ASR Models | Qiuming Zhao et.al. | 2408.03979 | translate | read | null |
| 2024-08-06 | Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training | Hawraz A. Ahmad et.al. | 2408.03887 | translate | read | null |
| 2024-08-07 | Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation | Karn N. Watcharasupat et.al. | 2408.03588 | translate | read | null |
| 2024-08-06 | ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval | Ruixiang Zhao et.al. | 2408.02978 | translate | read | null |
| 2024-08-06 | Self-Supervised Learning for Multi-Channel Neural Transducer | Atsushi Kojima et.al. | 2408.02945 | translate | read | null |
| 2024-08-05 | Automatic Voice Identification after Speech Resynthesis using PPG | Thibault Gaudier et.al. | 2408.02712 | translate | read | null |
| 2024-08-05 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition | Jaeyoung Kim et.al. | 2408.02582 | translate | read | null |
| 2024-08-05 | The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024 | He Wang et.al. | 2408.02369 | translate | read | null |
| 2024-08-05 | StreamVoice+: Evolving into End-to-end Streaming Zero-shot Voice Conversion | Zhichao Wang et.al. | 2408.02178 | translate | read | null |
| 2024-08-04 | Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model | Shipei Liu et.al. | 2408.01950 | translate | read | null |
| 2024-08-03 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic Features | Peng Cheng et.al. | 2408.01808 | translate | read | null |
| 2024-08-03 | Generating High-quality Symbolic Music Using Fine-grained Discriminators | Zhedong Zhang et.al. | 2408.01696 | translate | read | null |
| 2024-08-02 | EmoBack: Backdoor Attacks Against Speaker Identification Using Emotional Prosody | Coen Schoof et.al. | 2408.01178 | translate | read | null |
| 2024-08-01 | Expressive MIDI-format Piano Performance Generation | Jingwei Liu et.al. | 2408.00900 | translate | read | null |
| 2024-08-01 | SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data | Yichen Lu et.al. | 2408.00624 | translate | read | null |
| 2024-08-01 | Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation | Xinhan Di et.al. | 2408.00284 | translate | read | null |
| 2024-08-01 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation | Kohei Matsuura et.al. | 2408.00205 | translate | read | null |
| 2024-08-01 | Generative Expressive Conversational Speech Synthesis | Rui Liu et.al. | 2407.21491 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)