Audio Processing - 2025-01
Audio Processing - 2025-01
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-01-31 | Language Bias in Self-Supervised Learning For Automatic Speech Recognition | Edward Storey et.al. | 2501.19321 | translate | read | null |
| 2025-01-30 | AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment | Yuqin Cao et.al. | 2501.18314 | translate | read | null |
| 2025-01-29 | Self-Supervised Frameworks for Speaker Verification via Bootstrapped Positive Sampling | Theo Lepage et.al. | 2501.17772 | translate | read | null |
| 2025-01-29 | Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition | Zhengdong Yang et.al. | 2501.17615 | translate | read | null |
| 2025-01-29 | VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching | Ha-Yeong Choi et.al. | 2501.17612 | translate | read | null |
| 2025-01-28 | Compact Neural TTS Voices for Accessibility | Kunal Jain et.al. | 2501.17332 | translate | read | null |
| 2025-01-28 | RDMM: Fine-Tuned LLM Models for On-Device Robotic Decision Making with Enhanced Contextual Awareness in Specific Domains | Shady Nasrat et.al. | 2501.16899 | translate | read | link |
| 2025-01-28 | AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals | Dongliang Zhou et.al. | 2501.16780 | translate | read | null |
| 2025-01-28 | SCDiar: a streaming diarization system based on speaker change detection and speech recognition | Naijun Zheng et.al. | 2501.16641 | translate | read | null |
| 2025-01-27 | UniPET-SPK: A Unified Framework for Parameter-Efficient Tuning of Pre-trained Speech Models for Robust Speaker Verification | Mufan Sang et.al. | 2501.16542 | translate | read | null |
| 2025-01-27 | Optimized Self-supervised Training with BEST-RQ for Speech Recognition | Ilja Baumann et.al. | 2501.16131 | translate | read | null |
| 2025-01-27 | Classification Error Bound for Low Bayes Error Conditions in Machine Learning | Zijian Yang et.al. | 2501.15977 | translate | read | null |
| 2025-01-26 | Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning | Qian Yang et.al. | 2501.15613 | translate | read | null |
| 2025-01-26 | End-to-End Target Speaker Speech Recognition Using Context-Aware Attention Mechanisms for Challenging Enrollment Scenario | Mohsen Ghane et.al. | 2501.15466 | translate | read | null |
| 2025-01-26 | Overview of the Amphion Toolkit (v0.2) | Jiaqi Li et.al. | 2501.15442 | translate | read | link |
| 2025-01-25 | The Multicultural Medical Assistant: Can LLMs Improve Medical ASR Errors Across Borders? | Ayo Adedeji et.al. | 2501.15310 | translate | read | null |
| 2025-01-25 | Music Generation using Human-In-The-Loop Reinforcement Learning | Aju Ani Justus et.al. | 2501.15304 | translate | read | null |
| 2025-01-25 | Speech Translation Refinement using Large Language Models | Huaixia Dou et.al. | 2501.15090 | translate | read | link |
| 2025-01-25 | Robust Cross-Etiology and Speaker-Independent Dysarthric Speech Recognition | Satwinder Singh et.al. | 2501.14994 | translate | read | null |
| 2025-01-27 | Diffusion based Text-to-Music Generation with Global and Local Text based Conditioning | Jisi Zhang et.al. | 2501.14680 | translate | read | null |
| 2025-01-24 | FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration | Kai-Tuo Xu et.al. | 2501.14350 | translate | read | link |
| 2025-01-24 | Characteristic-Specific Partial Fine-Tuning for Efficient Emotion and Speaker Adaptation in Codec Language Text-to-Speech Models | Tianrui Wang et.al. | 2501.14273 | translate | read | null |
| 2025-01-24 | Generalizable Audio Deepfake Detection via Latent Space Refinement and Augmentation | Wen Huang et.al. | 2501.14240 | translate | read | null |
| 2025-01-24 | LoCoML: A Framework for Real-World ML Inference Pipelines | Kritin Maddireddy et.al. | 2501.14165 | translate | read | null |
| 2025-01-23 | Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction | Ali Farshian Abbasi et.al. | 2501.13996 | translate | read | null |
| 2025-01-23 | Predicting Compact Phrasal Rewrites with Large Language Models for ASR Post Editing | Hao Zhang et.al. | 2501.13831 | translate | read | null |
| 2025-01-23 | Learning-based A Posteriori Speech Presence Probability Estimation and Applications | Shuai Tao et.al. | 2501.13642 | translate | read | null |
| 2025-01-23 | DQ-Data2vec: Decoupling Quantization for Multilingual Speech Recognition | Qijie Shao et.al. | 2501.13497 | translate | read | null |
| 2025-01-23 | Generative Data Augmentation Challenge: Zero-Shot Speech Synthesis for Personalized Speech Enhancement | Jae-Sung Bae et.al. | 2501.13372 | translate | read | null |
| 2025-01-23 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia | Xuelong Geng et.al. | 2501.13306 | translate | read | link |
| 2025-01-22 | Let SSMs be ConvNets: State-space Modeling with Optimal Tensor Contractions | Yan Ru Pei et.al. | 2501.13230 | translate | read | link |
| 2025-01-22 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction | Moreno La Quatra et.al. | 2501.12979 | translate | read | link |
| 2025-01-21 | A Domain Adaptation Framework for Speech Recognition Systems with Only Synthetic data | Minh Tran et.al. | 2501.12501 | translate | read | null |
| 2025-01-21 | DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset | Yupei Li et.al. | 2501.12122 | translate | read | null |
| 2025-01-20 | Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio | Mateusz Barański et.al. | 2501.11378 | translate | read | null |
| 2025-01-20 | SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation | Ziling Huang et.al. | 2501.11274 | translate | read | null |
| 2025-01-19 | Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets | Or Haim Anidjar et.al. | 2501.11065 | translate | read | null |
| 2025-01-18 | A Benchmark of French ASR Systems Based on Error Severity | Antoine Tholly et.al. | 2501.10879 | translate | read | null |
| 2025-01-18 | GEC-RAG: Improving Generative Error Correction via Retrieval-Augmented Generation for Automatic Speech Recognition Systems | Amin Robatian et.al. | 2501.10734 | translate | read | null |
| 2025-01-17 | Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR | Karl El Hajal et.al. | 2501.10256 | translate | read | null |
| 2025-01-17 | Automatic Speech Recognition for Sanskrit with Transfer Learning | Bidit Sadhukhan et.al. | 2501.10024 | translate | read | null |
| 2025-01-17 | GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions | Heda Zuo et.al. | 2501.09972 | translate | read | null |
| 2025-01-21 | PIER: A Novel Metric for Evaluating What Matters in Code-Switching | Enes Yavuz Ugan et.al. | 2501.09512 | translate | read | link |
| 2025-01-16 | Teaching Wav2Vec2 the Language of the Brain | Tobias Fiedler et.al. | 2501.09459 | translate | read | link |
| 2025-01-16 | Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition | Takaaki Hori et.al. | 2501.09258 | translate | read | null |
| 2025-01-17 | persoDA: Personalized Data Augmentation for Personalized ASR | Pablo Peso Parada et.al. | 2501.09113 | translate | read | null |
| 2025-01-15 | A Non-autoregressive Model for Joint STT and TTS | Vishal Sunder et.al. | 2501.09104 | translate | read | null |
| 2025-01-13 | Discrimination loss vs. SRT: A model-based approach towards harmonizing speech test interpretations | Mareike Buhl et.al. | 2501.08921 | translate | read | null |
| 2025-01-15 | XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework | Sida Tian et.al. | 2501.08809 | translate | read | null |
| 2025-01-15 | Speech Synthesis along Perceptual Voice Quality Dimensions | Frederik Rautenberg et.al. | 2501.08791 | translate | read | null |
| 2025-01-15 | Adaptive Data Augmentation with NaturalSpeech3 for Far-field Speaker Verification | Li Zhang et.al. | 2501.08691 | translate | read | null |
| 2025-01-15 | Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom | Melissa Torgbi et.al. | 2501.08502 | translate | read | null |
| 2025-01-14 | Selective Attention Merging for low resource tasks: A case study of Child ASR | Natarajan Balaji Shankar et.al. | 2501.08468 | translate | read | link |
| 2025-01-14 | Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications | Dimme de Groot et.al. | 2501.08104 | translate | read | null |
| 2025-01-13 | Exploring the encoding of linguistic representations in the Fully-Connected Layer of generative CNNs for Speech | Bruno Ferenc Šegedin et.al. | 2501.07726 | translate | read | null |
| 2025-01-13 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding | Jiliang Hu et.al. | 2501.07329 | translate | read | null |
| 2025-01-13 | Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model | Ziyang Ma et.al. | 2501.07246 | translate | read | null |
| 2025-01-13 | AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR | The Chuong Chu et.al. | 2501.07102 | translate | read | null |
| 2025-01-11 | Discrete Speech Unit Extraction via Independent Component Analysis | Tomohiko Nakamura et.al. | 2501.06562 | translate | read | link |
| 2025-01-11 | A Survey on Spoken Italian Datasets and Corpora | Marco Giordano et.al. | 2501.06557 | translate | read | null |
| 2025-01-11 | Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives | Christiaan Jacobs et.al. | 2501.06478 | translate | read | null |
| 2025-01-11 | Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis | Rui Liu et.al. | 2501.06467 | translate | read | null |
| 2025-01-10 | TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer | Vladimir Bataev et.al. | 2501.06320 | translate | read | null |
| 2025-01-10 | Contextual ASR Error Handling with LLMs Augmentation for Goal-Oriented Conversational AI | Yuya Asano et.al. | 2501.06129 | translate | read | null |
| 2025-01-10 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding | Fabian David Schmidt et.al. | 2501.06117 | translate | read | link |
| 2025-01-10 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition | Shucong Zhang et.al. | 2501.06051 | translate | read | null |
| 2025-01-10 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing | Eklavya Sarkar et.al. | 2501.05987 | translate | read | link |
| 2025-01-10 | Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron | Kishor Kayyar Lakshminarayana et.al. | 2501.05976 | translate | read | null |
| 2025-01-10 | Universal-2-TF: Robust All-Neural Text Formatting for ASR | Yash Khare et.al. | 2501.05948 | translate | read | null |
| 2025-01-10 | ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification | Yi Ma et.al. | 2501.05729 | translate | read | link |
| 2025-01-09 | FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion | Alef Iury Siqueira Ferreira et.al. | 2501.05586 | translate | read | link |
| 2025-01-09 | Probing Speaker-specific Features in Speaker Representations | Aemon Yat Fei Chiu et.al. | 2501.05310 | translate | read | null |
| 2025-01-09 | DiffAttack: Diffusion-based Timbre-reserved Adversarial Attack in Speaker Identification | Qing Wang et.al. | 2501.05127 | translate | read | null |
| 2025-01-09 | JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis | Jun-Hyeok Cha et.al. | 2501.04904 | translate | read | null |
| 2025-01-08 | FleSpeech: Flexibly Controllable Speech Generation with Various Prompts | Hanzhao Li et.al. | 2501.04644 | translate | read | null |
| 2025-01-09 | OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis | Run Luo et.al. | 2501.04561 | translate | read | null |
| 2025-01-09 | Right Label Context in End-to-End Training of Time-Synchronous ASR Models | Tina Raissi et.al. | 2501.04521 | translate | read | null |
| 2025-01-08 | PolInterviews – A Dataset of German Politician Public Broadcast Interviews | Lukas Birkenmaier et.al. | 2501.04484 | translate | read | null |
| 2025-01-08 | ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training | Xinfa Zhu et.al. | 2501.04416 | translate | read | null |
| 2025-01-08 | Phone-purity Guided Discrete Tokens for Dysarthric Speech Recognition | Huimeng Wang et.al. | 2501.04379 | translate | read | null |
| 2025-01-08 | DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions | Weidong Chen et.al. | 2501.04256 | translate | read | null |
| 2025-01-08 | LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition | Bowen Hao et.al. | 2501.04204 | translate | read | null |
| 2025-01-07 | Spectral-Aware Low-Rank Adaptation for Speaker Verification | Zhe Li et.al. | 2501.03829 | translate | read | link |
| 2025-01-07 | NeuroIncept Decoder for High-Fidelity Speech Reconstruction from Neural Activity | Owais Mujtaba Khanday et.al. | 2501.03757 | translate | read | null |
| 2025-01-07 | Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection | Bang Zeng et.al. | 2501.03612 | translate | read | null |
| 2025-01-07 | Towards a Generalizable Speech Marker for Parkinson’s Disease Diagnosis | Maksim Siniukov et.al. | 2501.03581 | translate | read | null |
| 2025-01-07 | Deep Learning for Pathological Speech: A Survey | Shakeel A. Sheikh et.al. | 2501.03536 | translate | read | null |
| 2025-01-02 | FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles | Tian-Hao Zhang et.al. | 2501.03181 | translate | read | null |
| 2025-01-06 | SYKI-SVC: Advancing Singing Voice Conversion with Post-Processing Innovations and an Open-Source Professional Testset | Yiquan Zhou et.al. | 2501.02953 | translate | read | null |
| 2025-01-07 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models | Syed Abdul Gaffar Shakhadri et.al. | 2501.02832 | translate | read | null |
| 2025-01-05 | Reducing the Gap Between Pretrained Speech Enhancement and Recognition Models Using a Real Speech-Trained Bridging Module | Zhongjian Cui et.al. | 2501.02452 | translate | read | null |
| 2025-01-03 | Improving Transducer-Based Spoken Language Understanding with Self-Conditioned CTC and Knowledge Transfer | Vishal Sunder et.al. | 2501.01936 | translate | read | null |
| 2025-01-03 | CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation | Ziqi Liang et.al. | 2501.01861 | translate | read | null |
| 2025-01-03 | MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling | Simon Rouard et.al. | 2501.01757 | translate | read | null |
| 2025-01-03 | Controlling your Attributes in Voice | Xuyuan Li et.al. | 2501.01674 | translate | read | null |
| 2025-01-03 | AdaptVC: High Quality Voice Conversion with Adaptive Learning | Jaehun Kim et.al. | 2501.01347 | translate | read | null |
| 2025-01-02 | Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models | Bin Wang et.al. | 2501.01034 | translate | read | link |
| 2025-01-01 | Incremental Dialogue Management: Survey, Discussion, and Implications for HRI | Casey Kennington et.al. | 2501.00953 | translate | read | null |
| 2025-01-01 | Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation | Shoutao Guo et.al. | 2501.00868 | translate | read | link |
| 2025-01-01 | Automatic Text Pronunciation Correlation Generation and Application for Contextual Biasing | Gaofeng Cheng et.al. | 2501.00804 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)