Audio Processing - 2025-11

Publish Date Title Authors PDF Translate Read Code
2025-11-27 Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset Nick Rossenbach et.al. 2512.17915 translate read null
2025-11-04 V-Agent: An Interactive Video Search System Using Vision-Language Models SunYoung Park et.al. 2512.16925 translate read null
2025-11-30 Benchmarking Automatic Speech Recognition Models for African Languages Alvin Nahabwe et.al. 2512.10968 translate read null
2025-11-30 ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages Subham Kumar et.al. 2512.10967 translate read null
2025-11-23 SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model Kaidi Wang et.al. 2512.05126 translate read null
2025-11-18 On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts Kashaf Gulzar et.al. 2512.02027 translate read null
2025-11-29 Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning Arnesh Batra et.al. 2512.00621 translate read null
2025-11-28 OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion Sai Koneru et.al. 2512.00234 translate read null
2025-11-27 Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment Jiaying Hong et.al. 2512.00120 translate read null
2025-11-28 Scaling HuBERT for African Languages: From Base to Large and XL Antoine Caubrière et.al. 2511.23370 translate read null
2025-11-28 HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding Chen Li et.al. 2511.23178 translate read null
2025-11-28 Group-Aware Partial Model Merging for Children’s Automatic Speech Recognition Thomas Rolland et.al. 2511.23098 translate read null
2025-11-27 Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration Kanchon Gharami et.al. 2511.22769 translate read null
2025-11-27 Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition Maheswar Bora et.al. 2511.22443 translate read null
2025-11-27 GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis Teysir Baoueb et.al. 2511.22293 translate read null
2025-11-16 On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models Jonatas Grosman et.al. 2511.21704 translate read null
2025-11-26 ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features Ye Bhone Lin et.al. 2511.21088 translate read null
2025-11-26 CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation Jionghao Han et.al. 2511.21045 translate read null
2025-11-26 Towards Audio Token Compression in Large Audio Language Models Saurabhchand Bhati et.al. 2511.20973 translate read null
2025-11-26 SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications Jionghao Han et.al. 2511.20972 translate read null
2025-11-25 Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition Wesley Bian et.al. 2511.20534 translate read null
2025-11-25 Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification Akshit Pramod Anchan et.al. 2511.20474 translate read null
2025-11-25 Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach Huu Tuong Tu et.al. 2511.20107 translate read null
2025-11-25 Continual Audio Deepfake Detection via Universal Adversarial Perturbation Wangjie Li et.al. 2511.19974 translate read null
2025-11-24 Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation Maral Ebrahimzadeh et.al. 2511.19342 translate read null
2025-11-24 Neural Architecture Search for Quantum Autoencoders Hibah Agha et.al. 2511.19246 translate read null
2025-11-24 Context-Aware Whisper for Arabic ASR Under Linguistic Varieties Bashar Talafha et.al. 2511.18774 translate read null
2025-11-24 AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation Omar Garib et.al. 2511.18718 translate read null
2025-11-23 InstructAudio: Unified speech and music generation with natural language instruction Chunyu Qiang et.al. 2511.18487 translate read null
2025-11-23 A Multimodal Conversational Agent for Tabular Data Analysis Mohammad Nour Al Awad et.al. 2511.18405 translate read null
2025-11-21 Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation Scott Merrill et.al. 2511.17813 translate read null
2025-11-12 Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward Guansu Wang et.al. 2511.17555 translate read null
2025-11-21 MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core Callie C. Liao et.al. 2511.17323 translate read null
2025-11-20 Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs Wei-Cheng Tseng et.al. 2511.16639 translate read null
2025-11-20 WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue Zachary Ellis et.al. 2511.16544 translate read null
2025-11-20 SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise Rui Sang et.al. 2511.16114 translate read null
2025-11-20 Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio Mohan Shi et.al. 2511.16046 translate read null
2025-11-19 LargeSHS: A large-scale dataset of music adaptation Chih-Pin Tan et.al. 2511.15270 translate read null
2025-11-19 Aligning Generative Music AI with Human Preferences: Methods and Challenges Dorien Herremans et.al. 2511.15038 translate read null
2025-11-06 The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech Julio Cesar Galdino et.al. 2511.14779 translate read null
2025-11-18 A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder Dengyun Huang et.al. 2511.14600 translate read null
2025-11-18 TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation Wei Liu et.al. 2511.14410 translate read null
2025-11-18 AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR Gabrial Zencha Ashungafac et.al. 2511.14255 translate read null
2025-11-18 Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation Kumud Tripathi et.al. 2511.14219 translate read null
2025-11-17 Human-centric Maintenance Process Through Integration of AI, Speech, and AR Parul Khanna et.al. 2511.13918 translate read null
2025-11-05 Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion Xiao Li et.al. 2511.13731 translate read null
2025-11-17 Alpha Divergence Losses for Biometric Verification Dimitrios Koutsianos et.al. 2511.13621 translate read null
2025-11-17 Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets Máté Gedeon et.al. 2511.13529 translate read null
2025-11-17 Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs Zhe Sun et.al. 2511.13273 translate read null
2025-11-17 Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis Zaara Zabeen Arpa et.al. 2511.13159 translate read null
2025-11-16 Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans Hongbin Huang et.al. 2511.12662 translate read null
2025-11-15 VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing Zhisheng Zheng et.al. 2511.12347 translate read null
2025-11-15 How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer Minu Kim et.al. 2511.12285 translate read null
2025-11-15 Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets Huy M. Le et.al. 2511.12255 translate read null
2025-11-12 Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification Xingqi Lin et.al. 2511.11699 translate read null
2025-11-14 Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition Yiming Rong et.al. 2511.11139 translate read null
2025-11-13 Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces Farhan Sheth et.al. 2511.10793 translate read null
2025-11-13 TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English Fethi Bougares et.al. 2511.10780 translate read null
2025-11-09 Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment Yan Gao et.al. 2511.10670 translate read null
2025-11-13 VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction Yuhao Wang et.al. 2511.10232 translate read null
2025-11-13 FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features Wenyu Wang et.al. 2511.10112 translate read null
2025-11-13 Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS Haoyu Li et.al. 2511.09995 translate read null
2025-11-12 Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages Omnilingual ASR team et.al. 2511.09690 translate read null
2025-11-12 Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation Xinyi Tong et.al. 2511.09585 translate read null
2025-11-12 End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering Jiliang Hu et.al. 2511.09282 translate read null
2025-11-12 Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation Shulei Ji et.al. 2511.09090 translate read null
2025-11-12 Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition Chao Wang et.al. 2511.09085 translate read null
2025-11-12 Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask Tianzi Wang et.al. 2511.09084 translate read null
2025-11-11 HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios Bingsong Bai et.al. 2511.08496 translate read null
2025-11-11 Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models Yi Yang et.al. 2511.08252 translate read null
2025-11-11 Quantizing Whisper-small: How design choices affect ASR performance Arthur Söhler et.al. 2511.08093 translate read null
2025-11-11 SpeechJudge: Towards Human-Level Judgment for Speech Naturalness Xueyao Zhang et.al. 2511.07931 translate read null
2025-11-10 Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction Hyeryun Park et.al. 2511.07392 translate read null
2025-11-10 Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics Jonathan Lehmkuhl et.al. 2511.07268 translate read null
2025-11-10 Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models Umberto Cappellazzo et.al. 2511.07253 translate read null
2025-11-10 Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition Marc Jayson Baucas et.al. 2511.07189 translate read null
2025-11-10 Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation Matteo Pettenó et.al. 2511.07156 translate read null
2025-11-10 Generating Novel and Realistic Speakers for Voice Conversion Meiying Melissa Chen et.al. 2511.07135 translate read null
2025-11-10 On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation Matteo Pettenó et.al. 2511.07118 translate read null
2025-11-10 E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis Zhisheng Zhang et.al. 2511.07099 translate read null
2025-11-10 Metric Analysis for Spatial Semantic Segmentation of Sound Scenes Mayank Mishra et.al. 2511.07075 translate read null
2025-11-10 CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition Hung-Yang Sung et.al. 2511.06860 translate read null
2025-11-07 Persian Musical Instruments Classification Using Polyphonic Data Augmentation Diba Hadi Esfangereh et.al. 2511.05717 translate read null
2025-11-02 Factual and Musical Evaluation Metrics for Music Language Models Daniel Chenyu Lin et.al. 2511.05550 translate read null
2025-11-06 PromptSep: Generative Audio Separation via Multimodal Prompting Yutong Wen et.al. 2511.04623 translate read null
2025-11-06 MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers Ali Boudaghi et.al. 2511.04376 translate read null
2025-11-06 Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition Giovanni Barbarino et.al. 2511.04291 translate read null
2025-11-06 CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese Dazhong Chen et.al. 2511.04139 translate read null
2025-11-06 Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms Miguel E. Andres et.al. 2511.04133 translate read null
2025-11-06 WST: Weakly Supervised Transducer for Automatic Speech Recognition Dongji Gao et.al. 2511.04035 translate read null
2025-11-06 Accelerating scientific discovery with the common task framework J. Nathan Kutz et.al. 2511.04001 translate read null
2025-11-06 MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation Shih-Lun Wu et.al. 2511.03942 translate read null
2025-11-05 SyMuPe: Affective and Controllable Symbolic Music Performance Ilya Borovik et.al. 2511.03425 translate read null
2025-11-05 Seeing What You Say: Expressive Image Generation from Speech Jiyoung Lee et.al. 2511.03423 translate read null
2025-11-05 Open Source State-Of-the-Art Solution for Romanian Speech Recognition Gabriel Pirlogeanu et.al. 2511.03361 translate read null
2025-11-05 TASU: Text-Only Alignment for Speech Understanding Jing Peng et.al. 2511.03310 translate read null
2025-11-05 How to Evaluate Speech Translation with Source-Aware Neural MT Metrics Mauro Cettolo et.al. 2511.03295 translate read null
2025-11-04 An unscented Kalman filter method for real time input-parameter-state estimation Marios Impraimakis et.al. 2511.02717 translate read null
2025-11-04 Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision Kaimeng Jia et.al. 2511.02270 translate read null
2025-11-04 Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA Takuto Ando et.al. 2511.02269 translate read null
2025-11-03 ADNAC: Audio Denoiser using Neural Audio Codec Daniel Jimon et.al. 2511.01773 translate read null
2025-11-03 SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia Chaoqun Liu et.al. 2511.01670 translate read null
2025-11-03 The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity Louis Bradshaw et.al. 2511.01663 translate read null
2025-11-02 WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion Dong Liu et.al. 2511.01056 translate read null
2025-11-02 MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models Yayue Deng et.al. 2511.00850 translate read null
2025-11-02 Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures Barathi Subramanian et.al. 2511.00793 translate read null
2025-11-01 More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks Swapnil Bhosale et.al. 2511.00641 translate read null
2025-11-01 On Improvisation and Open-Endedness: Insights for Experiential AI Botao ‘Amber’ Hu et.al. 2511.00529 translate read null
2025-11-01 Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study Lucky Onyekwelu-Udoka et.al. 2511.00402 translate read null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)