Audio Processing - 2025-11
Audio Processing - 2025-11
| Publish Date | Title | Authors | Translate | Read | Code | |
|---|---|---|---|---|---|---|
| 2025-11-27 | Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset | Nick Rossenbach et.al. | 2512.17915 | translate | read | null |
| 2025-11-04 | V-Agent: An Interactive Video Search System Using Vision-Language Models | SunYoung Park et.al. | 2512.16925 | translate | read | null |
| 2025-11-30 | Benchmarking Automatic Speech Recognition Models for African Languages | Alvin Nahabwe et.al. | 2512.10968 | translate | read | null |
| 2025-11-30 | ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages | Subham Kumar et.al. | 2512.10967 | translate | read | null |
| 2025-11-23 | SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model | Kaidi Wang et.al. | 2512.05126 | translate | read | null |
| 2025-11-18 | On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts | Kashaf Gulzar et.al. | 2512.02027 | translate | read | null |
| 2025-11-29 | Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning | Arnesh Batra et.al. | 2512.00621 | translate | read | null |
| 2025-11-28 | OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion | Sai Koneru et.al. | 2512.00234 | translate | read | null |
| 2025-11-27 | Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment | Jiaying Hong et.al. | 2512.00120 | translate | read | null |
| 2025-11-28 | Scaling HuBERT for African Languages: From Base to Large and XL | Antoine Caubrière et.al. | 2511.23370 | translate | read | null |
| 2025-11-28 | HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding | Chen Li et.al. | 2511.23178 | translate | read | null |
| 2025-11-28 | Group-Aware Partial Model Merging for Children’s Automatic Speech Recognition | Thomas Rolland et.al. | 2511.23098 | translate | read | null |
| 2025-11-27 | Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration | Kanchon Gharami et.al. | 2511.22769 | translate | read | null |
| 2025-11-27 | Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition | Maheswar Bora et.al. | 2511.22443 | translate | read | null |
| 2025-11-27 | GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis | Teysir Baoueb et.al. | 2511.22293 | translate | read | null |
| 2025-11-16 | On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models | Jonatas Grosman et.al. | 2511.21704 | translate | read | null |
| 2025-11-26 | ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features | Ye Bhone Lin et.al. | 2511.21088 | translate | read | null |
| 2025-11-26 | CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation | Jionghao Han et.al. | 2511.21045 | translate | read | null |
| 2025-11-26 | Towards Audio Token Compression in Large Audio Language Models | Saurabhchand Bhati et.al. | 2511.20973 | translate | read | null |
| 2025-11-26 | SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications | Jionghao Han et.al. | 2511.20972 | translate | read | null |
| 2025-11-25 | Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition | Wesley Bian et.al. | 2511.20534 | translate | read | null |
| 2025-11-25 | Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification | Akshit Pramod Anchan et.al. | 2511.20474 | translate | read | null |
| 2025-11-25 | Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach | Huu Tuong Tu et.al. | 2511.20107 | translate | read | null |
| 2025-11-25 | Continual Audio Deepfake Detection via Universal Adversarial Perturbation | Wangjie Li et.al. | 2511.19974 | translate | read | null |
| 2025-11-24 | Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation | Maral Ebrahimzadeh et.al. | 2511.19342 | translate | read | null |
| 2025-11-24 | Neural Architecture Search for Quantum Autoencoders | Hibah Agha et.al. | 2511.19246 | translate | read | null |
| 2025-11-24 | Context-Aware Whisper for Arabic ASR Under Linguistic Varieties | Bashar Talafha et.al. | 2511.18774 | translate | read | null |
| 2025-11-24 | AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation | Omar Garib et.al. | 2511.18718 | translate | read | null |
| 2025-11-23 | InstructAudio: Unified speech and music generation with natural language instruction | Chunyu Qiang et.al. | 2511.18487 | translate | read | null |
| 2025-11-23 | A Multimodal Conversational Agent for Tabular Data Analysis | Mohammad Nour Al Awad et.al. | 2511.18405 | translate | read | null |
| 2025-11-21 | Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation | Scott Merrill et.al. | 2511.17813 | translate | read | null |
| 2025-11-12 | Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward | Guansu Wang et.al. | 2511.17555 | translate | read | null |
| 2025-11-21 | MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core | Callie C. Liao et.al. | 2511.17323 | translate | read | null |
| 2025-11-20 | Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs | Wei-Cheng Tseng et.al. | 2511.16639 | translate | read | null |
| 2025-11-20 | WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue | Zachary Ellis et.al. | 2511.16544 | translate | read | null |
| 2025-11-20 | SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise | Rui Sang et.al. | 2511.16114 | translate | read | null |
| 2025-11-20 | Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio | Mohan Shi et.al. | 2511.16046 | translate | read | null |
| 2025-11-19 | LargeSHS: A large-scale dataset of music adaptation | Chih-Pin Tan et.al. | 2511.15270 | translate | read | null |
| 2025-11-19 | Aligning Generative Music AI with Human Preferences: Methods and Challenges | Dorien Herremans et.al. | 2511.15038 | translate | read | null |
| 2025-11-06 | The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech | Julio Cesar Galdino et.al. | 2511.14779 | translate | read | null |
| 2025-11-18 | A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder | Dengyun Huang et.al. | 2511.14600 | translate | read | null |
| 2025-11-18 | TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation | Wei Liu et.al. | 2511.14410 | translate | read | null |
| 2025-11-18 | AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR | Gabrial Zencha Ashungafac et.al. | 2511.14255 | translate | read | null |
| 2025-11-18 | Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation | Kumud Tripathi et.al. | 2511.14219 | translate | read | null |
| 2025-11-17 | Human-centric Maintenance Process Through Integration of AI, Speech, and AR | Parul Khanna et.al. | 2511.13918 | translate | read | null |
| 2025-11-05 | Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion | Xiao Li et.al. | 2511.13731 | translate | read | null |
| 2025-11-17 | Alpha Divergence Losses for Biometric Verification | Dimitrios Koutsianos et.al. | 2511.13621 | translate | read | null |
| 2025-11-17 | Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets | Máté Gedeon et.al. | 2511.13529 | translate | read | null |
| 2025-11-17 | Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs | Zhe Sun et.al. | 2511.13273 | translate | read | null |
| 2025-11-17 | Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis | Zaara Zabeen Arpa et.al. | 2511.13159 | translate | read | null |
| 2025-11-16 | Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans | Hongbin Huang et.al. | 2511.12662 | translate | read | null |
| 2025-11-15 | VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing | Zhisheng Zheng et.al. | 2511.12347 | translate | read | null |
| 2025-11-15 | How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer | Minu Kim et.al. | 2511.12285 | translate | read | null |
| 2025-11-15 | Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets | Huy M. Le et.al. | 2511.12255 | translate | read | null |
| 2025-11-12 | Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification | Xingqi Lin et.al. | 2511.11699 | translate | read | null |
| 2025-11-14 | Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition | Yiming Rong et.al. | 2511.11139 | translate | read | null |
| 2025-11-13 | Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces | Farhan Sheth et.al. | 2511.10793 | translate | read | null |
| 2025-11-13 | TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English | Fethi Bougares et.al. | 2511.10780 | translate | read | null |
| 2025-11-09 | Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment | Yan Gao et.al. | 2511.10670 | translate | read | null |
| 2025-11-13 | VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction | Yuhao Wang et.al. | 2511.10232 | translate | read | null |
| 2025-11-13 | FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features | Wenyu Wang et.al. | 2511.10112 | translate | read | null |
| 2025-11-13 | Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS | Haoyu Li et.al. | 2511.09995 | translate | read | null |
| 2025-11-12 | Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages | Omnilingual ASR team et.al. | 2511.09690 | translate | read | null |
| 2025-11-12 | Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation | Xinyi Tong et.al. | 2511.09585 | translate | read | null |
| 2025-11-12 | End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering | Jiliang Hu et.al. | 2511.09282 | translate | read | null |
| 2025-11-12 | Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation | Shulei Ji et.al. | 2511.09090 | translate | read | null |
| 2025-11-12 | Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition | Chao Wang et.al. | 2511.09085 | translate | read | null |
| 2025-11-12 | Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask | Tianzi Wang et.al. | 2511.09084 | translate | read | null |
| 2025-11-11 | HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios | Bingsong Bai et.al. | 2511.08496 | translate | read | null |
| 2025-11-11 | Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models | Yi Yang et.al. | 2511.08252 | translate | read | null |
| 2025-11-11 | Quantizing Whisper-small: How design choices affect ASR performance | Arthur Söhler et.al. | 2511.08093 | translate | read | null |
| 2025-11-11 | SpeechJudge: Towards Human-Level Judgment for Speech Naturalness | Xueyao Zhang et.al. | 2511.07931 | translate | read | null |
| 2025-11-10 | Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction | Hyeryun Park et.al. | 2511.07392 | translate | read | null |
| 2025-11-10 | Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics | Jonathan Lehmkuhl et.al. | 2511.07268 | translate | read | null |
| 2025-11-10 | Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models | Umberto Cappellazzo et.al. | 2511.07253 | translate | read | null |
| 2025-11-10 | Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition | Marc Jayson Baucas et.al. | 2511.07189 | translate | read | null |
| 2025-11-10 | Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation | Matteo Pettenó et.al. | 2511.07156 | translate | read | null |
| 2025-11-10 | Generating Novel and Realistic Speakers for Voice Conversion | Meiying Melissa Chen et.al. | 2511.07135 | translate | read | null |
| 2025-11-10 | On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation | Matteo Pettenó et.al. | 2511.07118 | translate | read | null |
| 2025-11-10 | E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis | Zhisheng Zhang et.al. | 2511.07099 | translate | read | null |
| 2025-11-10 | Metric Analysis for Spatial Semantic Segmentation of Sound Scenes | Mayank Mishra et.al. | 2511.07075 | translate | read | null |
| 2025-11-10 | CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition | Hung-Yang Sung et.al. | 2511.06860 | translate | read | null |
| 2025-11-07 | Persian Musical Instruments Classification Using Polyphonic Data Augmentation | Diba Hadi Esfangereh et.al. | 2511.05717 | translate | read | null |
| 2025-11-02 | Factual and Musical Evaluation Metrics for Music Language Models | Daniel Chenyu Lin et.al. | 2511.05550 | translate | read | null |
| 2025-11-06 | PromptSep: Generative Audio Separation via Multimodal Prompting | Yutong Wen et.al. | 2511.04623 | translate | read | null |
| 2025-11-06 | MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers | Ali Boudaghi et.al. | 2511.04376 | translate | read | null |
| 2025-11-06 | Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition | Giovanni Barbarino et.al. | 2511.04291 | translate | read | null |
| 2025-11-06 | CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese | Dazhong Chen et.al. | 2511.04139 | translate | read | null |
| 2025-11-06 | Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms | Miguel E. Andres et.al. | 2511.04133 | translate | read | null |
| 2025-11-06 | WST: Weakly Supervised Transducer for Automatic Speech Recognition | Dongji Gao et.al. | 2511.04035 | translate | read | null |
| 2025-11-06 | Accelerating scientific discovery with the common task framework | J. Nathan Kutz et.al. | 2511.04001 | translate | read | null |
| 2025-11-06 | MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation | Shih-Lun Wu et.al. | 2511.03942 | translate | read | null |
| 2025-11-05 | SyMuPe: Affective and Controllable Symbolic Music Performance | Ilya Borovik et.al. | 2511.03425 | translate | read | null |
| 2025-11-05 | Seeing What You Say: Expressive Image Generation from Speech | Jiyoung Lee et.al. | 2511.03423 | translate | read | null |
| 2025-11-05 | Open Source State-Of-the-Art Solution for Romanian Speech Recognition | Gabriel Pirlogeanu et.al. | 2511.03361 | translate | read | null |
| 2025-11-05 | TASU: Text-Only Alignment for Speech Understanding | Jing Peng et.al. | 2511.03310 | translate | read | null |
| 2025-11-05 | How to Evaluate Speech Translation with Source-Aware Neural MT Metrics | Mauro Cettolo et.al. | 2511.03295 | translate | read | null |
| 2025-11-04 | An unscented Kalman filter method for real time input-parameter-state estimation | Marios Impraimakis et.al. | 2511.02717 | translate | read | null |
| 2025-11-04 | Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision | Kaimeng Jia et.al. | 2511.02270 | translate | read | null |
| 2025-11-04 | Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA | Takuto Ando et.al. | 2511.02269 | translate | read | null |
| 2025-11-03 | ADNAC: Audio Denoiser using Neural Audio Codec | Daniel Jimon et.al. | 2511.01773 | translate | read | null |
| 2025-11-03 | SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia | Chaoqun Liu et.al. | 2511.01670 | translate | read | null |
| 2025-11-03 | The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity | Louis Bradshaw et.al. | 2511.01663 | translate | read | null |
| 2025-11-02 | WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion | Dong Liu et.al. | 2511.01056 | translate | read | null |
| 2025-11-02 | MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models | Yayue Deng et.al. | 2511.00850 | translate | read | null |
| 2025-11-02 | Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures | Barathi Subramanian et.al. | 2511.00793 | translate | read | null |
| 2025-11-01 | More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks | Swapnil Bhosale et.al. | 2511.00641 | translate | read | null |
| 2025-11-01 | On Improvisation and Open-Endedness: Insights for Experiential AI | Botao ‘Amber’ Hu et.al. | 2511.00529 | translate | read | null |
| 2025-11-01 | Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study | Lucky Onyekwelu-Udoka et.al. | 2511.00402 | translate | read | null |
(<a href=../Audio_Processing.md>back to Audio Processing</a>)