Audio Processing - 2025-11 | Paper Arxiv Daily

Audio Processing - 2025-11

Publish Date	Title	Authors	PDF	Translate	Read	Code
2025-11-27	Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset	Nick Rossenbach et.al.	2512.17915	translate	read	null
2025-11-04	V-Agent: An Interactive Video Search System Using Vision-Language Models	SunYoung Park et.al.	2512.16925	translate	read	null
2025-11-30	Benchmarking Automatic Speech Recognition Models for African Languages	Alvin Nahabwe et.al.	2512.10968	translate	read	null
2025-11-30	ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages	Subham Kumar et.al.	2512.10967	translate	read	null
2025-11-23	SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model	Kaidi Wang et.al.	2512.05126	translate	read	null
2025-11-18	On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts	Kashaf Gulzar et.al.	2512.02027	translate	read	null
2025-11-29	Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning	Arnesh Batra et.al.	2512.00621	translate	read	null
2025-11-28	OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion	Sai Koneru et.al.	2512.00234	translate	read	null
2025-11-27	Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment	Jiaying Hong et.al.	2512.00120	translate	read	null
2025-11-28	Scaling HuBERT for African Languages: From Base to Large and XL	Antoine Caubrière et.al.	2511.23370	translate	read	null
2025-11-28	HPSU: A Benchmark for Human-Level Perception in Real-World Spoken Speech Understanding	Chen Li et.al.	2511.23178	translate	read	null
2025-11-28	Group-Aware Partial Model Merging for Children’s Automatic Speech Recognition	Thomas Rolland et.al.	2511.23098	translate	read	null
2025-11-27	Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration	Kanchon Gharami et.al.	2511.22769	translate	read	null
2025-11-27	Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition	Maheswar Bora et.al.	2511.22443	translate	read	null
2025-11-27	GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis	Teysir Baoueb et.al.	2511.22293	translate	read	null
2025-11-16	On the Cross-lingual Transferability of Pre-trained wav2vec2-based Models	Jonatas Grosman et.al.	2511.21704	translate	read	null
2025-11-26	ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features	Ye Bhone Lin et.al.	2511.21088	translate	read	null
2025-11-26	CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation	Jionghao Han et.al.	2511.21045	translate	read	null
2025-11-26	Towards Audio Token Compression in Large Audio Language Models	Saurabhchand Bhati et.al.	2511.20973	translate	read	null
2025-11-26	SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications	Jionghao Han et.al.	2511.20972	translate	read	null
2025-11-25	Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition	Wesley Bian et.al.	2511.20534	translate	read	null
2025-11-25	Modular Deep Learning Framework for Assistive Perception: Gaze, Affect, and Speaker Identification	Akshit Pramod Anchan et.al.	2511.20474	translate	read	null
2025-11-25	Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach	Huu Tuong Tu et.al.	2511.20107	translate	read	null
2025-11-25	Continual Audio Deepfake Detection via Universal Adversarial Perturbation	Wangjie Li et.al.	2511.19974	translate	read	null
2025-11-24	Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation	Maral Ebrahimzadeh et.al.	2511.19342	translate	read	null
2025-11-24	Neural Architecture Search for Quantum Autoencoders	Hibah Agha et.al.	2511.19246	translate	read	null
2025-11-24	Context-Aware Whisper for Arabic ASR Under Linguistic Varieties	Bashar Talafha et.al.	2511.18774	translate	read	null
2025-11-24	AIRHILT: A Human-in-the-Loop Testbed for Multimodal Conflict Detection in Aviation	Omar Garib et.al.	2511.18718	translate	read	null
2025-11-23	InstructAudio: Unified speech and music generation with natural language instruction	Chunyu Qiang et.al.	2511.18487	translate	read	null
2025-11-23	A Multimodal Conversational Agent for Tabular Data Analysis	Mohammad Nour Al Awad et.al.	2511.18405	translate	read	null
2025-11-21	Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation	Scott Merrill et.al.	2511.17813	translate	read	null
2025-11-12	Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward	Guansu Wang et.al.	2511.17555	translate	read	null
2025-11-21	MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core	Callie C. Liao et.al.	2511.17323	translate	read	null
2025-11-20	Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs	Wei-Cheng Tseng et.al.	2511.16639	translate	read	null
2025-11-20	WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue	Zachary Ellis et.al.	2511.16544	translate	read	null
2025-11-20	SceneGuard: Training-Time Voice Protection with Scene-Consistent Audible Background Noise	Rui Sang et.al.	2511.16114	translate	read	null
2025-11-20	Train Short, Infer Long: Speech-LLM Enables Zero-Shot Streamable Joint ASR and Diarization on Long Audio	Mohan Shi et.al.	2511.16046	translate	read	null
2025-11-19	LargeSHS: A large-scale dataset of music adaptation	Chih-Pin Tan et.al.	2511.15270	translate	read	null
2025-11-19	Aligning Generative Music AI with Human Preferences: Methods and Challenges	Dorien Herremans et.al.	2511.15038	translate	read	null
2025-11-06	The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech	Julio Cesar Galdino et.al.	2511.14779	translate	read	null
2025-11-18	A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder	Dengyun Huang et.al.	2511.14600	translate	read	null
2025-11-18	TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation	Wei Liu et.al.	2511.14410	translate	read	null
2025-11-18	AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR	Gabrial Zencha Ashungafac et.al.	2511.14255	translate	read	null
2025-11-18	Listen Like a Teacher: Mitigating Whisper Hallucinations using Adaptive Layer Attention and Knowledge Distillation	Kumud Tripathi et.al.	2511.14219	translate	read	null
2025-11-17	Human-centric Maintenance Process Through Integration of AI, Speech, and AR	Parul Khanna et.al.	2511.13918	translate	read	null
2025-11-05	Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion	Xiao Li et.al.	2511.13731	translate	read	null
2025-11-17	Alpha Divergence Losses for Biometric Verification	Dimitrios Koutsianos et.al.	2511.13621	translate	read	null
2025-11-17	Toward Conversational Hungarian Speech Recognition: Introducing the BEA-Large and BEA-Dialogue Datasets	Máté Gedeon et.al.	2511.13529	translate	read	null
2025-11-17	Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs	Zhe Sun et.al.	2511.13273	translate	read	null
2025-11-17	Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis	Zaara Zabeen Arpa et.al.	2511.13159	translate	read	null
2025-11-16	Hi-Reco: High-Fidelity Real-Time Conversational Digital Humans	Hongbin Huang et.al.	2511.12662	translate	read	null
2025-11-15	VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing	Zhisheng Zheng et.al.	2511.12347	translate	read	null
2025-11-15	How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer	Minu Kim et.al.	2511.12285	translate	read	null
2025-11-15	Fusionista2.0: Efficiency Retrieval System for Large-Scale Datasets	Huy M. Le et.al.	2511.12255	translate	read	null
2025-11-12	Tighter Truncated Rectangular Prism Approximation for RNN Robustness Verification	Xingqi Lin et.al.	2511.11699	translate	read	null
2025-11-14	Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition	Yiming Rong et.al.	2511.11139	translate	read	null
2025-11-13	Curved Worlds, Clear Boundaries: Generalizing Speech Deepfake Detection using Hyperbolic and Spherical Geometry Spaces	Farhan Sheth et.al.	2511.10793	translate	read	null
2025-11-13	TEDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English	Fethi Bougares et.al.	2511.10780	translate	read	null
2025-11-09	Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment	Yan Gao et.al.	2511.10670	translate	read	null
2025-11-13	VocalNet-M2: Advancing Low-Latency Spoken Language Modeling via Integrated Multi-Codebook Tokenization and Multi-Token Prediction	Yuhao Wang et.al.	2511.10232	translate	read	null
2025-11-13	FabasedVC: Enhancing Voice Conversion with Text Modality Fusion and Phoneme-Level SSL Features	Wenyu Wang et.al.	2511.10112	translate	read	null
2025-11-13	Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS	Haoyu Li et.al.	2511.09995	translate	read	null
2025-11-12	Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages	Omnilingual ASR team et.al.	2511.09690	translate	read	null
2025-11-12	Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation	Xinyi Tong et.al.	2511.09585	translate	read	null
2025-11-12	End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering	Jiliang Hu et.al.	2511.09282	translate	read	null
2025-11-12	Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation	Shulei Ji et.al.	2511.09090	translate	read	null
2025-11-12	Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition	Chao Wang et.al.	2511.09085	translate	read	null
2025-11-12	Towards Effective and Efficient Non-autoregressive decoders for Conformer and LLM-based ASR using Block-based Attention Mask	Tianzi Wang et.al.	2511.09084	translate	read	null
2025-11-11	HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios	Bingsong Bai et.al.	2511.08496	translate	read	null
2025-11-11	Melodia: Training-Free Music Editing Guided by Attention Probing in Diffusion Models	Yi Yang et.al.	2511.08252	translate	read	null
2025-11-11	Quantizing Whisper-small: How design choices affect ASR performance	Arthur Söhler et.al.	2511.08093	translate	read	null
2025-11-11	SpeechJudge: Towards Human-Level Judgment for Speech Naturalness	Xueyao Zhang et.al.	2511.07931	translate	read	null
2025-11-10	Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction	Hyeryun Park et.al.	2511.07392	translate	read	null
2025-11-10	Generating Piano Music with Transformers: A Comparative Study of Scale, Data, and Metrics	Jonathan Lehmkuhl et.al.	2511.07268	translate	read	null
2025-11-10	Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models	Umberto Cappellazzo et.al.	2511.07253	translate	read	null
2025-11-10	Improving Remote Patient Monitoring Systems Using a Fog-based IoT Platform with Speech Recognition	Marc Jayson Baucas et.al.	2511.07189	translate	read	null
2025-11-10	Conditional Diffusion as Latent Constraints for Controllable Symbolic Music Generation	Matteo Pettenó et.al.	2511.07156	translate	read	null
2025-11-10	Generating Novel and Realistic Speakers for Voice Conversion	Meiying Melissa Chen et.al.	2511.07135	translate	read	null
2025-11-10	On the Joint Minimization of Regularization Loss Functions in Deep Variational Bayesian Methods for Attribute-Controlled Symbolic Music Generation	Matteo Pettenó et.al.	2511.07118	translate	read	null
2025-11-10	E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis	Zhisheng Zhang et.al.	2511.07099	translate	read	null
2025-11-10	Metric Analysis for Spatial Semantic Segmentation of Sound Scenes	Mayank Mishra et.al.	2511.07075	translate	read	null
2025-11-10	CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition	Hung-Yang Sung et.al.	2511.06860	translate	read	null
2025-11-07	Persian Musical Instruments Classification Using Polyphonic Data Augmentation	Diba Hadi Esfangereh et.al.	2511.05717	translate	read	null
2025-11-02	Factual and Musical Evaluation Metrics for Music Language Models	Daniel Chenyu Lin et.al.	2511.05550	translate	read	null
2025-11-06	PromptSep: Generative Audio Separation via Multimodal Prompting	Yutong Wen et.al.	2511.04623	translate	read	null
2025-11-06	MusRec: Zero-Shot Text-to-Music Editing via Rectified Flow and Diffusion Transformers	Ali Boudaghi et.al.	2511.04376	translate	read	null
2025-11-06	Robustness of Minimum-Volume Nonnegative Matrix Factorization under an Expanded Sufficiently Scattered Condition	Giovanni Barbarino et.al.	2511.04291	translate	read	null
2025-11-06	CantoASR: Prosody-Aware ASR-LALM Collaboration for Low-Resource Cantonese	Dazhong Chen et.al.	2511.04139	translate	read	null
2025-11-06	Testing the Testers: Human-Driven Quality Assessment of Voice AI Testing Platforms	Miguel E. Andres et.al.	2511.04133	translate	read	null
2025-11-06	WST: Weakly Supervised Transducer for Automatic Speech Recognition	Dongji Gao et.al.	2511.04035	translate	read	null
2025-11-06	Accelerating scientific discovery with the common task framework	J. Nathan Kutz et.al.	2511.04001	translate	read	null
2025-11-06	MIDI-LLM: Adapting Large Language Models for Text-to-MIDI Music Generation	Shih-Lun Wu et.al.	2511.03942	translate	read	null
2025-11-05	SyMuPe: Affective and Controllable Symbolic Music Performance	Ilya Borovik et.al.	2511.03425	translate	read	null
2025-11-05	Seeing What You Say: Expressive Image Generation from Speech	Jiyoung Lee et.al.	2511.03423	translate	read	null
2025-11-05	Open Source State-Of-the-Art Solution for Romanian Speech Recognition	Gabriel Pirlogeanu et.al.	2511.03361	translate	read	null
2025-11-05	TASU: Text-Only Alignment for Speech Understanding	Jing Peng et.al.	2511.03310	translate	read	null
2025-11-05	How to Evaluate Speech Translation with Source-Aware Neural MT Metrics	Mauro Cettolo et.al.	2511.03295	translate	read	null
2025-11-04	An unscented Kalman filter method for real time input-parameter-state estimation	Marios Impraimakis et.al.	2511.02717	translate	read	null
2025-11-04	Augmenting Open-Vocabulary Dysarthric Speech Assessment with Human Perceptual Supervision	Kaimeng Jia et.al.	2511.02270	translate	read	null
2025-11-04	Energy-Efficient Hardware Acceleration of Whisper ASR on a CGLA	Takuto Ando et.al.	2511.02269	translate	read	null
2025-11-03	ADNAC: Audio Denoiser using Neural Audio Codec	Daniel Jimon et.al.	2511.01773	translate	read	null
2025-11-03	SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia	Chaoqun Liu et.al.	2511.01670	translate	read	null
2025-11-03	The Ghost in the Keys: A Disklavier Demo for Human-AI Musical Co-Creativity	Louis Bradshaw et.al.	2511.01663	translate	read	null
2025-11-02	WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion	Dong Liu et.al.	2511.01056	translate	read	null
2025-11-02	MULTI-Bench: A Multi-Turn Interactive Benchmark for Assessing Emotional Intelligence ability of Spoken Dialogue Models	Yayue Deng et.al.	2511.00850	translate	read	null
2025-11-02	Rhythm in the Air: Vision-based Real-Time Music Generation through Gestures	Barathi Subramanian et.al.	2511.00793	translate	read	null
2025-11-01	More Than A Shortcut: A Hyperbolic Approach To Early-Exit Networks	Swapnil Bhosale et.al.	2511.00641	translate	read	null
2025-11-01	On Improvisation and Open-Endedness: Insights for Experiential AI	Botao ‘Amber’ Hu et.al.	2511.00529	translate	read	null
2025-11-01	Emotion Detection in Speech Using Lightweight and Transformer-Based Models: A Comparative and Ablation Study	Lucky Onyekwelu-Udoka et.al.	2511.00402	translate	read	null

(<a href=../Audio_Processing.md>back to Audio Processing</a>)