← Back to Search

MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

☆☆☆☆☆Mar 24, 2026arxiv →
Zikang HuangMeng GeTianrui WangXuanchen LiXiaobao WangLongbiao Wang+1 more

Abstract

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

Explain this paper

Ask this paper

Loading chat…

Rate this paper

Similar Papers