Start Over

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Authors :: Xuyi Zhuang
Yukun Qian
Mingjiang Wang
Source :: EURASIP Journal on Audio, Speech, and Music Processing, Vol 2024, Iss 1, Pp 1-16 (2024)
Publication Year :: 2024
Publisher :: SpringerOpen, 2024.
Abstract: Abstract Self-supervised learning for speech pre-training models has achieved remarkable success in acquiring superior speech contextual representations by learning from unlabeled audio, excelling in numerous downstream speech tasks. However, the pre-training of these models necessitates significant computational resources and training duration, presenting a high barrier to entry into the realm of pre-training learning. In our efforts, by amalgamating the resource-efficient benefits of the generative learning model, Masked Auto Encoder, with the efficacy of the vector quantization method in discriminative learning, we introduce a novel pre-training framework: Speech Vector Quantization Masked Auto Encoder (SVQ-MAE). Distinct from the majority of SSL frameworks, which require simultaneous construction of speech contextual representations and mask reconstruction within an encoder-only module, we have exclusively designed a decoupled decoder for pre-training SVQ-MAE. This allows the additional decoupled decoder to undertake the mask reconstruction task solely, reducing the learning complexity of pretext tasks and enhancing the encoder’s efficiency in extracting speech contextual representations. Owing to this innovation, by using only 4 GPUs, SVQ-NAE can achieve high performance comparable to wav2vec 2.0, which requires 64 GPUs for training. In the Speech Processing Universal Performance Benchmark, SVQ-MAE surpasses wav2vec 2.0 in both keyword spotting and emotion recognition tasks. Furthermore, in cross-lingual ASR for Mandarin, upon fine-tuning on AISHELL-1, SVQ-MAE achieves a Character Error Rate of 4.09%, outperforming all supervised ASR models.

Subjects :: Self-supervised learning
MAE
Vector quantization
Constrained computational resources
Acoustics. Sound
QC221-246
Electronic computers. Computer science
QA75.5-76.95

Details

Language :: English
ISSN :: 16874722
Volume :: 2024
Issue :: 1
Database :: Directory of Open Access Journals
Journal :: EURASIP Journal on Audio, Speech, and Music Processing
Publication Type :: Academic Journal
Accession number :: edsdoj.22dd2237864a4756990308ecea310515
Document Type :: article
Full Text :: https://doi.org/10.1186/s13636-024-00375-1

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

SVQ-MAE: an efficient speech pre-training framework with constrained computational resources

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources