Learning Music Sequence Representation from Text Supervision

Authors :: Chen, Tianyu
Xie, Yuan
Zhang, Shuai
Huang, Shaohan
Zhou, Haoyi
Li, Jianxin
Source :: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 4583-4587
Publication Year :: 2023
Abstract: Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals. To excavate better MUsic SEquence Representation from labeled audio, we propose a novel text-supervision pre-training method, namely MUSER. MUSER adopts an audio-spectrum-text tri-modal contrastive learning framework, where the text input could be any form of meta-data with the help of text templates while the spectrum is derived from an audio sequence. Our experiments reveal that MUSER could be more flexibly adapted to downstream tasks compared with the current data-hungry pre-training method, and it only requires 0.056% of pre-training data to achieve the state-of-the-art performance.

Subjects :: Computer Science - Sound
Computer Science - Machine Learning
Electrical Engineering and Systems Science - Audio and Speech Processing

Database :: arXiv
Journal :: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 4583-4587
Publication Type :: Report
Accession number :: edsarx.2305.19602
Document Type :: Working Paper
Full Text :: https://doi.org/10.1109/ICASSP43922.2022.9746131