Start Over

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Authors :: Li, Sifei
Yang, Binxin
Yin, Chunji
Sun, Chong
Zhang, Yuxin
Dong, Weiming
Li, Chen
Publication Year :: 2024
Abstract: Video-to-music generation presents significant potential in video production, requiring the generated music to be both semantically and rhythmically aligned with the video. Achieving this alignment demands advanced music generation capabilities, sophisticated video understanding, and an efficient mechanism to learn the correspondence between the two modalities. In this paper, we propose VidMusician, a parameter-efficient video-to-music generation framework built upon text-to-music models. VidMusician leverages hierarchical visual features to ensure semantic and rhythmic alignment between video and music. Specifically, our approach utilizes global visual features as semantic conditions and local visual features as rhythmic cues. These features are integrated into the generative backbone via cross-attention and in-attention mechanisms, respectively. Through a two-stage training process, we incrementally incorporate semantic and rhythmic features, utilizing zero initialization and identity initialization to maintain the inherent music-generative capabilities of the backbone. Additionally, we construct a diverse video-music dataset, DVMSet, encompassing various scenarios, such as promo videos, commercials, and compilations. Experiments demonstrate that VidMusician outperforms state-of-the-art methods across multiple evaluation metrics and exhibits robust performance on AI-generated videos. Samples are available at \url{https://youtu.be/EPOSXwtl1jw}.

Subjects :: Computer Science - Sound
Electrical Engineering and Systems Science - Audio and Speech Processing

Details

Database :: arXiv
Publication Type :: Report
Accession number :: edsarx.2412.06296
Document Type :: Working Paper

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources