Query by Video: Cross-modal Music Retrieval

This project is in collaboration with Spotify, mentored by Aparna Kumar.

Publication

Bochen Li and Aparna Kumar, Query by video: cross-modal music retrieval, In Proc. International Society for Music Information Retrieval (ISMIR), 2019.

Bochen Li and Aparna Kumar, Systems, methods & computer program products for associating media content having different modalities, U.S. Patent 16/439,626, June 2019.

Problem Statement

Input: A short video clip.
Ouptut: A list of retrieved song from an existing music database.

Method

The model structure.

The learned cross-modal latent emotion space.

The t-SNE visualization of the latent emotion space. Four randomly selected regions are presented in colors representing different emotion concepts: gloomy, ambient, delicate, sweet, each with the thumbnails of the paired videos displayed.

Demo Results

Demo 1

The demo videos present the input silent video with the music track retrieved by the proposed model.

Query videos are from: The test split of Cowen2017 dataset .
Musc Database contains: The test split of AudioSet Music Mood Subset (354 music excerpts).

Note

Each Youtube link includes 30 query videos, each one is presented with the 5 top retrieved music excerpts.
The rank and cross-modal distance is displayed.
Query videos have various durations, but all music has 10 seconds. So some video frames will end earlier than audio track.