Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance

Bochen Li, Akira Maezawa, and Zhiyao Duan

This project is in collaboration with the Yamaha Corporation. This project is partially supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Publication

Bochen Li, Akira Maezawa, and Zhiyao Duan, Skeleton plays piano: online generation of pianist body movements from MIDI performance, in Proc. International Society for Music Information Retrieval (ISMIR), 2018. <pdf> <slides>

Akira Maezawa and Bochen Li, Information processing method, U.S. Patent 16/983,341, November 2020.

What is the problem?

We aim to train a system to generate a virtual pianist animation with expressive performance motions given a symbolic music in MIDI format.

Input: a live data stream of key depression actions and the corresponding metric structure (optional)
Ouput: a time sequence of body joint coordinates

Motivation

Generating expressive body movement is important for music interactions
Most existing framework cannot incorporate music context information for whole-body expressive movement generation

Applications

Demonstration for music learners by replicating a musician's body interpretations of music
More immersive music enjoyment experience
Visual interactions in automatic computer accompaniment

What is our approach?

We first use two CNN structures to parse the raw input of the MIDI note stream and the metric structure, and then feed the extracted feature representations to an LSTM network to generate the body movements, as a sequence of upper-body joint coordinates forming a skeleton.

Our Results

Subjective Evaluations

We conduct subjective evaluations to rate the expressiveness and naturalness of the generated skeleton movements compared with the ones extracted from real human players. More specifically, we recruit 18 subjects from Yamaha company to watch 32 10-sec video excerpts of "skeleton plays piano", 16 from the generated ones, and 16 from the real ones. The rating result is plotted in the following figure, where the tracks with significant different ratings are marked with "*".