Audiovisual Singing Voice Separation

This project is in collaboration with the ByteDance AI Lab. This project is partially supported by the National Science Foundation under grant No. 1741472, titled "BIGDATA: F: Audio-Visual Scene Understanding".
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

 

Publications

Bochen Li, Yuxuan Wang, and Zhiyao Duan, Audiovisual Singing Voice Separation, Transactions of the International Society for Music Information Retrieval, 4(1), ppp.195–209, 2021. DOI: http://doi.org/10.5334/tismir.108. http://doi.org/10.5334/tismir.108 <pdf>

Background / Motivation

  • Separating a song into vocal and accompaniment components is an active research topic.
  • We propose to apply the visual information corresponding to the singers’ vocal activities to improve the quality of the separated vocal signals.
  • We propose the "solo voice separation" concept, where backing vocal components (if exist) are regarded as accompaniment tracks.

Method

Model structure

  • (a) The audio subnetwork.
  • (b) The video subnetwork.
  • (c) The audiovisual fusion.

Results

Demo 1

Vocal separation results from the URSing dataset, which was recorded in sound booth with different scenarios as the training/validation data.


Original mixture

Groud-truth solo vocal

Result from audio-based method

Result from proposed audiovisual method

  • Separated vocal using audio-based method still have some backing vocals left.
  • The proposed audio-visual method only separates the vocal components related to the solo singer by attending to the mouth movements.

Demo 2

Evaluations on a capella songs downloaded from YouTube.

Original mixture

Separated vocal from audio-based method

Separated solo vocal from proposed method

Demo 3

Evaluations on randomly mixed samples (same scenario as the training/validation data).

Original mixture

Groud-truth solo vocal

Result from audio-based method

Result from proposed audiovisual method

Last Update: Jan 2022