Self-supervised Audio-reactive Music Video Synthesis

None, None

Self-supervised Audio-reactive Music Video Synthesis

Measuring and optimizing audiovisual correlation

Master Thesis (2022)

Author(s)

Hans Brouwer (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Y. Chen – Mentor (TU Delft - Data-Intensive Systems)

Cynthia C. S. Liem – Graduation committee member (TU Delft - Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science

Deep learning Generative Adversarial Networks Self-supervised learning Audio-reactive Video synthesis Multimodal machine learning Audiovisual correlation

To reference this document use:

https://resolver.tudelft.nl/uuid:4f9c0a36-2884-43e8-a135-9e4b90c77fd2

More Info

expand_more

Publication Year

2022

Language

English

Graduation Date

29-06-2022

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Synthesizing audio-reactive videos to accompany music is challenging multi-domain task that requires both a visual synthesis skill-set and an understanding of musical information extraction. In recent years a new flexible class of visual synthesis methods has gained popularity: generative adversarial networks. These deep neural networks can be trained to reproduce arbitrary images based on a dataset of about 10000 examples. After training, they can be harnessed to synthesize audio-reactive videos by constructing sequences of inputs based on musical information.

Current approaches suffer from a few problems which hamper the quality and usability of GAN-based audio-reactive video synthesis. Some approaches consider only a small number of possible musical inputs and ways of mapping these the GAN's parameters. This leads to weak audio-reactivity which has a similar motion characteristic across all musical inputs. Other approaches do harness the full design space, but are difficult to configure correctly for effective results.

This thesis aims to address the tradeoff between audio-reactive flexibility and ease of attaining effective results. We introduce multiple algorithms that explore the design space by using machine learning to generate sequences of inputs for the GAN.

To develop these machine learning algorithms, we first introduce a metric, the audiovisual correlation, that measures the audio-reactivity in a video. We use this metric to train models based only on a dataset of audio examples, avoiding the need of a large dataset of example audio-reactive videos. This self-supervised approach can even be extended to optimize a single audio-reactive video directly, removing the need to even train a model beforehand.

Our evaluation of the methods shows that our algorithms out-perform prior work in terms of their audio-reactivity. Our solutions explore a wider range of the audio-reactive space and do so without the need for manual feature extraction or configuration.

Files

Hans_Brouwer_Self_supervised_A... (pdf)

(pdf | 9.65 Mb)

License info not available