Compositional generative models

For generalizable scene generation and understanding

Doctoral Thesis (2026)
Author(s)

Y. Wang (TU Delft - Learning & Autonomous Control)

Contributor(s)

G.J.T. Leus – Promotor (TU Delft - Signal Processing Systems)

J.H.G. Dauwels – Promotor (TU Delft - Signal Processing Systems)

DOI related publication
https://doi.org/10.4233/uuid:8ca44a87-30ce-4bff-87a4-a07abcebb8c8 Final published version
More Info
expand_more
Publication Year
2026
Language
English
Defense Date
21-05-2026
Awarding Institution
ISBN (print)
978-94-6518-319-0
Downloads counter
6
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Human intelligence is fundamentally compositional: it constructs new ideas by flexibly recombining known concepts, enabling generalization to entirely new tasks. We aim to develop intelligent systems with similar robust generalization capabilities. To that end, we develop compositional generative modeling frameworks and present three research thrusts that advance scene generation, decomposition, and understanding.

First, we introduce a hierarchical object-centric generative model that integrates latent-variable modeling with object-centric representation learning, enabling coherent multi object scene generation and fine-grained object-level editing. This approach overcomes limitations of prior object-aware models by supporting flexible object morphology and significantly improving in-distribution generalization.

Second, we propose an unsupervised compositional image decomposition method that represents images as compositions of energy landscapes encoded by diffusion models. This enables the extraction of reusable global and local visual factors, such as shadows, expressions, and objects, and supports zero-shot compositional image generation by recombining these factors into novel configurations far outside the training distribution.

Third, we develop a compositional inverse generative modeling framework for scene understanding. By formulating inference as likelihood maximization over conditional generative model parameters, we show how composable diffusion models enable object discovery and multi-label classification in scenes substantially more complex than those seen during training, including generalization to images with more objects or new configurations. The framework also supports zero-shot category inference using pretrained generative models without additional training.

Overall, these contributions demonstrate that the incorporation of compositional structure into generative modeling yields interpretable, controllable, and significantly more generalizable intelligent systems. This thesis offers a step toward building intelligent agents with the flexible, systematic compositional imagination characteristic of human cognition.

Files

Thesis_Yanbo_WANG.pdf
(pdf | 15.5 Mb)
License info not available