Compositional generative models

None, None

doi:10.4233/uuid:8ca44a87-30ce-4bff-87a4-a07abcebb8c8

Compositional generative models

For generalizable scene generation and understanding

Doctoral Thesis (2026)

Author(s)

Y. Wang (TU Delft - Mechanical Engineering, TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G.J.T. Leus – Promotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.H.G. Dauwels – Promotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Signal Processing Systems

Scene Understanding Compositional Generative Modeling Object-Centric Generation Scene Decomposition Compositional Generalization

DOI related publication

https://doi.org/10.4233/uuid:8ca44a87-30ce-4bff-87a4-a07abcebb8c8 Final published version

To reference this document use

https://doi.org/10.4233/uuid:8ca44a87-30ce-4bff-87a4-a07abcebb8c8

More Info

expand_more

Publication Year

2026

Language

English

Defense Date

21-05-2026

Awarding Institution

Delft University of Technology

Research Group

Signal Processing Systems

ISBN (print)

978-94-6518-319-0

Downloads counter

45

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Human intelligence is fundamentally compositional: it constructs new ideas by flexibly recombining known concepts, enabling generalization to entirely new tasks. We aim to develop intelligent systems with similar robust generalization capabilities. To that end, we develop compositional generative modeling frameworks and present three research thrusts that advance scene generation, decomposition, and understanding.

First, we introduce a hierarchical object-centric generative model that integrates latent-variable modeling with object-centric representation learning, enabling coherent multi object scene generation and fine-grained object-level editing. This approach overcomes limitations of prior object-aware models by supporting flexible object morphology and significantly improving in-distribution generalization.

Second, we propose an unsupervised compositional image decomposition method that represents images as compositions of energy landscapes encoded by diffusion models. This enables the extraction of reusable global and local visual factors, such as shadows, expressions, and objects, and supports zero-shot compositional image generation by recombining these factors into novel configurations far outside the training distribution.

Third, we develop a compositional inverse generative modeling framework for scene understanding. By formulating inference as likelihood maximization over conditional generative model parameters, we show how composable diffusion models enable object discovery and multi-label classification in scenes substantially more complex than those seen during training, including generalization to images with more objects or new configurations. The framework also supports zero-shot category inference using pretrained generative models without additional training.

Overall, these contributions demonstrate that the incorporation of compositional structure into generative modeling yields interpretable, controllable, and significantly more generalizable intelligent systems. This thesis offers a step toward building intelligent agents with the flexible, systematic compositional imagination characteristic of human cognition.

Files

Thesis_final.pdf

(pdf | 15.2 Mb)

License info not available