Training-Free Spatial Control for Multi-Entity Text-to-Image Generation

Master Thesis (2026)
Author(s)

V. Petkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H. Jamali-Rad – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Hamid Palangi – Mentor

E. Isufi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Jorge Abraham Martinez Castaneda – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M. Skrodzki – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
19-06-2026
Awarding Institution
Delft University of Technology
Programme
Computer Science, Data Science and Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
16
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Recent text-to-image (T2I) diffusion models can generate highly realistic images, but they often struggle to correctly arrange multiple objects according to specified spatial relationships. This limitation reduces their usefulness as controllable design tools. The problem is particularly challenging for modern multi-modal diffusion transformers (MM-DiTs), such as Stable Diffusion 3.5 and FLUX, whose architecture prevents the direct application of earlier layout-control techniques. Existing solutions either require costly model retraining or use training-free methods that provide limited and often unreliable control. This thesis introduces FOCAL, a training-free layout controller that formulates spatial guidance as a stochastic optimal control problem during diffusion sampling. By applying a closed-form correction derived from the model’s attention maps, FOCAL simultaneously enforces object placement and attention separation without modifying model weights. The method improves compositional accuracy across multiple backbones and achieves performance competitive with much larger state-of-the-art systems.

Files

License info not available