Training-Free Spatial Control for Multi-Entity Text-to-Image Generation
V. Petkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)
H. Jamali-Rad – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Hamid Palangi – Mentor
E. Isufi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jorge Abraham Martinez Castaneda – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M. Skrodzki – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Recent text-to-image (T2I) diffusion models can generate highly realistic images, but they often struggle to correctly arrange multiple objects according to specified spatial relationships. This limitation reduces their usefulness as controllable design tools. The problem is particularly challenging for modern multi-modal diffusion transformers (MM-DiTs), such as Stable Diffusion 3.5 and FLUX, whose architecture prevents the direct application of earlier layout-control techniques. Existing solutions either require costly model retraining or use training-free methods that provide limited and often unreliable control. This thesis introduces FOCAL, a training-free layout controller that formulates spatial guidance as a stochastic optimal control problem during diffusion sampling. By applying a closed-form correction derived from the model’s attention maps, FOCAL simultaneously enforces object placement and attention separation without modifying model weights. The method improves compositional accuracy across multiple backbones and achieves performance competitive with much larger state-of-the-art systems.