Training-Free Spatial Control for Multi-Entity Text-to-Image Generation

None, None

Training-Free Spatial Control for Multi-Entity Text-to-Image Generation

Master Thesis (2026)

Author(s)

V. Petkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H. Jamali-Rad – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Hamid Palangi – Mentor

E. Isufi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Jorge Abraham Martinez Castaneda – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M. Skrodzki – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Diffusion Models Text-to-Image Generation Spatial Layout Control

To reference this document use

https://resolver.tudelft.nl/uuid:327a7be1-756c-4464-b14d-3c1e6d095e15

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

19-06-2026

Awarding Institution

Delft University of Technology

Programme

Computer Science, Data Science and Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

32

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Recent text-to-image (T2I) diffusion models can generate highly realistic images, but they often struggle to correctly arrange multiple objects according to specified spatial relationships. This limitation reduces their usefulness as controllable design tools. The problem is particularly challenging for modern multi-modal diffusion transformers (MM-DiTs), such as Stable Diffusion 3.5 and FLUX, whose architecture prevents the direct application of earlier layout-control techniques. Existing solutions either require costly model retraining or use training-free methods that provide limited and often unreliable control. This thesis introduces FOCAL, a training-free layout controller that formulates spatial guidance as a stochastic optimal control problem during diffusion sampling. By applying a closed-form correction derived from the model’s attention maps, FOCAL simultaneously enforces object placement and attention separation without modifying model weights. The method improves compositional accuracy across multiple backbones and achieves performance competitive with much larger state-of-the-art systems.

Files

Training_Free_Spatial_Control_... (pdf)

(pdf | 45.4 Mb)

License info not available