Guiding Diffusion Models for Spatially Consistent Image Generation

Master Thesis (2025)
Author(s)

V.P. Chatalbasheva (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H Jamali-Rad – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

S. Rastegar – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Elvin Isufi – Graduation committee member (TU Delft - Multimedia Computing)

Holger Caesar – Graduation committee member (TU Delft - Intelligent Vehicles)

Hamid Palangi – Graduation committee member (Google)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
20-06-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Text-to-image (T2I) diffusion models have achieved remarkable image quality but still struggle to produce images that align with the compositional information from the input text prompt, especially when it comes to spatial cues. We attribute this limitation to two key factors: the lack of clear fine-grained spatial supervision in common training datasets, and the inability of the CLIP text encoder, used in the pretraining of stable diffusion models, to represent spatial semantics. While recent work has addressed object omission and attribute mismatches, accurately generating objects in the spatial locations defined in the text prompt remains an open challenge. Prior solutions typically rely on fine-tuning, which introduces computational overhead and risks degrading the pretrained model’s generative prior on other tasks unrelated to spatial reasoning. In this paper, we introduce InfSplign, a simple and training-free method that improves spatial understanding in T2I diffusion models. InfSplign leverages attention maps and a centroid-based loss to guide object placement during sampling at inference time without modifying the pretrained model. Our approach is modular, lightweight and compatible with any pretrained diffusion model. InfSplign achieves strong performance on spatial benchmarks such as VISOR, T2I-CompBench and GenEval, outperforming baselines in many scenarios.

Files

License info not available