Guiding Diffusion Models for Spatially Consistent Image Generation

None, None

Guiding Diffusion Models for Spatially Consistent Image Generation

Master Thesis (2025)

Author(s)

V.P. Chatalbasheva (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Hadi Jamali Rad – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

S. Rastegar – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

E. Isufi – Graduation committee member (TU Delft - Multimedia Computing)

H. Caesar – Graduation committee member (TU Delft - Intelligent Vehicles)

Hamid Palangi – Graduation committee member (Google)

Faculty

Electrical Engineering, Mathematics and Computer Science

To reference this document use:

https://resolver.tudelft.nl/uuid:7395630b-9c0d-44c6-ac6f-6208cd040438

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

20-06-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Text-to-image (T2I) diffusion models have achieved remarkable image quality but still struggle to produce images that align with the compositional information from the input text prompt, especially when it comes to spatial cues. We attribute this limitation to two key factors: the lack of clear fine-grained spatial supervision in common training datasets, and the inability of the CLIP text encoder, used in the pretraining of stable diffusion models, to represent spatial semantics. While recent work has addressed object omission and attribute mismatches, accurately generating objects in the spatial locations defined in the text prompt remains an open challenge. Prior solutions typically rely on fine-tuning, which introduces computational overhead and risks degrading the pretrained model’s generative prior on other tasks unrelated to spatial reasoning. In this paper, we introduce InfSplign, a simple and training-free method that improves spatial understanding in T2I diffusion models. InfSplign leverages attention maps and a centroid-based loss to guide object placement during sampling at inference time without modifying the pretrained model. Our approach is modular, lightweight and compatible with any pretrained diffusion model. InfSplign achieves strong performance on spatial benchmarks such as VISOR, T2I-CompBench and GenEval, outperforming baselines in many scenarios.

Files

Guiding_Diffusion_Models_for_S... (pdf)

(pdf | 16.9 Mb)

License info not available