Text-to-image (T2I) diffusion models have achieved remarkable image quality but still struggle to produce images that align with the compositional information from the input text prompt, especially when it comes to spatial cues. We attribute this limitation to two key factors: th
...
Text-to-image (T2I) diffusion models have achieved remarkable image quality but still struggle to produce images that align with the compositional information from the input text prompt, especially when it comes to spatial cues. We attribute this limitation to two key factors: the lack of clear fine-grained spatial supervision in common training datasets, and the inability of the CLIP text encoder, used in the pretraining of stable diffusion models, to represent spatial semantics. While recent work has addressed object omission and attribute mismatches, accurately generating objects in the spatial locations defined in the text prompt remains an open challenge. Prior solutions typically rely on fine-tuning, which introduces computational overhead and risks degrading the pretrained model’s generative prior on other tasks unrelated to spatial reasoning. In this paper, we introduce InfSplign, a simple and training-free method that improves spatial understanding in T2I diffusion models. InfSplign leverages attention maps and a centroid-based loss to guide object placement during sampling at inference time without modifying the pretrained model. Our approach is modular, lightweight and compatible with any pretrained diffusion model. InfSplign achieves strong performance on spatial benchmarks such as VISOR, T2I-CompBench and GenEval, outperforming baselines in many scenarios.