Pixels · People · Places
Computer Vision and Image Embeddings for Perception-Aware Urban Analytics
F.O. Garrido Valenzuela (TU Delft - Transport and Logistics)
O. Cats – Promotor (TU Delft - Transport and Planning)
S. van Cranenburgh – Promotor (TU Delft - Transport and Logistics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Artificial intelligence, especially computer vision (CV), is reshaping how cities are studied and designed. Street-level imagery (SLI) carries multiple layers of urban information: infrastructure, design, vegetation, human activity, and beyond. Moreover, when paired with human input, these images also reveal how places are perceived. Over the past decade, many methods have either extracted physical components from images or predicted perceptions from those components. What remains uncommon, however, is a theory-guided, reproducible framework that coherently integrates both layers. Without such a framework, studies tend to describe what cities contain without explaining how they feel, which limits attribution of perceptions to specific components, the transfer of insights across cities, and the inclusion of subjective dimensions in public decision-making. Here, an integrated framework means a workflow that (i) defines what images encode in terms of components and perceptual conditions; (ii) specifies procedures to extract each layer at city scale; (iii) identifies when and how to include human-in-the-loop feedback to safeguard perceptual validity and local context meaning; and (iv) links both layers to interpretable behavioral models that attribute effects to concrete components. This thesis develops, operationalizes, and demonstrates such a framework connecting pixels, people, and places.
The research unfolds through six interrelated studies. First, an image typology distinguishes physical components from perceptual conditions, providing a common vocabulary and operational criteria for image-based urban research. Two subsequent studies build models for large-scale component extraction: Where Are the People? assembles a pipeline to detect people and street elements in millions of images and relates them to morphological indicators; Street Embeddings learns transferable visual representations that recover functional and morphological street typologies without intensive labeling. To connect the physical and the perceptual dimensions, PixelSurvey offers a modular, open-source platform for image-based surveys (stated choice, similarity judgments, and Likert scales), standardizing stimulus control, randomization, and data export. Using these data, From Pixels to Perceptions trains a supervised embedding model with human similarity judgments to align visual representations with perceptual structure. Finally, Computer-Vision–Enriched Discrete Choice Models (CV–DCM) integrates image embeddings into random-utility models, linking visual attributes and perceptions to choices in an interpretable manner.
Taken together, the thesis shows that (a) SLI can be turned into structured data of urban form about components and conditions; (b) learned spatial representations recover meaningful, transferable typologies; (c) locally sourced, human-in-the-loop supervision improves the perceptual relevance of spatial embeddings; and (d) behavioral models can incorporate visual information to anticipate how the built environment may influence perceptions, preferences, and choices. This, thereby enables ex ante appraisal of functional and experiential impacts. The work also offers policy guidance: use image-based surveys to broaden participation, strengthen governance around visual data, and provide a practical pathway for incorporating urban perceptions into the appraisal of urban-renewal projects.
Limitations and future directions are clear. Images can introduce important biases. For instance, uneven spatial coverage and licensing rules limit transferability. Also, latent urban representations (embeddings) remain difficult to interpret, calling for more transparent models that clarify what AI captures. Regarding perceptions, two caveats are central: (i) perception measures derived from images are local and cultural in scope and should not be generalized uncritically, and (ii) quantification of perceptions are proxies of lived experience, not the experience itself. Acknowledging these limits and reinforcing governance, the thesis charts a path toward perception-aware urban analytics that is scientifically rigorous and socially useful.