F.O. Garrido Valenzuela
Please Note
7 records found
1
Pixels · People · Places
Computer Vision and Image Embeddings for Perception-Aware Urban Analytics
The research unfolds through six interrelated studies. First, an image typology distinguishes physical components from perceptual conditions, providing a common vocabulary and operational criteria for image-based urban research. Two subsequent studies build models for large-scale component extraction: Where Are the People? assembles a pipeline to detect people and street elements in millions of images and relates them to morphological indicators; Street Embeddings learns transferable visual representations that recover functional and morphological street typologies without intensive labeling. To connect the physical and the perceptual dimensions, PixelSurvey offers a modular, open-source platform for image-based surveys (stated choice, similarity judgments, and Likert scales), standardizing stimulus control, randomization, and data export. Using these data, From Pixels to Perceptions trains a supervised embedding model with human similarity judgments to align visual representations with perceptual structure. Finally, Computer-Vision–Enriched Discrete Choice Models (CV–DCM) integrates image embeddings into random-utility models, linking visual attributes and perceptions to choices in an interpretable manner.
Taken together, the thesis shows that (a) SLI can be turned into structured data of urban form about components and conditions; (b) learned spatial representations recover meaningful, transferable typologies; (c) locally sourced, human-in-the-loop supervision improves the perceptual relevance of spatial embeddings; and (d) behavioral models can incorporate visual information to anticipate how the built environment may influence perceptions, preferences, and choices. This, thereby enables ex ante appraisal of functional and experiential impacts. The work also offers policy guidance: use image-based surveys to broaden participation, strengthen governance around visual data, and provide a practical pathway for incorporating urban perceptions into the appraisal of urban-renewal projects.
Limitations and future directions are clear. Images can introduce important biases. For instance, uneven spatial coverage and licensing rules limit transferability. Also, latent urban representations (embeddings) remain difficult to interpret, calling for more transparent models that clarify what AI captures. Regarding perceptions, two caveats are central: (i) perception measures derived from images are local and cultural in scope and should not be generalized uncritically, and (ii) quantification of perceptions are proxies of lived experience, not the experience itself. Acknowledging these limits and reinforcing governance, the thesis charts a path toward perception-aware urban analytics that is scientifically rigorous and socially useful.
...
The research unfolds through six interrelated studies. First, an image typology distinguishes physical components from perceptual conditions, providing a common vocabulary and operational criteria for image-based urban research. Two subsequent studies build models for large-scale component extraction: Where Are the People? assembles a pipeline to detect people and street elements in millions of images and relates them to morphological indicators; Street Embeddings learns transferable visual representations that recover functional and morphological street typologies without intensive labeling. To connect the physical and the perceptual dimensions, PixelSurvey offers a modular, open-source platform for image-based surveys (stated choice, similarity judgments, and Likert scales), standardizing stimulus control, randomization, and data export. Using these data, From Pixels to Perceptions trains a supervised embedding model with human similarity judgments to align visual representations with perceptual structure. Finally, Computer-Vision–Enriched Discrete Choice Models (CV–DCM) integrates image embeddings into random-utility models, linking visual attributes and perceptions to choices in an interpretable manner.
Taken together, the thesis shows that (a) SLI can be turned into structured data of urban form about components and conditions; (b) learned spatial representations recover meaningful, transferable typologies; (c) locally sourced, human-in-the-loop supervision improves the perceptual relevance of spatial embeddings; and (d) behavioral models can incorporate visual information to anticipate how the built environment may influence perceptions, preferences, and choices. This, thereby enables ex ante appraisal of functional and experiential impacts. The work also offers policy guidance: use image-based surveys to broaden participation, strengthen governance around visual data, and provide a practical pathway for incorporating urban perceptions into the appraisal of urban-renewal projects.
Limitations and future directions are clear. Images can introduce important biases. For instance, uneven spatial coverage and licensing rules limit transferability. Also, latent urban representations (embeddings) remain difficult to interpret, calling for more transparent models that clarify what AI captures. Regarding perceptions, two caveats are central: (i) perception measures derived from images are local and cultural in scope and should not be generalized uncritically, and (ii) quantification of perceptions are proxies of lived experience, not the experience itself. Acknowledging these limits and reinforcing governance, the thesis charts a path toward perception-aware urban analytics that is scientifically rigorous and socially useful.
This study sheds light on how utility derived from street-level conditions is spatially distributed, from a residential location choice perspective, at a city-wide scale. Unlike previous studies that analyse perceptions of urban environments from street-level imagery, this work maps preferences—that is, the utility residents derive from observable street-level conditions. To this end, we first develop a residential location discrete choice model that builds on two premises: (1) street-level images effectively capture street-level conditions, and (2) state-of-the-art segmentation models can extract salient information from these images and convert them into structured (i.e. tabular) data. We then apply the model to over 200 thousand geo-tagged street-level images of Rotterdam (the Netherlands) to map how utility derived from street-level conditions varies across the city. Results show strong local variation, with conditions changing rapidly even within neighbourhoods, and reveal that high real-estate prices in the city centre cannot primarily be attributed to attractive street-level conditions. As a secondary methodological contribution, the paper integrates foundation segmentation models into discrete choice analysis. Unlike conventional segmentation approaches limited to predefined object classes, our pipeline leverages prompt-based detection (GroundingDINO + SAM) to identify novel and more granular categories (e.g. transformer houses, shrubs vs. trees) overlooked in standard datasets. This integration enables a richer, fine-grained quantification of street-level conditions and demonstrates how visual information can be systematically embedded into residential location choice models. As such, this paper's findings and methodological contribution pave the way for further studies to explore integrating street-level conditions in urban planning.
From pixels to perceptions
Using human similarity judgments to enrich urban space embeddings
We present a method to classify street networks using only geo-tagged street-level imagery. By combining pre-trained image embeddings with unsupervised clustering, it produces visually coherent street typologies without supervised training or labeled data and requires only minimal data curation. The approach is lightweight, scalable, and, in principle, transferable across urban contexts. In a Delft (Netherlands) case study, we classify approximately 2,000 road sections using over 70,000 images. Our method recovers distinct street types such as residential, arterial, and historic ones. These results show that pre-trained visual embeddings alone can support effective street classification from visual inputs, offering a practical tool for urban planning, transport analysis, and mobility research.
Visual imagery is indispensable to many multi-attribute decision situations. Examples of such decision situations in travel behaviour research include residential location choices, vehicle choices, tourist destination choices, and various safety-related choices. However, current discrete choice models cannot handle image data algorithmically and thus cannot incorporate information embedded in images into their representations of choice behaviour. This gap between discrete choice models’ capabilities and the real-world behaviour it seeks to model leads to incomplete and, possibly, misleading outcomes. To solve this gap, this study proposes “Computer Vision-enriched Discrete Choice Models” (CV-DCMs). CV-DCMs can handle choice tasks involving numeric attributes and images by integrating computer vision and traditional discrete choice models. Moreover, because CV-DCMs are grounded in random utility maximisation principles, they maintain the solid behavioural foundation of traditional discrete choice models. We demonstrate the proposed CV-DCM by applying it to data obtained through a novel stated choice experiment involving residential location choices. In this experiment, respondents faced choice tasks with trade-offs between commute time, monthly housing cost and street-level conditions, presented using images. We find that CV-DCMs can offer novel insights into preferences regarding features presented in images, such as what street-level conditions people find most and least attractive and how these preferences vary across age groups.