F.O. Garrido Valenzuela | TU Delft Repository

Pixels · People · Places

Computer Vision and Image Embeddings for Perception-Aware Urban Analytics

Doctoral thesis (2026) - F.O. Garrido Valenzuela, O. Cats, S. van Cranenburgh

Artificial intelligence, especially computer vision (CV), is reshaping how cities are studied and designed. Street-level imagery (SLI) carries multiple layers of urban information: infrastructure, design, vegetation, human activity, and beyond. Moreover, when paired with human input, these images also reveal how places are perceived. Over the past decade, many methods have either extracted physical components from images or predicted perceptions from those components. What remains uncommon, however, is a theory-guided, reproducible framework that coherently integrates both layers. Without such a framework, studies tend to describe what cities contain without explaining how they feel, which limits attribution of perceptions to specific components, the transfer of insights across cities, and the inclusion of subjective dimensions in public decision-making. Here, an integrated framework means a workflow that (i) defines what images encode in terms of components and perceptual conditions; (ii) specifies procedures to extract each layer at city scale; (iii) identifies when and how to include human-in-the-loop feedback to safeguard perceptual validity and local context meaning; and (iv) links both layers to interpretable behavioral models that attribute effects to concrete components. This thesis develops, operationalizes, and demonstrates such a framework connecting pixels, people, and places.

The research unfolds through six interrelated studies. First, an image typology distinguishes physical components from perceptual conditions, providing a common vocabulary and operational criteria for image-based urban research. Two subsequent studies build models for large-scale component extraction: Where Are the People? assembles a pipeline to detect people and street elements in millions of images and relates them to morphological indicators; Street Embeddings learns transferable visual representations that recover functional and morphological street typologies without intensive labeling. To connect the physical and the perceptual dimensions, PixelSurvey offers a modular, open-source platform for image-based surveys (stated choice, similarity judgments, and Likert scales), standardizing stimulus control, randomization, and data export. Using these data, From Pixels to Perceptions trains a supervised embedding model with human similarity judgments to align visual representations with perceptual structure. Finally, Computer-Vision–Enriched Discrete Choice Models (CV–DCM) integrates image embeddings into random-utility models, linking visual attributes and perceptions to choices in an interpretable manner.

Taken together, the thesis shows that (a) SLI can be turned into structured data of urban form about components and conditions; (b) learned spatial representations recover meaningful, transferable typologies; (c) locally sourced, human-in-the-loop supervision improves the perceptual relevance of spatial embeddings; and (d) behavioral models can incorporate visual information to anticipate how the built environment may influence perceptions, preferences, and choices. This, thereby enables ex ante appraisal of functional and experiential impacts. The work also offers policy guidance: use image-based surveys to broaden participation, strengthen governance around visual data, and provide a practical pathway for incorporating urban perceptions into the appraisal of urban-renewal projects.

Limitations and future directions are clear. Images can introduce important biases. For instance, uneven spatial coverage and licensing rules limit transferability. Also, latent urban representations (embeddings) remain difficult to interpret, calling for more transparent models that clarify what AI captures. Regarding perceptions, two caveats are central: (i) perception measures derived from images are local and cultural in scope and should not be generalized uncritically, and (ii) quantification of perceptions are proxies of lived experience, not the experience itself. Acknowledging these limits and reinforcing governance, the thesis charts a path toward perception-aware urban analytics that is scientifically rigorous and socially useful.
...

Artificial intelligence, especially computer vision (CV), is reshaping how cities are studied and designed. Street-level imagery (SLI) carries multiple layers of urban information: infrastructure, design, vegetation, human activity, and beyond. Moreover, when paired with human input, these images also reveal how places are perceived. Over the past decade, many methods have either extracted physical components from images or predicted perceptions from those components. What remains uncommon, however, is a theory-guided, reproducible framework that coherently integrates both layers. Without such a framework, studies tend to describe what cities contain without explaining how they feel, which limits attribution of perceptions to specific components, the transfer of insights across cities, and the inclusion of subjective dimensions in public decision-making. Here, an integrated framework means a workflow that (i) defines what images encode in terms of components and perceptual conditions; (ii) specifies procedures to extract each layer at city scale; (iii) identifies when and how to include human-in-the-loop feedback to safeguard perceptual validity and local context meaning; and (iv) links both layers to interpretable behavioral models that attribute effects to concrete components. This thesis develops, operationalizes, and demonstrates such a framework connecting pixels, people, and places.

The research unfolds through six interrelated studies. First, an image typology distinguishes physical components from perceptual conditions, providing a common vocabulary and operational criteria for image-based urban research. Two subsequent studies build models for large-scale component extraction: Where Are the People? assembles a pipeline to detect people and street elements in millions of images and relates them to morphological indicators; Street Embeddings learns transferable visual representations that recover functional and morphological street typologies without intensive labeling. To connect the physical and the perceptual dimensions, PixelSurvey offers a modular, open-source platform for image-based surveys (stated choice, similarity judgments, and Likert scales), standardizing stimulus control, randomization, and data export. Using these data, From Pixels to Perceptions trains a supervised embedding model with human similarity judgments to align visual representations with perceptual structure. Finally, Computer-Vision–Enriched Discrete Choice Models (CV–DCM) integrates image embeddings into random-utility models, linking visual attributes and perceptions to choices in an interpretable manner.

Taken together, the thesis shows that (a) SLI can be turned into structured data of urban form about components and conditions; (b) learned spatial representations recover meaningful, transferable typologies; (c) locally sourced, human-in-the-loop supervision improves the perceptual relevance of spatial embeddings; and (d) behavioral models can incorporate visual information to anticipate how the built environment may influence perceptions, preferences, and choices. This, thereby enables ex ante appraisal of functional and experiential impacts. The work also offers policy guidance: use image-based surveys to broaden participation, strengthen governance around visual data, and provide a practical pathway for incorporating urban perceptions into the appraisal of urban-renewal projects.

Limitations and future directions are clear. Images can introduce important biases. For instance, uneven spatial coverage and licensing rules limit transferability. Also, latent urban representations (embeddings) remain difficult to interpret, calling for more transparent models that clarify what AI captures. Regarding perceptions, two caveats are central: (i) perception measures derived from images are local and cultural in scope and should not be generalized uncritically, and (ii) quantification of perceptions are proxies of lived experience, not the experience itself. Acknowledging these limits and reinforcing governance, the thesis charts a path toward perception-aware urban analytics that is scientifically rigorous and socially useful.

A utility-based spatial analysis of residential street-level conditions a case study of Rotterdam

Journal article (2026) - Sander van Cranenburgh, Francisco Garrido-Valenzuela

This study sheds light on how utility derived from street-level conditions is spatially distributed, from a residential location choice perspective, at a city-wide scale. Unlike previous studies that analyse perceptions of urban environments from street-level imagery, this work maps preferences—that is, the utility residents derive from observable street-level conditions. To this end, we first develop a residential location discrete choice model that builds on two premises: (1) street-level images effectively capture street-level conditions, and (2) state-of-the-art segmentation models can extract salient information from these images and convert them into structured (i.e. tabular) data. We then apply the model to over 200 thousand geo-tagged street-level images of Rotterdam (the Netherlands) to map how utility derived from street-level conditions varies across the city. Results show strong local variation, with conditions changing rapidly even within neighbourhoods, and reveal that high real-estate prices in the city centre cannot primarily be attributed to attractive street-level conditions. As a secondary methodological contribution, the paper integrates foundation segmentation models into discrete choice analysis. Unlike conventional segmentation approaches limited to predefined object classes, our pipeline leverages prompt-based detection (GroundingDINO + SAM) to identify novel and more granular categories (e.g. transformer houses, shrubs vs. trees) overlooked in standard datasets. This integration enables a richer, fine-grained quantification of street-level conditions and demonstrates how visual information can be systematically embedded into residential location choice models. As such, this paper's findings and methodological contribution pave the way for further studies to explore integrating street-level conditions in urban planning. ...

Computer vision-enriched discrete choice models, with an application to residential location choice

Journal article (2025) - Sander van Cranenburgh, Francisco Garrido-Valenzuela

Visual imagery is indispensable to many multi-attribute decision situations. Examples of such decision situations in travel behaviour research include residential location choices, vehicle choices, tourist destination choices, and various safety-related choices. However, current discrete choice models cannot handle image data algorithmically and thus cannot incorporate information embedded in images into their representations of choice behaviour. This gap between discrete choice models’ capabilities and the real-world behaviour it seeks to model leads to incomplete and, possibly, misleading outcomes. To solve this gap, this study proposes “Computer Vision-enriched Discrete Choice Models” (CV-DCMs). CV-DCMs can handle choice tasks involving numeric attributes and images by integrating computer vision and traditional discrete choice models. Moreover, because CV-DCMs are grounded in random utility maximisation principles, they maintain the solid behavioural foundation of traditional discrete choice models. We demonstrate the proposed CV-DCM by applying it to data obtained through a novel stated choice experiment involving residential location choices. In this experiment, respondents faced choice tasks with trade-offs between commute time, monthly housing cost and street-level conditions, presented using images. We find that CV-DCMs can offer novel insights into preferences regarding features presented in images, such as what street-level conditions people find most and least attractive and how these preferences vary across age groups. ...

From pixels to perceptions

Using human similarity judgments to enrich urban space embeddings

Journal article (2025) - F.O. Garrido Valenzuela, O. Cats, S. van Cranenburgh

This research introduces a new method for constructing and training an Urban Space Embedding Model (USEM) by integrating human perceptions and street-level images (SLI) into its formulation. Traditional urban embedding models often overlook subjective human experiences, such as perceptions of safety or attractiveness. To address this gap, our method leverages similarity judgments from over 1500 participants, who compared different urban spaces based on SLI. These human judgments were then used as a supervision signal in training the USEM, allowing the model to capture both visual and perceptual information about urban spaces. The method is implemented across the Netherlands, using around one million geo-tagged SLI, and demonstrated in Rotterdam. This approach represents a significant advancement in urban computing by incorporating human-centered data into urban modeling. It offers new opportunities for city planners and policymakers to better understand how urban spaces are perceived and to consider these perceptions in efforts to design more livable and inclusive environments. ...

An image embedding-based approach for classifying street networks

Conference paper (2025) - Francisco Garrido-Valenzuela, Max Lange, Juan C. Herrera, Sander Van Cranenburgh, Oded Cats

We present a method to classify street networks using only geo-tagged street-level imagery. By combining pre-trained image embeddings with unsupervised clustering, it produces visually coherent street typologies without supervised training or labeled data and requires only minimal data curation. The approach is lightweight, scalable, and, in principle, transferable across urban contexts. In a Delft (Netherlands) case study, we classify approximately 2,000 road sections using over 70,000 images. Our method recovers distinct street types such as residential, arterial, and historic ones. These results show that pre-trained visual embeddings alone can support effective street classification from visual inputs, offering a practical tool for urban planning, transport analysis, and mobility research. ...

Where are the people? Counting people in millions of street-level images to explore associations between people's urban density and urban characteristics

Journal article (2023) - Francisco Garrido-Valenzuela, Oded Cats, Sander van Cranenburgh

A thorough understanding of how urban space characteristics, such as urban equipment or network topology, affect people's density in urban spaces is essential to well-informed urban policy making. Hitherto, studies have primarily examined how the characteristics of the urban space impacts the number of people visiting different parts of the urban area (e.g., the city center). However, these studies almost without exception have used relatively small data sets, targeting specific neighborhoods or places. As a result, their findings are confined to specific areas and it is unclear to what extent their findings generalize to other urban areas. This study addresses this gap. We propose a new computer vision-based method to study how the urban space is associated with people's urban density in outdoor urban spaces. Specifically, our method uses a pre-trained object detection model to identify and count people as well as urban-related objects, such as presence of cars, and benches in millions street-level images collected throughout the Netherlands. Importantly, each street-level image is geo-located. Therefore, for each detected person and object its location is known. In turn, we regress urban space characteristics and urban-related objects on the number of people identified as a proxy for density in urban spaces. Our results show that higher numbers of people tend to be observed in places with smaller blocks, suggesting that compact urban development may be an effective way to increase people's density. Moreover, we find that the presence of food places and bicycles is associated with more people, indicating that urban planners could study the location of these amenities to attract more visitors to urban spaces and exploring the causality effects in this relationship. Our methodology offers a complementary way to monitor how the urban space is used over the time and to assess the effectiveness of urban interventions and policies. ...

A thorough understanding of how urban space characteristics, such as urban equipment or network topology, affect people's density in urban spaces is essential to well-informed urban policy making. Hitherto, studies have primarily examined how the characteristics of the urban space impacts the number of people visiting different parts of the urban area (e.g., the city center). However, these studies almost without exception have used relatively small data sets, targeting specific neighborhoods or places. As a result, their findings are confined to specific areas and it is unclear to what extent their findings generalize to other urban areas. This study addresses this gap. We propose a new computer vision-based method to study how the urban space is associated with people's urban density in outdoor urban spaces. Specifically, our method uses a pre-trained object detection model to identify and count people as well as urban-related objects, such as presence of cars, and benches in millions street-level images collected throughout the Netherlands. Importantly, each street-level image is geo-located. Therefore, for each detected person and object its location is known. In turn, we regress urban space characteristics and urban-related objects on the number of people identified as a proxy for density in urban spaces. Our results show that higher numbers of people tend to be observed in places with smaller blocks, suggesting that compact urban development may be an effective way to increase people's density. Moreover, we find that the presence of food places and bicycles is associated with more people, indicating that urban planners could study the location of these amenities to attract more visitors to urban spaces and exploring the causality effects in this relationship. Our methodology offers a complementary way to monitor how the urban space is used over the time and to assess the effectiveness of urban interventions and policies.

Bayesian Route Choice Inference to Address Missed Bluetooth Detections

Journal article (2020) - F.O. Garrido Valenzuela, Sebastián Raveau, Juan C. Herrera

By installing wireless sensors such as Bluetooth or Wi-Fi at a specific set of intersections and/or roadways it is possible to detect the passage of vehicles equipped with these technologies. However, Wi-Fi or Bluetooth-equipped vehicles are not necessarily detected by every sensor they pass by, as the detection probability depends on several factors, such as weather, nearby infrastructure, or the vehicle’s speed. To address this lack of perfect information, we propose a methodology to infer the most likely route used by a vehicle between two successive detections. The methodology consists of three stages. The first stage entails constructing a graph of the road network and the location of the sensors. The second stage consists of using the wireless data to calibrate the distribution of dwell time at each node and travel time for each link of the graph defined in the first stage. The third and final stage consists of convoluting node and link time distributions between successive detections to obtain an aggregate time distribution for each potential route. A Bayesian inference is then applied based on the travel time observed for each vehicle and the number of missed detections, to determine the probability of each alternative route. The methodology is tested through microsimulations, showing a prediction performance of over 90% in most favorable scenarios tested. When compared to a benchmark approach to infer routes, the proposed methodology provides better results when the network’s sensory density is low and data available are reduced. ...