S. Sharifi Noorian | TU Delft Repository

From recognition to understanding: enriching visual models through multi-modal semantic integration

Doctoral thesis (2025) - S. Sharifi Noorian, G.J.P.M. Houben, A. Bozzon, J. Yang

This thesis addresses the semantic gap in visual understanding, improving visual models with semantic reasoning capabilities so they can handle tasks like image captioning, question-answering, and scene understanding. The main focus is on integrating visual and textual data, leveraging human cognitive insights, and developing a robust multi-modal foundation model. The research starts with the exploration of multi-modal data integration to enhance semantic and contextual reasoning in fine-grained scene recognition. The proposed multi-modal models, which combine visual and textual inputs, outperform traditional models that rely solely on visuals. This is particularly true in complex urban environments where visual ambiguities often occur. This method emphasizes the significance of semantic enrichment through multi-modal integration, which helps resolve visual ambiguities and improve scene understanding. ...

Perspective

Leveraging Human Understanding for Identifying and Characterizing Image Atypicality

Conference paper (2023) - Shahin Sharifi Noorian, Sihang Qiu, Burcu Sayin, Agathe Balayn, Ujwal Gadiraju, Jie Yang, Alessandro Bozzon

High-quality data plays a vital role in developing reliable image classification models. Despite that, what makes an image difficult to classify remains an unstudied topic. This paper provides a first-of-its-kind, model-agnostic characterization of image atypicality based on human understanding. We consider the setting of image classification "in the wild", where a large number of unlabeled images are accessible, and introduce a scalable and effective human computation approach for proactive identification and characterization of atypical images. Our approach consists of i) an image atypicality identification and characterization task that presents to the human worker both a local view of visually similar images and a global view of images from the class of interest and ii) an automatic image sampling method that selects a diverse set of atypical images based on both visual and semantic features. We demonstrate the effectiveness and cost-efficiency of our approach through controlled crowdsourcing experiments and provide a characterization of image atypicality based on human annotations of 10K images. We showcase the utility of the identified atypical images by testing state-of-the-art image classification services against such images and provide an in-depth comparative analysis of the alignment between human- and machine-perceived image atypicality. Our findings have important implications for developing and deploying reliable image classification systems. ...

Is it safe to be attractive?

Disentangling the influence of streetscape features on the perceived safety and attractiveness of city streets

Conference paper (2023) - V. Milias, S. Sharifi Noorian, A. Bozzon, A. Psyllidis

City streets that feel safe and attractive motivate active travel behaviour and promote people’s well-being. However, determining what makes a street safe and attractive is a challenging task because subjective qualities of the streetscape are difficult to quantify. Existing evidence typically focuses on how different street features influence perceived safety or attractiveness, but little is known about what influences both. To fill this knowledge gap, we developed a crowdsourcing tool and conducted a study with 403 participants, who were asked to virtually navigate city streets in Frankfurt, Germany, through a sequence of street-level images, rate locations based on perceived safety and attractiveness, and explain their ratings. Our results contribute new insights regarding the key similarities and differences between the factors influencing perceived safety and attractiveness. We show that the presence of human activity is strongly related to perceived safety, whereas attractiveness is influenced primarily by aesthetic qualities, as well as the number and type of amenities along a street. Moreover, we demonstrate that the presence of construction sites and underpasses has a disproportionately negative impact on perceived safety and attractiveness, outweighing the influence of any other features. We use the results to make evidence-informed recommendations for designing safer and more attractive streets that encourage active travel modes and promote well-being. ...

What Should You Know? A Human-In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition

Conference paper (2022) - Shahin Sharifi Noorian, Sihang Qiu, Ujwal Gadiraju, Jie Yang, Alessandro Bozzon

Unknown unknowns represent a major challenge in reliable image recognition. Existing methods mainly focus on unknown unknowns identification, leveraging human intelligence to gather images that are potentially difficult for the machine. To drive a deeper understanding of unknown unknowns and more effective identification and treatment, this paper focuses on unknown unknowns characterization. We introduce a human-in-the-loop, semantic analysis framework for characterizing unknown unknowns at scale. We engage humans in two tasks that specify what a machine should know and describe what it really knows, respectively, both at the conceptual level, supported by information extraction and machine learning interpretability methods. Data partitioning and sampling techniques are employed to scale out human contributions in handling large data. Through extensive experimentation on scene recognition tasks, we show that our approach provides a rich, descriptive characterization of unknown unknowns and allows for more effective and cost-efficient detection than the state of the art. ...

Detecting, classifying, and mapping retail storefronts using street-level imagery

Conference paper (2020) - Shahin Sharifi Noorian, Sihang Qiu, Achilleas Psyllidis, Alessandro Bozzon, Geert Jan Houben

Up-to-date listings of retail stores and related building functions are challenging and costly to maintain. We introduce a novel method for automatically detecting, geo-locating, and classifying retail stores and related commercial functions, on the basis of storefronts extracted from street-level imagery. Specifically, we present a deep learning approach that takes storefronts from street-level imagery as input, and directly provides the geo-location and type of commercial function as output. Our method showed a recall of 89.05% and a precision of 88.22% on a real-world dataset of street-level images, which experimentally demonstrated that our approach achieves human-level accuracy while having a remarkable run-time efficiency compared to methods such as Faster Region-Convolutional Neural Networks (Faster R-CNN) and Single Shot Detector (SSD). ...

ST-Sem

A Multimodal Method for Points-of-Interest Classification Using Street-Level Imagery

Conference paper (2019) - Shahin Sharifi Noorian, Achilleas Psyllidis, Alessandro Bozzon

Street-level imagery contains a variety of visual information about the facades of Points of Interest (POIs). In addition to general mor- phological features, signs on the facades of, primarily, business-related POIs could be a valuable source of information about the type and iden- tity of a POI. Recent advancements in computer vision could leverage visual information from street-level imagery, and contribute to the classification of POIs. However, there is currently a gap in existing literature regarding the use of visual labels contained in street-level imagery, where their value as indicators of POI categories is assessed. This paper presents Scene-Text Semantics (ST-Sem), a novel method that leverages visual la- bels (e.g., texts, logos) from street-level imagery as complementary in- formation for the categorization of business-related POIs. Contrary to existing methods that fuse visual and textual information at a feature- level, we propose a late fusion approach that combines visual and textual cues after resolving issues of incorrect digitization and semantic ambiguity of the retrieved textual components. Experiments on two existing and a newly-created datasets show that ST-Sem can outperform visual-only approaches by 80% and related multimodal approaches by 4%. ...

A time-varying p-median model for location-allocation analysis

Conference paper (2018) - Shahin Sharifi Noorian, Achilleas Psyllidis, Alessandro Bozzon

Location models have traditionally played an important role in suggesting sites for the placement of facilities, so that efficient service delivery is ensured. A common formulation of several location models is associated with the p-median problem, which aims to minimize the travel distance between support facilities and demand in a region. However, the influence of external conditions, such as traffic, on travel time is largely ignored. In this paper, we present a time-varying approach to the classical p-median problem, which accounts for fluctuations in travel cost distance at different time intervals. Using Google Traffic and Foursquare data to respectively retrieve traffic information and estimate demand in a region, and by employing an adaptive genetic algorithm in a planning problem application in the Netherlands, we show that our proposed model outperforms the classical p-median formulation, in providing more travel efficient service of demand nodes. Moreover, we achieve better placement of support facilities across major street arteries. The paper concludes with a discussion of associated uncertainties that are important to be recognized prior to viewing the modeling results as suggestions for implementation in planning and policy making. ...