G.J.P.M. Houben | TU Delft Repository

Foreword from the Program Chairs

Foreword postscript (2023) - Lora Aroyo, Carlos Castillo, Geert Jan Houben

Meaningful human control: actionable properties for AI system development

Journal article (2022) - L. Cavalcante Siebert, M.L. Lupetti, M.J. van den Hoven, D. Forster, R.L. Lagendijk, E. Aizenberg, N.W.M. Beckers, A. Zgonnikov, H.M. Veluwenkamp, D.A. Abbink, Elisa Giaccardi, G.J.P.M. Houben, C.M. Jonker

How can humans remain in control of artificial intelligence (AI)-based systems designed to perform tasks autonomously? Such systems are increasingly ubiquitous, creating benefits - but also undesirable situations where moral responsibility for their actions cannot be properly attributed to any particular person or group. The concept of meaningful human control has been proposed to address responsibility gaps and mitigate them by establishing conditions that enable a proper attribution of responsibility for humans; however, clear requirements for researchers, designers, and engineers are yet inexistent, making the development of AI-based systems that remain under meaningful human control challenging. In this paper, we address the gap between philosophical theory and engineering practice by identifying, through an iterative process of abductive thinking, four actionable properties for AI-based systems under meaningful human control, which we discuss making use of two applications scenarios: automated vehicles and AI-based hiring. First, a system in which humans and AI algorithms interact should have an explicitly defined domain of morally loaded situations within which the system ought to operate. Second, humans and AI agents within the system should have appropriate and mutually compatible representations. Third, responsibility attributed to a human should be commensurate with that human’s ability and authority to control the system. Fourth, there should be explicit links between the actions of the AI agents and actions of humans who are aware of their moral responsibility. We argue that these four properties will support practically minded professionals to take concrete steps toward designing and engineering for AI systems that facilitate meaningful human control. ...

Interactive Data Discovery in Data Lakes

Conference paper (2021) - A. Ionescu, A Katsifodimos, G.J.P.M. Houben

As data is produced at an unprecedented rate, the need and ex- pectation to make it easily available for the end-users is growing. Dataset Discovery has become an important subject in the data management community, as it represents the means of providing the data to the user and fulfilling an information need. Since the end-user is the one that needs the information and knows what type of information to look for, little has been done to involve the user in the discovery process. This PhD project addresses the topic of interactive data discovery, where the user’s interests are modelled through interactions and used as a context for the discovery process. We aim to develop a system that addresses the problem of minimising the trade-off between efficiency and effectiveness, thus providing accurate re- sults in an interactive fashion. The innovative part of the system consists of extracting the user’s interests and data needs through interactions and using them to enrich the data context and provide tailored results to the user. We describe the steps to create models and methods that would be used in designing the prototype and we relate to previous systems and neighbouring communities for optimising the system. ...

Integrating Massive Data Streams

Conference paper (2021) - G. Siachamis, G.J.P.M. Houben, A. van Deursen, A Katsifodimos

Data Integration has been a long-standing and challenging problem for enterprises and researchers. Data residing in multiple heterogeneous sources must be integrated and prepared such that the valuable information that it carries, can be extracted and analysed. However, the volume and the velocity of the produced data in addition to the modern business needs for real-time results have pushed data analytics, and therefore data integration, towards data streams. While data integration is a hard problem in and of itself, integrating data streams becomes even more challenging. Streams are characterized by their high velocity, infinite nature and predisposition to concept drift.

The goal of this doctoral work is to design and provide scalable methods to support data integration tasks on massive data streams, i.e., support streaming data integration. The aim of this work is threefold. First, we aim at developing and proposing streaming methods to compute temporal stream data-profiles and summaries that can describe the dynamic state of a stream in the course of time. Second, we aim at developing methods and metrics of stream similarity. Those methods and metrics can serve as means to detect similar or complementary streams in a streaming data lake. Finally, we aim at optimizing distributed streaming similarity joins - a very important operation that precedes entity linking and resolution. This paper discusses exciting challenges and open problems in the field, and a research plan on tackling them. ...

Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems

Journal article (2021) - A.M.A. Balayn, C. Lofi, G.J.P.M. Houben

The increasing use of data-driven decision support systems in industry and governments is accompanied by the discovery of a plethora of bias and unfairness issues in the outputs of these systems. Multiple computer science communities, and especially machine learning, have started to tackle this problem, often developing algorithmic solutions to mitigate biases to obtain fairer outputs. However, one of the core underlying causes for unfairness is bias in training data which is not fully covered by such approaches. Especially, bias in data is not yet a central topic in data engineering and management research. We survey research on bias and unfairness in several computer science domains, distinguishing between data management publications and other domains. This covers the creation of fairness metrics, fairness identification, and mitigation methods, software engineering approaches and biases in crowdsourcing activities. We identify relevant research gaps and show which data management activities could be repurposed to handle biases and which ones might reinforce such biases. In the second part, we argue for a novel data-centered approach overcoming the limitations of current algorithmic-centered methods. This approach focuses on eliciting and enforcing fairness requirements and constraints on data that systems are trained, validated, and used on. We argue for the need to extend database management systems to handle such constraints and mitigation methods. We discuss the associated future research directions regarding algorithms, formalization, modelling, users, and systems. ...

Complex event processing on real-time video streams

Journal article (2020) - Ziyu Li, Asterios Katsifodimos, Alessandro Bozzon, Geert Jan Houben

Cameras are ubiquitous nowadays and video analytic systems have been widely used in surveillance, traffic control, business intelligence and autonomous driving. Some applications, e.g., detecting road congestion in traffic monitoring, require continuous and timely reporting of complex patterns. However, conventional complex event processing (CEP) systems fail to support video processing, while the existing video query languages offer limited support for expressing advanced CEP queries, such as iteration, and window. In this PhD research, we aim to develop systems and methods to alleviate these issues. In this paper, we first identify the need for an expressive CEP language which allows users to define queries over video streams, and receive fast, accurate results. To evaluate CEP queries on videos in real-time and with high accuracy, we explain how a streaming query engine can be designed to provide native support of machine learning (ML) models for fast and accurate inference on video streams. In addition, we describe a set of optimization problems that arise when ML models, with trade-offs in speed, accuracy, and cost, are part of a query plan. Finally, we describe how query plans on real-time videos can be optimized and deployed on edge devices with limited computational and network capabilities. ...

Analyzing Workers Performance in Online Mapping Tasks Across Web, Mobile, and Virtual Reality Platforms

Conference paper (2020) - G.A. van Alphen, S. Qiu, A. Bozzon, G.J.P.M. Houben

In online crowd mapping, crowd workers recruited through crowdsourcing marketplaces collect geographic data. Compared to traditional mapping methods, where workers physically explore the area, the benefit of using online crowd mapping is the potential to be cost-effective and time-efficient. Previous studies have focused on mapping urban objects using street-level imagery. However, they are specifically aimed at a single type of object, and only through web platforms. To the best of our knowledge, there is still a lack of understanding on how workers perform the mapping tasks through different platforms. Aiming to fill this knowledge gap, we investigate the worker performance across web, mobile, and virtual reality platforms by designing a multi-platform system for mapping urban objects using street-level imagery with novel methods for geo-location estimation. We design a preliminary study to show the feasibility of executing online mapping tasks on three platforms. The result demonstrates that the type of task and execution platform can affect the worker performance in terms of worker accuracy, execution time, user engagement, and cognitive load. ...

Conversational crowdsourcing

Conference paper (2020) - Sihang Qiu, Ujwal Gadiraju, Alessandro Bozzon, Geert Jan Houben

The trend of remote work leads to the prosperity of crowdsourcing marketplaces. In crowdsourcing marketplaces, online workers can select their preferable tasks and then complete them to get paid, while requesters design and publish tasks to acquire their desirable data. The standard user interface of the crowdsourcing task is the web page, where users provide answers using HTML-based web elements, and the task-related information (including instructions and questions) is displayed on a single web page. Although the traditional way of presenting tasks is straightforward, it could negatively affect workers’ satisfaction and performance by causing problems such as boredom and fatigue. To address this challenge, we proposed a novel concept — conversational crowdsourcing, which employs conversational interfaces to facilitate crowdsourcing task execution. With conversational crowdsourcing, workers receive task information as messages from a conversational agent, and provide answers by sending messages back to the agent. In this vision paper, we introduce our recent work in terms of using conversational crowdsourcing to improve worker performance and experience by employing novel human-computer interaction affordances. Our findings reveal that conversational crowdsourcing has important implications in improving the worker satisfaction and requester-worker relationship in crowdsourcing marketplaces. ...

Detecting, classifying, and mapping retail storefronts using street-level imagery

Conference paper (2020) - Shahin Sharifi Noorian, Sihang Qiu, Achilleas Psyllidis, Alessandro Bozzon, Geert Jan Houben

Up-to-date listings of retail stores and related building functions are challenging and costly to maintain. We introduce a novel method for automatically detecting, geo-locating, and classifying retail stores and related commercial functions, on the basis of storefronts extracted from street-level imagery. Specifically, we present a deep learning approach that takes storefronts from street-level imagery as input, and directly provides the geo-location and type of commercial function as output. Our method showed a recall of 89.05% and a precision of 88.22% on a real-world dataset of street-level images, which experimentally demonstrated that our approach achieves human-level accuracy while having a remarkable run-time efficiency compared to methods such as Faster Region-Convolutional Neural Networks (Faster R-CNN) and Single Shot Detector (SSD). ...

VirtualCrowd

A Simulation Platform for Microtask Crowdsourcing Campaigns

Conference paper (2020) - Sihang Qiu, Alessandro Bozzon, Geert Jan Houben

This demo presents VirtualCrowd, a simulation platform for crowdsourcing campaigns. The platform allows the design, configuration, step-by-step execution, and analysis of customized tasks, worker profiles, and crowdsourcing strategies. The platform will be demonstrated through a crowd-mapping example in two cities, which will highlight the utility of VirtualCrowd for complex crowdsourcing tasks in real world settings. ...

The Web Conference 2019 Companion Volume Preface

Conference paper (2019) - Sihem Amer-Yahia, Geert Jan Houben, Mohammad Madian, Kristina Lerman, Ashish Goel, Julian McAuley

PhD symposium chairs' welcome

Foreword postscript (2019) - Geert-Jan Houben, Kristina Lerman

Supporting Self-Regulated Learning in Online Learning Environments and MOOCs

A Systematic Review

Journal article (2019) - Jacqueline Wong, Martine Baars, Dan Davis, Tim Van Der Zee, Geert Jan Houben, Fred Paas

Massive Open Online Courses (MOOCs) allow learning to take place anytime and anywhere with little external monitoring by teachers. Characteristically, highly diverse groups of learners enrolled in MOOCs are required to make decisions related to their own learning activities to achieve academic success. Therefore, it is considered important to support self-regulated learning (SRL) strategies and adapt to relevant human factors (e.g., gender, cognitive abilities, prior knowledge). SRL supports have been widely investigated in traditional classroom settings, but little is known about how SRL can be supported in MOOCs. Very few experimental studies have been conducted in MOOCs at present. To fill this gap, this paper presents a systematic review of studies on approaches to support SRL in multiple types of online learning environments and how they address human factors. The 35 studies reviewed show that human factors play an important role in the efficacy of SRL supports. Future studies can use learning analytics to understand learners at a fine-grained level to provide support that best fits individual learners. The objective of the paper is twofold: (a) to inform researchers, designers and teachers about the state of the art of SRL support in online learning environments and MOOCs; (b) to provide suggestions for adaptive self-regulated learning support. ...

UMAP 2019 theory, reflection, and opinion track

Chairs' welcome and overview

Foreword postscript (2019) - Geert-Jan Houben, Bamshad Mobasher

ACM UMAP - User Modelling, Adaptation and Personalization is the premier international conference for researchers and practitioners working on systems that adapt to individual users, to groups of users, and that collect, represent, and model user information. The Theory, Opinion and Reflection (TOR) track at UMAP is designed to highlight emerging areas of inquiry in UMAP and to promote discussion of potentially visionary ideas. ...

Crowd-Mapping Urban Objects from Street-Level Imagery

Conference paper (2019) - Sihang Qiu, Achilleas Psyllidis, Alessandro Bozzon, Geert-Jan Houben

Knowledge about the organization of the main physical elements (e.g. streets) and objects (e.g. trees) that structure cities is important in the maintenance of city infrastructure and the planning of future urban interventions. In this paper, a novel approach to crowd-mapping urban objects is proposed. Our method capitalizes on strategies for generating crowdsourced object annotations from street-level imagery, in combination with object density and geo-location estimation techniques to enable the enumeration and geo-tagging of urban objects. To address both the coverage and precision of the mapped objects within budget constraints, we design a scheduling strategy for micro-task prioritization, aggregation, and assignment to crowd workers. We experimentally demonstrate the feasibility of our approach through a use case pertaining to the mapping of street trees in New York City and Amsterdam. We show that anonymous crowds can achieve high recall (up to 80%) and precision (up to 68%), with geo-location precision of approximately 3m. We also show that similar performance could be achieved at city scale, possibly with stringent budget constraints. ...

Approaches for Dialog Management in Conversational Agents

Journal article (2019) - Jan-Gerrit Harms, Pavel Kucherbaev, Alessandro Bozzon, Geert-Jan Houben

Dialog agents, like digital assistants and automated chat interfaces (e.g., chatbots), are becoming more and more popular as users adapt to conversing with their devices as they do with humans. In this paper, we present approaches and available tools for dialog management (DM), a component of dialog agents that handles dialog context and decides the next action for the agent to take. In this paper, we establish an overview of the field of DM, compare approaches and state-of-the-art tools in industry and research work on a set of dimensions, and identify directions for further research work. ...

Training Data Augmentation for Detecting Adverse Drug Reactions in User-Generated Content

Conference paper (2019) - Sepideh Mesbah, Jie Yang, Robert-Jan Sips, Manuel Valle Torre, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben

Social media provides a timely yet challenging data source for adverse drug reaction (ADR) detection. Existing dictionary-based, semi-supervised learning approaches are intrinsically limited by the coverage and maintainability of laymen health vocabularies. In this paper, we introduce a data augmentation approach that leverages variational autoencoders to learn high-quality data distributions from a large unlabeled dataset, and subsequently, to automatically generate a large labeled training set from a small set of labeled samples. This allows for efficient social-media ADR detection with low training and re-training costs to adapt to the changes and emergence of informal medical laymen terms. An extensive evaluation performed on Twitter and Reddit data shows that our approach matches the performance of fully-supervised approaches while requiring only 25% of training data. ...

Educational Theories and Learning Analytics: From Data to Knowledge

The Whole Is Greater Than the Sum of Its Parts

Book chapter (2019) - J. Wong, M. Baars, B. de Koning, T. van der Zee, D.J. Davis, M. Khalil, Geert-Jan Houben, F. Paas

The study of learning is grounded in theories and research. Since learning is complex and not directly observable, it is often inferred by collecting and analysing data based on the things learners do or say. By virtue, theories are developed from the analyses of data collected. With the proliferation of technology, large amounts of data are generated when students learn online. Therefore, researchers not only have data on students’ learning performance, but they also have data on the actions students take to achieve the desired learning outcomes. These data could help researchers to understand how students learn and the conditions needed for successful learning. In turn, the information can be translated to instructional and learning design to support students. The aim of the chapter is to discuss how learning theories and learning analytics are important components of educational research. To achieve this aim, studies employing learning analytics are qualitatively reviewed to examine which theories have been used and how the theories have been investigated. The results of the review show that self-regulated learning, motivation, and social constructivism theories were used in studies employing learning analytics. However, the studies at present are mostly correlational. Therefore, experimental studies are needed to examine how theory-informed practices can be implemented so that students can be better supported in online learning environments. The chapter concludes by proposing an iterative loop for educational research employing learning analytics in which learning theories guide data collection and analyses. To convert data into knowledge, it is important to recognize what we already know and what we want to examine. ...

Activating Learning at Scale

A Review of Innovations in Online Learning Strategies

Journal article (2018) - Dan Davis, Guanliang Chen, Claudia Hauff, Geert-Jan Houben

Taking advantage of the vast history of theoretical and empirical findings in the learning literature we have inherited, this research offers a synthesis of prior findings in the domain of empirically evaluated active learning strategies in digital learning environments. The primary concern of the present study is to evaluate these findings with an eye towards scalable learning. Massive Open Online Courses (MOOCs) have emerged as the new way to reach the masses with educational materials, but so far they have failed to maintain learners' attention over the long term. Even though we now understand how effective active learning principles are for learners, the current landscape of MOOC pedagogy too often allows for passivity — leading to the unsatisfactory performance experienced by many MOOC learners today. As a starting point to this research we took John Hattie's seminal work from 2008 on learning strategies used to facilitate active learning. We considered research published between 2009 and 2017 that presents empirical evaluations of these learning strategies. Through our systematic search we found 126 papers meeting our criteria and categorized them according to Hattie's learning strategies. We found large-scale experiments to be the most challenging environment for experimentation due to their size, heterogeneity of participants, and platform restrictions, and we identified the three most promising strategies for effectively leveraging learning at scale as Cooperative Learning, Simulations & Gaming, and Interactive Multimedia ...

Evaluating Crowdworkers as a Proxy for Online Learners in Video-Based Learning Contexts

Journal article (2018) - Daniel Davis, Claudia Hauff, Geert-Jan Houben

Crowdsourcing has emerged as an effective method of scaling-up tasks previously reserved for a small set of experts. Accordingly, researchers in the large-scale online learning space have begun to employ crowdworkers to conduct research about large-scale, open online learning. We here report results from a crowdsourcing study (N=135) to evaluate the extent to which crowdworkers and MOOC learners behave comparably on lecture viewing and quiz tasks---the most utilized learning activities in MOOCs. This serves to (i) validate the assumption of previous research that crowdworkers are indeed reliable proxies of online learners and (ii) address the potential of employing crowdworkers as a means of online learning environment testing. Overall, we observe mixed results---in certain contexts (quiz performance and video watching behavior) crowdworkers appear to behave comparably to MOOC learners, and in other situations (interactions with in-video quizzes), their behaviors appear to be disparate. We conclude that future research should be cautious if employing crowdworkers to carry out learning tasks, as the two populations do not behave comparably on all learning-related activities. ...