AB

A.M.A. Balayn

info

Please Note

19 records found

A Human-Centered Perspective on Technological Challenges and Opportunities

Journal article (2025) - Andrea Tocchetti, Lorenzo Corti, Agathe Balayn, Mireia Yurrita, Philip Lippmann, Marco Brambilla, Jie Yang
Despite the impressive performance of Artificial Intelligence (AI) systems, their robustness remains elusive and constitutes a key issue that impedes large-scale adoption. Besides, robustness is interpreted differently across domains and contexts of AI. In this work, we systematically survey recent progress to provide a reconciled terminology of concepts around AI robustness. We introduce three taxonomies to organize and describe the literature both from a fundamental and applied point of view: (1) methods and approaches that address robustness in different phases of the machine learning pipeline; (2) methods improving robustness in specific model architectures, tasks, and systems; and in addition, (3) methodologies and insights around evaluating the robustness of AI systems, particularly the tradeoffs with other trustworthiness properties. Finally, we identify and discuss research gaps and opportunities and give an outlook on the field. We highlight the central role of humans in evaluating and enhancing AI robustness, considering the necessary knowledge they can provide, and discuss the need for better understanding practices and developing supportive tools in the future. ...

Understanding the Effect of Decision-Makers' Configuration on Decision-Subjects' Fairness Perceptions

Human intervention is claimed to safeguard decision-subjects’ rights in algorithmic decision-making and contribute to their fairness perceptions. However, how decision-subjects perceive hybrid decision-maker configurations (i.e., combining humans and algorithms) is unclear. We address this gap through a mixed-methods study in an algorithmic policy enforcement context. Through qualitative interviews (Study 1; N1 = 21), we identify three characteristics (i.e., decision-maker’s profile, model type, input data provenance) that affect how decision-subjects perceive decision-makers’ ability, benevolence, and integrity (ABI). Through a quantitative study (Study 2; N2 = 223), we then systematically evaluate the individual and combined effects of these characteristics on decision-subjects’ perceptions towards decision-makers, and fairness perceptions. We found that only decision-maker’s profile contributes to perceived ability, benevolence, and integrity. Interestingly, the effect of decision-maker’s profile on fairness perceptions was mediated by perceived ability and integrity. Our findings have design implications for ensuring effective human intervention as a protection against harmful algorithmic decisions. ...

Understanding the Evolution of Explainability Needs of Clinicians in Pulmonary Medicine

Conference paper (2024) - Lorenzo Corti, Rembrandt Oltmans, Jiwon Jung, Agathe Balayn, Marlies Wijsenbeek, Jie Yang
Clinicians increasingly pay attention to Artificial Intelligence (AI) to improve the quality and timeliness of their services. There are converging opinions on the need for Explainable AI (XAI) in healthcare. However, prior work considers explanations as stationary entities with no account for the temporal dynamics of patient care. In this work, we involve 16 Idiopathic Pulmonary Fibrosis (IPF) clinicians from a European university medical centre and investigate their evolving uses and purposes for explainability throughout patient care. By applying a patient journey map for IPF, we elucidate clinicians' informational needs, how human agency and patient-specific conditions can influence the interaction with XAI systems, and the content, delivery, and relevance of explanations over time. We discuss implications for integrating XAI in clinical contexts and more broadly how explainability is defined and evaluated. Furthermore, we reflect on the role of medical education in addressing epistemic challenges related to AI literacy. ...

Contestability Along AI Value Chains

Conference paper (2024) - Agathe Balayn, Yulu Pi, David Gray Widder, Kars Alfrink, Mireia Yurrita, Sohini Upadhyay, Naveena Karusala, Henrietta Lyons, Cagatay Turkay, Ujwal Gadiraju
This workshop will grow and consolidate a community of interdisciplinary CSCW researchers focusing on the topic of contestable AI. As an outcome of the workshop, we will synthesize the most pressing opportunities and challenges for contestability along AI value chains in the form of a research roadmap. This roadmap will help shape and inspire imminent work in this field. Considering the length and depth of AI value chains, it will especially spur discussions around the contestability of AI systems along various sites of such chains. The workshop will serve as a platform for dialogue and demonstrations of concrete, successful, and unsuccessful examples of AI systems that (could or should) have been contested, to identify requirements, obstacles, and opportunities for designing and deploying contestable AI in various contexts. This will be held primarily as an in-person workshop, with some hybrid accommodation. The day will consist of individual presentations and group activities to stimulate ideation and inspire broad reflections on the field of contestable AI. Our aim is to facilitate interdisciplinary dialogue by bringing together researchers, practitioners, and stakeholders to foster the design and deployment of contestable AI. ...
Doctoral thesis (2023) - A.M.A. Balayn, G.J.P.M. Houben, A. Bozzon
Machine learning (ML) is an artificial intelligence technology that has a great potential for being adopted in various sectors of activities. Yet, it is now also increasingly recognized as a hazardous technology. Failures in the outputs of an ML system might cause physical or social harms. Besides, the development and deployment of an ML system itself are also argued to be harmful in certain contexts.

Surprisingly, these hazards persist in applications where ML technology has been deployed, despite the increasing amount of research performed by the ML research community. In this thesis, we task ourselves with the challenges of understanding the reasons for the subsistence of hazardous system’s output failures and of hazardous development and deployment processes in practice, and of developing solutions to further diagnose these hazardous failures (especially in the system’s outputs). For that, we investigate further the nature of the potential gap between research and the practices of those developers who build and deploy the systems. To do so, we survey major related ML research directions, surface developers practices and challenges, and search for types of (mis)alignment between theory and practices. There, among others, we find a lack of technical support for ML developers to identify the potential failures of their systems. Hence, we then tackle the development and evaluation of a human-in-the-loop, explainability-based, failure diagnosis method and user-interface for computer vision systems...
...

The Effects of Explanations, Human Oversight, and Contestability

Conference paper (2023) - M. Yurrita Semperena, Tim Draws, Agathe Balayn, Dave Murray-Rust, Nava Tintarev, Alessandro Bozzon
Recent research claims that information cues and system attributes of algorithmic decision-making processes affect decision subjects' fairness perceptions. However, little is still known about how these factors interact. This paper presents a user study (N = 267) investigating the individual and combined effects of explanations, human oversight, and contestability on informational and procedural fairness perceptions for high- and low-stakes decisions in a loan approval scenario. We find that explanations and contestability contribute to informational and procedural fairness perceptions, respectively, but we find no evidence for an effect of human oversight. Our results further show that both informational and procedural fairness perceptions contribute positively to overall fairness perceptions but we do not find an interaction effect between them. A qualitative analysis exposes tensions between information overload and understanding, human involvement and timely decision-making, and accounting for personal circumstances while maintaining procedural consistency. Our results have important design implications for algorithmic decision-making processes that meet decision subjects' standards of justice. ...

Leveraging Human Understanding for Identifying and Characterizing Image Atypicality

Conference paper (2023) - Shahin Sharifi Noorian, Sihang Qiu, Burcu Sayin, Agathe Balayn, Ujwal Gadiraju, Jie Yang, Alessandro Bozzon
High-quality data plays a vital role in developing reliable image classification models. Despite that, what makes an image difficult to classify remains an unstudied topic. This paper provides a first-of-its-kind, model-agnostic characterization of image atypicality based on human understanding. We consider the setting of image classification "in the wild", where a large number of unlabeled images are accessible, and introduce a scalable and effective human computation approach for proactive identification and characterization of atypical images. Our approach consists of i) an image atypicality identification and characterization task that presents to the human worker both a local view of visually similar images and a global view of images from the class of interest and ii) an automatic image sampling method that selects a diverse set of atypical images based on both visual and semantic features. We demonstrate the effectiveness and cost-efficiency of our approach through controlled crowdsourcing experiments and provide a characterization of image atypicality based on human annotations of 10K images. We showcase the utility of the identified atypical images by testing state-of-the-art image classification services against such images and provide an in-depth comparative analysis of the alignment between human- and machine-perceived image atypicality. Our findings have important implications for developing and deploying reliable image classification systems. ...
Conference paper (2023) - Agathe Balayn, Mireia Yurrita, Jie Yang, Ujwal Gadiraju
Fairness toolkits are developed to support machine learning (ML) practitioners in using algorithmic fairness metrics and mitigation methods. Past studies have investigated practical challenges for toolkit usage, which are crucial to understanding how to support practitioners. However, the extent to which fairness toolkits impact practitioners’ practices and enable reflexivity around algorithmic harms remains unclear (i.e., distributive unfairness beyond algorithmic fairness, and harms that are not related to the outputs of ML systems). Little is currently understood about the root factors that fragment practices when using fairness toolkits and how practitioners reflect on algorithmic harms. Yet, a deeper understanding of these facets is essential to enable the design of support tools for practitioners. To investigate the impact of toolkits on practices and identify factors that shape these practices, we carried out a qualitative study with 30 ML practitioners with varying backgrounds. Through a mixed within and between-subjects design, we tasked the practitioners with developing an ML model, and analyzed their reported practices to surface potential factors that lead to differences in practices. Interestingly, we found that fairness toolkits act as double-edge swords — with potentially positive and negative impacts on practices. Our findings showcase a plethora of human and organizational factors that play a key role in the way toolkits are envisioned and employed. These results bear implications for the design of future toolkits and educational training for practitioners and call for the creation of new policies to handle the organizational constraints faced by practitioners. ...

A Study on the Use of the Voice Modality for Crowdsourced Relevance Assessments

Conference paper (2023) - Nirmal Roy, Agathe Balayn, David Maxwell, Claudia Hauff
The creation of relevance assessments by human assessors (often nowadays crowdworkers) is a vital step when building IR test collections. Prior works have investigated assessor quality & behaviour, and tooling to support assessors in their task. We have few insights though into the impact of a document's presentation modality on assessor efficiency and effectiveness. Given the rise of voice-based interfaces, we investigate whether it is feasible for assessors to judge the relevance of text documents via a voice-based interface. We ran a user study (n = 49) on a crowdsourcing platform where participants judged the relevance of short and long documents-sampled from the TREC Deep Learning corpus-presented to them either in the text or voice modality. We found that: (i) participants are equally accurate in their judgements across both the text and voice modality; (ii) with increased document length it takes participants significantly longer (for documents of length > 120 words it takes almost twice as much time) to make relevance judgements in the voice condition; and (iii) the ability of assessors to ignore stimuli that are not relevant (i.e., inhibition) impacts the assessment quality in the voice modality-assessors with higher inhibition are significantly more accurate than those with lower inhibition. Our results indicate that we can reliably leverage the voice modality as a means to effectively collect relevance labels from crowdworkers. ...

A Critical Review of Communications, Reports, Regulations, and Standards in the EU, US, and UK

Conference paper (2023) - Luca Nannini, Agathe Balayn, Adam Leon Smith
Public attention towards explainability of artificial intelligence (AI) systems has been rising in recent years to offer methodologies for human oversight. This has translated into the proliferation of research outputs, such as from Explainable AI, to enhance transparency and control for system debugging and monitoring, and intelligibility of system process and output for user services. Yet, such outputs are difficult to adopt on a practical level due to a lack of a common regulatory baseline, and the contextual nature of explanations. Governmental policies are now attempting to tackle such exigence, however it remains unclear to what extent published communications, regulations, and standards adopt an informed perspective to support research, industry, and civil interests. In this study, we perform the first thematic and gap analysis of this plethora of policies and standards on explainability in the EU, US, and UK. Through a rigorous survey of policy documents, we first contribute an overview of governmental regulatory trajectories within AI explainability and its sociotechnical impacts. We find that policies are often informed by coarse notions and requirements for explanations. This might be due to the willingness to conciliate explanations foremost as a risk management tool for AI oversight, but also due to the lack of a consensus on what constitutes a valid algorithmic explanation, and how feasible the implementation and deployment of such explanations are across stakeholders of an organization. Informed by AI explainability research, we then conduct a gap analysis of existing policies, which leads us to formulate a set of recommendations on how to address explainability in regulations for AI systems, especially discussing the definition, feasibility, and usability of explanations, as well as allocating accountability to explanation providers. ...
Handling failures in computer vision systems that rely on deep learning models remains a challenge. While an increasing number of methods for bug identification and correction are proposed, little is known about how practitioners actually search for failures in these models. We perform an empirical study to understand the goals and needs of practitioners, the workflows and artifacts they use, and the challenges and limitations in their process. We interview 18 practitioners by probing them with a carefully crafted failure handling scenario. We observe that there is a great diversity of failure handling workflows in which cooperations are often necessary, that practitioners overlook certain types of failures and bugs, and that they generally do not rely on potentially relevant approaches and tools originally stemming from research. These insights allow to draw a list of research opportunities, such as creating a library of best practices and more representative formalisations of practitioners' goals, developing interfaces to exploit failure handling artifacts, as well as providing specialized training. ...

Eliciting Diverse Knowledge Using A Configurable Game

Conference paper (2022) - Agathe Balayn, Gaole He, Andrea Hu, Jie Yang, Ujwal Gadiraju
Access to commonsense knowledge is receiving renewed interest for developing neuro-symbolic AI systems, or debugging deep learning models. Little is currently understood about the types of knowledge that can be gathered using existing knowledge elicitation methods. Moreover, these methods fall short of meeting the evolving requirements of several downstream AI tasks. To this end, collecting broad and tacit knowledge, in addition to negative or discriminative knowledge can be highly useful. Addressing this research gap, we developed a novel game with a purpose, 'FindItOut', to elicit different types of knowledge from human players through easily configurable game mechanics. We recruited 125 players from a crowdsourcing platform, who played 2430 rounds, resulting in the creation of more than 150k tuples of knowledge. Through an extensive evaluation of these tuples, we show that FindItOut can successfully result in the creation of plural knowledge with a good player experience. We evaluate the efficiency of the game (over 10 × higher than a reference baseline) and the usefulness of the resulting knowledge, through the lens of two downstream tasks - commonsense question answering and the identification of discriminative attributes. Finally, we present a rigorous qualitative analysis of the tuples' characteristics, that informs the future use of FindItOut across various researcher and practitioner communities. ...
In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders. ...
Deep learning models for image classification suffer from dangerous issues often discovered after deployment. The process of identifying bugs that cause these issues remains limited and understudied. Especially, explainability methods are often presented as obvious tools for bug identification. Yet, the current practice lacks an understanding of what kind of explanations can best support the different steps of the bug identification process, and how practitioners could interact with those explanations. Through a formative study and an iterative co-creation process, we build an interactive design probe providing various potentially relevant explainability functionalities, integrated into interfaces that allow for flexible workflows. Using the probe, we perform 18 user-studies with a diverse set of machine learning practitioners. Two-thirds of the practitioners engage in successful bug identification. They use multiple types of explanations, e.g. visual and textual ones, through non-standardized sequences of interactions including queries and exploration. Our results highlight the need for interactive, guiding, interfaces with diverse explanations, shedding light on future research directions. ...
With recent advances in explainable artificial intelligence (XAI), researchers have started to pay attention to concept-level explanations, which explain model predictions with a high level of abstraction. However, such explanations may be difficult to digest for laypeople due to the potential knowledge gap and the concomitant cognitive load. Inspired by recent work, we argue that analogy-based explanations composed of commonsense knowledge may be a potential solution to tackle this issue. In this paper, we propose analogical inference as a bridge to help end-users leverage their commonsense knowledge to better understand the concept-level explanations. Specifically, we design an effective analogy-based explanation generation method and collect 600 analogy-based explanations from 100 crowd workers. Furthermore, we propose a set of structured dimensions for the qualitative assessment of analogy-based explanations and conduct an empirical evaluation of the generated analogies with experts. Our findings reveal significant positive correlations between the qualitative dimensions of analogies and the perceived helpfulness of analogy-based explanations. These insights can inform the design of future methods for the generation of effective analogy-based explanations. We also find that the understanding of commonsense explanations varies with the experience of the recipient user, which points out the need for further work on personalization when leveraging commonsense explanations. ...

A Survey of Technical Biases Informed by Psychology Literature

Journal article (2021) - Agathe Balayn, Jie Yang, Zoltán Szlávik, Alessandro Bozzon
The automatic detection of conflictual languages (harmful, aggressive, abusive, and offensive languages) is essential to provide a healthy conversation environment on the Web. To design and develop detection systems that are capable of achieving satisfactory performance, a thorough understanding of the nature and properties of the targeted type of conflictual language is of great importance. The scientific communities investigating human psychology and social behavior have studied these languages in details, but their insights have only partially reached the computer science community.

In this survey, we aim both at systematically characterizing the conceptual properties of online conflictual languages, and at investigating the extent to which they are reflected in state-of-the-art automatic detection systems. Through an analysis of psychology literature, we provide a reconciled taxonomy that denotes the ensemble of conflictual languages typically studied in computer science. We then characterize the conceptual mismatches that can be observed in the main semantic and contextual properties of these languages and their treatment in computer science works; and systematically uncover resulting technical biases in the design of machine learning classification models and the dataset created for their training. Finally, we discuss diverse research opportunities for the computer science community and reflect on broader technical and structural issues. ...
Conference paper (2021) - Agathe Balayn, Panagiotis Soilis, Christoph Lofi, Jie Yang, Alessandro Bozzon
Global interpretability is a vital requirement for image classification applications. Existing interpretability methods mainly explain a model behavior by identifying salient image patches, which require manual efforts from users to make sense of, and also do not typically support model validation with questions that investigate multiple visual concepts. In this paper, we introduce a scalable human-in-the-loop approach for global interpretability. Salient image areas identified by local interpretability methods are annotated with semantic concepts, which are then aggregated into a tabular representation of images to facilitate automatic statistical analysis of model behavior. We show that this approach answers interpretability needs for both model validation and exploration, and provides semantically more diverse, informative, and relevant explanations while still allowing for scalable and cost-efficient execution. ...
Journal article (2021) - A.M.A. Balayn, C. Lofi, G.J.P.M. Houben
The increasing use of data-driven decision support systems in industry and governments is accompanied by the discovery of a plethora of bias and unfairness issues in the outputs of these systems. Multiple computer science communities, and especially machine learning, have started to tackle this problem, often developing algorithmic solutions to mitigate biases to obtain fairer outputs. However, one of the core underlying causes for unfairness is bias in training data which is not fully covered by such approaches. Especially, bias in data is not yet a central topic in data engineering and management research. We survey research on bias and unfairness in several computer science domains, distinguishing between data management publications and other domains. This covers the creation of fairness metrics, fairness identification, and mitigation methods, software engineering approaches and biases in crowdsourcing activities. We identify relevant research gaps and show which data management activities could be repurposed to handle biases and which ones might reinforce such biases. In the second part, we argue for a novel data-centered approach overcoming the limitations of current algorithmic-centered methods. This approach focuses on eliciting and enforcing fairness requirements and constraints on data that systems are trained, validated, and used on. We argue for the need to extend database management systems to handle such constraints and mitigation methods. We discuss the associated future research directions regarding algorithms, formalization, modelling, users, and systems. ...
Conference paper (2018) - Agathe Balayn, Panagiotis Mavridis, Alessandro Bozzon, Benjamin Timmermans, Zoltán Szlávik
Training machine learning (ML) models for natural language processing usually requires large amount of data, often acquired through crowdsourcing. The way this data is collected and aggregated can have an effect on the outputs of the trained model such as ignoring the labels which differ from the majority. In this paper we investigate how label aggregation can bias the ML results towards certain data samples and propose a methodology to highlight and mitigate this bias. Although our work is applicable to any kind of label aggregation for data subject to multiple interpretations, we focus on the effects of the bias introduced by majority voting on toxicity prediction over sentences. Our preliminary results point out that we can mitigate the majority-bias and get increased prediction accuracy for the minority opinions if we take into account the different labels from annotators when training adapted models, rather than rely on the aggregated labels. ...