J. Yang
Please Note
92 records found
1
Co-Data
Cultivating Effective Human-LLM Collaboration for Collaborative Data Processing
A.I. Robustness
A Human-Centered Perspective on Technological Challenges and Opportunities
Despite the impressive performance of Artificial Intelligence (AI) systems, their robustness remains elusive and constitutes a key issue that impedes large-scale adoption. Besides, robustness is interpreted differently across domains and contexts of AI. In this work, we systematically survey recent progress to provide a reconciled terminology of concepts around AI robustness. We introduce three taxonomies to organize and describe the literature both from a fundamental and applied point of view: (1) methods and approaches that address robustness in different phases of the machine learning pipeline; (2) methods improving robustness in specific model architectures, tasks, and systems; and in addition, (3) methodologies and insights around evaluating the robustness of AI systems, particularly the tradeoffs with other trustworthiness properties. Finally, we identify and discuss research gaps and opportunities and give an outlook on the field. We highlight the central role of humans in evaluating and enhancing AI robustness, considering the necessary knowledge they can provide, and discuss the need for better understanding practices and developing supportive tools in the future.
In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound “value” metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate.
FedTrans
Client-transparent utility estimation for robust federated learning
Federated Learning (FL) is an important privacy-preserving learning paradigm that plays an important role in the Intelligent Internet of Things. Training a global model in FL, however, is vulnerable to the data noise across the clients. In this paper, we introduce FedTrans, a novel client-transparent client utility estimation method designed to guide client selection for noisy scenarios, mitigating performance degradation problems. To estimate the client utility, we propose a Bayesian framework that models client utility and its relationships with the weight parameters and the performance of local models. We then introduce a variational inference algorithm to effectively infer client utility at the FL server, given only a small amount of auxiliary data. Our evaluation results demonstrate that leveraging FedTrans to select the clients can improve the accuracy performance (up to 7.8%), ensuring the robustness of FL in noisy scenarios.
Opening the Analogical Portal to Explainability
Can Analogies Help Laypeople in AI-assisted Decision Making?
Concepts are an important construct in semantics, based on which humans understand the world with various levels of abstraction. With the recent advances in explainable artificial intelligence (XAI), concept-level explanations are receiving an increasing amount of attention from the broad research community. However, laypeople may find such explanations difficult to digest due to the potential knowledge gap and the concomitant cognitive load. Inspired by prior work that has explored analogies and sensemaking, we argue that augmenting concept-level explanations with analogical inference information from commonsense knowledge can be a potential solution to tackle this issue. To investigate the validity of our proposition, we first designed an effective analogy-based explanation generation method and collected 600 analogy-based explanations from 100 crowd workers. Next, we proposed a set of structured dimensions for the qualitative assessment of such explanations, and conducted an empirical evaluation of the generated analogies with experts. Our findings revealed significant positive correlations between the qualitative dimensions of analogies and the perceived helpfulness of analogy-based explanations, suggesting the effectiveness of the dimensions. To understand the practical utility and the effectiveness of analogybased explanations in assisting human decision-making, we conducted a follow-up empirical study (N = 280) on a skin cancer detection task with non-expert humans and an imperfect AI system. Thus, we designed a between-subjects study spanning five different experimental conditions with varying types of explanations. The results of our study confirmed that a knowledge gap can prevent participants from understanding concept-level explanations. Consequently, when only the target domain of our designed analogy-based explanation was provided (in a specific experimental condition), participants demonstrated relatively more appropriate reliance on the AI system. In contrast to our expectations, we found that analogies were not effective in fostering appropriate reliance. We carried out a qualitative analysis of the open-ended responses from participants in the study regarding their perceived usefulness of explanations and analogies. Our findings suggest that human intuition and the perceived plausibility of analogies may have played a role in affecting user reliance on the AI system. We also found that the understanding of commonsense explanations varied with the varying experience of the recipient user, which points out the need for further work on personalization when leveraging commonsense explanations. In summary, although we did not find quantitative support for our hypotheses around the benefits of using analogies, we found considerable qualitative evidence suggesting the potential of high-quality analogies in aiding non-expert users in their decision making with AI-assistance. These insights can inform the design of future methods for the generation and use of effective analogy-based explanations.
MRHF
Multi-stage Retrieval and Hierarchical Fusion for Textbook Question Answering
Textbook question answering is challenging as it aims to automatically answer various questions on textbook lessons with long text and complex diagrams, requiring reasoning across modalities. In this work, we propose MRHF, a novel framework that incorporates dense passage re-ranking and the mixture-of-experts architecture for TQA. MRHF proposes a novel query augmentation method for diagram questions and then adopts multi-stage dense passage re-ranking with large pretrained retrievers for retrieving paragraph-level contexts. Then it employs a unified question solver to process different types of text questions. Considering the rich blobs and relation knowledge contained in diagrams, we propose to perform multimodal feature fusion over the retrieved context and the heterogeneous diagram features. Furthermore, we introduce the mixture-of-experts architecture to solve the diagram questions to learn from both the rich text context and the complex diagrams and mitigate the possible negative effects between features of the two modalities. We test the framework on the CK12-TQA benchmark dataset, and the results show that MRHF outperforms the state-of-the-art results in all types of questions. The ablation and case study also demonstrates the effectiveness of each component of the framework.
Editorial
Special Issue on Human in the Loop Data Curation
XCrowd
Combining Explainability and Crowdsourcing to Diagnose Models in Relation Extraction
Large Language Models (LLMs) are expected to significantly impact various socio-technical systems, offering transformative possibilities for improved interaction between humans and technology. However, their integration poses complex challenges due to the intricate interplay between societal structures, human behaviour, and technological innovation. This research explores these multifaceted challenges, emphasising the need for a human-centered approach in integrating LLMs to ensure that technological advancements are aligned with ethical standards and societal needs. Utilizing a structured methodology comprising a workshop, literature analysis, and expert collaborations, the study uses a multi-dimensional human-centered AI framework to guide the responsible integration of LLMs. Key insights include the importance of inclusive data, considering unintended consequences, maintaining privacy, and respecting intellectual property rights. The paper identifies and advocates for principles like human-in-the-loop, continuous longitudinal studies, proactive awareness campaigns, and regular audits to develop LLMs that are ethically sound, adaptable, and effectively integrated into various socio-technical systems, thus addressing user needs and broader societal impacts. The paper also underlines the importance of collaboration among academia, industry, and policymakers to develop LLMs that are ethically aligned, socially beneficial, and adaptable to future societal needs. The findings offer valuable insights into the strategic integration of LLMs, advocating for a broader research perspective beyond industrial motivations to fully understand and leverage LLMs in socio-technical landscapes.
“It Is a Moving Process”
Understanding the Evolution of Explainability Needs of Clinicians in Pulmonary Medicine
Visible light positioning (VLP) based on the received signal strength (RSS) can leverage a dense deployment of LEDs in future lighting infrastructure to provide accurate and energy-efficient indoor positioning. However, its positioning accuracy heavily depends on the density of collected fingerprints, which is labor-intensive. In this work, we propose a data pre-processing method, including data cleaning and data augmentation, to construct reliable and dense fingerprint samples, thereby alleviating the impact of noisy samples as well as reducing labor intensity. Extensive experiments demonstrate that our proposed method achieves an average positioning error of 1.7 cm, utilizing a sparse dataset that reduces the fingerprint collection effort by 98 percent. Running a tinyML-based model for VLP on the Arduino Nano microcontroller, we also show the possibilities for deploying RSS fingerprint-based VLP systems on resource-constrained embedded devices for real-world applications.
DaisyRec 2.0
Benchmarking Recommendation for Rigorous Evaluation
Recently, one critical issue looms large in the field of recommender systems - there are no effective benchmarks for rigorous evaluation - which consequently leads to unreproducible evaluation and unfair comparison. We, therefore, conduct studies from the perspectives of practical theory and experiments, aiming at benchmarking recommendation for rigorous evaluation. Regarding the theoretical study, a series of hyper-factors affecting recommendation performance throughout the whole evaluation chain are systematically summarized and analyzed via an exhaustive review on 141 papers published at eight top-tier conferences within 2017-2020. We then classify them into model-independent and model-dependent hyper-factors, and different modes of rigorous evaluation are defined and discussed in-depth accordingly. For the experimental study, we release DaisyRec 2.0 library by integrating these hyper-factors to perform rigorous evaluation, whereby a holistic empirical study is conducted to unveil the impacts of different hyper-factors on recommendation performance. Supported by the theoretical and experimental studies, we finally create benchmarks for rigorous evaluation by proposing standardized procedures and providing performance of ten state-of-the-arts across six evaluation metrics on six datasets as a reference for later study. Overall, our work sheds light on the issues in recommendation evaluation, provides potential solutions for rigorous evaluation, and lays foundation for further investigation.
In this paper, we argue that the way we have been training and evaluating ML models has largely forgotten the fact that they are applied in an organization or societal context as they provide value to people. We show that with this perspective we fundamentally change how we evaluate and select machine learning models.
Faulty or Ready? Handling Failures in Deep-Learning Computer Vision Models until Deployment
A Study of Practices, Challenges, and Needs
Handling failures in computer vision systems that rely on deep learning models remains a challenge. While an increasing number of methods for bug identification and correction are proposed, little is known about how practitioners actually search for failures in these models. We perform an empirical study to understand the goals and needs of practitioners, the workflows and artifacts they use, and the challenges and limitations in their process. We interview 18 practitioners by probing them with a carefully crafted failure handling scenario. We observe that there is a great diversity of failure handling workflows in which cooperations are often necessary, that practitioners overlook certain types of failures and bugs, and that they generally do not rely on potentially relevant approaches and tools originally stemming from research. These insights allow to draw a list of research opportunities, such as creating a library of best practices and more representative formalisations of practitioners' goals, developing interfaces to exploit failure handling artifacts, as well as providing specialized training.