A System for Model Diagnosis centered around Human Computation

More Info
expand_more

Abstract

Machine learning (ML) systems for computer vision applications are widely deployed in decision-making contexts, including high-stakes domains such as autonomous driving and medical diagnosis. While largely accelerating the decision-making process, those systems have been found to suffer from a severe issue of reliability, i.e., they can easily fail on serving data that are slightly different from the data captured during their training phase. Such an issue has resulted in undesired outcomes with safety, ethical, and societal concerns across various applications, such as numerous examples of semi-automatic cars causing accidents on the road.
In this thesis, we hence develop a system in order to support ML practitioners in debugging their computer vision models, even before deploying them and having access to serving data.

We take inspiration from prior ongoing works in order to formulate the current diagnosis problem, identify its challenges, and envision a human-computation-based solution. We then thoroughly analyse the requirements for developing a system instantiating the solution, actually design such a system, and implement it in a well-functioning, full-fledged, highly-modular, and easily-customizable system.
The solution is based on the definition of human computation operations, that, altogether, allow to a) identify the mechanisms a human would expect the model to learn in an ideal world, b) identify the mechanisms the model has actually learned (via annotations of saliency maps), and c) to compare these two sets of mechanisms to conclude about the good behavior of the model. The solution is especially made to account for certainty issues in the work of the human workers, and to handle ambiguous granularities in the concepts the model might have learned.
To the best of our knowledge, our work is the first system that allows an ML practitioner to first identify their own goals for debugging a model (among a large diversity of goals), accounting for their limited monetary budget, then to configure a debugging session according to these goals, and finally to fully-automatically run the system with such configuration to obtain a model debugging report.

Finally, we conduct a thorough investigation of our system. First, we set-up to understand the correctness and informativeness of the outputs, by running the system with various configurations on different models trained using various datasets, for which the biases are more or less controlled. This first evaluation particularly shows that the outputs and its implementation are correct. With these outputs, we are able to identify the biases that have been injected in the model, as well as to learn about previously unknown behaviors of highly-common models that are used by many practitioners.
Second, we evaluate the cost-effectiveness of running the system. For that, we ran tests in two settings: when the human workers might make mistakes (e.g., due to a lack of expertise, the complexity of the task, or inattention), and when human workers are fully accurate. We vary the configurations of the system (e.g., the order in which the human operations are conducted, the number of workers allocated at the start of the debugging session) within the two settings, and we observe how the number of human operations needed evolve, in order to reach correct system outputs. We find that the system's output is potentially relevant, informative and complete. The system output provide an in depth analysis of the model's behaviour and unravel what the model comprehends, where it falls short, and what it should ideally have grasped.

All in all, in this thesis, we build the system and thoroughly evaluate it. While we identify a number of conceptual and practical limitations of this system (e.g., difficulty to annotate concepts, potentially high cost), our work constitutes a first step towards developing complete solutions to help practitioners debug their system. We encourage readers to build on our work, in order to further optimize our system for cost. Note that we make all our code publicly available for anyone to re-use our system, or reproduce our experiments.

Files