Studying the Machine Learning Lifecycle and Improving Code Quality of Machine Learning Applications

None, None

Studying the Machine Learning Lifecycle and Improving Code Quality of Machine Learning Applications

Master Thesis (2020)

Author(s)

M.P.A. Haakman (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. van Deursen – Mentor (TU Delft - Software Technology)

Maurício Aniche – Graduation committee member (TU Delft - Software Engineering)

C.C.S. Liem – Graduation committee member (TU Delft - Multimedia Computing)

Luis Cruz – Graduation committee member (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Static Code Analysis FinTech Machine Learning Lifecycle

To reference this document use:

https://resolver.tudelft.nl/uuid:38ff4e9a-222a-4987-998c-ac9d87880907

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Graduation Date

07-08-2020

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As organizations start to adopt machine learning in critical business scenarios, the development processes change and the reliability of the applications becomes more important. To investigate these changes and improve the reliability of those applications, we conducted two studies in this thesis. The first study aims to understand the evolution of the processes by which machine learning applications are developed and how state-of-the-art lifecycle models fit the current needs of the fintech industry. Therefore, we conducted a case study with seventeen machine learning practitioners at the fintech company ING. The results indicate that the existing lifecycle models CRISP-DM and TDSP largely reflect the current development processes of machine learning applications, but there are crucial steps missing, including a feasibility study, documentation, model evaluation, and model monitoring. Our second study aims to reduce bugs and improve the code quality of machine learning applications. We developed a static code analysis tool consisting of six checkers to find probable bugs and enforcing best practices, specifically in Python code used for processing large amounts of data and modeling in the machine learning lifecycle. The evaluation of the tool using 1000 collected notebooks from Kaggle shows that static code analysis can detect and thus help prevent probable bugs in data science code. Our work shows that the real challenges of applying machine learning go much beyond sophisticated learning algorithms -- more focus is needed on the entire lifecycle.

Files

Thesis_Mark_Haakman.pdf

(pdf | 0.889 Mb)

License info not available