Detecting PII in Git commits

Master thesis (2022)

Authors

N. van der Plas Electrical Engineering, Mathematics and Computer Science

Contributors

Luis Cruz Software Engineering - (supervisor 1)

Luiz Oliveira (supervisor 2)

A. van Deursen Software Technology (supervisor 1)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:fe195c17-ecf5-4811-a987-89f238a6802f

Published Date

04-07-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

With the advancement of technology, organizations are experiencing more trouble with keeping their data private with it often leaked to the public via their code-repositories or databases. There are methods to counter the leakage of data while pushing code to a repository however, these are heavily reliant on regular expressions. Personal names, locations and other Personally Identifiable Information (PII) do not follow a reoccurring pattern and can thus only be prevented by manual code reviews, which are also prone to errors. A tool to detect these PII should be designed as an initial measure to counteract the leakage. In this paper, we propose a heavily modifiable tool in which we combine the strength of regular expressions with a state-of-the-art machine learning model to detect a variety of important PII within the code changes of Python software projects. We use CodeBERT, a RoBERTa-like Transformer model, as our PII recognizer. This recognizer is fine-tuned using the Scikit-learn library of which we injected the git commits with fake sensitive data. To test and improve the quality of the model and the entire tool, we design an experimental methodology to find the optimal value for the hyper parameters of the model, compare it against another Transformer model and run the fine-tuned model against several other code-bases with different programming languages. The outcome of these experiments benefit the quality of the model in a positive way and allows us to design a robust tool with a well-performing machine learning model to detect a variety of entities. This tool can be personalized to any business and mitigate a significant part of the potential data leaks.

Files

Master_Thesis_Niek.pdf

(.pdf | 1.98 Mb)