Predicting software vulnerabilities with unsupervised learning techniques

Master thesis (2020)

Authors

K.W. Man Electrical Engineering, Mathematics and Computer Science

Contributors

S.E. Verwer Cyber Security - (supervisor 1)

A. Panichella Software Engineering - (supervisor 1)

R.L. Lagendijk Cyber Security - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

Clustering K-nearest neighbors Unsupervised learning Anomaly detection K-means Fortify Software vulnerability detection Software fault prediction

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:80c1b078-b8ca-4c29-b0ba-866fdc5f656b

Published Date

20-08-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

As software is produced more and more every year, software also gets exploited more. This exploitation can lead to huge monetary losses and other damages to companies and users. The exploitation can be reduced by automatically detecting the software vulnerabilities that leads to exploitation. Unfortunately, the state-of-the-art methods for this automated process are not perfect and thus more research is needed to address this issue.

This research was partly done at ING, one of the banks of The Netherlands, in order to find a software vulnerabilities prediction method that is more efficient than their already deployed static code analysis tool Fortify Static Code Analyzer. This report proposes a method to predict software vulnerabilities in code using unsupervised learning methods. The data set is comprised of software metrics of code written by developers of ING, in conjunction with its corresponding label whether the code was vulnerable or non-vulnerable, confirmed by a security expert. Principal component analysis reduced the dimensions of the data set. From here on, the unsupervised learning technique k-means was used to build our prediction model and a distance-based anomaly detection technique was applied to find the software vulnerabilities. This produced poor results. In a final attempt to find better results, k-nearest neighbor was used to build a new prediction model and another distance-based anomaly detection technique was applied. The outcome of this latter method was surprisingly good.

Files

Kwman_MSc_thesis.pdf

(.pdf | 7.16 Mb)