Predicting software vulnerabilities with unsupervised learning techniques

More Info
expand_more

Abstract

As software is produced more and more every year, software also gets exploited more. This exploitation can lead to huge monetary losses and other damages to companies and users. The exploitation can be reduced by automatically detecting the software vulnerabilities that leads to exploitation. Unfortunately, the state-of-the-art methods for this automated process are not perfect and thus more research is needed to address this issue.

This research was partly done at ING, one of the banks of The Netherlands, in order to find a software vulnerabilities prediction method that is more efficient than their already deployed static code analysis tool Fortify Static Code Analyzer. This report proposes a method to predict software vulnerabilities in code using unsupervised learning methods. The data set is comprised of software metrics of code written by developers of ING, in conjunction with its corresponding label whether the code was vulnerable or non-vulnerable, confirmed by a security expert. Principal component analysis reduced the dimensions of the data set. From here on, the unsupervised learning technique k-means was used to build our prediction model and a distance-based anomaly detection technique was applied to find the software vulnerabilities. This produced poor results. In a final attempt to find better results, k-nearest neighbor was used to build a new prediction model and another distance-based anomaly detection technique was applied. The outcome of this latter method was surprisingly good.