Using endpoints process information for malicious behavior detection

Master thesis (2015)

Authors

K.J. Wijnands

Contributors

J. Van den Berg (mentor)

S. Verwer (mentor)

V. Dignum (mentor)

M. Warnier (mentor)

M. Boone (mentor)

Programme

SEPAM - ICT () (TU Delft)

Anomaly detection Process trees Heatmaps Malware detection

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:e1678077-9056-47ac-82e6-2762bfb40a63

Published Date

24-09-2015

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Programme

SEPAM - ICT

Abstract

In the last years the impact of malware has become a huge problem. Each year, more and more new malware samples are discovered [2]. And the malware is becoming more sophisticated, for example ransomware. Ransomware encrypts personal documents, such as photos and word documents, and asks money to be able to decrypt these files, hence the name. Malware is not only used for financial gain at the backs of consumers. Sophisticated targeted attacks at enterprise is not uncommon, for example the Sony hack. Although there are many security solutions which should protect endpoints, malware infections still occur. The reason for this has to due with the way current security solutions work. Most of these security solutions act upon known malware behavior and signatures. However when new malware is released and the behavior and signature is still unknown the security solutions cannot protect the endpoint against these infections. To be able to overcome this problem a new method for malware detection should be developed. This detection method should be able to detect malicious behavior without prior knowledge. In scientific literature this type of detection is called anomaly detection [26, 38, 57, 58]. Anomaly detection uses the gathered data to construct a model for normal behavior. Any deviation from the defined normal behavior is seen as an anomaly. At Fox-IT, an IT security company based in the Netherlands, a new security solution is developed, clled FoxGuard. This security solution has the ability to block and allow process activity based on a set of rules. FoxGuard also has the ability to log very detailed low level information of all the processes running on a system. This information include actions such as filesystem actions and registry actions. For a more detailed explanation of the data FoxGuard can gather read section 4.1. In this master thesis an explorative research is conducted on using anomaly detection to detect malicious process on an endpoint by using the detailed process information FoxGuard can collect. The main research question to be answered is: How can anomaly based detection be used for detecting unknown malicious processes based on the detailed process information gathered on a single endpoint? To answer this question first a literature research was conducted on the use of anomaly detection for detecting malicious process in scientific literature, see chapter 2. The main conclusion from the literature study is that using process information combined with tree based representations, large quantities of data can be stored in a compact representation. These compact representations can aid the security officer in graphically analyzing the processes on an endpoint and hereby possibly spotting deviations. In chapter 3 the design requirements of the developed system are analyzed. The conclusion of this analysis is that the amount of data used should be reduced. Not only does it prevent the chances of generating a detection method in which overfitting occurs, reducing the data also reduces the need for huge amounts of memory, storage, processing power and network data send. The collection and preparation of the data is discussed in chapter 4. We have collected four clean datasets, a complete dataset contains one complete bootcycle, and five malware datasets. To generatethe malware dataset the following malware was used: a banking malware, a Remote Access Trojan and a sample of Zeus. The collected data is aggregated, such that a dataframe remains containing per process the number of times it triggered the following activities: filesystem, registry, process create, thread create, object callback and module load. Furthermore it contained the unique process id of the parent process. As the difference between the number of times process activities were triggered the data was normalized between 0 and 10, such that the data of the process activities becomes comparable between each other. A k-means clustering algorithm was applied on the process activities to assign every process to cluster with likewise processes. The aggregated and processed data is used to generate process trees, section 5.1 and heatmaps, section 5.3. These two tools provide a graphical representation of the processes. In a heatmap a security officer can easily spot the processes with high number of process activities per second compared to other processes. In analyzing the process tree deviations were spotted in the top part of the tree, providing proof that an expert can use the process tree to easily spot deviations in the top level. However due to the huge number of nodes present in a tree and the difference in computer usage each day, finding deviations in the lower levels of the tree proofed to be difficult. Analyzing the process trees from the malware sets proofed again that the process tree can help in finding deviations. The rat malware processes were clearly visible as deviations on the process tree. Further more the analysis showed that all malware samples ran could be found in the same part of the process tree. Chapter 6 explains the three algorithms used to calculate the distances between processes in the clean and malware set. These calculated distance are used for marking a process malicious or benign. A process is marked malicious if it is above a set threshold value. To set these threshold values we used the mean and 75%, 80%, 85%, 90% and 95% quantile. All threshold values and algorithms were test and the True Postive Rate, False Negative Rate and the Accuracy were calculated. The outcome of all experiments is shown in chapter 7. In figure 1 the True Positive Rate, False Positive Rate and Accuracy for all algorithms is shown. As can be seen in the figure the malicious processes of the banking malware and rat malware could partly be detect. The highest True Positive rate gained is 0.917 using algorithm 1 and 3 on the banking malware. However paired with this is a high False positive Rate. However the Zeus malware was not detect. In chapter 8 the conclusion and recommendations of this thesis are presented. The main short-coming for the conducted research is way in which the collection of the data was done. By using two different machines differences in processes from the same executable were noticeable. This had to do with the fact that the running times for these processes differs. For future research this experiment should be repeated by collecting data on one machine. Although the shortcoming had its effects on the collected data the proposed algorithms showed the ability to detect malicious processes from at least two out of the three malware types. Furthermore the analysis of the process trees showed us that, although limited, deviations can be detected.

Files

Master_thesis_kjwijnands.pdf

(pdf | 5.78 Mb)

Scientific_paper_kjwijnands.pd... (pdf)

(pdf | 0.171 Mb)