Anomaly detection is a cornerstone of data analysis, aimed at identifying patterns that deviate from expected behaviour. However, conventional anomaly detection methods often fail to differentiate between actionable anomalies and those that, while statistically anomalous, are irr
...
Anomaly detection is a cornerstone of data analysis, aimed at identifying patterns that deviate from expected behaviour. However, conventional anomaly detection methods often fail to differentiate between actionable anomalies and those that, while statistically anomalous, are irrelevant to the user’s goals. Such uninteresting anomalies, originating from distinct, unrelated distributions, contribute to false alarms and resource inefficiencies, particularly in critical domains like cybersecurity and healthcare. This thesis proposes a novel, adaptive framework that retrains anomaly detection models to exclude uninteresting anomalies from being flagged, thereby improving the relevance of detected anomalies.
Central to this framework is the use of the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC) to artificially augment datasets with labelled uninteresting anomalies, transforming them into regular data. The framework also incorporates user feedback to iteratively refine model performance during deployment. A key finding of this research is that the effectiveness of the framework is highly dependent on the degree of distinguishability between interesting and uninteresting anomalies. Specifically, when the two types of anomalies are clearly distinct in terms of their statistical and categorical features, the framework achieves a significant reduction in false positives without adversely affecting the detection rate of actionable anomalies or the accuracy of regular data.
The framework was evaluated using four state-of-the-art anomaly detection models: Isolation Forest, One-Class Support Vector Machines, Autoencoders, and Variational Autoencoders. It was then tested using two datasets: one in cybersecurity, involving various attack types, and another in healthcare, where anomalies represent different diagnostic categories. The results demonstrate that the framework can effectively identify and suppress uninteresting anomalies, achieving over 90\% accuracy in classifying these cases as regular data under favourable conditions. Notably, when the distinction between interesting and uninteresting anomalies was substantial, the models retained their ability to detect actionable anomalies, with minimal degradation in overall accuracy. Furthermore, it was seen that a significant number of samples are needed to be able to successfully represent the uninteresting anomalies class, a minimum of around 50. When not using the framework, it takes many more samples for the algorithm to successfully no longer detect these uninteresting anomalies as anomalies. This class then needs to make up a significant amount of the training data, so depending on the size of the training data set, this can mean thousands of samples. Gathering around 50 samples poses no problem for the framework, as it is meant to be especially relevant when we have too many uninteresting anomalies, where it is constantly giving false alerts, so there is usually a sufficient number of samples available.