Locally Explainable Isolation Forest with Mixed-Attribute Data and Ternary Isolation Trees

Combatting Money Laundering with Anomaly Detection

More Info
expand_more

Abstract

In the fight against money laundering, demand for data-driven Anti-Money Laundering (AML) solutions is growing. Particularly anomaly detection algorithms have proven effective in the detection of suspicious customer behaviour, as well as observing patterns otherwise hidden in customer transaction data. In this thesis, the Isolation Forest anomaly detection algorithm is studied in combination with the model-specific local explanation method, Multiple Indicator Local Depth-based Isolation Forest Feature Importance (MI-Local-DIFFI). To expand Isolation Forest to mixed-attribute data sets, the incorporation of nominal features is explored in more detail. This analysis resulted in the introduction of Isolation Forest with Categorical Sampling (iForestCS ), a methodology that directly incorporates nominal attributes into an isolation tree without the need of encoding it onto a numerical scale. This method is tested against different encoding strategies and Isolation Forest Conditional Anomaly Detection (iForestCAD) using different synthetic data sets. The method shows improved performance to the utilization of encoding strategies for different parameters of the underlying synthetic data. Furthermore, this thesis explores the potential of ternary Isolation Forest, in which the branching strategy of an isolation tree is expanded to produce three child nodes. It is demonstrated using synthetic data, that particularly the performance of MI-Local-DIFFI reduces when applied to a ternary Isolation Forest. Finally, the research considers a practical use-case. Using customer transaction data from Triodos Bank, the locally explainable Isolation Forest is applied to mixed-attribute customer transaction data. This has provided useful insight and resulted in the detection of suspicious customer behaviour and the introduction of new rules into business practices. Although the most interesting customer behaviour did not directly emanate from the nominal attributes, the method of incorporating nominal features resulted in differences when considering the anomalies with the highest anomaly scores.