A. Zarras | TU Delft Repository

Vacant Holes for Unsupervised Detection of the Outliers in Compact Latent Representation

Journal article (2023) - Misha Glazunov, Apostolis Zarras

Detection of the outliers is pivotal for any machine learning model deployed and operated in real-world. It is essential for the Deep Neural Networks that were shown to be overconfident with such inputs. Moreover, even deep generative models that allow estimation of the probability density of the input fail in achieving this task. In this work, we concentrate on the specific type of these models: Variational Autoencoders (VAEs). First, we unveil a significant theoretical flaw in the assumption of the classical VAE model. Second, we enforce an accommodating topological property to the image of the deep neural mapping to the latent space: compactness to alleviate the flaw and obtain the means to provably bound the image within the determined limits by squeezing both inliers and outliers together. We enforce compactness using two approaches: (i) Alexandroff extension and (ii) fixed Lipschitz continuity constant on the mapping of the encoder of the VAEs. Finally and most importantly, we discover that the anomalous inputs predominantly tend to land on the vacant latent holes within the compact space, enabling their successful identification. For that reason, we introduce a specifically devised score for hole detection and evaluate the solution against several baseline benchmarks achieving promising results. ...

Upside Down

Exploring the Ecosystem of Dark Web Data Markets

Conference paper (2022) - Bogdan Covrig, Enrique Barrueco Mikelarena, Constanta Rosca, Catalina Goanta, Gerasimos Spanakis, Apostolis Zarras

Large-scale dark web marketplaces have been around for more than a decade. So far, academic research has mainly focused on drug and hacking-related offers. However, data markets remain understudied, especially given their volatile nature and distinct characteristics based on shifting iterations. In this paper, we perform a large-scale study on dark web data markets. We first characterize data markets by using an innovative theoretical legal taxonomy based on the Council of Europe’s Cybercrime Convention and its implementation in Dutch law. The recent Covid-19 pandemic showed that cybercrime has become more prevalent with the increase of digitalization in society. In this context, important questions arise regarding how cybercrime harms are determined, measured, and prioritized. We propose a determination of harm based on criminal law qualifications and sanctions. We also address the empirical question of what the economic activity on data markets looks like nowadays by performing a comprehensive measurement of digital goods based on an original dataset scraped from twelve marketplaces consisting of approximately 28,000 offers from 642 vendors. The resulting analysis combines insights from the theoretical legal framework and the results of the measurement study. To our knowledge, this is the first study to combine these two elements systematically. ...

Do Bayesian Variational Autoencoders Know What They Don't Know?

Conference paper (2022) - Misha Glazunov, Apostolis Zarras

The problem of detecting the Out-of-Distribution (OoD) inputs is of paramount importance for Deep Neural Networks. It has been previously shown that even Deep Generative Models that allow estimating the density of the inputs may not be reliable and often tend to make over-confident predictions for OoDs, assigning to them a higher density than to the in-distribution data. This over-confidence in a single model can be potentially mitigated with Bayesian inference over the model parameters that take into account epistemic uncertainty. This paper investigates three approaches to Bayesian inference: stochastic gradient Markov chain Monte Carlo, Bayes by Backpropagation, and Stochastic Weight Averaging-Gaussian. The inference is implemented over the weights of the deep neural networks that parameterize the likelihood of the Variational Autoencoder. We empirically evaluate the approaches against several benchmarks that are often used for OoD detection: estimation of the marginal likelihood utilizing sampled model ensemble, typicality test, disagreement score, and Watanabe-Akaike Information Criterion. Finally, we introduce two simple scores that demonstrate the state-of-the-art performance. ...

Detecting and categorizing Android malware with graph neural networks

Conference paper (2021) - Peng Xu, Claudia Eckert, Apostolis Zarras

Android is the most dominant operating system in the mobile ecosystem. As expected, this trend did not go unnoticed by miscreants, and quickly enough, it became their favorite platform for discovering new victims through malicious apps. These apps have become so sophisticated that they can bypass anti-malware measures implemented to protect the users. Therefore, it is safe to admit that traditional anti-malware techniques have become cumbersome, sparking the urge to come up with an efficient way to detect Android malware. In this paper, we present a novel Natural Language Processing (NLP) inspired Android malware detection and categorization technique based on Function Call Graph Embedding. We design a graph neural network (graph embedding) based approach to convert the whole graph structure of an Android app to a vector. We then utilize the graphs' vectors to detect and categorize the malware families. Our results reveal that graph embedding yields better results as we get 99.6% accuracy on average for the malware detection and 98.7% accuracy for the malware categorization. ...

Hybroid

Toward Android Malware Detection and Categorization with Program Code and Network Traffic

Conference paper (2021) - Mohammad Reza Norouzian, Peng Xu, Claudia Eckert, Apostolis Zarras

Android malicious applications have become so sophisticated that they can bypass endpoint protection measures. Therefore, it is safe to admit that traditional anti-malware techniques have become cumbersome, thereby raising the need to develop efficient ways to detect Android malware. In this paper, we present Hybroid, a hybrid Android malware detection and categorization solution that utilizes program code structures as static behavioral features and network traffic as dynamic behavioral features for detection (binary classification) and categorization (multi-label classification). For static analysis, we introduce a natural-language-processing-inspired technique based on function call graph embeddings and design a graph-neural-network-based approach to convert the whole graph structure of an Android app to a vector. For dynamic analysis, we extract network flow features from the raw network traffic by capturing each application’s network flow. Finally, Hybroid utilizes the network flow features combined with the graphs’ vectors to detect and categorize the malware. Our solution demonstrates 97.0% accuracy on average for malware detection and 94.0% accuracy for malware categorization. Also, we report remarkable results in different performance metrics such as F1-score, precision, recall, and AUC. ...

HawkEye

Cross-Platform Malware Detection with Representation Learning on Graphs

Conference paper (2021) - Peng Xu, Youyi Zhang, Claudia Eckert, Apostolis Zarras

Malicious software, widely known as malware, is one of the biggest threats to our interconnected society. Cybercriminals can utilize malware to carry out their nefarious tasks. To address this issue, analysts have developed systems that can prevent malware from successfully infecting a machine. Unfortunately, these systems come with two significant limitations. First, they frequently target one specific platform/architecture, and thus, they cannot be ubiquitous. Second, code obfuscation techniques used by malware authors can negatively influence their performance. In this paper, we design and implement HawkEye, a control-flow-graph-based cross-platform malware detection system, to tackle the problems mentioned above. In more detail, HawkEye utilizes a graph neural network to convert the control flow graphs of executable to vectors with the trainable instruction embedding and then uses a machine-learning-based classifier to create a malware detection system. We evaluate HawkEye by testing real samples on different platforms and operating systems, including Linux (x86, x64, and ARM-32), Windows (x86 and x64), and Android. The results outperform most of the existing works with an accuracy of 96.82% on Linux, 93.39% on Windows, and 99.6% on Android. To the best of our knowledge, HawkEye is the first approach to consider graph neural networks in the malware detection field, utilizing natural language processing. ...

Falcon

Malware Detection and Categorization with Network Traffic Images

Conference paper (2021) - Peng Xu, Claudia Eckert, Apostolis Zarras

Android is the most popular smartphone operating system. At the same time, miscreants have already created malicious apps to find new victims and infect them. Unfortunately, existing anti-malware procedures have become obsolete, and thus novel Android malware techniques are in high demand. In this paper, we present Falcon, an Android malware detection and categorization framework. More specifically, we treat the network traffic classification task as a 2D image sequence classification and handle each network packet as a 2D image. Furthermore, we use a bidirectional LSTM network to process the converted 2D images to obtain the network vectors. We then utilize those converted vectors to detect and categorize the malware. Our results reveal that Falcon could be an accurate and viable solution as we get 97.16% accuracy on average for the malware detection and 88.32% accuracy for the malware categorization. ...