C.A. Hammerschmidt | TU Delft Repository

FlexFringe

Modeling software behavior by learning probabilistic automata

Journal article (2025) - Sicco Verwer, Christian Hammerschmidt

We present the efficient implementations of probabilistic deterministic finite automaton learning methods available in FlexFringe. These are well-known strategies for state merging, including several modifications to improve their performance in practice. We show experimentally that these algorithms obtain competitive results and significant improvements over a default implementation. We also demonstrate how to use FlexFringe to learn interpretable models from software logs and use these for anomaly detection. Although less interpretable, we show that learning smaller, more convoluted models improves the performance of FlexFringe on anomaly detection, making it competitive with an existing solution based on neural nets. ...

Beyond Labeling

Using Clustering to Build Network Behavioral Profiles of Malware Families

Book chapter (2021) - A. Nadeem, C.A. Hammerschmidt, C. Hernandez Ganan, S.E. Verwer

Malware family labels are known to be inconsistent. They are also black-box since they do not represent the capabilities of malware. The current state of the art in malware capability assessment includes mostly manual approaches, which are infeasible due to the ever-increasing volume of discovered malware samples. We propose a novel unsupervised machine learning-based method called MalPaCA, which automates capability assessment by clustering the temporal behavior in malware's network traces. MalPaCA provides meaningful behavioral clusters using only 20 packet headers. Behavioral profiles are generated based on the cluster membership of malware's network traces. A Directed Acyclic Graph shows the relationship between malwares according to their overlapping behaviors. The behavioral profiles together with the DAG provide more insightful characterization of malware than current family designations. We also propose a visualization-based evaluation method for the obtained clusters to assist practitioners in understanding the clustering results. We apply MalPaCA on a financial malware dataset collected in the wild that comprises 1.1 k malware samples resulting in 3.6 M packets. Our experiments show that (i) MalPaCA successfully identifies capabilities, such as port scans and reuse of Command and Control servers; (ii) It uncovers multiple discrepancies between behavioral clusters and malware family labels; and (iii) It demonstrates the effectiveness of clustering traces using temporal features by producing an error rate of 8.3%, compared to 57.5% obtained from statistical features. ...

The Robust Malware Detection Challenge and Greedy Random Accelerated Multi-Bit Search

Conference paper (2020) - S.E. Verwer, A. Nadeem, C.A. Hammerschmidt, L. Bliek, Abdullah Al-Dujaili, Una-May O’Reilly

Training classifiers that are robust against adversarially modified examples is becoming increasingly important in practice. In the field of malware detection, adversaries modify malicious binary files to seem benign while preserving their malicious behavior. We report on the results of a recently held robust malware detection challenge. There were two tracks in which teams could participate: the attack track asked for adversarially modified malware samples and the defend track asked for trained neural network classifiers that are robust to such modifications. The teams were unaware of the attacks/defenses they had to detect/evade. Although only 9 teams participated, this unique setting allowed us to make several interesting observations. We also present the challenge winner: GRAMS, a family of novel techniques to train adversarially robust networks that preserve the intended (malicious) functionality and yield high-quality adversarial samples. These samples are used to iteratively train a robust classifier. We show that our techniques, based on discrete optimization techniques, beat purely gradient-based methods. GRAMS obtained first place in both the attack and defend tracks of the competition. ...

Auto Semi-supervised Outlier Detection for Malicious Authentication Events

Conference paper (2020) - Georgios Kaiafas, Christian Hammerschmidt, Sofiane Lagraa, Radu State

Cyber-attacks become more sophisticated and complex especially when adversaries steal user credentials to traverse the network of an organization. Detecting a breach is extremely difficult and this is confirmed by the findings of studies related to cyber-attacks on organizations. A study conducted last year by IBM found that it takes 206 days on average to US companies to detect a data breach. As a consequence, the effectiveness of existing defensive tools is in question. In this work we deal with the detection of malicious authentication events, which are responsible for effective execution of the stealthy attack, called lateral movement. Authentication event logs produce a pure categorical feature space which creates methodological challenges for developing outlier detection algorithms. We propose an auto semi-supervised outlier ensemble detector that does not leverage the ground truth to learn the normal behavior. The automatic nature of our methodology is supported by established unsupervised outlier ensemble theory. We test the performance of our detector on a real-world cyber security dataset provided publicly by the Los Alamos National Lab. Overall, our experiments show that our proposed detector outperforms existing algorithms and produces a 0 False Negative Rate without missing any malicious login event and a False Positive Rate which improves the state-of-the-art. In addition, by detecting malicious authentication events, compared to the majority of the existing works which focus solely on detecting malicious users or computers, we are able to provide insights regarding when and at which systems malicious login events happened. Beyond the application on a public dataset we are working with our industry partner, POST Luxembourg, to employ the proposed detector on their network. ...

Federated learning for cyber security

Conference paper (2020) - Ekaterina Khramtsova, Christian Hammerschmidt, Sofian Lagraa, Radu State

Managed security service providers increasingly rely on machine-learning methods to exceed traditional, signature-based threat detection and classification methods. As machine-learning often improves with more data available, smaller organizations and clients find themselves at a disadvantage: Without the ability to share their data and others willing to collaborate, their machine-learned threat detection will perform worse than the same model in a larger organization. We show that Federated Learning, i.e. collaborative learning without data sharing, successfully helps to overcome this problem. Our experiments focus on a common task in cyber security, the detection of unwanted URLs in network traffic seen by security-as-a-service providers. Our experiments show that i) Smaller participants benefit from larger participants ii) Participants seeing different types of malicious traffic can generalize better to unseen types of attacks, increasing performance by 8% to 15% on average, and up to 27% in the extreme case. iii) Participating in Federated training never harms the performance of the locally trained model. In our experiment modeling a security-as-a service setting, Federated Learning increased detection up to 30% for some participants in the scheme. This clearly shows that Federated Learning is a viable approach to address issues of data sharing in common cyber security settings. ...

Learning behavioral fingerprints from Netflows using Timed Automata

Conference paper (2017) - Nino Pellegrino, Qin Lin, Christian Hammerschmidt, Sicco Verwer

We present a novel way to detect infected hosts and identify malware in networks by analyzing network communication statistics with state-of-the-art automata learning algorithms. The automata encode patterns of short-term interactions in known malicious hosts, and are used to obtain small but effective fingerprints of machine behavior. We showcase the effectiveness of our system, named BASTA1 (Behavioral Analytics System using Timed Automata), on a public dataset containing Netflow traces of real-world botnet malware. Compared to a deep packet inspection of communication content, Netflows are easy and cheap to collect and analyze, and preserve a greater degree of privacy. Even though the high level of abstraction in Netflow data makes it more difficult to utilize it, BASTA shows very impressive results achieving high accuracy in several settings while returning few false positives. It is also capable of detecting infections of previously unseen malware. ...

flexfringe

A Passive Automaton Learning Package

Conference paper (2017) - Sicco Verwer, Christian A. Hammerschmidt

Finite state models, such as Mealy machines or state charts, are often used to express and specify protocol and software behavior. Consequently, these models are often used in verification, testing, and for assistance in the development and maintenance process. Reverse engineering these models from execution traces and log files, in turn, can accelerate and improve the software development and inform domain experts about the processes actually executed in a system. We present name, an open-source software tool to learn variants of finite state automata from traces using a state-of-the-art evidence-driven state-merging algorithm at its core. We embrace the need for customized models and tailored learning heuristics in different application domains by providing a flexible, extensible interface. ...

Reliable Machine Learning for Networking

Key Issues and Approaches

Conference paper (2017) - Christian A. Hammerschmidt, Sebastian Garcia, Sicco Verwer, Radu State

Machine learning has become one of the go-to methods for solving problems in the field of networking. This development is driven by data availability in large-scale networks and the commodification of machine learning frameworks. While this makes it easier for researchers to implement and deploy machine learning solutions on networks quickly, there are a number of vital factors to account for when using machine learning as an approach to a problem in networking and translate testing performance to real networks deployments successfully. This paper, rather than presenting a particular technical result, discusses the necessary considerations to obtain good results when using machine learning to analyze network-related data. ...

Behavioral Clustering of Non-Stationary IP Flow Record Data

Conference paper (2016) - Christian Hammerschmidt, Samuel Marchal, Radu State, Sicco Verwer

Automated network traffic analysis using machine learning techniques plays an important role in managing networks and IT infrastructure. A key challenge to the correct and effective application of machine learning is dealing with non-stationary learning data sources and concept drift. Traffic evolves overtime due to new technology, software, services being used, changes in user behavior but also due to changes in network graphs like dynamic IP address assignment. In this paper, we present an automatic online method to detect changepointsin network traffic based on IP flow record analysis. This technique is used to segment an observed behavior into smaller consecutive behaviors differing one from another. The segmented traffic is used to learn small communication profile characterizing accurately the activities present between two observed changepoints. We validate our method using synthetic data and outlinea real-world application to botnet hosts behavior modeling. ...

Efficient Learning of Communication Profiles from IP Flow Records

Conference paper (2016) - Christian Hammerschmidt, Samuel Marchal, Radu State, Nino Pellegrino, Sicco Verwer

The task of network traffic monitoring has evolved drastically with the ever-increasing amount of data flowing in large scale networks. The automated analysis of this tremendous source of information often comes with using simpler models on aggregated data (e.g. IP flow records) due to time and space constraints. A step towards utilizing IP flow records more effectively are stream learning techniques. We propose a method to collect a limited yet relevant amount of data in order to learn a class of complex models, finite state machines, in real-time. These machines are used as communication profiles to fingerprint, identify or classify hosts and services and offer high detection rates while requiring less training data and thus being faster to compute than simple models. ...

Flexible State-Merging for learning (P)DFAs in Python

Conference paper (2016) - Christian Hammerschmidt, Benjamin Loos, Radu State, Thomas Engel, Sicco Verwer

We present a Python package for learning (non-)probabilistic deterministic nite state automata and provide heuristics in the red-blue framework. As our package is built along the API of the popular scikit-learn package, it is easy to use and new learning methods are easy to add. It provides PDFA learning as an additional tool for sequence prediction or classication to data scientists, without the need to understand the algorithm itself but rather the limitations of PDFA as a model. With applications of automata learning in diverse elds such as network trac analysis, software engineering and biology, a stratied package opens opportunities for practitioners. ...

Learning Deterministic Finite Automata from Innite Alphabets

Conference paper (2016) - Nino Pellegrino, Christian Hammerschmidt, Sicco Verwer, Qin Lin

We proposes an algorithm to learn automata innite alphabets, or at least too large to enumerate. We apply it to dene a generic model intended for regression, with transitions constrained by intervals over the alphabet. The algorithm is based on the Red & Blue framework for learning from an input sample. We show two small case studies where the alphabets are respectively the natural and real numbers, and show how nice properties of automata models like interpretability and graphical representation transfer to regression where typical models are hard to interpret. ...