LLM-Based Autonomous Agents for Dynamic Malware Analysis
T. Crull (TU Delft - Electrical Engineering, Mathematics and Computer Science)
P. Pawelczak – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S.S. Chakraborty – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. van Deursen – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Dynamic malware analysis produces large amounts of behavioural evidence, which can be difficult to interpret manually and too large to process directly with small Large Language Models (LLMs). This paper evaluates to what extent Qwen3-4B can distinguish between benign and malicious Windows executables using reduced CAPEv2 dynamic-analysis reports. To test this, we built a sandbox pipeline which executes samples in a Windows 10 Pro detonation VM, collects CAPEv2 reports, filters them down to the most relevant dynamic-analysis information, and feeds them to Qwen for classification. The reduced reports retain the process tree, domains and DNS activity, behavioural signatures, and their ATT&CK TTP and Malware Behavior Catalog mappings. The dataset consisted of 1082 malware samples from MalwareBazaar and 762 benign samples collected from PortableApps, PortableApps installers, the Sysinternals Suite, and Benign-NET. Two prompts were tested: one with only benign and malware as possible verdicts, and one which also allowed an inconclusive verdict. The first prompt achieved 73.01% recall and 60.91% precision, showing that Qwen could detect many malware samples but also misclassified many benign samples as malicious. The second prompt did not solve this issue, since more correct classifications became inconclusive than incorrect ones. Overall, Qwen3-4B shows some potential for dynamic malware analysis, but its high false positive rate makes it unsuitable as a standalone classifier without further improvements or fine-tuning.