LLM-Based Autonomous Agents for Dynamic Malware Analysis

None, None

LLM-Based Autonomous Agents for Dynamic Malware Analysis

Bachelor Thesis (2026)

Author(s)

T. Crull (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P. Pawelczak – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S.S. Chakraborty – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. van Deursen – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Dynamic analysis Malware analysis Qwen CAPEv2

To reference this document use

https://resolver.tudelft.nl/uuid:ac3a0426-d121-4df6-a8a6-3d1bfcfb86fc

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

24-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

8

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Dynamic malware analysis produces large amounts of behavioural evidence, which can be difficult to interpret manually and too large to process directly with small Large Language Models (LLMs). This paper evaluates to what extent Qwen3-4B can distinguish between benign and malicious Windows executables using reduced CAPEv2 dynamic-analysis reports. To test this, we built a sandbox pipeline which executes samples in a Windows 10 Pro detonation VM, collects CAPEv2 reports, filters them down to the most relevant dynamic-analysis information, and feeds them to Qwen for classification. The reduced reports retain the process tree, domains and DNS activity, behavioural signatures, and their ATT&CK TTP and Malware Behavior Catalog mappings. The dataset consisted of 1082 malware samples from MalwareBazaar and 762 benign samples collected from PortableApps, PortableApps installers, the Sysinternals Suite, and Benign-NET. Two prompts were tested: one with only benign and malware as possible verdicts, and one which also allowed an inconclusive verdict. The first prompt achieved 73.01% recall and 60.91% precision, showing that Qwen could detect many malware samples but also misclassified many benign samples as malicious. The second prompt did not solve this issue, since more correct classifications became inconclusive than incorrect ones. Overall, Qwen3-4B shows some potential for dynamic malware analysis, but its high false positive rate makes it unsuitable as a standalone classifier without further improvements or fine-tuning.

Files

Thomas_Crull_Thesis.pdf

(pdf | 1.2 Mb)

License info not available