Do Agent Architectures Matter for Crypto CTFs?

An Empirical Evaluation of LLM Architectures on the AICrypto Benchmark

Bachelor Thesis (2026)
Author(s)

Q.W. Voet (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Erkin – Mentor (TU Delft - Cyber Security)

M.J.G. Olsthoorn – Mentor (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
30-01-2026
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) demonstrate gold-medal performance in pure mathematics but continue to struggle in professional Capture-The-Flag (CTF) cybersecurity competitions, where the goal is to obtain a flag string as proof. While models can solve textbook equations, the iterative engineering required for cryptographic challenges often exceeds their single-pass capabilities. We explore whether this performance gap is a limitation of the base models or if it can be bridged by using different agent architectures. This paper presents a systematic evaluation of five distinct control flow architectures (i.e., how the agent plans, uses tools, and iterates) applied to the AICrypto CTF benchmark: Chain-Of-Thought (CoT) as the single-pass baseline, ReAct as a reactive method, and three planning/search approaches (ReWOO, ADaPT, and LATS). To isolate the impact of reasoning structures, we restrict the scope to static challenges, which are puzzles that do not require network interaction. Using a fixed base model and tool stack (SageMath execution, command execution and flag submission), we analyze the trade-offs between success rate, time-to-solve, and token consumption to determine if the computational overhead of different architectures contributes to the solving capability. Overall, there is only a 5.4% gain in success rate by using agent architectures compared to the single-pass baseline (35.1% vs. 29.7% success rate), primarily by iteratively debugging solutions rather than discovering new cryptographic ideas. The higher success rates from complex planning/search architectures come at the expense of higher token and time costs, with ADaPT consuming 93% more tokens on average compared to the baseline in successful challenges.

Files

License info not available