Q.W. Voet

info

Please Note

<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>

Bachelor thesis (1)

1 records found

Do Agent Architectures Matter for Crypto CTFs?

An Empirical Evaluation of LLM Architectures on the AICrypto Benchmark

Bachelor thesis (2026) - Q.W. Voet, Z. Erkin, M.J.G. Olsthoorn

Large Language Models (LLMs) demonstrate gold-medal performance in pure mathematics but continue to struggle in professional Capture-The-Flag (CTF) cybersecurity competitions, where the goal is to obtain a flag string as proof. While models can solve textbook equations, the iterative engineering required for cryptographic challenges often exceeds their single-pass capabilities. We explore whether this performance gap is a limitation of the base models or if it can be bridged by using different agent architectures. This paper presents a systematic evaluation of five distinct control flow architectures (i.e., how the agent plans, uses tools, and iterates) applied to the AICrypto CTF benchmark: Chain-Of-Thought (CoT) as the single-pass baseline, ReAct as a reactive method, and three planning/search approaches (ReWOO, ADaPT, and LATS). To isolate the impact of reasoning structures, we restrict the scope to static challenges, which are puzzles that do not require network interaction. Using a fixed base model and tool stack (SageMath execution, command execution and flag submission), we analyze the trade-offs between success rate, time-to-solve, and token consumption to determine if the computational overhead of different architectures contributes to the solving capability. Overall, there is only a 5.4% gain in success rate by using agent architectures compared to the single-pass baseline (35.1% vs. 29.7% success rate), primarily by iteratively debugging solutions rather than discovering new cryptographic ideas. The higher success rates from complex planning/search architectures come at the expense of higher token and time costs, with ADaPT consuming 93% more tokens on average compared to the baseline in successful challenges. ...