Measuring LLM Tool-Use Efficiency in Cryptographic Capture-the-Flag Competitions

Bachelor Thesis (2026)
Author(s)

M. Iordache (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Erkin – Mentor (TU Delft - Cyber Security)

M.J.G. Olsthoorn – Graduation committee member (TU Delft - Software Engineering)

More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
30-01-2026
Awarding Institution
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Downloads counter
61
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models demonstrate strong capability in mathematics, yet struggle with cryptographic tasks requiring precise algorithmic implementation. We investigate this capability gap through an ablation study that quantifies the marginal utility of giving the models access to multiple tools. Models interact with these tools through Python REPL with unrestricted access to external libraries to autonomously install and configure their own computational environments within a custom ReAct framework. Using 15 tasks from the AICrypto benchmark covering diverse cryptographic categories, we isolate the contributions of general purpose programming versus domain specific libraries and tools across three configurations and three models, measuring success rate, time to solve, stability, and iteration depth. Our analysis reveals three principal findings: transitioning from pure reasoning to code execution with python and no additional libraries or access to external software yields a +37.78 percentage point performance increase; expanding to unrestricted access produces diminishing returns due to tool management and installation overhead, with effects varying by model type (reasoning / non reasoning); and tool and Python library utility is task-dependent, with large gains for complex algorithmic and computationally intensive challenges but penalties for pattern-based cryptanalysis. These findings establish the existence of a reasoning capability threshold for LLMs below which additional tools consume planning capacity, demonstrating that the Python Standard Library provides optimal performance for most models and cryptographic categories in our selected dataset. This work provides a quantitative basis for integrating large language models with tool usage in cryptographic CTFs.

https://github.com/bogdansys/Research-Project-TU-Delft-CSE3000

Files

License info not available