Measuring LLM Tool-Use Efficiency in Cryptographic Capture-the-Flag Competitions
M. Iordache (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Erkin – Mentor (TU Delft - Cyber Security)
M.J.G. Olsthoorn – Graduation committee member (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large Language Models demonstrate strong capability in mathematics, yet struggle with cryptographic tasks requiring precise algorithmic implementation. We investigate this capability gap through an ablation study that quantifies the marginal utility of giving the models access to multiple tools. Models interact with these tools through Python REPL with unrestricted access to external libraries to autonomously install and configure their own computational environments within a custom ReAct framework. Using 15 tasks from the AICrypto benchmark covering diverse cryptographic categories, we isolate the contributions of general purpose programming versus domain specific libraries and tools across three configurations and three models, measuring success rate, time to solve, stability, and iteration depth. Our analysis reveals three principal findings: transitioning from pure reasoning to code execution with python and no additional libraries or access to external software yields a +37.78 percentage point performance increase; expanding to unrestricted access produces diminishing returns due to tool management and installation overhead, with effects varying by model type (reasoning / non reasoning); and tool and Python library utility is task-dependent, with large gains for complex algorithmic and computationally intensive challenges but penalties for pattern-based cryptanalysis. These findings establish the existence of a reasoning capability threshold for LLMs below which additional tools consume planning capacity, demonstrating that the Python Standard Library provides optimal performance for most models and cryptographic categories in our selected dataset. This work provides a quantitative basis for integrating large language models with tool usage in cryptographic CTFs.
https://github.com/bogdansys/Research-Project-TU-Delft-CSE3000