Measuring LLM Tool-Use Efficiency in Cryptographic Capture-the-Flag Competitions

None, None

Measuring LLM Tool-Use Efficiency in Cryptographic Capture-the-Flag Competitions

Bachelor Thesis (2026)

Author(s)

M. Iordache (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Erkin – Mentor (TU Delft - Cyber Security)

M.J.G. Olsthoorn – Graduation committee member (TU Delft - Software Engineering)

Cryptography AI LLM CTFs Agentic

To reference this document use

https://resolver.tudelft.nl/uuid:319fbad0-83aa-40f1-b81a-c681c0032219

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

30-01-2026

Awarding Institution

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Downloads counter

61

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models demonstrate strong capability in mathematics, yet struggle with cryptographic tasks requiring precise algorithmic implementation. We investigate this capability gap through an ablation study that quantifies the marginal utility of giving the models access to multiple tools. Models interact with these tools through Python REPL with unrestricted access to external libraries to autonomously install and configure their own computational environments within a custom ReAct framework. Using 15 tasks from the AICrypto benchmark covering diverse cryptographic categories, we isolate the contributions of general purpose programming versus domain specific libraries and tools across three configurations and three models, measuring success rate, time to solve, stability, and iteration depth. Our analysis reveals three principal findings: transitioning from pure reasoning to code execution with python and no additional libraries or access to external software yields a +37.78 percentage point performance increase; expanding to unrestricted access produces diminishing returns due to tool management and installation overhead, with effects varying by model type (reasoning / non reasoning); and tool and Python library utility is task-dependent, with large gains for complex algorithmic and computationally intensive challenges but penalties for pattern-based cryptanalysis. These findings establish the existence of a reasoning capability threshold for LLMs below which additional tools consume planning capacity, demonstrating that the Python Standard Library provides optimal performance for most models and cryptographic categories in our selected dataset. This work provides a quantitative basis for integrating large language models with tool usage in cryptographic CTFs.

https://github.com/bogdansys/Research-Project-TU-Delft-CSE3000

Files

Measuring-LLM_-Tool-Use-Effici... (pdf)

(pdf | 0.858 Mb)

License info not available