Can Large Language Models reason? Investigating Open-Source Cryptographic Reasoning
A. Taneva (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Erkin – Mentor (TU Delft - Cyber Security)
M.J.G. Olsthoorn – Graduation committee member (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large language models (LLMs) have shown re- markable performance on mathematical competi- tions (AIME), and recently on the AICrypto bench- mark. AICrypto has tested some of the best commercially-available models on capture-the- flag (CTF)-style cryptography challenges across multiple-choice challenges, proof, and open-ended questions. While they do well on the first 2 cat- egories, the LLMs struggle to solve open-ended questions where advanced mathematical reasoning and creativity are required. This research follows an approach similar to the AICrypto benchmark, testing open-source LLMs and their ability to rea- son. Instead of assigning simple pass/fail scores, the LLM is evaluated qualitatively based on the logic it follows. We aim to demystify the black- box working of commercial LLMs, and potentially lead to developing an open-source framework for solving CTF challenges. The tests are performed on the Qwen3 32B LLM using an agentic ReAct framework. The most common failure modes are discussed, as well their causal factors.