Using local LLMs in constrained environments for increasing mutation score
R.R.L. van der Geest (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Software Engineering)
Mitchell Olsthoorn – Mentor (TU Delft - Software Engineering)
C.B. Bach Poulsen – Graduation committee member (TU Delft - Programming Languages)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Mutation testing is a way to test the effectiveness of a test suite for catching bugs in a given piece of code. Writing these tests manually can be cumbersome and time-consuming. Automated tools can be used to generate tests that achieve a high mutation score. The output of these tools is often very hard to understand for humans, and therefore rarely used as actual test suites for software programs. Because LLMs have been shown to be able to generate programs that can be more easily understood by humans, we ask if these LLMs can be used for improving or generating tests for the purpose of mutation testing. Some LLMs run in the cloud, while others run locally. Cloud-based LLMs such as ChatGPT or Copilot are not always an option because of privacy concerns, speed, or regulations, but do not require possession of hardware. Local LLMs do not have the privacy concerns, but sometimes require large amounts of hardware to be available. This paper will focus on local LLMs that can be run in a computationally restricted environment. We present an automated approach to use a local LLM to improve the mutation score of existing test suites. We compare three different models (DeepSeek Coder, Code Llama and Codestral), evaluated on publicly available datasets. Using this approach, we were able to successfully generate unit tests that, combined with the existing manually written tests, are able to increase the mutation score around one third to half of the time depending on the model.