Evaluating the Effectiveness of Meta Llama3 70B for Unit Test Generation
R.J.H. Schep (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Software Engineering)
Mitchell Olsthoorn – Mentor (TU Delft - Software Engineering)
C.B. Bach Poulsen – Graduation committee member (TU Delft - Programming Languages)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The automated generation of test suites is crucial for enhancing software quality and efficiency. Manually writing tests is time-consuming and accounts for about 15% of project time while tests generated by automated tools like EvoSuite and Pynguin often lack readability and comprehensibility. Recent research suggests that Large Language Models (LLMs) might offer a promising alternative. This paper investigates the effectiveness of Llama3 70B in generating unit test cases for Java and Python projects. We compared Llama3 against EvoSuite and Pynguin by measuring the mutation score of test suites generated for a corpus of 20 Java and 20 Python classes. Our findings reveal that EvoSuite significantly outperforms Llama3 in terms of mutation score, achieving an average mutation score of 81.05% versus Llama3's 66.26%. Conversely, Llama3 surpasses Pynguin, with scores of 51.95% and 42.73% respectively. These results highlight that while Llama3 is not superior to EvoSuite, it shows potential as a viable tool for test generation, especially for dynamically typed languages like Python. Further, empirical observations indicate that Llama3 requires significantly more time to generate tests compared to EvoSuite and Pynguin. This study underscores the need for continued research to optimize LLMs for software testing and improve their efficiency and accuracy.