Evaluating the Effectiveness of Meta Llama3 70B for Unit Test Generation

None, None

Evaluating the Effectiveness of Meta Llama3 70B for Unit Test Generation

Bachelor Thesis (2024)

Author(s)

R.J.H. Schep (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Panichella – Mentor (TU Delft - Software Engineering)

Mitchell Olsthoorn – Mentor (TU Delft - Software Engineering)

C.B. Bach Poulsen – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

Python Java Software Quality EvoSuite Mutation Testing Automated Test Suite Generation , Large Language Models Llama3 70B Unit Test Cases Pynguin

To reference this document use:

https://resolver.tudelft.nl/uuid:a41541ca-ebdb-47a0-9b57-3d04a709d5bf

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

26-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project', 'Bug Buster: Augmenting Test Assertions using Large Language Models']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The automated generation of test suites is crucial for enhancing software quality and efficiency. Manually writing tests is time-consuming and accounts for about 15% of project time while tests generated by automated tools like EvoSuite and Pynguin often lack readability and comprehensibility. Recent research suggests that Large Language Models (LLMs) might offer a promising alternative. This paper investigates the effectiveness of Llama3 70B in generating unit test cases for Java and Python projects. We compared Llama3 against EvoSuite and Pynguin by measuring the mutation score of test suites generated for a corpus of 20 Java and 20 Python classes. Our findings reveal that EvoSuite significantly outperforms Llama3 in terms of mutation score, achieving an average mutation score of 81.05% versus Llama3's 66.26%. Conversely, Llama3 surpasses Pynguin, with scores of 51.95% and 42.73% respectively. These results highlight that while Llama3 is not superior to EvoSuite, it shows potential as a viable tool for test generation, especially for dynamically typed languages like Python. Further, empirical observations indicate that Llama3 requires significantly more time to generate tests compared to EvoSuite and Pynguin. This study underscores the need for continued research to optimize LLMs for software testing and improve their efficiency and accuracy.

Files

Final_paper_reinier_schep.pdf

(pdf | 1.35 Mb)

License info not available