The Effectiveness of GPT-4o for Generating Test Assertions

None, None

The Effectiveness of GPT-4o for Generating Test Assertions

Bachelor Thesis (2024)

Author(s)

A. Bagdonas (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Mitchell Olsthoorn – Mentor (TU Delft - Software Engineering)

A. Panichella – Mentor (TU Delft - Software Engineering)

C.B. Bach Poulsen – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

EvoSuite Artificial Intelligence Mutation Testing Generative AI JUnit GPT-4o OpenAI

To reference this document use:

https://resolver.tudelft.nl/uuid:102f3083-8ffd-4d4c-a1bf-f40d07622e55

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

25-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Over the last few years, Large Language Models have become remarkably popular in research and in daily use with GPT-4o being the most advanced model from OpenAI as of the publishing of this paper. We assessed its performance in unit test generation using mutation testing. 20 Java classes were selected from the SF110 Corpus of classes, and for each 10 different test classes were generated. After we resolved build errors and removed failing assertions, the evaluation using Pitest produced around 71% of mutation coverage on average on the sample dataset. Manually fixing the failing assertions increased the overall mutation score to 75%. Nonetheless, one of the main drawbacks was the need to manually resolve problems that the GPT-4o responses produced, such as code hallucination and incorrect assumptions about the classes under test.

Files

The_Effectiveness_of_GPT-4o_fo... (pdf)

(pdf | 0.507 Mb)

License info not available