Readability Driven Test Selection

Using Large Language Models to Assign Readability Scores and Rank Auto-Generated Unit Tests

Bachelor Thesis (2024)
Author(s)

I. Zaidi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Deljouyi – Mentor (TU Delft - Software Engineering)

Andy Zaidman – Mentor (TU Delft - Software Technology)

A. Katsifodimos – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
25-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Writing tests enhances quality, yet developers often deprioritize writing tests. Existing tools for automatic test generation face challenges in test under- standability. This is primarily due to the fact that these tools fail to consider the context, leading to the generation of identifiers, test names, and identifier data that are not contextually appropriate for the code they are testing. Current metrics for judging the understandability of unit tests are limited as they do not take into account contextual factors such as the quality of comments. Developing a metric to evaluate test readability is essential for selecting the most comprehensible tests. This research builds on UTGen, incorporating LLMs to enhance the readability of automatically generated unit tests. We developed a readability score and used LLMs to evaluate and rank tests, comparing these rankings with human evaluations. This research concludes that LLMs can successfully evaluate the readability of test cases. The GPT-4 Turbo Simple Prompt model exhibited the best performance, with a correlation of 0.7632 with human evaluations. Through comparing different LLMs and techniques for as- signing readability scores, we identified approaches that closely matched human evaluations, demonstrating that LLMs can successfully rate the read- ability of test cases.

Files

License info not available