Readability Driven Test Selection

None, None

Readability Driven Test Selection

Using Large Language Models to Assign Readability Scores and Rank Auto-Generated Unit Tests

Bachelor Thesis (2024)

Author(s)

I. Zaidi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Deljouyi – Mentor (TU Delft - Software Engineering)

Andy Zaidman – Mentor (TU Delft - Software Technology)

A. Katsifodimos – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models Readability Unit Tests

To reference this document use:

https://resolver.tudelft.nl/uuid:377ba5c9-0a16-4078-b165-77200bf5454f

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

25-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Writing tests enhances quality, yet developers often deprioritize writing tests. Existing tools for automatic test generation face challenges in test under- standability. This is primarily due to the fact that these tools fail to consider the context, leading to the generation of identifiers, test names, and identifier data that are not contextually appropriate for the code they are testing. Current metrics for judging the understandability of unit tests are limited as they do not take into account contextual factors such as the quality of comments. Developing a metric to evaluate test readability is essential for selecting the most comprehensible tests. This research builds on UTGen, incorporating LLMs to enhance the readability of automatically generated unit tests. We developed a readability score and used LLMs to evaluate and rank tests, comparing these rankings with human evaluations. This research concludes that LLMs can successfully evaluate the readability of test cases. The GPT-4 Turbo Simple Prompt model exhibited the best performance, with a correlation of 0.7632 with human evaluations. Through comparing different LLMs and techniques for as- signing readability scores, we identified approaches that closely matched human evaluations, demonstrating that LLMs can successfully rate the read- ability of test cases.

Files

CSE3000-Readability-Driven-Tes... (pdf)

(pdf | 0.281 Mb)

License info not available