Empirical Study on Test Generation Using GitHub Copilot

More Info
expand_more

Abstract

Writing unit tests is a crucial task in the software development lifecycle, ensuring the correctness of the software developed. Due to its time-consuming and laborious nature, it is, however, often neglected by software engineers. Numerous automatic test generation tools have been devised to ease unit testing efforts, but these test generation tools produce tests that are typically difficult to understand. Recently, Large Language Models (LLMs) have shown promising results in generating unit tests and in supporting other software engineering tasks. LLMs are capable of producing natural-looking (human-like) source code and text. In this thesis, we investigate the usability of tests generated by GitHub Copilot, a proprietary closed-source code generation tool that uses a LLM for its generations and integrates into well-known IDEs. We evaluate GitHub Copilot’s test generation abilities both within and without an existing test suite. Furthermore, we also evaluate the impact of different code commenting strategies on test generations, both within and without an existing test suite. We devise aspects of usability to investigate GitHub Copilot’s test generations. In total, we investigate the usability of 290 tests generated by GitHub Copilot. Our findings reveal that within an existing test suite, approximately 45.28% of the tests generated by Copilot are passing tests. The majority (54.72%) of generated tests in an existing test suite are failing, broken, or empty tests. Furthermore, tests generated by Copilot without an existing test suite are less usable compared to those generated within an existing test suite. The vast majority (92.45%) of these test generations are failing, broken, or empty tests. Only 7.55% of tests generated without an existing test suite were passing, and most of them provided less branch coverage when compared to human-written tests. Finally, we find that tests using a code usage example comment resulted in the most usable generations within an existing test suite. In contrast, when there is no existing test suite, a comment combining instructive natural language combined with a code usage example yielded the most usable test generations.