Print Email Facebook Twitter Empirical Study on Test Generation Using GitHub Copilot Title Empirical Study on Test Generation Using GitHub Copilot Author El Haji, Khalid (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Brandt, C.E. (mentor) Zaidman, A.E. (graduation committee) Degree granting institution Delft University of Technology Corporate name Delft University of Technology Programme Computer Science | Software Technology Date 2023-06-21 Abstract Writing unit tests is a crucial task in the software development lifecycle, ensuring the correctness of the software developed. Due to its time-consuming and laborious nature, it is, however, often neglected by software engineers. Numerous automatic test generation tools have been devised to ease unit testing efforts, but these test generation tools produce tests that are typically difficult to understand. Recently, Large Language Models (LLMs) have shown promising results in generating unit tests and in supporting other software engineering tasks. LLMs are capable of producing natural-looking (human-like) source code and text. In this thesis, we investigate the usability of tests generated by GitHub Copilot, a proprietary closed-source code generation tool that uses a LLM for its generations and integrates into well-known IDEs. We evaluate GitHub Copilot’s test generation abilities both within and without an existing test suite. Furthermore, we also evaluate the impact of different code commenting strategies on test generations, both within and without an existing test suite. We devise aspects of usability to investigate GitHub Copilot’s test generations. In total, we investigate the usability of 290 tests generated by GitHub Copilot. Our findings reveal that within an existing test suite, approximately 45.28% of the tests generated by Copilot are passing tests. The majority (54.72%) of generated tests in an existing test suite are failing, broken, or empty tests. Furthermore, tests generated by Copilot without an existing test suite are less usable compared to those generated within an existing test suite. The vast majority (92.45%) of these test generations are failing, broken, or empty tests. Only 7.55% of tests generated without an existing test suite were passing, and most of them provided less branch coverage when compared to human-written tests. Finally, we find that tests using a code usage example comment resulted in the most usable generations within an existing test suite. In contrast, when there is no existing test suite, a comment combining instructive natural language combined with a code usage example yielded the most usable test generations. Subject Large Language ModelsGitHub CopilotTest AutomationEmpirical Research To reference this document use: http://resolver.tudelft.nl/uuid:57973043-efa8-4534-9afe-c64287c22d64 Bibliographical note https://zenodo.org/record/8025746 Replication package. Part of collection Student theses Document type master thesis Rights © 2023 Khalid El Haji Files PDF thesis_khalidelhaji_final.pdf 643.06 KB Close viewer /islandora/object/uuid:57973043-efa8-4534-9afe-c64287c22d64/datastream/OBJ/view