AI for Software Engineering: Reviewing and Improving Benchmarking Practices

Master Thesis (2024)
Author(s)

P.M. de Bekker (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

Arie Deursen – Mentor (TU Delft - Software Engineering)

Maria Soledad Pera – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
10-07-2024
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed via benchmarks. However, the current focus on performance scores has often overshadowed the quality and rigor of these benchmarks, as emphasized by the absence of studies on this topic. This thesis addresses this gap by reviewing and improving benchmarking practices in the field of AI for software engineering (AI4SE).

First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.

This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field.

Files

License info not available