AI for Software Engineering: Reviewing and Improving Benchmarking Practices

None, None

AI for Software Engineering: Reviewing and Improving Benchmarking Practices

Master Thesis (2024)

Author(s)

P.M. de Bekker (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

Arie Deursen – Mentor (TU Delft - Software Engineering)

Maria Soledad Pera – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Software Engineering Artifical Intelligence Benchmarking Best Practices HumanEvalPro

To reference this document use:

https://resolver.tudelft.nl/uuid:826586d1-b588-4a4e-9fc9-bb3d62f521ce

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

10-07-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed via benchmarks. However, the current focus on performance scores has often overshadowed the quality and rigor of these benchmarks, as emphasized by the absence of studies on this topic. This thesis addresses this gap by reviewing and improving benchmarking practices in the field of AI for software engineering (AI4SE).

First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.

This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field.

Files

MSc_Thesis_Philippe_de_Bekker_... (pdf)

(pdf | 2.83 Mb)

License info not available