Benchmarking AI Models in Software Engineering

None, None; None, None; None, None; None, None

Benchmarking AI Models in Software Engineering

A Review, Search Tool, and Unified Approach for Elevating Benchmark Quality

Review (2025)

Author(s)

Roham Koohestani (Student TU Delft)

Philippe de Bekker (Student TU Delft)

Begüm Koç (Student TU Delft)

Maliheh Izadi (TU Delft - Software Engineering)

Research Group

Software Engineering

DOI related publication

https://doi.org/10.1109/TSE.2025.3644183

To reference this document use:

https://resolver.tudelft.nl/uuid:bbf6ff77-c3a2-4f40-aa02-684d4df7b9cc

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Software Engineering

Issue number

2

Volume number

52

Pages (from-to)

651-674

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Benchmarks are essential for unified evaluation and reproducibility. The rapid rise of Artificial Intelligence for Software Engineering (AI4SE) has produced numerous benchmarks for tasks such as code generation and bug repair. However, this proliferation has led to major challenges: (1) fragmented knowledge across tasks, (2) difficulty in selecting contextually relevant benchmarks, (3) lack of standardization in benchmark creation, and (4) flaws that limit utility. Addressing these requires a dual approach: systematically mapping existing benchmarks for informed selection and defining unified guidelines for robust, adaptable benchmark development. We conduct a review of 247 studies, identifying 273 AI4SE benchmarks since 2014. We categorize them, analyze limitations, and expose gaps in current practices. Building on these insights, we introduce BenchScout, an extensible semantic search tool for locating suitable benchmarks. BenchScout employs automated clustering with contextual embeddings of benchmark-related studies, followed by dimensionality reduction. In a user study with 22 participants, BenchScout achieved usability, effectiveness, and intuitiveness scores of 4.5, 4.0, and 4.1 out of 5. To improve benchmarking standards, we propose BenchFrame, a unified approach to improve benchmark quality. Applying BenchFrame to HumanEval yielded HumanEvalNext, which features corrected errors, improved language conversion, higher test coverage, and greater difficulty. Evaluating 10 state-of-the-art code models on HumanEval, HumanEvalPlus, and HumanEvalNext revealed average pass-at-1 drops of 31.22% and 19.94%, respectively, underscoring the need for continuous benchmark refinement. We further examine BenchFrame's scalability through an agentic pipeline and confirm its generalizability on the MBPP dataset. Lastly, we publicly release the material of our review, user study, and the enhanced benchmark.
¹https://github.com/AISE-TUDelft/AI4SE-benchmarks

Files

Benchmarking_AI_Models_in_Soft... (pdf)

(pdf | 1.28 Mb)

Taverne

File under embargo until 16-06-2026