Effective LLM-based automated program repair (APR) methods can lead to massive cost reductions and have improved significantly in recent times. However, the validity of many APR evaluations as they are conducted at this point is at risk due to data leakage: Prior research has sho
...
Effective LLM-based automated program repair (APR) methods can lead to massive cost reductions and have improved significantly in recent times. However, the validity of many APR evaluations as they are conducted at this point is at risk due to data leakage: Prior research has shown that LLMs can memorize solutions to problems if the evaluation benchmark overlaps with the training set, leading to overinflated results.
In this study, we examine the potential of using metamorphic transformations to mitigate the effects of data leakage. For this, we create a variant benchmark for two popular, well-established benchmarks Defects4J and GitBug-Java, and evaluate the APR performance of several LLMs on these benchmarks and their transformed counterparts. In addition, we investigate to what extent our results align with data leakage metrics from other studies.
Our results show that state-of-the-art LLMs for code repair exhibit significant performance degradation (Up to 4.1% for Claude-3.7-Sonnet) on a metamorphically transformed Defecsts4J benchmark. Moreover, we find a significant correlation between our results and the negative log-likelihood as a metric of data leakage. Our results demonstrate the potential of using metamorphic transformations to mitigate the overinflation of evaluation results due to data leakage. We recommend that researchers report results on both original and metamorphically transformed benchmarks in future evaluations.