LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor Thesis (2024)
Authors

M. Ziemlewski (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Supervisors

Jonathan Katzy (TU Delft - Software Engineering)

Arie Van Deursen (TU Delft - Software Engineering)

Maliheh Izadi (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
25-06-2024
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dataset of Polish Java code from GitHub, apply a Fill-in-the-Middle objective for code comment completion, and evaluate the results using BLEU and ROUGE-L metrics. Additionally, we manually evaluate approximately 1150 generated comments and document the encountered errors. Based on the findings, we iteratively develop a taxonomy of errors using an open coding approach.

Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.

Files

License info not available