LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor Thesis (2024)

Authors

M. Ziemlewski (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Supervisors

Jonathan Katzy (TU Delft - Software Engineering)

Arie Van Deursen (TU Delft - Software Engineering)

Maliheh Izadi (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Evaluation Language Models Error Taxonomy Open Coding Error Classification Non-English Code Llama Code Comment Completion

To reference this document use:

https://resolver.tudelft.nl/9306d5b2-bbc7-487a-bb56-46ad067d0da6

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

25-06-2024

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dataset of Polish Java code from GitHub, apply a Fill-in-the-Middle objective for code comment completion, and evaluate the results using BLEU and ROUGE-L metrics. Additionally, we manually evaluate approximately 1150 generated comments and document the encountered errors. Based on the findings, we iteratively develop a taxonomy of errors using an open coding approach.

Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.

Files

LLM_of_Babel_Maksym_Ziemlewski... (pdf)

(pdf | 1.18 Mb)

License info not available