LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor Thesis (2024)
Author(s)

Y. Huang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Arie Van van Deursen – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

Gosia Migut – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
25-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.

Files

License info not available