LLM of Babel: Evaluation of LLMs on code for non-English use-cases

None, None

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor Thesis (2024)

Author(s)

Y. Huang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Arie Van van Deursen – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

Gosia Migut – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Natural Language Processing Computational Linguistics Code Comments Large Language Model LLM4Code Multilingual Code Code Inference Error Taxonomy Model Evaluation AI in Software Development Cosine Similarity

To reference this document use:

https://resolver.tudelft.nl/uuid:381c703d-d5b8-490a-9d85-31f5a1d7692a

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

25-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.

Files

LLM_of_Babel_Yongcheng_Final_r... (pdf)

(pdf | 1.63 Mb)

License info not available