LLM of Babel: Evaluation of LLMs on code for non-English use-cases

None, None

LLM of Babel: Evaluation of LLMs on code for non-English use-cases

Bachelor Thesis (2024)

Author(s)

P. Loizides (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

Arie Van Deursen – Mentor (TU Delft - Software Engineering)

Gosia Migut – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

NLP Evaluation metrics LLM Multilingual Tokenization Greek language Code summarization Hierarchical error taxonomy

To reference this document use:

https://resolver.tudelft.nl/uuid:62a05015-fb15-4465-9baa-d02c082f6e24

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

25-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper evaluates the performance of Large Language Models, specifically StarCoder 2, in non-English code summarization, with a focus on the Greek language. We establish a hierarchical error taxonomy through an open coding approach to enhance the understanding and improvement of Large Language Models in multilingual settings as well as identify the challenges associated with tokenization and influence by mathematical datasets. Our study includes a comprehensive analysis of error types, tokenization efficiency, and quantitative metrics such as BLEU, ROUGE, and Semantic Similarity. The findings highlight the importance of semantic similarity as a reliable performance metric and suggest the need for more inclusive tokenizers and training datasets to address the limitations and errors in non-English contexts.

Files

LLM_of_Babel_Paris.pdf

(pdf | 1.24 Mb)

License info not available