A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Conference Paper (2025)
Author(s)

J.B. Katzy (TU Delft - Software Engineering)

Yongcheng Huang (Student TU Delft)

Gopal Raj Panchu (Student TU Delft)

Maksym Ziemlewski (Student TU Delft)

Paris Loizides (Student TU Delft)

Sander Vermeulen (Student TU Delft)

A. van Deursen (TU Delft - Software Engineering)

M. Izadi (TU Delft - Software Engineering)

Research Group
Software Engineering
DOI related publication
https://doi.org/10.1145/3727582.3728683
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Software Engineering
Pages (from-to)
31-40
ISBN (electronic)
9798400715945
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment correctness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.