A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

A Qualitative Investigation into LLM-Generated Multilingual Code Comments and Automatic Evaluation Metrics

Conference Paper (2025)

Author(s)

Jonathan Katzy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Yongcheng Huang (Student TU Delft)

Gopal Raj Panchu (Student TU Delft)

Maksym Ziemlewski (Student TU Delft)

Paris Loizides (Student TU Delft)

Sander Vermeulen (Student TU Delft)

Arie van Deursen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Maliheh Izadi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Software Engineering

Large Language Models Multilingual Open Coding Comment Generation Qualitative Evaluation

DOI related publication

https://doi.org/10.1145/3727582.3728683 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:78cf59de-8083-4195-9bff-92fc7844a13b

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Software Engineering

Pages (from-to)

31-40

Publisher

ACM

ISBN (electronic)

9798400715945

Event

21st International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE 2025 , co-located with the International Conference on the Foundations of Software Engineering, FSE 2025 (2025-06-26 - 2025-06-26), Trondheim, Norway

Downloads counter

201

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models are essential coding assistants, yet their training is predominantly English-centric. In this study, we evaluate the performance of code language models in non-English contexts, identifying challenges in their adoption and integration into multilingual workflows. We conduct an open-coding study to analyze errors in code comments generated by five state-of-the-art code models, CodeGemma, CodeLlama, CodeQwen1.5, GraniteCode, and StarCoder2 across five natural languages: Chinese, Dutch, English, Greek, and Polish. Our study yields a dataset of 12,500 labeled generations, which we publicly release. We then assess the reliability of standard metrics in capturing comment correctness across languages and evaluate their trustworthiness as judgment criteria. Through our open-coding investigation, we identified a taxonomy of 26 distinct error categories in model-generated code comments. They highlight variations in language cohesion, informativeness, and syntax adherence across different natural languages. Our analysis shows that, while these models frequently produce partially correct comments, modern neural metrics fail to reliably differentiate meaningful completions from random noise. Notably, the significant score overlap between expert-rated correct and incorrect comments calls into question the effectiveness of these metrics in assessing generated comments.

Files

3727582.3728683.pdf

(pdf | 2.58 Mb)