Beyond the Traceback: Using LLMs for Adaptive Explanations of Programming Errors

None, None

Beyond the Traceback: Using LLMs for Adaptive Explanations of Programming Errors

Master Thesis (2025)

Author(s)

A.R. Moraru (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Ujwal Gadiraju – Mentor (TU Delft - Web Information Systems)

S. Biswas – Mentor (TU Delft - Web Information Systems)

Przemysław Pawełczak – Graduation committee member (TU Delft - Embedded Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Python Crowdsourcing Large Language Models (LLMs) Programming Education Human-Computer Interaction (HCI)

To reference this document use:

https://resolver.tudelft.nl/uuid:f6cfe507-0305-49a9-bc5c-1b85d1b7ae08

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

29-08-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Error messages are a primary feedback channel in programming environments, yet they often obstruct progress, especially for novices. Although large language models (LLMs) are widely used for code generation and debugging assistance, there is limited empirical evidence that LLM-rephrased error messages consistently improve code correction capabilities, and skill-adaptive designs remain largely unexplored. We introduce a framework that uses an LLM to rewrite Python standard interpreter errors in two different styles, which are designed to be tailored to user expertise: the pragmatic style, which is concise and action oriented, and the contingent style, which provides scaffolded, actionable guidance organized by a clear argumentation model. To measure Python skill level reliably, we first ran a pilot study that informed the design of a short 8 multiple-choice question assessment focused on debugging and error-message interpretation. We then used this instrument in the main study to classify participants as either novices or experts.

To gauge the effectiveness of our LLM-enhanced programming error messages (PEMs), we evaluated the framework in a crowdsourced Prolific study with 103 participants. We measured objective outcomes such as fix rate, time to fix, and number of attempts to fix, while also capturing subjective perceptions of PEMs, including readability, cognitive load, and authoritativeness. Objectively, LLM-enhanced PEMs showed favorable trends but did not produce statistically significant improvements over the standard interpreter. Subjectively, novices and experts alike, rated the pragmatic messages as significantly more readable and helpful, lower in intrinsic and extraneous cognitive load, and considerably less authoritative. Contingent messages exceeded the baseline on average but did not consistently reach statistical significance across all of our measurements, which points to a need for tighter control of error message verbosity and granularity, particularly for beginners.

These results show that LLMs, especially small-sized ones, are already capable of delivering targeted text-rewriting interventions that improve the perceived quality of error feedback. Future work should validate the effects at larger scale and across languages, expand coverage of real-world error contexts, and pursue true adaptivity in which error message style and level of detail adjust dynamically to user skill and task state.

Files

MSc_Final_Thesis_Report_Alexan... (pdf)

(pdf | 6.83 Mb)

License info not available