Beyond the Traceback: Using LLMs for Adaptive Explanations of Programming Errors

Master Thesis (2025)
Author(s)

A.R. Moraru (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Ujwal Gadiraju – Mentor (TU Delft - Web Information Systems)

Shreyan Biswas – Mentor (TU Delft - Web Information Systems)

Przemystaw Pawełczak – Graduation committee member (TU Delft - Embedded Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
29-08-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Error messages are a primary feedback channel in programming environments, yet they often obstruct  progress, especially for novices. Although large language models (LLMs) are widely used for code  generation and debugging assistance, there is limited empirical evidence that LLM-rephrased error  messages consistently improve code correction capabilities, and skill-adaptive designs remain largely  unexplored. We introduce a framework that uses an LLM to rewrite Python standard interpreter errors  in two different styles, which are designed to be tailored to user expertise: the pragmatic style, which is  concise and action oriented, and the contingent style, which provides scaffolded, actionable guidance  organized by a clear argumentation model. To measure Python skill level reliably, we first ran a pilot  study that informed the design of a short 8 multiple-choice question assessment focused on debugging  and error-message interpretation. We then used this instrument in the main study to classify participants  as either novices or experts.

To gauge the effectiveness of our LLM-enhanced programming error messages (PEMs), we evaluated  the framework in a crowdsourced Prolific study with 103 participants. We measured objective outcomes  such as fix rate, time to fix, and number of attempts to fix, while also capturing subjective perceptions of  PEMs, including readability, cognitive load, and authoritativeness. Objectively, LLM-enhanced PEMs  showed favorable trends but did not produce statistically significant improvements over the standard  interpreter. Subjectively, novices and experts alike, rated the pragmatic messages as significantly  more readable and helpful, lower in intrinsic and extraneous cognitive load, and considerably less  authoritative. Contingent messages exceeded the baseline on average but did not consistently reach  statistical significance across all of our measurements, which points to a need for tighter control of error  message verbosity and granularity, particularly for beginners.   

These results show that LLMs, especially small-sized ones, are already capable of delivering targeted  text-rewriting interventions that improve the perceived quality of error feedback. Future work should  validate the effects at larger scale and across languages, expand coverage of real-world error contexts,  and pursue true adaptivity in which error message style and level of detail adjust dynamically to user  skill and task state.

Files

License info not available