Error messages are a primary feedback channel in programming environments, yet they often obstruct progress, especially for novices. Although large language models (LLMs) are widely used for code generation and debugging assistance, there is limited empirical evidence that LLM-
...
Error messages are a primary feedback channel in programming environments, yet they often obstruct progress, especially for novices. Although large language models (LLMs) are widely used for code generation and debugging assistance, there is limited empirical evidence that LLM-rephrased error messages consistently improve code correction capabilities, and skill-adaptive designs remain largely unexplored. We introduce a framework that uses an LLM to rewrite Python standard interpreter errors in two different styles, which are designed to be tailored to user expertise: the pragmatic style, which is concise and action oriented, and the contingent style, which provides scaffolded, actionable guidance organized by a clear argumentation model. To measure Python skill level reliably, we first ran a pilot study that informed the design of a short 8 multiple-choice question assessment focused on debugging and error-message interpretation. We then used this instrument in the main study to classify participants as either novices or experts.
To gauge the effectiveness of our LLM-enhanced programming error messages (PEMs), we evaluated the framework in a crowdsourced Prolific study with 103 participants. We measured objective outcomes such as fix rate, time to fix, and number of attempts to fix, while also capturing subjective perceptions of PEMs, including readability, cognitive load, and authoritativeness. Objectively, LLM-enhanced PEMs showed favorable trends but did not produce statistically significant improvements over the standard interpreter. Subjectively, novices and experts alike, rated the pragmatic messages as significantly more readable and helpful, lower in intrinsic and extraneous cognitive load, and considerably less authoritative. Contingent messages exceeded the baseline on average but did not consistently reach statistical significance across all of our measurements, which points to a need for tighter control of error message verbosity and granularity, particularly for beginners.
These results show that LLMs, especially small-sized ones, are already capable of delivering targeted text-rewriting interventions that improve the perceived quality of error feedback. Future work should validate the effects at larger scale and across languages, expand coverage of real-world error contexts, and pursue true adaptivity in which error message style and level of detail adjust dynamically to user skill and task state.