This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Perfo
...
This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Performance was measured by comparing the generated outputs with the ground truth, where high similarity indicates high performance. We analyse how factors such as language, model size, task type and granularity of educational value affect performance across educational value. We find that most factors do not have a relation with education value, as most metrics plateau except for exact-match. It is observed to have a consistent negative correlation with educational value. Additionally, a consistent turning point is seen around an educational value of 1.75, before which, performance tends to have a more positive relation with educational value. Results highlight the influence of input quality on LLM behaviour and offer insights for more effective training and evaluation strategies.