How well does GPT-3.5 perform on course assignments from the TU Delft Computer science and engineering Bachelor?

Finding themes in course assignments GPT-3.5 performs well on and does not perform well on

More Info
expand_more

Abstract

Since large language models (LLM) have been emerged, they took a present role in today’s soci- ety. From society, they also found their way into the field of education that is why in this research paper, we looked into assignments and exams from the TU Delft Computer science and engineering bach- elor and assessed which problems Generative pre- trained transformer (GPT) version 3.5, the current version used by ChatGPT, performs well on (i.e. at least above a pass rate) and on which problems it performs less good (i.e. below pass rate). For our research, we collected assignments by asking professors for consent, to make sure our research was ethically correct. Upon receiving consent, pro- fessors had the option to send material, which al- lowed a deeper analysis, or they could also allow a Brightspace (site where TU Delft courses are hosted) course page scrapping. Once all the ques- tions were gathered, we processed them by prompt- ing them into ChatGPT. We gathered the results and categorized them as wrong or right. We did this all with as few modifications to the questions as pos- sible. The only modifications we did were correc- tions of copy errors from a PDF, for example: C becoming e after copying. From the results, we found that ChatGPT has its limitations, particularly in large code understanding and complex mathe- matical reasoning. However, the model performed well in defining concepts and connecting different ideas. We suggest that GPT lacks a comprehensive understanding of coding principles, which hinders its ability to comprehend code. Future work could include exploring other LLMs like GPT-4 and com- paring their performance. Further work could also look at assignments from other universities, pos- sibly in different educational fields. Additionally, investigating different prompting techniques to en- hance the model’s accuracy and reliability could be done as well.