Distil-CodeGPT, Distilling Code-Generation Models for Local Use

None, None

Distil-CodeGPT, Distilling Code-Generation Models for Local Use

Bachelor Thesis (2023)

Author(s)

E.L. Malmsten (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Al-Kaswan – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

A. van van Deursen – Mentor (TU Delft - Software Technology)

Avishek Anand – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

GPT NLP Knowledge Distillation CodeGPT BERT LLM Copilot

To reference this document use:

https://resolver.tudelft.nl/uuid:22217e2b-0db8-4c56-8808-9713dd678425

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

28-06-2023

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Abstract

The application of large language models (LLMs) for programming tasks, such as automatic code completion, has seen a significant upswing in recent years. However, due to their computational demands, they have to operate on servers. This both requires users to have a steady internet connection and raises potential privacy concerns. Therefore, this study aims to explore the feasibility of compressing LLMs for code using knowledge distillation (KD), thereby facilitating local usage of these models. Existing research has primarily focused on the efficacy of using KD to compress BERT models for language tasks. Its application to GPT models for coding tasks and the impact of implementing KD in-training, as opposed to the pre-training, remain largely unexplored. To address these gaps we adapted DistilBERT, a pre-training KD algorithm for distilling BERT models for language tasks. Our adapted model, Distil-CodeGPT, utilizes intraining KD to compress LLMs for Python code. The findings of this study suggest that a substantial reduction in model size is achievable, albeit accompanied by a compromise in predictive accuracy. Specifically, using 8 layers, instead of the original 12, resulted in a 24% reduction in disk size and a 28% speed increase, with an accompanying accuracy decrease of 11%. These results show that this approach has potential and is a solid first step toward smaller code models.

Files

CSE3000_DistilBERT_Emil.pdf

(pdf | 0.197 Mb)

License info not available