Distil-CodeGPT, Distilling Code-Generation Models for Local Use
E.L. Malmsten (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Al-Kaswan – Mentor (TU Delft - Software Engineering)
Maliheh Izadi – Mentor (TU Delft - Software Engineering)
A. van van Deursen – Mentor (TU Delft - Software Technology)
Avishek Anand – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
The GitHub repository of the project
https://github.com/AISE-TUDelft/LLM4CodeCompression/tree/main/distill-CodeGPTOther than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The application of large language models (LLMs) for programming tasks, such as automatic code completion, has seen a significant upswing in recent years. However, due to their computational demands, they have to operate on servers. This both requires users to have a steady internet connection and raises potential privacy concerns. Therefore, this study aims to explore the feasibility of compressing LLMs for code using knowledge distillation (KD), thereby facilitating local usage of these models. Existing research has primarily focused on the efficacy of using KD to compress BERT models for language tasks. Its application to GPT models for coding tasks and the impact of implementing KD in-training, as opposed to the pre-training, remain largely unexplored. To address these gaps we adapted DistilBERT, a pre-training KD algorithm for distilling BERT models for language tasks. Our adapted model, Distil-CodeGPT, utilizes intraining KD to compress LLMs for Python code. The findings of this study suggest that a substantial reduction in model size is achievable, albeit accompanied by a compromise in predictive accuracy. Specifically, using 8 layers, instead of the original 12, resulted in a 24% reduction in disk size and a 28% speed increase, with an accompanying accuracy decrease of 11%. These results show that this approach has potential and is a solid first step toward smaller code models.