CodeGPT on XTC

Compressing a CodeGPT Model Using Hybrid Layer Reduction and Extreme Quantisation through Knowledge Distillation

Bachelor Thesis (2023)
Author(s)

Aral de Moor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

Ali Al-Kaswan – Mentor (TU Delft - Software Engineering)

A. van van Deursen – Mentor (TU Delft - Software Technology)

Avishek Anand – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Aral de Moor
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Aral de Moor
Graduation Date
27-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Related content

Replication Code Repository

https://github.com/AISE-TUDelft/LLM4CodeCompression
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models are powerful because of their state-of-the-art language processing abilities. But, they come at the cost of being extremely resource-intensive, and are steadily growing in size. As a result, compressing such models for resource- constrained devices is an active and promising re- search area. In spite of their current popular- ity, many novel compression techniques lack im- plementation for GPT models. We apply the XTC pipeline, consisting of layer-reduction and quantisation through knowledge distillation, to a CodeGPT generative model. The resulting mod- els are evaluated on the CodeXGLUE line-level code-completion benchmark. Based on this, we demonstrate that (1) XTC can be adapted to GPT- like models, translating many of the findings of the original study; (2) a 6-layer reduction with 1-bit weight and 8-bit activation quantisation is able to reduce model size 15×, in addition to almost dou- bling inference speed, with minimal performance degradation. The resulting compressed models show promise for use in local code generation. By showing that a novel compression technique can be adapted to GPT-like models, we hope to further in- spire research in this field.

Files

CSE3000_XTC_Aral_Final_.pdf
(pdf | 0.263 Mb)
License info not available