CodeGPT on XTC
Compressing a CodeGPT Model Using Hybrid Layer Reduction and Extreme Quantisation through Knowledge Distillation
Aral de Moor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Maliheh Izadi – Mentor (TU Delft - Software Engineering)
Ali Al-Kaswan – Mentor (TU Delft - Software Engineering)
A. van van Deursen – Mentor (TU Delft - Software Technology)
Avishek Anand – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large language models are powerful because of their state-of-the-art language processing abilities. But, they come at the cost of being extremely resource-intensive, and are steadily growing in size. As a result, compressing such models for resource- constrained devices is an active and promising re- search area. In spite of their current popular- ity, many novel compression techniques lack im- plementation for GPT models. We apply the XTC pipeline, consisting of layer-reduction and quantisation through knowledge distillation, to a CodeGPT generative model. The resulting mod- els are evaluated on the CodeXGLUE line-level code-completion benchmark. Based on this, we demonstrate that (1) XTC can be adapted to GPT- like models, translating many of the findings of the original study; (2) a 6-layer reduction with 1-bit weight and 8-bit activation quantisation is able to reduce model size 15×, in addition to almost dou- bling inference speed, with minimal performance degradation. The resulting compressed models show promise for use in local code generation. By showing that a novel compression technique can be adapted to GPT-like models, we hope to further in- spire research in this field.