CodeGPT on XTC

Compressing a CodeGPT Model Using Hybrid Layer Reduction and Extreme Quantisation through Knowledge Distillation

Bachelor thesis (2023)

Authors

Aral de Moor Electrical Engineering, Mathematics and Computer Science

Contributors

M. Izadi Software Engineering - (supervisor 1)

A. Al-Kaswan Software Engineering - (supervisor 1)

A. van Deursen Software Technology (supervisor 1)

A. Anand Web Information Systems - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:f37924fc-ecac-4bd4-b923-7d4c73f74a72

Published Date

27-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Large language models are powerful because of their state-of-the-art language processing abilities. But, they come at the cost of being extremely resource-intensive, and are steadily growing in size. As a result, compressing such models for resource- constrained devices is an active and promising re- search area. In spite of their current popular- ity, many novel compression techniques lack im- plementation for GPT models. We apply the XTC pipeline, consisting of layer-reduction and quantisation through knowledge distillation, to a CodeGPT generative model. The resulting mod- els are evaluated on the CodeXGLUE line-level code-completion benchmark. Based on this, we demonstrate that (1) XTC can be adapted to GPT- like models, translating many of the findings of the original study; (2) a 6-layer reduction with 1-bit weight and 8-bit activation quantisation is able to reduce model size 15×, in addition to almost dou- bling inference speed, with minimal performance degradation. The resulting compressed models show promise for use in local code generation. By showing that a novel compression technique can be adapted to GPT-like models, we hope to further in- spire research in this field.

Files

CSE3000_XTC_Aral_Final_.pdf

(.pdf | 0.263 Mb)