Compressing code generation language models on CPUs

Using Group Lasso pruning and post-training quantization

Bachelor Thesis (2023)
Author(s)

D. Sochirca (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Al-Kaswan – Mentor (TU Delft - Software Engineering)

M. Izadi – Mentor (TU Delft - Software Engineering)

A. van Deursen – Mentor (TU Delft - Software Technology)

A. Anand – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Dan Sochirca
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Dan Sochirca
Graduation Date
28-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code generation models have become more popular recently, due to the fact that they assist developers in writing code in a more productive manner. While these large models deliver impressive performance, they require significant computational resources and memory, making them difficult to deploy and expensive to train. Additionally, their large carbon footprint raises environmental concerns. To address these challenges, there is a need to develop techniques for compressing these models while maintaining their performance.
In this work, we study the effectiveness of Group lasso pruning and post-training quantization techniques on CPUs, applied to the code generation model CodeGPT. We evaluate the performance of the compressed model using the Exact Match (EM) and Edit Similarity (ES) metrics and study the model size on disk, memory footprint, and CPU inference. In contrast with the original CodeGPT model, our solution offers a 48% relative reduction in disk size, with only a mild drop in the accuracy metrics: 8.51% absolute drop in ES and a 5.5% in EM. Using the ONNX runtime on a regular laptop, we are able to deliver a 2x inference speedup at a 32.6% reduction in size. Our code is publicly available at https://github.com/AISE-TUDelft/LLM4CodeCompression/tree/main/CodeGPT-on-Intel.

Files

Final_paper_Dan.pdf
(pdf | 1.66 Mb)
License info not available