Compressing code generation language models on CPUs

None, None

Compressing code generation language models on CPUs

Using Group Lasso pruning and post-training quantization

Bachelor Thesis (2023)

Author(s)

D. Sochirca (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Al-Kaswan – Mentor (TU Delft - Software Engineering)

M. Izadi – Mentor (TU Delft - Software Engineering)

A. van Deursen – Mentor (TU Delft - Software Technology)

A. Anand – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transformers Compression CodeGPT Code generation Group Lasso Pruning Post-Training Quantization

To reference this document use:

https://resolver.tudelft.nl/uuid:47817baa-9c64-4cca-b206-09544ac5a75b

More Info

expand_more

Publication Year

2023

Language

English

Graduation Date

28-06-2023

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code generation models have become more popular recently, due to the fact that they assist developers in writing code in a more productive manner. While these large models deliver impressive performance, they require significant computational resources and memory, making them difficult to deploy and expensive to train. Additionally, their large carbon footprint raises environmental concerns. To address these challenges, there is a need to develop techniques for compressing these models while maintaining their performance.
In this work, we study the effectiveness of Group lasso pruning and post-training quantization techniques on CPUs, applied to the code generation model CodeGPT. We evaluate the performance of the compressed model using the Exact Match (EM) and Edit Similarity (ES) metrics and study the model size on disk, memory footprint, and CPU inference. In contrast with the original CodeGPT model, our solution offers a 48% relative reduction in disk size, with only a mild drop in the accuracy metrics: 8.51% absolute drop in ES and a 5.5% in EM. Using the ONNX runtime on a regular laptop, we are able to deliver a 2x inference speedup at a 32.6% reduction in size. Our code is publicly available at https://github.com/AISE-TUDelft/LLM4CodeCompression/tree/main/CodeGPT-on-Intel.

Files

Final_paper_Dan.pdf

(pdf | 1.66 Mb)

License info not available