Automatic Depth Pruning and Post-Processing for Efficient Deep Learning

Master Thesis (2025)
Author(s)

M. Kuchar (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Gaydadjiev – Mentor (TU Delft - Computer Engineering)

M. Naderan-Tahan – Mentor (TU Delft - Computer Engineering)

Y. Li – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Henk Corporaal – Mentor (Eindhoven University of Technology)

Manil Dev Gomony – Mentor (Eindhoven University of Technology)

T.E.P.M.F. Abeel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
28-11-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Model compression techniques are crucial for reducing the deployment cost of large neural networks. Among these, depth pruning (removing layers/blocks) and width pruning (removing sections within layers) are essential for reducing memory footprint and inference latency. While various depth pruning methods exist, there is currently no unified approach that works consistently across different architectures. This work proposes a novel methodology that enables effective depth pruning across diverse networks, spanning from Large Language Models (LLMs) to Convolutional Neural Networks (CNNs). To achieve this unified compression, we introduce a hybrid technique that combines a primary depth pruning step with an auxiliary width pruning step. This combined approach specifically targets the elementary linear and convolutional layers, which constitute the foundational building blocks of diverse architectures. Our method demonstrates strong performance across different model types: First, on Llama2 LLM, the method demonstrates competitive results against state-of-the-art methods, offering a unique set of favorable trade-offs at varying compression levels. Second, on YOLOv5 CNNs the method enables removing up to 46% of parameters while preserving 80% of COCO dataset accuracy. Furthermore, our experiments reveal previously undiscovered pruning dependencies between network layers, providing new insights into the model compression landscape.

Files

Thesis_report_Kuchar.pdf
(pdf | 1.87 Mb)
License info not available