Automatic Depth Pruning and Post-Processing for Efficient Deep Learning

None, None

Automatic Depth Pruning and Post-Processing for Efficient Deep Learning

Master Thesis (2025)

Author(s)

M. Kuchar (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Gaydadjiev – Mentor (TU Delft - Computer Engineering)

M. Naderan-Tahan – Mentor (TU Delft - Computer Engineering)

Y. Li – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Henk Corporaal – Mentor (Eindhoven University of Technology)

Manil Dev Gomony – Mentor (Eindhoven University of Technology)

Thomas Abeel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Deep learning Pruning Deep learning efficiency Depth pruning

To reference this document use:

https://resolver.tudelft.nl/uuid:5efe15f3-0e41-40c0-a0cb-1b4ff48e2290

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

28-11-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Model compression techniques are crucial for reducing the deployment cost of large neural networks. Among these, depth pruning (removing layers/blocks) and width pruning (removing sections within layers) are essential for reducing memory footprint and inference latency. While various depth pruning methods exist, there is currently no unified approach that works consistently across different architectures. This work proposes a novel methodology that enables effective depth pruning across diverse networks, spanning from Large Language Models (LLMs) to Convolutional Neural Networks (CNNs). To achieve this unified compression, we introduce a hybrid technique that combines a primary depth pruning step with an auxiliary width pruning step. This combined approach specifically targets the elementary linear and convolutional layers, which constitute the foundational building blocks of diverse architectures. Our method demonstrates strong performance across different model types: First, on Llama2 LLM, the method demonstrates competitive results against state-of-the-art methods, offering a unique set of favorable trade-offs at varying compression levels. Second, on YOLOv5 CNNs the method enables removing up to 46% of parameters while preserving 80% of COCO dataset accuracy. Furthermore, our experiments reveal previously undiscovered pruning dependencies between network layers, providing new insights into the model compression landscape.

Files

Thesis_report_Kuchar.pdf

(pdf | 1.87 Mb)

License info not available