Model compression techniques are crucial for reducing the deployment cost of large neural networks. Among these, depth pruning (removing layers/blocks) and width pruning (removing sections within layers) are essential for reducing memory footprint and inference latency. While var
...
Model compression techniques are crucial for reducing the deployment cost of large neural networks. Among these, depth pruning (removing layers/blocks) and width pruning (removing sections within layers) are essential for reducing memory footprint and inference latency. While various depth pruning methods exist, there is currently no unified approach that works consistently across different architectures. This work proposes a novel methodology that enables effective depth pruning across diverse networks, spanning from Large Language Models (LLMs) to Convolutional Neural Networks (CNNs). To achieve this unified compression, we introduce a hybrid technique that combines a primary depth pruning step with an auxiliary width pruning step. This combined approach specifically targets the elementary linear and convolutional layers, which constitute the foundational building blocks of diverse architectures. Our method demonstrates strong performance across different model types: First, on Llama2 LLM, the method demonstrates competitive results against state-of-the-art methods, offering a unique set of favorable trade-offs at varying compression levels. Second, on YOLOv5 CNNs the method enables removing up to 46% of parameters while preserving 80% of COCO dataset accuracy. Furthermore, our experiments reveal previously undiscovered pruning dependencies between network layers, providing new insights into the model compression landscape.