DNN-porting for maximizing inference hardware utilisation at the Edge

More Info
expand_more

Abstract

Industry 4.0 and the Industrial Internet of Things (IIoT) growth will result in an explosion of data generated by connected devices. Adapting 5G and 6G technology could be the leading enabler of the broad possibilities of connecting IIoT devices in masses. However, the edge solution has some disadvantages, such as the loss of resource elasticity compared to cloud solutions. The research questions of this thesis are whether Deep Neural Networks (DNN)-porting can solve the accuracy-performance trade-off of edge computing solutions and how to implement an edge computing system based on open-source container-orchestrated DNN model inference platforms to enable vertical model autoscaling capabilities.
The thesis shows how porting techniques like structured pruning on DNN enable the accuracy performance trade-off in hardware-constrained settings. It generates models with reduced complexity and size while minimally degrading the accuracy. By using these ported models in the proposed inference platform, the thesis demonstrates how an edge computing system can achieve vertical model autoscaling capabilities, enabling efficient use of computational resources. This research focuses on CPU hardware and Real-Time (RT) request scenarios, where the latency Service Level Objective (SLO) combined with current demand are crucial factors. When the resources in an inference system deplete, the latency of individual requests can increase significantly due to queuing. The results show how an orchestrator can make live model version selections based on the model versions and demand. The proposed system increases the maximum possible throughput compared to the state-of-the-art while avoiding creating a queue in the RT scenario and improving system accuracy when CPU resources are available. Additionally, this work proposes a design to implement these benefits in industry-adopted open-source DNN inference platforms.