Multi-model inference on the edge

Scheduling for multi-model execution on resource constrained devices

More Info
expand_more

Abstract

Deep neural networks (DNNs) are becoming the core components of many applications running on edge devices,especially for image-based analysis, e.g., identifying objects, faces, and genders. While very successful in resource rich environments like the cloud of powerful computers, utilizing Deep Learning on edge devices for inference is not used to great extend. The resource constraints of edge devices present bottlenecks that prevent all but small networks from on-device inference. Existing solutions like compression or off-loading to the cloud sacrifice either model accuracy or inference latency. To counter the aforementioned shortcomings, Masa and EdgeCaffe are proposed as a solution. The process of DNN inference by existing frameworks is to loaded and execute a model in its entirety. As this ignores the resource limitations that are inherent to edge devices, a new framework EdgeCaffe is proposed. EdgeCaffe is able to execute models partially and thereby reducing the memory footprint of the model. Additional, EdgeCaffe allows for coordination between models when a multi-model DNN inference is executed. Masa , a memory-aware multi-DNN scheduling algorithm, featuring on modeling inter-and intra-network dependency and leveraging complimentary memory usage of each layer. Masa can consistently ensure the average response time when deterministically and stochastically executing multiple DNN-based image analysis. With extensive evaluations, Masa can consistently ensure the average response time when deterministically and stochastically executing multiple DNN-based image analyses. We extensively evaluate Masa on three configurations of Raspberry Pi and a large set of popular DNN models triggered by different arrival patterns of images. Our evaluation results show that Masa can achieve lower average response times by up to 8× and 16× for deterministic and stochastic arrivals respectively on devices with small memory, i.e., 512 MB to 1 GB, compared to the state of the art multi-DNN scheduling solutions.