Multi-model inference on the edge

Scheduling for multi-model execution on resource constrained devices

Master Thesis (2020)
Author(s)

B.A. Cox (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Lydia Chen – Mentor (TU Delft - Data-Intensive Systems)

D.H.J. Epema – Graduation committee member (TU Delft - Data-Intensive Systems)

Marco A. Zuñiga Zamalloa – Graduation committee member (TU Delft - Embedded Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Bart Cox
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Bart Cox
Graduation Date
09-11-2020
Awarding Institution
Delft University of Technology
Programme
['Computer Science | Data Science and Technology']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Deep neural networks (DNNs) are becoming the core components of many applications running on edge devices,especially for image-based analysis, e.g., identifying objects, faces, and genders. While very successful in resource rich environments like the cloud of powerful computers, utilizing Deep Learning on edge devices for inference is not used to great extend. The resource constraints of edge devices present bottlenecks that prevent all but small networks from on-device inference. Existing solutions like compression or off-loading to the cloud sacrifice either model accuracy or inference latency. To counter the aforementioned shortcomings, Masa and EdgeCaffe are proposed as a solution. The process of DNN inference by existing frameworks is to loaded and execute a model in its entirety. As this ignores the resource limitations that are inherent to edge devices, a new framework EdgeCaffe is proposed. EdgeCaffe is able to execute models partially and thereby reducing the memory footprint of the model. Additional, EdgeCaffe allows for coordination between models when a multi-model DNN inference is executed. Masa , a memory-aware multi-DNN scheduling algorithm, featuring on modeling inter-and intra-network dependency and leveraging complimentary memory usage of each layer. Masa can consistently ensure the average response time when deterministically and stochastically executing multiple DNN-based image analysis. With extensive evaluations, Masa can consistently ensure the average response time when deterministically and stochastically executing multiple DNN-based image analyses. We extensively evaluate Masa on three configurations of Raspberry Pi and a large set of popular DNN models triggered by different arrival patterns of images. Our evaluation results show that Masa can achieve lower average response times by up to 8× and 16× for deterministic and stochastic arrivals respectively on devices with small memory, i.e., 512 MB to 1 GB, compared to the state of the art multi-DNN scheduling solutions.

Files

MScThesis_BartCox.pdf
(pdf | 5.18 Mb)
License info not available