Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs

Wen, Shilin; Han, Rui; Liu, Chi Harold; Chen, Lydia Y.

doi:10.1186/s13677-023-00465-z

Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs

Title

Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs

Author

Wen, Shilin (Beijing Institute of Technology)
Han, Rui (Beijing Institute of Technology)
Liu, Chi Harold (Beijing Institute of Technology)
Chen, Lydia Y. (TU Delft Data-Intensive Systems)

Date

2023

Abstract

Edge-cloud applications are rapidly prevailing in recent years and pose the challenge of using both resource-strenuous edge devices and elastic cloud resources under dynamic workloads. Efficient resource allocation on edge-cloud jobs via cluster schedulers (e.g. Kubernetes/Volcano scheduler) is essential to guarantee their performance, e.g. tail latency, and such allocation is sensitive to scheduler configurations such as applied scheduling algorithms and task restart/discard policy. Deep reinforcement learning (DRL) is increasingly applied to optimize scheduling decisions. However, DRL faces the conundrum of achieving high rewards at a dauntingly long training time (e.g. hours or days), making it difficult to tune the scheduler configurations online in accordance to dynamically changing edge-cloud workloads and resources. For such an issue, this paper proposes EdgeTuner, a fast scheduler configuration tuning approach that efficiently leverages DRL to reduce tail latency of edge-cloud jobs. The enabling feature of EdgeTuner is to effectively simulate the execution of edge-cloud jobs under different scheduler configurations and thus quickly estimate these configurations’ influence on job performance. The simulation results allow EdgeTuner to timely train a DRL agent in order to properly tune scheduler configurations in dynamic edge-cloud environment. We implement EdgeTuner in both Kubernetes and Volcano schedulers and extensively evaluate it on real workloads driven by Alibaba production traces. Our results show that EdgeTuner outperforms prevailing scheduling algorithms by achieving much lower tail latency while accelerating DRL training speed by an average of 151.63x.

Subject

DRL
Edge-cloud jobs
Kubernetes and Volcano
Scheduler configurations
Tail latency

To reference this document use:

http://resolver.tudelft.nl/uuid:88d92b03-24c7-46c8-a31a-072c75245515

DOI

https://doi.org/10.1186/s13677-023-00465-z

Source

Journal of Cloud Computing, 12 (1)

Part of collection

Institutional Repository

Document type

journal article

Rights

Files

PDF

s13677_023_00465_z.pdf

17.53 MB

Close viewer