Print Email Facebook Twitter Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs Title Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobs Author Wen, Shilin (Beijing Institute of Technology) Han, Rui (Beijing Institute of Technology) Liu, Chi Harold (Beijing Institute of Technology) Chen, Lydia Y. (TU Delft Data-Intensive Systems) Date 2023 Abstract Edge-cloud applications are rapidly prevailing in recent years and pose the challenge of using both resource-strenuous edge devices and elastic cloud resources under dynamic workloads. Efficient resource allocation on edge-cloud jobs via cluster schedulers (e.g. Kubernetes/Volcano scheduler) is essential to guarantee their performance, e.g. tail latency, and such allocation is sensitive to scheduler configurations such as applied scheduling algorithms and task restart/discard policy. Deep reinforcement learning (DRL) is increasingly applied to optimize scheduling decisions. However, DRL faces the conundrum of achieving high rewards at a dauntingly long training time (e.g. hours or days), making it difficult to tune the scheduler configurations online in accordance to dynamically changing edge-cloud workloads and resources. For such an issue, this paper proposes EdgeTuner, a fast scheduler configuration tuning approach that efficiently leverages DRL to reduce tail latency of edge-cloud jobs. The enabling feature of EdgeTuner is to effectively simulate the execution of edge-cloud jobs under different scheduler configurations and thus quickly estimate these configurations’ influence on job performance. The simulation results allow EdgeTuner to timely train a DRL agent in order to properly tune scheduler configurations in dynamic edge-cloud environment. We implement EdgeTuner in both Kubernetes and Volcano schedulers and extensively evaluate it on real workloads driven by Alibaba production traces. Our results show that EdgeTuner outperforms prevailing scheduling algorithms by achieving much lower tail latency while accelerating DRL training speed by an average of 151.63x. Subject DRLEdge-cloud jobsKubernetes and VolcanoScheduler configurationsTail latency To reference this document use: http://resolver.tudelft.nl/uuid:88d92b03-24c7-46c8-a31a-072c75245515 DOI https://doi.org/10.1186/s13677-023-00465-z Source Journal of Cloud Computing, 12 (1) Part of collection Institutional Repository Document type journal article Rights © 2023 Shilin Wen, Rui Han, Chi Harold Liu, Lydia Y. Chen Files PDF s13677_023_00465_z.pdf 17.53 MB Close viewer /islandora/object/uuid:88d92b03-24c7-46c8-a31a-072c75245515/datastream/OBJ/view