Oikonomos-II: A Reinforcement-Learning, Resource-Recommendation System for Cloud HPC

None, None; None, None; None, None

Oikonomos-II: A Reinforcement-Learning, Resource-Recommendation System for Cloud HPC

Conference Paper (2023)

Author(s)

J. L. F. Betting (Erasmus MC)

C. I. De Zeeuw (Erasmus MC, Netherlands Institute for Neuroscience)

C. Strydis (TU Delft - Computer Engineering, Erasmus MC)

Research Group

Computer Engineering

DOI related publication

https://doi.org/10.1109/HiPC58850.2023.00044

Cloud computing Prediction High-Performance Computing Resource recommendation Middle ware

To reference this document use:

https://resolver.tudelft.nl/uuid:46883b37-8c9a-436f-a522-d96b4de279eb

More Info

expand_more

Publication Year

2023

Language

English

Research Group

Computer Engineering

Pages (from-to)

266-276

ISBN (print)

979-8-3503-8323-2

ISBN (electronic)

979-8-3503-8322-5

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The cloud has become a powerful and useful environment for the deployment of High-Performance Computing (HPC) applications, but the large number of available instance types poses a challenge in selecting the optimal platform. Users often do not have the time or knowledge necessary to make an optimal choice. Recommender systems have been developed for this purpose but current state-of-the-art systems either require large amounts of training data, or require running the application multiple times; this is costly. In this work, we propose Oikonomos-II, a resource-recommendation system based on reinforcement learning for HPC applications in the cloud. Oikonomos-II models the relationship between different input parameters, instance types, and execution times. The system does not require any preexisting training data or repeated job executions, as it gathers its own training data opportunistically using user-submitted jobs, employing a variant of the Neural-LinUCB algorithm. When deployed on a mix of HPC applications, Oikonomos-II quickly converged towards an optimal policy. The system eliminates the need for preexisting training data or auxiliary runs, providing an economical, general-purpose, resource-recommendation system for cloud HPC.

Files

Oikonomos-II_A_Reinforcement-L... (pdf)

(pdf | 0.427 Mb)

- Embargo expired in 05-10-2024

License info not available