Oikonomos-II+
a Reinforcement-Learning, Cloud Resource Recommender for HPC & AI Workloads
R. E.V. Betting (Erasmus MC)
Q. Chen (Erasmus MC)
C. I. De Zeeuw (Erasmus MC, Netherlands Institute for Neuroscience)
C. Strydis (Erasmus MC, TU Delft - Computer Engineering)
More Info
expand_more
Abstract
Oikonomos-II+ is a hybrid, reinforcement-learning system for recommending optimal cloud-instance types for HighPerformance Computing (HPC) and Artificial-Intelligence (AI) applications. Unlike existing approaches that require historical data or repeated job executions, Oikonomos-II+ learns online using user-submitted jobs. It combines a modified Neural-LinUCB algorithm with Gaussian-Process regression to model the relationship between job parameters, instance types, and execution time. This allows it to balance exploration and exploitation efficiently, even in the absence of prior data. We evaluated six configurations of Oikonomos-II+ on a diverse set of HPC and AI workloads, optimizing for cost and speed. Results show that the complete system converges to optimal resource choices, outperforming purely predictive or search-based approaches. By treating deployed applications as a black box and by eliminating the need for preexisting training data or auxiliary runs, Oikonomos-II+ provides a general-purpose, low-overhead solution for dynamic resource selection in heterogeneous cloud environments.