FM

F.S. Mastenbroek

info

Please Note

3 records found

Cloud datacenters underpin our increasingly digital society, serving stakeholders across industry, government, and academia. These stakeholders have come to expect reliable operation and high quality of service, yet demand low cost, high scalability, and corporate (environmental) responsibility. Datacenter operators are confronted frequently with highly complex decisions that involve numerous aspects of risk. The consequence of bad decisions can be financial penalties or even loss of customers on the one hand, or a competitive disadvantage or unsustainable environmental impact on the other hand. Despite risk analysis being an integral part of the design and operation of cloud infrastructure, relatively few comprehensive approaches and tools exist, leaving many datacenter operators ill-equipped to make informed decisions with confidence.

We propose Radice, an instrument for data-driven analysis of IT-related operational risks in sustainable cloud datacenters. Unlike most state-of-the-art approaches used by the industry, Radice automates the process of risk analysis in datacenters and utilizes the large and diverse volume of data reported by the monitoring systems in datacenters, including environmental data. Underpinning this system is the trace-based, discrete-event simulator OpenDC, which enables the exploration of many risk scenarios through its support for diverse workloads, datacenter topologies, and operational phenomena. Radice’s interactive and explorative user interface assists datacenter operators in addressing complex decisions involving risks, providing them with actionable insights, automated visualizations, and suggestions to reduce risk.

We implement Radice and conduct a comprehensive evaluation of the system to demonstrate how it can aid datacenter operators when confronted with fundamental risk trade-offs. Although Radice is designed to work across many kinds of datacenters, in this work, we focus on private-cloud, business-critical workloads, and on public-cloud operations, representing the majority of workloads in Dutch datacenters. Our experiments show many interesting findings, supporting our claim for a need for data-driven risk analysis in datacenters. We highlight the increasing risk faced by datacenter operators due to price surges in the electricity and CO2 bond markets, and demonstrate how Radice can be used to control such risks. We further show that Radice can automatically optimize topology and operational settings in datacenters for risk, revealing configurations that reduce the overall risk by 10%–30%. Following extensive performance engineering, Radice is able to evaluate risk scenarios by a factor 70x–330x faster than others, opening possibilities for interactive risk exploration. We release Radice as free and open-source software for the community to inspect and re-use. ...
Datacenter infrastructure has become vital for stakeholders across industry, academia and government. To operate efficiently, datacenter operators rely on a variety of complex scheduling techniques, to distribute user workloads across resources. In this work, we leverage a reference architecture for datacenter scheduling to design and implement an instrument for systematic design space exploration of datacenter schedulers. We construct a formal representation of the design space for datacenter schedulers, using scheduling policies collected from real-world schedulers. We then use a genetic algorithm in combination with trace-based simulation to explore the space, optimizing for workload metrics. Through several experiments, we assess the viability of the instrument. We find that our instrument is able to identify patterns in the workloads and adapt the scheduling policies appropriately. Overall, our work leads to numerous findings, which can become valuable for future comprehension and development of schedulers. ...

Design, Validation, and Experiments

Conference paper (2019) - Georgios Andreadis, Laurens Versluis, Fabian Mastenbroek, Alexandru Iosup
Datacenters act as cloud-infrastructure to stakeholders across industry, government, and academia. To meet growing demand yet operate efficiently, datacenter operators employ increasingly more sophisticated scheduling systems, mechanisms, and policies. Although many scheduling techniques already exist, relatively little research has gone into the abstraction of the scheduling process itself, hampering design, tuning, and comparison of existing techniques. In this work, we propose a reference architecture for datacenter schedulers. The architecture follows five design principles: components with clearly distinct responsibilities, grouping of related components where possible, separation of mechanism from policy, scheduling as complex workflow, and hierarchical multi-scheduler structure. To demonstrate the validity of the reference architecture, we map to it state-of-the-art datacenter schedulers. We find scheduler-stages are commonly underspecified in peer-reviewed publications. Through trace-based simulation and real-world experiments, we show underspecification of scheduler-stages can lead to significant variations in performance. ...