A. Iosup | TU Delft Repository

The future is big graphs

A CommunityView on Graph Processing Systems

Journal article (2021) - Sherif Sakr, Angela Bonifati, Hannes Voigt, Alexandru Iosup, Khaled Ammar, Renzo Angles, Walid Aref, Marcelo Arenas, MacIej Besta, More authors...

Graphs are ubiquitous abstractions enabling reusable computing tools for graph processing with applications in every domain. Diverse workloads, standard models and languages, algebraic frameworks, and suitable and reproducible performance metrics will be at the core of graph processing ecosystems in the future. Academics, start-ups, and big tech companies such as Google, Face book, and Microsoft have introduced various systems for managing and processing the growing presence of big graphs. An increasing number of use cases revealed RDBMS performance problems in managing highly connected data, motivating various startups and innovative products, such as Neo4j, Sparksee, and the current Amazon Neptune. Microsoft Trinity along with Azure SQL DB have provided an early distributed database-oriented approach to big graph management. ...

Capelin

Data-Driven Compute Capacity Procurement for Cloud Datacenters using Portfolios of Scenarios

Journal article (2021) - Georgios Andreadis, Fabian Mastenbroek Mastenbroek, Vincent van Beek, Alexandru Iosup

Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption. ...

The atlarge vision on the design of distributed systems and ecosystems

Conference paper (2019) - Alexandru Iosup, Laurens Versluis, Animesh Trivedi, Erwin Van Eyk, Lucian Toader, Vincent Van Beek, Giulia Frascaria, Ahmed Musaafir, Sacheendra Talluri

High-quality designs of distributed systems and services are essential for our digital economy and society. Threatening to slow down the stream of working designs, we identify the mounting pressure of scale and complexity of (eco-)systems, of ill-defined and wicked problems, and of unclear processes, methods, and tools. We envision design itself as a core research topic in distributed systems, to understand and improve the science and practice of distributed (eco-)system design. Toward this vision, we propose the AtLarge design framework, accompanied by a set of 8 core design principles. We also propose 10 key challenges, which we hope the community can address in the following 5 years. In our experience so far, the proposed framework and principles are practical, and lead to pragmatic and innovative designs for large-scale distributed systems. ...

Efficient Estimation of Read Density when Caching for Big Data Processing

Conference paper (2019) - Sacheendra Talluri, Alexandru Iosup

Big data processing systems are becoming increasingly more present in cloud workloads. Consequently, they are starting to incorporate more sophisticated mechanisms from traditional database and distributed systems. We focus in this work on the use of caching policies, which for big data raise important new challenges. Not only they must respond to new variants of the trade-off between hit rate, response time, and the space consumed by the cache, but they must do so at possibly higher volume and velocity than web and database workloads. Previous caching policies have not been tested experimentally with big data workloads. We address these challenges in this work. We propose the Read Density family of policies, which is a principled approach to quantify the utility of cached objects through a family of utility functions that depend on the frequency of reads of an object. We further design the Approximate Histogram, which is a policy-based technique based on an array of counters. This technique promises to achieve runtime-space efficient computation of the metric required by the cache policy. We evaluate through trace-based simulation the caching policies from the Read Density family, and compare them with over ten state-of-the-art alternatives. We use two workload traces representative for big data processing, collected from commercial Spark and MapReduce deployments. While we achieve comparable performance to the state-of-art with less parameters, meaningful performance improvement for big data workloads remain elusive. ...

A CPU Contention Predictor for Business-Critical Workloads in Cloud Datacenters

Conference paper (2019) - Vincent Van Beek, Giorgos Oikonomou, Alexandru Iosup

Resource contention is one of the major problems in cloud datacenters. Many types of resource contention occur, with important impact on the performance and sometimes even the reliability of applications running in cloud datacenters. Cloud applications run together on the same physical machines with different workloads resulting in non-synchronized accesses to the shared resources. This leads to cases where co-hosted applications are contending for the common resources and not receiving the demanded resource amounts. In this work, we investigate the contention in CPU resources, as CPU is allowed to be over-committed by typical SLAs. We propose a CPU-contention predictor for the demanding business-critical workloads, which require low resource contention to deliver the required performance to customers. Our predictor is based on a set of regression models and metrics which we evaluate extensively. We tune the predictor with data collected from a real-world cloud operation spanning multiple datacenters and servicing business-critical workloads. ...

A Reference Architecture for Datacenter Scheduling

Design, Validation, and Experiments

Conference paper (2019) - Georgios Andreadis, Laurens Versluis, Fabian Mastenbroek, Alexandru Iosup

Datacenters act as cloud-infrastructure to stakeholders across industry, government, and academia. To meet growing demand yet operate efficiently, datacenter operators employ increasingly more sophisticated scheduling systems, mechanisms, and policies. Although many scheduling techniques already exist, relatively little research has gone into the abstraction of the scheduling process itself, hampering design, tuning, and comparison of existing techniques. In this work, we propose a reference architecture for datacenter schedulers. The architecture follows five design principles: components with clearly distinct responsibilities, grouping of related components where possible, separation of mechanism from policy, scheduling as complex workflow, and hierarchical multi-scheduler structure. To demonstrate the validity of the reference architecture, we map to it state-of-the-art datacenter schedulers. We find scheduler-stages are commonly underspecified in peer-reviewed publications. Through trace-based simulation and real-world experiments, we show underspecification of scheduler-stages can lead to significant variations in performance. ...

Characterization of a Big Data Storage Workload in the Cloud

Conference paper (2019) - Sacheendra Talluri, Cristina L. Abad, Alicja Łuszczak, Alexandru Iosup

The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats. ...

Yardstick

A benchmark for minecraft-like services

Conference paper (2019) - Jerom Van Der Sar, Jesse Donkervliet, Alexandru Iosup

Online gaming applications entertain hundreds of millions of daily active players and often feature vastly complex architecture. Among online games, Minecraft-like games simulate unique (e.g., modifiable) environments, are virally popular, and are increasingly provided as a service. However, the performance of Minecraft-like services, and in particular their scalability, is not well understood. Moreover, currently no benchmark exists for Minecraft-like games. Addressing this knowledge gap, in this work we design and use the Yardstick benchmark to analyze the performance of Minecraft-like services. Yardstick is based on an operational model that captures salient characteristics of Minecraft-like services. As input workload, Yardstick captures important features, such as the most-popular maps used within the Minecraft community. Yardstick captures system- and application-level metrics, and derives from them service-level metrics such as frequency of game-updates under scalable workload. We implement Yardstick, and, through real-world experiments in our clusters, we explore the performance and scalability of popular Minecraft-like servers, including the official vanilla server, and the community-developed servers Spigot and Glowstone. Our findings indicate the scalability limits of these servers, that Minecraft-like services are poorly parallelized, and that Glowstone is the least viable option among those tested. ...

A mirroring architecture for sophisticated mobile games using computation‐offloading

Journal article (2018) - M.H. Jiang, Otto W. Visser, I.S.W.B. Prasetya, Alexandru Iosup

Mobile gaming is already a popular and lucrative market. However, the low performance and reduced power capacity of mobile devices severely limit the complexity of mobile games and the duration of their game sessions. To mitigate these issues, in this article, we explore using computation‐offloading, that is, allowing the compute‐intensive parts of mobile games to execute on remote infrastructure. Computation‐offloading raises the combined challenge of addressing the trade‐offs between performance and power‐consumption while also keeping the game playable. We propose Mirror, a system for computation‐offloading that supports the demanding performance requirements of sophisticated mobile games. Mirror proposes several conceptual contributions: support for fine‐grained partitioning, both offline (set by developers) and dynamic (policy‐based), and real‐time asynchronous offloading and user‐input synchronization protocols that enable Mirror‐based systems to bound the delays introduced by offloading and thus to achieve adequate performance. Mirror is compatible with all games that are tick‐based and user‐input deterministic. We implement a real‐world prototype of Mirror and apply it to the real‐world, complex, popular game OpenTTD. The experimental results show that, in comparison with the non‐offloaded OpenTTD, Mirror‐ed OpenTTD can significantly improve performance and power consumption while also delivering smooth gameplay. As a trade‐off, Mirror introduces acceptable delay on user inputs. ...

An Experimental Performance Evaluation of Autoscalers for Complex Workflows

Journal article (2018) - Alexey Ilyushkin, Ahmed Ali-Eldin, Nikolas Herbst, André Bauer, Alessandro Papadopoulos, Dick Epema, Alexandru Iosup

Elasticity is one of the main features of cloud computing allowing customers to scale their resources based on the workload. Many autoscalers have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application based on the workload utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is often compared only to static provisioning using a predefined quality of service target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy, as there is seldom enough analysis on the performance of the autoscalers in different operating conditions and with different applications. In our work, we conduct an experimental performance evaluation of autoscaling policies, using as application model workflows, a popular formalism for automating resource management for applications with well-defined yet complex structures. We present a detailed comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the seven policies, we conduct various experiments and compare their performance in both pairwise and group comparisons. We report both individual and aggregated metrics. As many workflows have deadline requirements on the tasks, we study the effect of autoscaling on workflow deadlines. Additionally, we look into the effect of autoscaling on the accounted and hourly based charged costs, and we evaluate performance variability caused by the autoscaler selection for each group of workflow sizes. Our results highlight the trade-offs between the suggested policies, how they can impact meeting the deadlines, and how they perform in different operating conditions, thus enabling a better understanding of the current state-of-the-art. ...

Exploring HPC and Big Data Convergence

A Graph Processing Study on Intel Knights Landing

Conference paper (2018) - Alexandru Uta, Ana Lucia Varbanescu, Ahmed Musaafir, Chris Lemaire, Alexandru Iosup

The question 'Can big data and HPC infrastructure converge?' has important implications for many operators and clients of modern computing. However, answering it is challenging. The hardware is currently different, and fast evolving: big data uses machines with modest numbers of fat cores per socket, large caches, and much memory, whereas HPC uses machines with larger numbers of (thinner) cores, non-trivial NUMA architectures, and fast interconnects. In this work, we investigate the convergence of big data and HPC infrastructure for one of the most challenging application domains, the highly irregular graph processing. We contrast through a systematic, experimental study of over 300,000 core-hours the performance of a modern multicore, Intel Knights Landing (KNL) and of traditional big data hardware, in processing representative graph workloads using state-of-the-art graph analytics platforms. The experimental results indicate KNL is convergence-ready, performance-wise, but only after extensive and expert-level tuning of software and hardware parameters. ...

An Elasticity Study of Distributed Graph Processing

Conference paper (2018) - Sietse Au, Alexandru Uta, Alexey Ilyushkin, Alexandru Iosup

Graphs are a natural fit for modeling concepts used in solving diverse problems in science, commerce, engineering, and governance. Responding to the variety of graph data and algorithms, many parallel and distributed graph processing systems exist. However, until now these platforms use a static model of deployment: they only run on a pre-defined set of machines. This raises many conceptual and pragmatic issues, including misfit with the highly dynamic nature of graph processing, and could lead to resource waste and high operational costs. In contrast, in this work we explore a dynamic model of deployment. We first characterize workload dynamicity, beyond mere active-vertex variability. Then, to conduct an in-depth elasticity study of distributed graph processing, we build a prototype, JoyGraph, which is the first such system that implements complex, policy-based, and fine-grained elasticity. Using the state-of-the-art LDBC Graphalytics benchmark and the SPEC Cloud Group's elasticity metrics, we show the benefits of elasticity in graph processing: (i) improved resource utilization, (ii) reduced operational costs, and (iii) aligned operation-workload dynamicity. Furthermore, we explore the cost of elasticity in graph processing. We identify a key drawback: although elasticity does not degrade application throughput, graph-processing workloads are sensitive to data movement while leasing or releasing resources. ...

Serverless is More

From PaaS to Present Cloud Computing

Journal article (2018) - Erwin Van Eyk, Lucian Toader, Sacheendra Talluri, Laurens Versluis, Alexandru Uta, Alexandru Iosup

In the late-1950s, leasing time on an IBM 704 cost hundreds of dollars per minute. Today, cloud computing, that is, using IT as a service, on-demand and pay-per-use, is a widely used computing paradigm that offers large economies of scale. Born from a need to make platform as a service (PaaS) more accessible, fine-grained, and affordable, serverless computing has garnered interest from both industry and academia. This article aims to give an understanding of these early days of serverless computing: what it is, where it comes from, what is the current status of serverless technology, and what are its main obstacles and opportunities. ...

Elasticity in Graph Analytics?

A Benchmarking Framework for Elastic Graph Processing

Conference paper (2018) - Alexandru Uta, Sietse Au, Alexey Ilyushkin, Alexandru Iosup

Graphs are a natural fit for modeling concepts used in solving diverse problems in science, commerce, engineering, and governance. Responding to the diversity of graph data and algorithms, many parallel and distributed graph-processing systems exist. However, until now these platforms use a static model of deployment: they only run on a pre-defined set of machines. This raises many conceptual and pragmatic issues, including misfit with the highly dynamic nature of graph processing, and could lead to resource waste and high operational costs. In contrast, in this work we explore the benefits and drawbacks of the dynamic model of deployment. Building a threelayer benchmarking framework for assessing elasticity in graph analytics, we conduct an in-depth elasticity study of distributed graph processing. Our framework is composed of state-ofthe-art workloads, autoscalers, and metrics, derived from the LDBC Graphalytics benchmark and SPEC RG Cloud Group’s elasticity metrics. We uncover the benefits and cost of elasticity in graph processing: while elasticity allows for fine-grained resource management, and does not degrade application performance, we find that graph workloads are sensitive to data migration while leasing or releasing resources. Moreover, we identify non-trivial interactions between scaling policies and graph workloads, which add an extra level of complexity to resource management and scheduling for graph processing. ...

The Power of Social Features in Online Gaming

Book chapter (2018) - Fernando Kuipers, Marcus Märtens, Ernst van der Hoeven, Alexandru Iosup

Within the vast and rich field of online gaming, a new generation of Online Social Games (OSGs) is emerging that have in common a core of social interaction, sometimes explicit, other times implicit. This common core of social experience promises to become at least as important as the experience derived from the game-‐world itself. In this chapter, we consider the social side of OSGs and provide the following contributions: 1.We motivate the importance of taking social features into account to improve the quality of experience in online gaming. 2.We discuss the various dimensions of (player experience in) OSGs. 3.We describe a social network analysis methodology for identifying relations in OSGs and indicate how this methodology could be used to improve the game-‐play experience. 4.We also consider and illustrate how certain “social” behaviour, like toxicity, is negative and may harm the game-‐play experience, if not adequately addressed. 5.We mention several directions for future research to put the power of social features in OSGs to good use. ...

Massivizing computer systems

A vision to understand, design, and engineer computer ecosystems through and beyond modern distributed systems

Conference paper (2018) - Alexandru Iosup, Alexandru Uta, Laurens Versluis, Georgios Andreadis, Erwin Van Eyk, Tim Hegeman, Sacheendra Talluri, Vincent Van Beek, Lucian Toader

Our society is digital: industry, science, governance, and individuals depend, often transparently, on the inter-operation of large numbers of distributed computer systems. Although the society takes them almost for granted, these computer ecosystems are not available for all, may not be affordable for long, and raise numerous other research challenges. Inspired by these challenges and by our experience with distributed computer systems, we envision Massivizing Computer Systems, a domain of computer science focusing on understanding, controlling, and evolving successfully such ecosystems. Beyond establishing and growing a body of knowledge about computer ecosystems and their constituent systems, the community in this domain should also aim to educate many about design and engineering for this domain, and all people about its principles. This is a call to the entire community: there is much to discover and achieve. ...

Granula: Toward Fine-grained Performance Analysis of Large-scale Graph Processing Platforms

Conference paper (2017) - Wing Ngai, Tim Hegeman, Stijn Heldens, Alexandru Iosup

Big Data processing has become an integral part of many applications that are vital to our industry, academic endeavors, and society at large. To cope with the data deluge, existing Big Data platforms require significant conceptual and engineering advances. In particular, Big Data platforms for large-scale graph processing require in-depth performance analysis to continue to support the broad applicability of linked data processing. However, in-depth performance analysis of such platforms remains challenging due to many factors, among which the inherent complexity of the platforms, the limited insight provided by coarse-grained "black-box" and inefficiency of fine-grained analysis, and the lack of reusability of results. In this work, we propose Granula, a performance analysis system for Big Data platforms that focuses on graph processing. Granula facilitates the complex, end-to-end processes of fine-grained performance modeling, monitoring, archiving, and visualization. It offers a comprehensive evaluation process that can be iteratively tuned to deliver more fine-grained performance information. We showcase with a prototype of Granula how it can provide meaningful insights into the operation of two large-scale graph processing platforms, Giraph and PowerGraph. ...

An Analysis on a YouTube-like UGC site with Enhanced Social Features

Conference paper (2017) - Adele Jia, Siqi Shen, Shengling Chen, Dongsheng Li, Alexandru Iosup

YouTube-like User Generated Content (UGC) sites are nowadays entertaining over a billion people. Resource provision is essential for these giant UGC sites as they allow users to request videos from a potentially unlimited selection in an asynchronous fashion. Still, the UGC sites are seeking to create new viewing patterns and social interactions that would engage and attract more users and complicate the already rigorous resource provision problem. In this paper, we seek to combine these two tasks by leveraging social features to provide the reference for resource provision. To this end, we conduct an extensive measurement and analysis of BiliBili, a YouTube-like UGC site with enhanced social features including user following, chat replay, and virtual money donation. Based on datasets that capture the complete view of BiliBili---containing over 2 million videos and over 28 million users---we characterize its video repository and user activities, we demonstrate the positive reinforcement between on-line social behavior and upload behavior, we propose graph models that reveal user relationships and high-level social structures, and we successfully apply our findings to build machine-learnt classifiers to identify videos that will need priority in resource provision. ...

ANANKE: a Q-Learning-Based Portfolio Scheduler for Complex Industrial Workflows

Technical Report DS-2017-001

Report (2017) - Shenjun Ma, Alexey Ilyushkin, Alexander Stegehuis, Alexandru Iosup

Complex workflows that process sensor data are useful for industrial infrastructure management and diagnosis. Although running such workflows in clouds promises reduces operational costs, there are still numerous scheduling challenges to overcome. Such complex workflows are dynamic, exhibit periodic patterns, and combine diverse task groupings and requirements. In this work, we propose ANANKE, a scheduling system addressing these challenges. Our approach extends the state-of-the-art in portfolio scheduling for datacenters with a reinforcement-learning technique, and proposes various scheduling policies for managing complex workflows. Portfolio scheduling addresses the dynamic aspect of the workload. Reinforcement learning, based in this work on Q-learning, allows our approach to adapt to the periodic patterns of the workload, and to tune the other configuration parameters. The proposed policies are heuristics that guide the provisioning process, and map workflow tasks to the provisioned cloud resources. Through real-world experiments based on real and synthetic industrial workloads, we analyze and compare our prototype implementation of ANANKE with a system without portfolio scheduling (baseline) and with a system equipped with a standard portfolio scheduler. Overall, our experimental results give evidence that a learningbased portfolio scheduler can perform better (5–20%) and cost less (20–35%) than the considered alternatives. ...

The OpenDC Vision: Towards Collaborative Datacenter Simulation and Exploration for Everybody

Conference paper (2017) - Alexandru Iosup, Georgios Andreadis, Vincent van Beek, Matthijs Bijman, Erwin van Eyk, Mihai Neacsu, Leon Overweel, Sacheneendra Talluri, Laurens Versluis, Maaike Visser

In the new Digital Economy, massive computer systems, often grouped in datacenters, serve as factories "producing" cloud services with massive consumption. However, to afford cloud services globally, we must address new research challenges in designing, operating, and using modern datacenters. We must also address challenges in educating and training the next generation of datacenter engineers. Addressing such challenges, in this work we present our vision on OpenDC: we envision the exploration of various datacenter concepts and technologies, using existing and new scientific methods, enabling new education practices and topics, and leading to the creation of new software and data artifacts. We present the datacenter concepts and technologies we are currently planning to explore using OpenDC. We identify the scientific methods we want to use, and explain our vision of education practices. We present the architecture and open-source program underlying the OpenDC software, and the format and open-access data we use for datacenter experiments. We conclude with an open invitation for the community to join our effort. ...