| 1 |
|
Understanding and Improving the Performance Consistency of Distributed Computing Systems
With the increasing adoption of distributed systems in both academia and industry, and with the increasing computational and storage requirements of distributed applications, users inevitably demand more from these systems. Moreover, users also depend on these systems for latency and throughput sensitive applications, such as interactive perception applications and MapReduce applications, which make the performance of these systems even more important. Therefore, for the users it is very important that distributed systems provide consistent performance, that is, the system provides a similar level of performance at all times. In this thesis we address the problem of understanding and improving the performance consistency of state-of-the-art distributed computing systems. Towards this end, we take an empirical approach and we investigate various resource management, scheduling, and statistical modeling techniques with real system experiments in diverse distributed systems, such as clusters, multi-cluster grids, and clouds, using various types of workloads, such as Bags-of-tasks (BoTs), interactive perception applications, and scientific workloads.
In addition, as failures are known to be an important source of significant performance inconsistency, we also provide fundamental insights into the characteristics of failures in distributed systems, which is required to design systems that can mitigate the impact of failures on performance consistency.
|
[PDF]
[Abstract]
|
| 2 |
|
Performance Analysis of Chainsaw-based Live P2P Video Streaming
Due to the growing popularity of viewing media over the Internet, content servers are suffering from more and more stress every day. This problem is traditionally solved by enhancing the server infrastructure at the content provider, which is
effective but also costly. A more cost effective solution would be to use P2P technology to distribute the media stream in real-time. For this purpose, the Chainsaw algorithm has been proposed, which performs very well in simulations. However,
Chainsaw has not been implemented in a real video player yet. We have built our own version of Chainsaw called Kettingzaag, and we have added some improvements and features which make it more resillient to errors, such as multiple description coding. Kettingzaag is put to the test in our own video player called Lumberjack, on the DAS-3 supercomputer in Delft. Our experiments show that the Kettingzaag algorithm performs well for network sizes up to a hundred nodes, and is likely to perform just as well for larger network sizes.
|
[PDF]
[Abstract]
|
| 3 |
|
A workload model for MapReduce
MapReduce is a parallel programming model used by Cloud service providers for data mining. To be able to enhance existing and to develop new MapReduce sys- tems, we need to evaluate the performance of these systems. To this end we intro- duce in this work the Cloud Workloads Archive Toolbox. This toolbox facilitates the analysis of MapReduce workload traces, generation of realistic synthetic work- loads, and the evaluation of MapReduce systems in simulation. We present an overview and analysis of real world MapReduce workload traces, we propose a model for MapReduce workloads, we describe the development of the toolbox, and we present an experiment in which we use our toolbox to evaluate two MapReduce schedulers.
|
[PDF]
[Abstract]
|
| 4 |
|
Adding Cycle Scavenging Support to the Koala Grid Resource Manager
Cycle scavenging (CS) is the process of using otherwise idle computational resources to provide large, aggregate, amounts of computational power. It is the core principle of so-called desktop grids and volunteer computing, which use the idle cycles of desktop computers to do computations. However, resources in a multicluster grid likewise may have idle time, and so, multi-cluster grids go through periods of low efficiency. In addition, many practical grid applications are of the Bag-of-Tasks (BoT) type, which are large collections of embarrassingly parallel tasks. These require a vast amount of computational power, which multi-cluster grids can provide. Claiming an entire grid for one application would however not be fair to other grid users. In this thesis, we design a system for multi-cluster grids that detects and uses otherwise idle resources, specifically to execute BoTs. By combining CS with the computational power of the grid, we increase grid efficiency, we are able to provide the resources needed by BoTs, and, at the same time,we can guarantee unobtrusiveness to all grid users. The system we design, KOALACS,
is an extension of the KOALA grid resource manager. We create a framework for submitting CS jobs through KOALA, and implement several policies for fair sharing among such CS jobs. We then evaluate the performance of the complete
system, and demonstrate its capabilities, through a series of experiments on a real grid system, the DAS-3. With these experiments, we show that KOALA-CS does not hinder non-CS grid users, that it ensures fair sharing of resources among CS jobs, and that it is robust and fault tolerant.
|
[PDF]
[Abstract]
|
| 5 |
|
The design and implementation of the KOALA grid resource management system
Grid computing is an emerging form of distributed computing, distinguished from traditional forms by its focus on large-scale,
multi-organizational resource sharing and innovative applications. To harness shared resources across multiple organizations, and at the same time to deal with the challenges of grid resource management and of deploying grid applications, we have developed KOALA, a Grid Resource Management System. KOALA supports co-allocation, i.e., the allocation of both processors and data to single applications in multiple sites and the simultaneous access to these resources by the applications. By supporting co-allocation, KOALA addresses new challenges presented to resource management in grids by co-allocation of allocating resources in multiple sites, guaranteeing the simultaneous availability of the co-allocated resources, and managing sets of highly dynamic grid resources. KOALA also has mechanisms that simplify the deployment of different application types on the grid. Without these mechanisms, deployment of different application types on grids is difficult because of the characteristics of the grid applications and of the grid infrastructure. KOALA has been deployed on the Distributed ASCI Supercomputer (DAS), which is an experimental computer testbed in the Netherlands that is exclusively used for research on parallel, distributed, and grid computing. KOALA has proven to be working reliably on the DAS testbed with over 500,000 jobs already submitted to it.
|
[PDF]
[Abstract]
|
| 6 |
|
A framework for the study of grid inter-operation mechanisms
The study of the history of computing infrastructures reveals an integration trend. For example, the explosive growth of the Internet in the 1990s was the result of an integration process started in the 1960s with the emerging networks of computers. By using the Internet, millions of users were capable of accessing information anytime and anywhere, much like other daily utilities such as water, electricity, and telephone. However, an important category of users remained under-served: the users with large computational and storage requirements, e.g., the scientists, the companies that focus on data analysis, and the governmental departments that manage the interaction between the state and the population (such as census, tax, and public health). Thus, in the mid-1990s, the vision of the (Computing) Grid as a universal computing utility was formulated. The main benefits promised by the Grid are similar to those of other integration efforts: extended and optimized service of the integrated network, and significant reductions of maintenance and operation costs through sharing and better scheduling. While the universal Grid has yet to be developed, large-scale distributed computing infrastructures that provide their users with seamless and secure access to computing resources, individually called Grid parts or simply grids, have been built throughout the world---in different countries, for different sciences, and both for production work and for computer-science research. At the same time, the main technological alternatives to grids, that is, supercomputers and large clusters, have evolved into much larger, scalable, and reliable systems. Thus, the integration of existing grids into larger infrastructures and finally into The Grid is key in keeping the grid vision attractive for its potential users. The integration of grids raises a double challenge, one related with the efficient scaling of a distributed computing system, the second associated with the operation of a system across different ownership and administrative domains. Thus, many of the traditional approaches for inter-operating computer systems, such as those based on completely centralized or purely decentralized system architectures, are eliminated from the start. To mark the distinction between the typical problem of integrating smaller components into a larger system and the double challenge of grid integration, we call the latter the problem of grid inter-operation. In this thesis we approach the problem of grid inter-operation with two main objectives: to design a comprehensive framework for the study of grid inter-operation mechanisms, and to provide an initial but good solution for this problem. We design a framework for the study of grid inter-operation that includes a toolbox for grid inter-operation research and a method for the study of grid inter-operation mechanisms. In the research toolbox we include the Grid Workloads Archive (GWA), a comprehensive model for grid resources and workloads, the GrenchMark performance evaluation framework, and the Delft Grid Simulation (DGSim) framework for repeated and realistic simulations of multi-cluster and multi-grid environments. The GWA and our comprehensive model show that grid computing is mostly used in practice for single-processor jobs and not for parallel computing, which raises previously ignored challenges related to the volume of jobs to be managed. We also devise in this thesis a method for studying grid inter-operation mechanisms. We answer using our framework important questions regarding existing grid operation mechanisms, and in particular show that these mechanisms are too limited to cope with real and realistic conditions. We further demonstrate the usefulness of our framework by designing Delegated MatchMaking, a novel mechanism for inter-operating grids. This mechanism is used to operate an architecture that is a hybrid between hierarchical and purely decentralized architectures. The Delegated Matchmaking mechanism attempts to use the local resources of a grid as much as possible and also transparently extends the local environment with resources obtained (delegated) from other sites when resources are not available locally. Our approach is compared with five alternatives through trace-based simulations, and is found to deliver the best performance, especially when the system is heavily loaded. While many other mechanisms can be designed in the future, our experiments prove that the Delegated MatchMaking approach already is a good solution for the problem of grid inter-operation. Our experiments also demonstrate that having grids inter-operate leads to better performance than having the same grids operate independently.
|
[PDF]
[Abstract]
|
| 7 |
|
Reducing Free Riding in Peer-to-Peer Systems with Supervised Teaming
Peer-to-peer technology has produced thriving communities in which peers contribute bandwidth to each other. However, when free riding occurs, these communities will not be able to sustain themselves without incentives to enforce this contribution. This thesis presents Supervised Teaming, a peer-to-peer transfer protocol that gives uploading peers, called supervisors, control over a team of downloading peers, called team members. The team members are given an incentive to transfer data among each other, effectively reducing the upload cost of the supervisors to one piece of data while still duplicating this piece to every team member. Using transport efficiency, time efficiency, and sharing ratio as performance metrics, we prove that Supervised Teaming performs equally to BitTorrent under best-case scenarios and several factors better, depending on the chosen team size, under flash crowds and free riding scenarios. Furthermore, we have implemented a peer-to-peer client that can use both the BitTorrent protocol and a simplified version of the Supervised Teaming protocol. The experiments we have performed with this client verify that Supervised Teaming performs as expected.
|
[PDF]
[Abstract]
|
| 8 |
|
Trace-based performance analysis of scheduling bags of tasks in grids
Grid computing promises large scale computing facilities based on distributed systems. Much research has been done on the subject of increasing the performance of grids. We believe that an adequate performance analysis of grids requires knowledge
of the workload and the architecture of the grid. Currently, researchers assume that grids are similar to other distributed systems, such as massively parallel computers. However, workloads in grids differ from other distributed systems, because
they consist for a significant part of bags-of-tasks. This research presents a method to model the workload of grids realistically, which enables us to analyze the performance of those systems. We have created a flexible workload model that is specifically tailed for grids. The model explicitly handles bag-of-tasks, which comprise the majority of grid workloads. This workload model has been built using a vast amount of workload trace data from seven real-world grids. The workload model enables us to conduct a performance analysis in which we analyze the impact of several workload characteristics, task selection and scheduling policies, and resource management architectures on system performance. We use simulations to systematically and realistically investigate the system performance in various scenarios. This research has resulted in a grid performance analysis toolbox, a software package that allows researchers to analyze, model, and generate workloads of grids. In addition, we have contributed trace data and analysis to the community by means of the Grid Workloads Archive, an on-line archive of trace data analyzed with our toolbox.
|
[PDF]
[Abstract]
|