1 

ApplicationOriented Scheduling in Multicluster Grids
Grid computing appeared in the mid 1990s with the vision of sharing geographically dispersed large computational resources for executing computationintensive scientific applications. Today, we can name numerous grid projects that run successfully to solve challenging scientific problems such as the grid project of European Organization for Nuclear Research (CERN), which combines thousands of computers worldwide (over 200 sites in about 30 countries) to store and analyze huge amounts of data, which are produced by the Large Hadron Collider (LHC) at CERN.
The resources in a grid system are typically heterogeneous since they belong to different administrative domains, and they are managed by proprietary policies. To cope with this heterogeneity, a grid relies on a layer of middleware, which offers transparent access to the distributed resources and simplifies the collaboration between organizations.
Grids also need highlevel scheduling systems that use grid middleware in order to map application tasks to resources and then manage their execution on behalf of users. However, scheduling in grids is challenging due to the dynamic nature of the grid resources as well as to the lack of control of those resources. The wide variety in the structural and the communication characteristics of the applications submitted to grids further complicate grid scheduling, and may lead to poor or unpredictable performance unless these characteristics are taken into account.
In this thesis we address the challenge of designing and analyzing realistic and practical applicationoriented scheduling mechanisms in multicluster grid systems. Applicationoriented scheduling focuses on the optimization of usercentric performance criteria, such as application execution time, with methods that are specialized for different types of applications. In this thesis we cover a widerange of grid application types, including parallel applications that may need coallocation or malleability, bagsoftasks that can benefit from cycle scavenging, and workflow applications that may push the system to its limits with their computation and data requirements. We investigate the performance of our scheduling mechanisms and policies in a real multicluster grid system, the DAS, using our KOALA multicluster grid scheduler, as well as with simulations using realistic scenarios.

[PDF]
[Abstract]

2 

Tracebased performance analysis of scheduling bags of tasks in grids
Grid computing promises large scale computing facilities based on distributed systems. Much research has been done on the subject of increasing the performance of grids. We believe that an adequate performance analysis of grids requires knowledge
of the workload and the architecture of the grid. Currently, researchers assume that grids are similar to other distributed systems, such as massively parallel computers. However, workloads in grids differ from other distributed systems, because
they consist for a significant part of bagsoftasks. This research presents a method to model the workload of grids realistically, which enables us to analyze the performance of those systems. We have created a flexible workload model that is specifically tailed for grids. The model explicitly handles bagoftasks, which comprise the majority of grid workloads. This workload model has been built using a vast amount of workload trace data from seven realworld grids. The workload model enables us to conduct a performance analysis in which we analyze the impact of several workload characteristics, task selection and scheduling policies, and resource management architectures on system performance. We use simulations to systematically and realistically investigate the system performance in various scenarios. This research has resulted in a grid performance analysis toolbox, a software package that allows researchers to analyze, model, and generate workloads of grids. In addition, we have contributed trace data and analysis to the community by means of the Grid Workloads Archive, an online archive of trace data analyzed with our toolbox.

[PDF]
[Abstract]

3 

Highperformance Processing in Networked and Grid Environments
In this dissertation, we present several techniques to achieve the highperformance processing in networked and grid environments. Many applications need a highperformance processing system to execute efficiently.
Highperformance processing mainly stems from parallelism. The parallel nature of grid computing is a very attractive solution to exploit the mentioned parallelism by executing either different parts of an application or several applications in parallel. In a grid system, the most important resources are computing and communication resources. The computing resources are the processors in the nodes on the grid. Communication within the grid is important for distributing tasks and their required data to the nodes within the grid.
We propose an innovative highperformance platform to utilize reconfigurable processors in grid environments. Furthermore, we focus on the communication infrastructures and network processing (processing required for packets) platforms to utilize them through the grid environments. The collaboration of reconfigurable processors in a grid environment is presented and several computeintensive multimedia kernels are simulated. Subsequently, we introduce three approaches to accelerate network processing tasks using Bloom filters in the networked and grid environments. The first and second techniques present a cache architecture for a counting Bloom filter (CCBF) and a memory optimization approach for Bloom filters using an additional hashing function (BFAH). The third technique proposes a power efficient pipelined Bloom filter.
We present the results of our proposed approaches in collaboration of reconfigurable processors in grid computing (CRGC) and Bloom filters in network processing applications, e.g., packet classification. The experimental results show that the CRGC approach improves performance up to 7.2x and 2.5x compared to a GPP and the collaboration of GPPs, respectively. The results of the CCBF and BFAH for packet classification show that the proposed techniques decrease the number of memory accesses when compared to a standard Bloom filter.

[PDF]
[Abstract]

4 

Quantifying Pathology in Diffusion Weighted MRI
In this thesis algorithms are proposed for quantification of pathology in Diffusion Weighted Magnetic Resonance Imaging (DWMRI) data. Functional evidence for brain diseases can be explained by specific structural loss in the white matter of the brain. That is, certain biomarkers may exist where the disease inhibits improper functioning. Axonal and myelin sheath damage hamper neural connectivity. This can be assessed in vivo by measuring a change in the diffusion of water molecules. DWMRI may deliver such biomarkers by capturing subtle changes in the diffusion process in an early disease stage. Diffusion tensor derived scalar measures such as the Mean Diffusivity and Fractional Anisotropy (FA) quantify this process. When comparing such measures on group level, patients may be found to significantly differ from healthy controls.
Conventional analysis treats measurements in each voxel independently. Due to high interbrain connectivity, we hypothesize that multiple brain regions are involved in complex brain diseases such as schizophrenia. We introduce a new machine learning framework for performing such analysis in comparative studies on volumetric data. By ‘shaving' the mapping computed by discriminant analysis, a characteristic set of regions is automatically extracted. We opt to ‘prune' the dataset by iteratively discarding misclassified subjects from the cohort, such that the mapping is based merely on representative prototypes. Then a comparison is made between a linear and nonlinear kernel discriminant analysis, to identify the dimensionality of the problem.
Methods are proposed for modeling the 30% of white matter that cannot be adequately described by a single tensor. These regions are currently disregarded in comparitive studies. More complex diffusion models that are introduced need to be adequately evaluated. We propose a new method for creating experimental phantom data of fiber crossings, by mixing the DWIsignals from high FAregions. These models demand HARDI (High Angular Resolution Diffusion Imaging) data acquired with higher SNR, diffusion weighting and angular resolution. In comparitive studies, scanning time may be insufficient for meeting the requirements. We propose a method to create a dual tensor atlas from multiple coregistered nonHARDI datasets. The random fluctuations in the pose of subjects in the scanner as well as anatomical heterogeneity contribute to an increased angular resolution. Finally, we build an optimization framework for estimating both diffusion shape and orientation in fiber crossings. This work sets fundamental limits for comparative studies to correctly analyze crossing white matter structures. Firstly, it assesses the precision and accuracy with which parameters may be estimated. Secondly, the optimal acquisition parameters are selected in order to do so. This model allows for estimating consistent FAprofiles along crossing tracts.
Fiber tracts provide a specific frame of reference for computing statistics. We perform 3D tractbased comparison of tensorderived indices between groups. Intersubject correspondence is achieved by nonrigid registration based on a joint clustering and matching. The clustering delivers atlas points that serve as a frame of reference for performing the analysis. Spatial consistency is taken to reflect a significant difference between groups.
The potential of our methods is demonstrated in two comparative studies: on Childhood Cancer survivors and Amyotrophic Lateral Sclerosis respectively.
High throughput analysis of our data is realized by adopting a grid computing approach. The grid provides fast and easy access to shared resources. We present our progress over the past four years in the development and porting the DWMRI analysis pipeline to grids. By doing do, our algorithms and results can also be accessed by fellow researchers.

[PDF]
[Abstract]

5 

Fast solution of nonsymmetric linear systems on Grid computers using parallel variants of IDR(s)
IDR(s) is a family of fast algorithms for iteratively solving large nonsymmetric linear systems [14]. With cluster computing and in particular with Grid computing, the inner product is a bottleneck operation. In this paper, three techniques are combined in order to alleviate this bottleneck. Firstly, the efficient and stable IDR(s) algorithm from [16] is reformulated in such a way that it has a single global synchronisation point per iteration step. Secondly, the so–called test matrix is chosen so that
the work, communication, and storage involving this matrix is minimised in multi–cluster environments. Finally, a methodology is presented for a–priori estimation of the optimal value of s using only problem and machine–based parameters. Numerical experiments applied to a 3D convection–diffusion problem are performed on the DAS–3 Grid computer, demonstrating the effectiveness of these three techniques.

[PDF]
[Abstract]

6 

Exploiting the flexibility of IDR(s) for grid computing
The IDR(s) method that is proposed in [26] is an efficient limited memory method for solving large nonsymmetric systems of linear equations. In [11] an IDR(s) variant is described that has a single synchronisation point per iteration step, which makes this variant wellsuited for parallel and grid computing. In this paper, we combine this IDR(s) variant with an asynchronous preconditioning iteration to further improve the performance of IDR(s) on a grid computer. Asynchronous preconditioners do not require expensive synchronisation and adapt to volatile computational resources, and are therefore wellsuited for such a computational environment. However, an
asynchronous preconditioning operation is also nonconstant by nature: the preconditioner
changes in every iteration. The success of the combination of IDR(s) with an asynchronous preconditioner therefore depends on the flexibility of IDR(s). We will explain why IDR(s) can be used as a flexible method, and we will successfully use the combination of IDR(s) with an asynchronous preconditioner for solving large convectiondiffusion problems. The numerical experiments are performed on the DAS3 grid computer, which is composed of five geographically separated parallel
clusters.

[PDF]
[Abstract]

7 

Understanding and Improving the Performance Consistency of Distributed Computing Systems
With the increasing adoption of distributed systems in both academia and industry, and with the increasing computational and storage requirements of distributed applications, users inevitably demand more from these systems. Moreover, users also depend on these systems for latency and throughput sensitive applications, such as interactive perception applications and MapReduce applications, which make the performance of these systems even more important. Therefore, for the users it is very important that distributed systems provide consistent performance, that is, the system provides a similar level of performance at all times. In this thesis we address the problem of understanding and improving the performance consistency of stateoftheart distributed computing systems. Towards this end, we take an empirical approach and we investigate various resource management, scheduling, and statistical modeling techniques with real system experiments in diverse distributed systems, such as clusters, multicluster grids, and clouds, using various types of workloads, such as Bagsoftasks (BoTs), interactive perception applications, and scientific workloads.
In addition, as failures are known to be an important source of significant performance inconsistency, we also provide fundamental insights into the characteristics of failures in distributed systems, which is required to design systems that can mitigate the impact of failures on performance consistency.

[PDF]
[Abstract]

8 

A framework for the study of grid interoperation mechanisms
The study of the history of computing infrastructures reveals an integration trend. For example, the explosive growth of the Internet in the 1990s was the result of an integration process started in the 1960s with the emerging networks of computers. By using the Internet, millions of users were capable of accessing information anytime and anywhere, much like other daily utilities such as water, electricity, and telephone. However, an important category of users remained underserved: the users with large computational and storage requirements, e.g., the scientists, the companies that focus on data analysis, and the governmental departments that manage the interaction between the state and the population (such as census, tax, and public health). Thus, in the mid1990s, the vision of the (Computing) Grid as a universal computing utility was formulated. The main benefits promised by the Grid are similar to those of other integration efforts: extended and optimized service of the integrated network, and significant reductions of maintenance and operation costs through sharing and better scheduling. While the universal Grid has yet to be developed, largescale distributed computing infrastructures that provide their users with seamless and secure access to computing resources, individually called Grid parts or simply grids, have been built throughout the worldin different countries, for different sciences, and both for production work and for computerscience research. At the same time, the main technological alternatives to grids, that is, supercomputers and large clusters, have evolved into much larger, scalable, and reliable systems. Thus, the integration of existing grids into larger infrastructures and finally into The Grid is key in keeping the grid vision attractive for its potential users. The integration of grids raises a double challenge, one related with the efficient scaling of a distributed computing system, the second associated with the operation of a system across different ownership and administrative domains. Thus, many of the traditional approaches for interoperating computer systems, such as those based on completely centralized or purely decentralized system architectures, are eliminated from the start. To mark the distinction between the typical problem of integrating smaller components into a larger system and the double challenge of grid integration, we call the latter the problem of grid interoperation. In this thesis we approach the problem of grid interoperation with two main objectives: to design a comprehensive framework for the study of grid interoperation mechanisms, and to provide an initial but good solution for this problem. We design a framework for the study of grid interoperation that includes a toolbox for grid interoperation research and a method for the study of grid interoperation mechanisms. In the research toolbox we include the Grid Workloads Archive (GWA), a comprehensive model for grid resources and workloads, the GrenchMark performance evaluation framework, and the Delft Grid Simulation (DGSim) framework for repeated and realistic simulations of multicluster and multigrid environments. The GWA and our comprehensive model show that grid computing is mostly used in practice for singleprocessor jobs and not for parallel computing, which raises previously ignored challenges related to the volume of jobs to be managed. We also devise in this thesis a method for studying grid interoperation mechanisms. We answer using our framework important questions regarding existing grid operation mechanisms, and in particular show that these mechanisms are too limited to cope with real and realistic conditions. We further demonstrate the usefulness of our framework by designing Delegated MatchMaking, a novel mechanism for interoperating grids. This mechanism is used to operate an architecture that is a hybrid between hierarchical and purely decentralized architectures. The Delegated Matchmaking mechanism attempts to use the local resources of a grid as much as possible and also transparently extends the local environment with resources obtained (delegated) from other sites when resources are not available locally. Our approach is compared with five alternatives through tracebased simulations, and is found to deliver the best performance, especially when the system is heavily loaded. While many other mechanisms can be designed in the future, our experiments prove that the Delegated MatchMaking approach already is a good solution for the problem of grid interoperation. Our experiments also demonstrate that having grids interoperate leads to better performance than having the same grids operate independently.

[PDF]
[Abstract]

9 

Efficient Iterative Solution of Large Linear Systems on Heterogeneous Computing Systems
This dissertation deals mainly with the design, implementation, and analysis of efficient iterative
solution methods for large sparse linear systems on distributed and heterogeneous computing
systems as found in Grid computing.
First, a case study is performed on iteratively solving large symmetric linear systems on
both a multi–cluster and a local heterogeneous cluster using standard block Jacobi precondi
tioning within the software constraints of standard Grid middleware and within the algorith
mic constraints of preconditioned Conjugate Gradient–type methods. This shows that global
synchronisation is a key bottleneck operation and in order to alleviate this bottleneck, three
main strategies are proposed: exploiting the hierarchical structure of multi–clusters, using asyn
chronous iterative methods as preconditioners, and minimising the number of inner products in
Krylov subspace methods.
Asynchronous iterative methods have never really been successfully applied to the solution
of extremely large sparse linear systems. The main reason is that the slow convergence rates
limit the applicability of these methods. Nevertheless, the lack of global synchronisation points
in these methods is a highly favourable property in heterogeneous computing systems. Krylov
subspace methods offer significantly improved convergence rates, but the global synchronisation
points induced by the inner product operations in each iteration step limits the applicability.
By using an asynchronous iterative method as a preconditioner in a flexible Krylov subspace
method, the best of both worlds is combined. It is shown that this hybrid combination of a
slow but asynchronous inner iteration and a fast but synchronous outer iteration results in high
convergence rates on heterogeneous networks of computers. Since the preconditionering iteration
is performed on heterogeneous computing hardware, it varies in each iteration step. Therefore,
a flexible iterative method which can handle a varying preconditioner has to be employed. This
partially asynchronous algorithm is implemented on two different types of Grid hardware applied
to two different applications using two different types of Grid middleware.
The IDR(s) method and its variants are new and powerful algorithms for iteratively solving
large nonsymmetric linear systems. Four techniques are used to construct an efficient IDR(s)
variant for parallel computing and in particular for Grid computing. Firstly, an efficient and
robust IDR(s) variant is presented that has a single global synchronisation point per matrix–
vector multiplication step. Secondly, the so–called IDR test matrix in IDR(s) can be chosen
freely and this matrix is constructed such that the work, communication, and storage involving
this matrix are minimised in the context of multi–clusters. Thirdly, a methodology is presented
for a priori estimation of the optimal value of s in IDR(s). Finally, the proposed IDR(s) variant
is combined with an asynchronous preconditioning iteration.
By using an asynchronous preconditioner in IDR(s), the IDR(s) method is treated as a flexible
method, where the preconditioner changes in each iteration step. In order to begin analysing
mathematically the effect of a varying preconditioning operator on the convergence properties
of IDR(s), the IDR(s) method is interpreted as a special type of deflation method. This leads
to a better understanding of the core structure of IDR(s) methods. In particular, it provides an
intuitive explanation for the excellent convergence properties of IDR(s).
Two applications from computational fluid dynamics are considered: large bubbly flow prob
lems and large (more general) convection–diffussion problems, both in 2D and 3D. However, the
techniques presented can be applied to a wide range of scientific applications.
Large numerical experiments are performed on two heterogeneous computing platforms: (i)
local networks of non–dedicated computers and (ii) a dedicated cluster of clusters linked by a
high–speed network. The numerical experiments not only demonstrate the effectiveness of the
techniques, but they also illustrate the theoretical results.

[PDF]
[Abstract]
