DE

D.H.J. Epema

info

Please Note

52 records found

Conference paper (2026) - Chi Hong, Jiyue Huang, Robert Birke, Dick Epema, Stefanie Roos, Lydia Y. Chen
While diffusion models effectively generate remarkable synthetic images, a key limitation is the inference inefficiency, requiring numerous sampling steps. To accelerate inference and maintain high-quality synthesis, teacher-student distillation is applied to compress the diffusion models in a progressive and binary manner by retraining, e.g., reducing the 1024-step model to a 128-step model in 3 folds. In this paper, we propose a single-fold distillation algorithm, SFDDM, which can flexibly compress the teacher diffusion model into a student model of any desired step, based on reparameterization of the intermediate inputs from the teacher model. To train the student diffusion, we minimize not only the output distance but also the distribution of the hidden variables between the teacher and student model. Extensive experiments on four datasets demonstrate that our student model trained by the proposed SFDDM is able to sample high-quality data with steps reduced to less than 1%, thus, trading off inference time. Our remarkable performance highlights that SFDDM effectively transfers knowledge in single-fold distillation, achieving semantic consistency and meaningful image interpolation. ...
Web3 is emerging as the new Internet-interaction model that facilitates direct collaboration between strangers without a need for prior trust between network participants and without central authorities. However, one of its shortcomings is the lack of a defense mechanism against the ability of a single user to generate a surplus of identities, known as the Sybil attack. Web3 has a Sybil attack problem because it uses peer sampling to establish connections between users. We evaluate the promising but under-explored direction of Sybil avoidance using network latency measurements, according to which two identities with equal latencies are suspected to be operated from the same node, and thus are likely Sybils. Network latency measurements have two desirable properties: they are only malleable by attackers by adding latency, and they do not require any trust between network participants. Our basic SybilSys mechanism avoids Sybil attackers using only network latency measurements if attackers do not actively exploit their malleability. We present an enhanced version of SybilSys that protects against targeted attacks using a variant of the flow correlation attack, which we name TrafficJamTrigger. We show how the message flows of Round-Trip Time measurements can be used to expose attack patterns and we propose and evaluate six classifiers to recognize these patterns. Our experiments show, through both emulation and real-world deployment, that enhanced SybilSys can serve a fundamental role for Web3, effectively establishing connections to real users even in the face of networks consisting of 99% Sybils. ...
The solar industry in residential areas has been witnessing an astonishing growth worldwide. At the heart of this transformation, affecting the edge of the electricity grid, reside smart inverters (SIs). These IoT-enabled devices aim to introduce a certain degree of intelligence to conventional inverters by integrating various grid support capabilities (e.g., voltage and frequency control). However, with the remarkable automation of these devices come enormous security risks. Thus, rising rates of vulnerabilities have increased the necessity for designing resilient, auditable, and secure SIs' firmware over the air (FOTA) amendment schemes suitable for this heterogeneous SIs-based ecosystem. In this regard, we propose leveraging blockchain as an innovative technology to guarantee these cybersecurity requirements. In this article, we present the design of a distributed FOTA scheme, namely, RASSIFAB, governing the process of amending SIs' firmware within residential areas in an immutable and scalable manner. The scheme was implemented on a blockchain test network to assess its functionalities and performance. We also carried out a security evaluation to determine whether RASSIFAB is resistant to various identified threats. The obtained results confirm that the scheme is efficient and sound. They also indicate that RASSIFAB ensures reliable and authentic firmware amendments even with malicious insiders, differentiating our framework from the existing ones. ...
The concept of the internet of energy (IoE) emerged as an innovative paradigm to encompass all the complex and intertwined notions relevant to the transition of current smart grids towards more decarbonization, digitalization and decentralization. With a focus on the two last aspects, the amount of intelligent devices being connected in a scattered way to the existing power grid is ever-growing. Nevertheless, guaranteeing a cyber-secure and resilient control of these IoE components as well as a seamless and reliable delivery of electricity services, such as renewable energy exchange, electric vehicles charging, demand response, and so forth; might be the bottleneck of current power systems that are largely still functioning following a centralized approach. Thus, the future power grid would gradually incorporate a growing number of distributed-based control schemes to deal with this challenge. And many believe that blockchain could be a key-enabler in this transition, due to its consistent characteristics with multiple requirements of future power systems. In this paper, we provide an extensive state-of-the-art of blockchain-based additions to the IoE. Where, we first introduce various concepts related to blockchain and discuss the rationale behind its adoption in the context of IoE. Then, differently from the existing body of literature surveys, we do not only provide a taxonomy and evaluate a wide range of recent research outputs that integrated blockchain within modern power systems. But we also draw some valuable lessons learned for each studied category and discuss the intersection of blockchain with various emerging paradigms that have the potential of radically impacting the smart grid. In addition, we present some real-world industrial initiatives and ongoing projects built on top of blockchain, dedicated for offering diverse electricity services with a case study of a pilot project on energy trading in Amsterdam. Finally, we discuss the remaining challenges and worthwhile opportunities of deploying blockchain in this particular area, with a focus on the aspect of operational cyber-security. ...
Journal article (2021) - Long Cheng, Ying Wang, Qingzhi Liu, Dick H.J. Epema, Cheng Liu, Ying Mao, John Murphy
Large data centers are currently the mainstream infrastructures for big data processing. As one of the most fundamental tasks in these environments, the efficient execution of distributed data operators (e.g., join and aggregation) are still challenging current data systems, and one of the key performance issues is network communication time. State-of-the-art methods trying to improve that problem focus on either application-layer data locality optimization to reduce network traffic or on network-layer data flow optimization to increase bandwidth utilization. However, the techniques in the two layers are totally independent from each other, and performance gains from a joint optimization perspective have not yet been explored. In this article, we propose a novel approach called NEAL (NEtwork-Aware Locality scheduling) to bridge this gap, and consequently to further reduce communication time for distributed big data operators. We present the detailed design and implementation of NEAL, and our experimental results demonstrate that NEAL always performs better than current approaches for different workloads and network bandwidth configurations. ...
Journal article (2021) - Ayman Esmat, Martijn de Vos, Yashar Ghiassi-Farrokhfal, Peter Palensky, Dick Epema
Peer-to-Peer (P2P) energy trading, which allows energy consumers/producers to directly trade with each other, is one of the new paradigms driven by the decarbonization, decentralization, and digitalization of the energy supply chain. Additionally, the rise of blockchain technology suggests unprecedented socio-economic benefits for energy systems, especially when coupled with P2P energy trading. Despite such future prospects in energy systems, three key challenges might hinder the full integration of P2P energy trading and blockchain. First, it is quite complicated to design a decentralized P2P market that keeps a fair balance between economic efficiency and information privacy. Secondly, with the proliferation of storage devices, new P2P market designs are needed to account for their inter-temporal dependencies. Thirdly, a practical implementation of blockchain technology for P2P trading is required, which can facilitate efficient trading in a secured and fraud-resilient way, while eliminating any intermediaries’ costs. In this paper, we develop a new decentralized P2P energy trading platform to address all the aforementioned challenges. Our platform consists of two key layers: market and blockchain. The market layer features a parallel and short-term pool-structured auction and is cleared using a novel decentralized Ant-Colony Optimization method. This market arrangement guarantees a near-optimally efficient market solution, preserves players’ privacy, and allows inter-temporal market products trading. The blockchain layer offers a high level of automation, security, and fast real-time settlements through smart contract implementation. Finally, using real-world data, we simulate the functionality of the platform regarding energy trading, market clearing, smart contract operations, and blockchain-based settlements. ...
Conference paper (2021) - Satwik Prabhu Kumble, Dick Epema, Stefanie Roos
Lightning, the prevailing solution to Bitcoin's scalability issue, uses onion routing to hide senders and recipients of payments. Yet, the path between the sender and the recipient along which payments are routed is selected such that it is short, cost efficient, and fast. The low degree of randomness in the path selection entails that anonymity sets are small. However, quantifying the anonymity provided by Lightning is challenging due to the existence of multiple implementations that differ with regard to the path selection algorithm and exist in parallel within the network. In this paper, we propose a general method allowing a local internal attacker to determine sender and recipient anonymity sets. Based on an in-depth code review of three Lightning implementations, we analyze how an adversary can predict the sender and the recipient of a multi-hop transaction. Our simulations indicate that only one adversarial node on a payment path uniquely identifies at least one of sender and recipient for around 70% of the transactions observed by the adversary. Moreover, multiple colluding attackers can almost always identify sender and receiver uniquely. ...
Existing digital identity management systems fail to deliver the desirable properties of control by the users of their own identity data, credibility of disclosed identity data, and network-level anonymity. The recently proposed Self-Sovereign Identity (SSI) approach promises to give users these properties. However, we argue that without addressing privacy at the network level, SSI systems cannot deliver on this promise. In this paper we present the design and analysis of our solution TCID, created in collaboration with the Dutch government. TCID is a system consisting of a set of components that together satisfy seven functional requirements to guarantee the desirable system properties. We show that the latency incurred by network-level anonymization in TCID is significantly larger than that of identity data disclosure protocols but is still low enough for practical situations. We conclude that current research on SSI is too narrowly focused on these data disclosure protocols. ...
The demand for additional performance due to the rapid increase in the size and importance of data-intensive applications has considerably elevated the complexity of computer architecture. In response, systems offer pre-determined behaviors based on heuristics and then expose a large number of configuration parameters for operators to adjust them to their particular infrastructure. Unfortunately, in practice this leads to a substantial manual tuning effort. In this work, we focus on one of the most impactful tuning decisions in big data systems: the number of executor threads. We first show the impact of I/O contention on the runtime of workloads and a simple static solution to reduce the number of threads for I/O-bound phases. We then present a more elaborate solution in the form of self-adaptive executors which are able to continuously monitor the underlying system resources and detect contentions. This enables the executors to tune their thread pool size dynamically at runtime in order to achieve the best performance. Our experimental results show that being adaptive can significantly reduce the execution time especially in I/O intensive applications such as Terasort and PageRank which see a 34% and 54% reduction in runtime. ...
Journal article (2019) - Amin Rezaeian, Mahmoud Naghibzadeh, Dick H.J. Epema
Cloud schedulers that allocate resources exclusively to single workflows are not work-conserving as they may be forced to leave gaps in their schedules because of the precedence constraints in the workflows. Thus, they may lead to a waste of financial resources. This problem can be mitigated by multiple-workflow schedulers that share the leased cloud resources among multiple workflows or users by filling the gaps left by one workflow with the tasks of other workflows. This solution may even work when users have different performance objectives for their workflows, such as budgets and deadlines. As an additional requirement, we want the scheduler to be fair to all workflows regardless of their performance objectives. In this paper, we propose a multiple-workflow scheduler that is able to target different quality of service goals for different workflows and that considers fairness among different users. To this aim, we propose an unfairness metric and four workflow selection policies. We prove that the resource selection that decides based on a task’s sub-budget, sub-deadline, finish time, and cost on different resources is selecting the best resource based on the given information, while using the smallest number of calculations. Simulations show that there is a trade-off between overall cost, makespan, and fairness. We conclude that the best workflow selection policy to reduce unfairness is the direct policy, which explicitly selects the workflow that minimizes the value of the proposed unfairness metric in each round. ...
Journal article (2018) - Alexey Ilyushkin, Ahmed Ali-Eldin, Nikolas Herbst, André Bauer, Alessandro Papadopoulos, Dick Epema, Alexandru Iosup
Elasticity is one of the main features of cloud computing allowing customers to scale their resources based on the workload. Many autoscalers have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application based on the workload utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is often compared only to static provisioning using a predefined quality of service target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy, as there is seldom enough analysis on the performance of the autoscalers in different operating conditions and with different applications. In our work, we conduct an experimental performance evaluation of autoscaling policies, using as application model workflows, a popular formalism for automating resource management for applications with well-defined yet complex structures. We present a detailed comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the seven policies, we conduct various experiments and compare their performance in both pairwise and group comparisons. We report both individual and aggregated metrics. As many workflows have deadline requirements on the tasks, we study the effect of autoscaling on workflow deadlines. Additionally, we look into the effect of autoscaling on the accounted and hourly based charged costs, and we evaluate performance variability caused by the autoscaler selection for each group of workflow sizes. Our results highlight the trade-offs between the suggested policies, how they can impact meeting the deadlines, and how they perform in different operating conditions, thus enabling a better understanding of the current state-of-the-art. ...
Conference paper (2018) - Aleksandra Kuzmanovska, Hans van den Bogert, Rudolf Mak, Dick Epema
When multiple data-processing frameworks with time-varying workloads are simultaneously present in a single cluster or data-center, an apparent goal is to have them experience equal performance, expressed in whatever performance metrics are applicable. In modern data-center environments, Two-Level Schedulers (TLSs) that leave the scheduling of individual jobs to the schedulers within the data-processing frameworks are typically used for managing the resources of data-processing frameworks. Two such TLSs with opposite designs are Mesos and Koala-F. Mesos employs fine-grained resource allocation and aims at Dominant Resource Fairness (DRF) among framework instances by offering resources to them for the duration of a single task. In contrast, Koala-F aims at performance fairness among framework instances by employing dynamic coarse-grained resource allocation of sets of complete nodes based on performance feedback from individual instances. The goal of this paper is to explore the trade-offs between these two TLS designs when trying to achieve performance balance among frameworks. We select Apache Spark as a representative of data-processing frameworks, and perform experiments on a modest-sized cluster, using jobs chosen from commonly used data-processing benchmarks. Our results reveal that achieving performance balance among framework instances is a challenge for both TLS designs, despite their opposite design choices. Moreover, we exhibit design flaws in the DRF allocation policy that prevent Mesos from achieving performance balance. Finally, to remedy these flaws, we propose a feedback controller for Mesos that dynamically adapts framework weights, as used in Weighted DRF (W-DRF), based on their performance. ...
Conference paper (2018) - Alexey Ilyushkin, Dick Epema
Workflow schedulers often rely on task runtime estimates when making scheduling decisions, and they usually target the scheduling of a single workflow or batches of workflows. In contrast, in this paper, we evaluate the impact of the absence or limited accuracy of task runtime estimates on slowdown when scheduling complete workloads of workflows that arrive over time. We study a total of seven scheduling policies: four of these are popular existing policies for (batches of) workloads from the literature, including a simple backfilling policy which is not aware of task runtime estimates, two are novel workloadoriented policies, including one which targets fairness, and one is the well-known HEFT policy for a single workflow adapted to the online workload scenario. We simulate homogeneous and heterogeneous distributed systems to evaluate the performance of these policies under varying accuracy of task runtime estimates. Our results show that for high utilizations, the order in which workflows are processed is more important than the knowledge of correct task runtime estimates. Under low utilizations, all policies considered show good results, even a policy which does not use task runtime estimates. We also show that our Fair Workflow Prioritization (FWP) policy effectively decreases the variance of workflow slowdown and thus achieves fairness, and that the plan-based scheduling policy derived from HEFT does not show much performance improvement while bringing extra complexity to the scheduling process. ...

Grappling with Failures of In-Memory Data Analytics Frameworks

Conference paper (2017) - Bogdan Ghit, Dick Epema
Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointed
rather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load. ...
Conference paper (2017) - Long Cheng, Ying Wang, Yulong Pei, Dick Epema
Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the network
communication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain.
However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored.
In this paper, based on current research in coflow scheduling,
we propose a novel Coflow-based Co-optimization Framework
(CCF), which can co-optimize application-level data movement
and network-level data communications for distributed operators,
and consequently contribute to their performance in
large distributed environments. We present the detailed design
and implementation of CCF, and conduct an experimental
evaluation of CCF using large-scale simulations on large data
joins. Our results demonstrate that CCF can always perform
faster than current approaches on network communications in
large-scale distributed scenarios. ...
Journal article (2017) - Yong Guo, Sungpack Hong, Hassan Chafi, Alexandru Iosup, Dick Epema
In recent years, many distributed graph-processing systems have been designed and developed to analyze large-scale graphs. For all distributed graph-processing systems, partitioning graphs is a key part of processing and an important aspect to achieve good processing performance. To keep low the overhead of partitioning graphs, even when processing the ever-increasing modern graphs ,many previous studies use lightweight streaming graph-partitioning policies. Although many such policies exist, currently there is no comprehensive study of their impact on load balancing and communication overheads, and on the overall performance of graph-processing systems. This relative lack of understanding hampers the development and tuning of new streaming policies, and could limit the entire research community to the existing classes of policies. We address these issues in this work. We begin by modeling the execution time of distributed graph-processing systems. By analyzing this model under the load of realistic graph-data characteristics, we propose a method to identify important performance issues and then design new streaming graph-partitioning policies to address them. By using three typical large-scale graphs and three popular graph-processing algorithms, we conduct comprehensive experiments to study the performance of our and of many alternative streaming policies on a real distributed graph-processing system. We also explore the impact on performance of using different real-world networks and of other real-world technical details. We further discuss how to use our results, the coverage of our model and method, and
the design of future partitioning policies. ...
Conference paper (2017) - Alexey Ilyushkin, Ahmed Ali-Eldin, Nikolas Herbst, Alessandro Papadopoulos, Bogdan Ghit, Dick Epema, Alexandru Iosup
Simplifying the task of resource management and scheduling for customers, while still delivering complex Quality-of-Service (QoS), is key to cloud computing. Many autoscaling policies have been proposed in the past decade to decide on behalf of cloud customers when and how to provision resources to a cloud application utilizing cloud elasticity features. However, in prior work, when a new policy is proposed, it is seldom compared to the state-of-the-art, and is
often compared only to static provisioning using a predefined QoS target. This reduces the ability of cloud customers and of cloud operators to choose and deploy an autoscaling policy. In our work, we conduct an experimental performance evaluation of autoscaling policies, using as application model workflows, a commonly used formalism for automating resource management for applications with well-defined yet complex structure. We present a detailed
comparative study of general state-of-the-art autoscaling policies, along with two new workflow-specific policies. To understand the performance differences between the 7 policies, we conduct various forms of pairwise and group comparisons. We report both individual and aggregated metrics. Our results highlight the trade-offs between the suggested policies, and thus enable a better understanding of the current state-of-the-art. ...

The Creators and Spectators of Online Game Replays and Live Streaming

Journal article (2016) - Adele Lu Jia, Siqi Shen, Dick H J Epema, A. Iosup
Online gaming franchises such as World of Tanks, Defense of the Ancients, and StarCraft have attracted hundreds of millions of users who, apart from playing the game, also socialize with each other through gaming and viewing gamecasts. As a form of User Generated Content (UGC), gamecasts play an important role in user entertainment and gamer education. They deserve the attention of both industrial partners and the academic communities, corresponding to the large amount of revenue involved and the interesting research problems associated with UGC sites and social networks. Although previous work has put much effort into analyzing general UGC sites such as YouTube, relatively little is known about the gamecast sharing sites. In this work, we provide the first comprehensive study of gamecast sharing sites, including commercial streaming-based sites such as Amazon's Twitch.tv and community-maintained replay-based sites such as WoTreplays. We collect and share a novel dataset on WoTreplays that includes more than 380,000 game replays, shared by more than 60,000 creators with more than 1.9 million gamers. Together with an earlier published dataset on Twitch.tv, we investigate basic characteristics of gamecast sharing sites, and we analyze the activities of their creators and spectators. Among our results, we find that (i) WoTreplays and Twitch.tv are both fast-consumed repositories, with millions of gamecasts being uploaded, viewed, and soon forgotten; (ii) both the gamecasts and the creators exhibit highly skewed popularity, with a significant heavy tail phenomenon; and (iii) the upload and download preferences of creators and spectators are different: while the creators emphasize their individual skills, the spectators appreciate team-wise tactics. Our findings provide important knowledge for infrastructure and service improvement, for example, in the design of proper resource allocation mechanisms that consider future gamecasting and in the tuning of incentive policies that further help player retention. ...
Conference paper (2016) - Yong Guo, Ana Lucia Varbanescu, Dick Epema, Alex Iosup
Graph processing is increasingly used in a variety of domains, from engineering to logistics and from scientific computing to online gaming. To process graphs efficiently, GPU-enabled graph-processing systems such as TOTEM and Medusa exploit the GPU or the combined CPU+GPU capabilities of a single machine. Unlike scalable distributed CPU-based systems such as Pregel and GraphX, existing GPU-enabled systems are restricted to the resources of a single machine, including the limited amount of GPU memory, and thus cannot analyze the increasingly large-scale graphs we see in practice. To address this problem, we design and implement three families of distributed heterogeneous graph-processing systems that can use both the CPUs and GPUs of multiple machines. We further focus on graph partitioning, for which we compare existing graph-partitioning policies and a new policy specifically targeted at heterogeneity. We implement all our distributed heterogeneous systems based on the programming model of the single-machine TOTEM, to which we add (1) a new communication layer for CPUs and GPUs across multiple machines to support distributed graphs, and (2) a workload partitioning method that uses offline profiling to distribute the work on the CPUs and the GPUs. We conduct a comprehensive real-world performance evaluation for all three families. To ensure representative results, we select 3 typical algorithms and 5 datasets with different characteristics. Our results include algorithm run time, performance breakdown, scalability, graph partitioning time, and comparison with other graph-processing systems. They demonstrate the feasibility of distributed heterogeneous graph processing and show evidence of the high performance that can be achieved by combining CPUs and GPUs in a distributed environment. ...

Size-Based Resource Allocation in MapReduce Frameworks

Conference paper (2016) - Bogdan Ghit, Dick Epema
Many large-scale data analytics infrastructures are employed for a wide variety of jobs, ranging from short interactive queries to large data analysis jobs that may take hours or even days to complete. As a consequence, data-processing frameworks like MapReduce may have workloads consisting of jobs with heavy-tailed processing requirements. With such workloads, short jobs may experience slowdowns that are an order of magnitude larger than large jobs do, while the users may expect slowdowns that are more in proportion with the job sizes. To address this problem of large job slowdown variability in MapReduce frameworks, we design a scheduling system called TYREX that is inspired by the well-known TAGS task assignment policy in distributed-server systems. In particular, TYREX partitions the resources of a MapReduce framework, allowing any job running in any partition to read data stored on any machine, imposes runtime limits in the partitions, and successively executes parts of jobs in a work-conserving way in these partitions until they can run to completion. We develop a statistical model for dynamically setting the runtime limits that achieves near optimal job slowdown performance, and we empirically evaluate TYREX on a cluster system with workloads consisting of both synthetic and real-world benchmarks. We find that TYREX cuts in half the job slowdown variability while preserving the median job slowdown when compared to state-of-the-art MapReduce schedulers such as FIFO and FAIR. Furthermore, TYREX reduces the job slowdown at the 95th percentile by more than 50% when compared to FIFO and by 20-40% when compared to FAIR. ...