J. Oostenbrink | TU Delft Repository

A Data-Driven Approach to Disaster Resilience in Communication Networks

Doctoral thesis (2023) - J. Oostenbrink

Communication networks are critical in business, government, and even our day-to-day life. A prolonged communication outage can have devastating effects, particularly during and after a disaster. Unfortunately, our communication infrastructure is still vulnerable to natural disasters and other events that damage multiple network components within a confined area. In this thesis, we study the disaster resilience of communication networks. We propose scalable, data-driven methods to help stakeholders both assess and improve the resilience of networks to disasters. We first study the global risk of earthquakes to Internet Exchange Points (IXPs). We find that many facilities are at risk of earthquakes and that, when an earthquake occurs, it is not unlikely that multiple facilities will fail simultaneously. Fortunately, our analysis also shows that larger IXPs tend to be located in less earthquake-prone areas, and that peering at multiple facilities significantly reduces the impact of earthquakes to IXPs and autonomous systems. To help network operators in reducing the impact of earthquakes on their autonomous systems, we propose a novel metric for selecting peering facilities, based on the probability of simultaneous facility failures. We show that applying our metric can significantly increase the resilience of individual autonomous systems, as well as that of the Internet as a whole. To effectively improve the resilience of communication networks to natural disasters, stakeholders need to make well-informed trade-offs between costs, network performance, and network resilience. To help stakeholders make these decisions, we propose a single-disaster and a successive-disaster framework for assessing the resilience of a network to natural disasters. These frameworks can help stakeholders anticipate potential disasters, and compare the effects of any trade-off on the resilience of their networks. The main principle behind both frameworks is to assess the disaster resilience of a network based on a large set of representative disaster scenarios (called the disaster set). This approach is flexible with respect to the underlying disaster dataset, and can be applied to datasets of widely varying sizes and properties. Our single-disaster framework allows one to efficiently compute the distribution of a network performance metric, assuming that a single, random disaster strikes the network and damages one or more network components in a confined area. Our method speeds up computation by first computing the distribution of the state of the network after a random disaster (the number of possible states tends to be much smaller than the disaster set itself), and only then computing the performance of the network in each of these states. In addition to studying the impact of a single disaster on a network, we also address the issue of successive disasters. We first define the concept of successive disasters: a subsequent disaster that strikes the network while the damage due to a previous disaster is still being repaired. We then propose a framework capable of modeling a sequence of disasters in time, while taking into account recovery operations. We develop both an exact and a Monte Carlo method to compute the vulnerability of a network to successive disasters and find that the probability of a second disaster striking the network during recovery can be significant even for short repair times. Our successive disaster framework can not only be applied to subsequent disasters, but also to potential follow-up attacks. Experiments on two network topologies show that even small targeted attacks can greatly aggravate the network disruption caused by a natural disaster. Fortunately, we find that this effect can be mitigated - at almost no cost to network performance - by adopting a calculated repair strategy that takes into account the possibility of follow-up attacks. In addition to providing methods for assessing the resilience of networks, we also provide algorithms for improving the resilience of networks to natural disasters. These algorithms can help stakeholders (1) recover network functionality more effectively in the initial period after a disaster, and (2) reduce the initial impact of a disaster on network performance. After a disaster, a network operator can quickly restore some functionality by replacing nodes with temporary emergency nodes. These emergency nodes should be deployed as soon as possible. However, selecting an optimal set of replacement nodes is computationally intensive, and the complete state of the network might still be unknown after the disaster. Thus, we propose selecting a disaster strategy a priori - before the occurrence of the disaster. We give an algorithm for evaluating such strategies, by extending our single-disaster assessment framework. An effective, but costly, method of improving the disaster resilience of a network is to add new, geographically redundant, cable connections. These redundant connections ensure that more areas remain connected after a disaster strikes the network, and thus reduce the initial impact of the disaster on the network. We provide algorithms for finding cable routes that minimize a function of disaster impact and cable cost under any disaster set. Since this problem is NP-hard, we give an exact algorithm, as well as a heuristic, for solving it. ...

Communication networks are critical in business, government, and even our day-to-day life. A prolonged communication outage can have devastating effects, particularly during and after a disaster. Unfortunately, our communication infrastructure is still vulnerable to natural disasters and other events that damage multiple network components within a confined area. In this thesis, we study the disaster resilience of communication networks. We propose scalable, data-driven methods to help stakeholders both assess and improve the resilience of networks to disasters. We first study the global risk of earthquakes to Internet Exchange Points (IXPs). We find that many facilities are at risk of earthquakes and that, when an earthquake occurs, it is not unlikely that multiple facilities will fail simultaneously. Fortunately, our analysis also shows that larger IXPs tend to be located in less earthquake-prone areas, and that peering at multiple facilities significantly reduces the impact of earthquakes to IXPs and autonomous systems. To help network operators in reducing the impact of earthquakes on their autonomous systems, we propose a novel metric for selecting peering facilities, based on the probability of simultaneous facility failures. We show that applying our metric can significantly increase the resilience of individual autonomous systems, as well as that of the Internet as a whole. To effectively improve the resilience of communication networks to natural disasters, stakeholders need to make well-informed trade-offs between costs, network performance, and network resilience. To help stakeholders make these decisions, we propose a single-disaster and a successive-disaster framework for assessing the resilience of a network to natural disasters. These frameworks can help stakeholders anticipate potential disasters, and compare the effects of any trade-off on the resilience of their networks. The main principle behind both frameworks is to assess the disaster resilience of a network based on a large set of representative disaster scenarios (called the disaster set). This approach is flexible with respect to the underlying disaster dataset, and can be applied to datasets of widely varying sizes and properties. Our single-disaster framework allows one to efficiently compute the distribution of a network performance metric, assuming that a single, random disaster strikes the network and damages one or more network components in a confined area. Our method speeds up computation by first computing the distribution of the state of the network after a random disaster (the number of possible states tends to be much smaller than the disaster set itself), and only then computing the performance of the network in each of these states. In addition to studying the impact of a single disaster on a network, we also address the issue of successive disasters. We first define the concept of successive disasters: a subsequent disaster that strikes the network while the damage due to a previous disaster is still being repaired. We then propose a framework capable of modeling a sequence of disasters in time, while taking into account recovery operations. We develop both an exact and a Monte Carlo method to compute the vulnerability of a network to successive disasters and find that the probability of a second disaster striking the network during recovery can be significant even for short repair times. Our successive disaster framework can not only be applied to subsequent disasters, but also to potential follow-up attacks. Experiments on two network topologies show that even small targeted attacks can greatly aggravate the network disruption caused by a natural disaster. Fortunately, we find that this effect can be mitigated - at almost no cost to network performance - by adopting a calculated repair strategy that takes into account the possibility of follow-up attacks. In addition to providing methods for assessing the resilience of networks, we also provide algorithms for improving the resilience of networks to natural disasters. These algorithms can help stakeholders (1) recover network functionality more effectively in the initial period after a disaster, and (2) reduce the initial impact of a disaster on network performance. After a disaster, a network operator can quickly restore some functionality by replacing nodes with temporary emergency nodes. These emergency nodes should be deployed as soon as possible. However, selecting an optimal set of replacement nodes is computationally intensive, and the complete state of the network might still be unknown after the disaster. Thus, we propose selecting a disaster strategy a priori - before the occurrence of the disaster. We give an algorithm for evaluating such strategies, by extending our single-disaster assessment framework. An effective, but costly, method of improving the disaster resilience of a network is to add new, geographically redundant, cable connections. These redundant connections ensure that more areas remain connected after a disaster strikes the network, and thus reduce the initial impact of the disaster on the network. We provide algorithms for finding cable routes that minimize a function of disaster impact and cable cost under any disaster set. Since this problem is NP-hard, we give an exact algorithm, as well as a heuristic, for solving it.

A Global Study of the Risk of Earthquakes to IXPs

Conference paper (2022) - Jorik Oostenbrink, Fernando Kuipers

In this paper, we study the risk of earthquakes to global Internet infrastructure, namely Internet eXchange Point (IXP) facilities. Leveraging the CAIDA IXPs dataset and publicly available earthquake models and hazard computation tools, we find that more than 50% of the facilities have at least a 2% probability of experiencing potentially damaging levels of shaking, due to earthquakes, within a period of 50 years. Furthermore, we estimate that there is a 10% probability that at least 20 facilities will simultaneously experience potentially damaging levels of shaking within a period of 50 years. Fortunately, our analysis shows that IXPs that host many Autonomous Systems (ASes) tend to be located in less earthquake-prone areas, and that spreading out over multiple facilities significantly reduces the impact of earthquakes to IXPs. Following this observation, we propose a novel metric to help AS operators select peering facilities based on the probability of simultaneous facility failures. We show that applying our metric can significantly increase the resilience of individual ASes, as well as that of the Internet as a whole. ...

Probabilistic Shared Risk Link Groups Modeling Correlated Resource Failures Caused by Disasters

Journal article (2021) - Balazs Vass, János Tapolcai, Zalan Heszberger, Jozsef Biro, David Hay, Fernando A. Kuipers, Jorik Oostenbrink, Alessandro Valentini, Lajos Ronyai

To evaluate the expected availability of a backbone network service, the administrator should consider all possible failure scenarios under the specific service availability model stipulated in the corresponding service-level agreement. Given the increase in natural disasters and malicious attacks with geographically extensive impact, considering only independent single component failures is often insufficient. This paper builds a stochastic model of geographically correlated link failures caused by disasters to estimate the hazards an optical backbone network may be prone to and to understand the complex correlation between possible link failures. We first consider link failures only and later extend our model also to capture node failures. With such a model, one can quickly extract essential information such as the probability of an arbitrary set of network resources to fail simultaneously, the probability of two nodes to be disconnected, the probability of a path to survive a disaster. Furthermore, we introduce standard data structures and a unified terminology on Probabilistic Shared Risk Link Groups (PSRLGs), along with a pre-computation process, which represents the failure probability of a set of resources succinctly. In particular, we generate a quasilinear-sized data structure in polynomial time, which allows the efficient computation of the cumulative failure probability of any set of network elements. Our evaluation is based on carefully pre-processed seismic hazard data matched to real-world optical backbone network topologies. ...

Going the Extra Mile with Disaster-Aware Network Augmentation

Conference paper (2021) - J. Oostenbrink, F.A. Kuipers

Network outages have significant economic and societal costs. While network operators have become adept at managing smaller failures, this is not the case for larger, regional failures such as natural disasters. Although it is not possible, and certainly not economic, to prevent all potential disaster damage and impact, we can reduce their impact by adding cost-efficient, geographically redundant, cable connections to the network.
In this paper, we provide algorithms for finding cost-efficient, disaster-aware cable routes based on empirical hazard data. In contrast to previous work, our approach finds disaster-aware routes by considering the impact of a large set of input disasters on the network as a whole, as well as on the individual cable. For this, we propose the Disaster-Aware Network Augmentation Problem of finding a new cable connection that minimizes a function of disaster impact and cable cost. We prove that this problem is NP-hard and give an exact algorithm, as well as a heuristic, for solving it. Our algorithms are applicable to both planar and geographical coordinates. Using actual seismic hazard data, we demonstrate that by applying our algorithms, network operators can cost-efficiently raise the resilience of their network and future cable connections. ...

Sequential Zeroing: Online Heavy-Hitter Detection on Programmable Hardware

Conference paper (2020) - Belma Turkovic, Jorik Oostenbrink, Fernando Kuipers, Isaac Keslassy, Ariel Orda

F1ows that have exceeded a given percentage of the last sliding window of N packets, denoted as heavy-hitter flows, require special handling, since they may disrupt the service of other flows or may be indicative of malicious traffic. However, even when equipped with a programmable switch, it is unclear how to detect heavy hitters on a per-packet basis, while obeying the stringent switch memory access rates. For instance, existing solutions, such as HashPipe, cannot detect heavy hitters without halving the line rate and do not support sliding windows. To the best of our knowledge, this paper is the first to present heavy-hitter detection solutions that provide per-packet granularity at line-rate performance. We realize this by introducing (1) Modulo sketching, a novel counting algorithm that reuses counters and limits the impact of smaller flows beyond early processing stages; and (2) Sequential Zeroing, a new approach to extending interval-based schemes to sliding window measurements. Our solutions are extensively evaluated, both via simulations and experiments on a Netronome SmartNIC, and demonstrate significant performance gains over the state-of-theart. ...

A Moment of Weakness

Protecting Against Targeted Attacks Following a Natural Disaster

Journal article (2020) - Jorik Oostenbrink, Fernando Kuipers

By targeting communication and power networks, malicious actors can significantly disrupt our society. As networks are more vulnerable after a natural disaster, this moment of weakness may be exploited to disrupt the network even further. However, the potential impact and mitigation of such a follow-up attack has yet to be studied.

In this paper, we propose a framework to analyze the impact of a combination of a natural disaster followed by a targeted single node failure. We apply this framework on empirical disaster data and two network topologies. Our experiments show that even small targeted attacks can significantly augment the already grave network disruption caused by a natural disaster. We further show that this effect can be mitigated by adopting a calculated repair strategy. ...

How to Model and Enumerate Geographically Correlated Failure Events in Communication Networks

Book chapter (2020) - Balazs Vass, János Tapolcai, David Hay, Jorik Oostenbrink, Fernando Kuipers

Several works shed light on the vulnerability of networks against regional failures, which are failures of multiple pieces of equipment in a geographical region as a result of a natural or human-made disaster. This chapter overviews how this in- formation can be added to existing network protocols through defining Shared Risk Link Groups (SRLGs) and Probabilistic SRLGs (PSRLGs). The output of this chap- ter can be the inputs of later chapters to design and operate the networks to enhance the preparedness against disasters and regional failures in general. In particular, we are focusing on the state-of-the-art algorithmic approaches for generating lists of (P)SRLGs of the communication networks protecting different sets of disasters. ...

Network Resiliency Against Earthquakes

Conference paper (2019) - Alessandro Valentini, Balazs Vass, Jorik Oostenbrink, Levente Csak, Fernando Kuipers, Bruno Pace, David Hay, Janos Tapolcai

Guaranteeing a high availability of network services is a crucial part of network management. In this study, we show how to compute the availability of network services under earthquakes, by using empirical data. We take a multi-disciplinary approach and create an earthquake model based on seismological research and historical data. We then show how to integrate this empirical disaster model into existing network resiliency models to obtain the vulnerability and availability of a network under earthquakes. While previous studies have applied their models to ground shaking hazard models or earthquake scenarios, we compute (1) earthquake activity rates and (2) a relation between magnitude and disaster area, and use both as input data for our modeling. This approach is more in line with existing network resiliency models: it provides better information on the correlation between link failures than ground shaking hazard models and a more comprehensive view than a fixed set of scenarios. ...

Evaluating local disaster recovery strategies

Journal article (2019) - Jorik Oostenbrink, Fernando A. Kuipers, Bjarne E. Helvik, Poul E. Heegaard

It is of vital importance to maintain at least some network functionality after a disaster, for example by temporarily replacing damaged nodes by emergency nodes. We propose a framework to evaluate dierent node replacement strategies, based on a large set of representative disasters. We prove that computing the optimal choice of nodes to replace is an NP-hard problem and propose several simple strategies. We evaluate these strategies on two U.S. topologies and show that a simple greedy strategy can perform close to optimal. ...

The Risk of Successive Disasters

A Blow-by-Blow Network Vulnerability Analysis

Conference paper (2019) - Jorik Oostenbrink, Fernando Kuipers

It is often assumed that a network will not be struck by multiple disasters in a relatively short period of time; that is, a subsequent disaster will not strike within the recovery phase of a previous disaster. However, recent events have shown that combinations of disasters are not implausible. This realization calls for a new perspective on how we assess the vulnerability of our networks and shows a need for a framework to assess the vulnerability of networks to successive independent disasters. We propose a network and disaster model capable of modeling a sequence of disasters in time, while taking into account recovery operations. Based on that model, we develop both an exact and a Monte Carlo method to compute the vulnerability of a network to successive disasters. By applying our approach to real empirical disaster data, we show that the probability of a second disaster striking the network during recovery can be significant even for short repair times. Our framework is a first step towards determining the vulnerability of networks to such successive disasters. ...

Computing the Impact of Disasters on Networks

Journal article (2017) - Jorik Oostenbrink, Fernando Kuipers

In this paper, we consider the vulnerability of a network to disasters, in particular earthquakes, and we propose an efficient method to compute the distribution of a network performance measure, based on a finite set of disaster areas and occurrence probabilities. Our approach has been implemented as a tool to help visualize the vulnerability of a network to disasters. With that tool, we demonstrate our methods on an official set of Japanese earthquake scenarios. ...