Communication networks are critical in business, government, and even our day-to-day life. A prolonged communication outage can have devastating effects, particularly during and after a disaster. Unfortunately, our communication infrastructure is still vulnerable to natural disasters and other events that damage multiple network components within a confined area. In this thesis, we study the disaster resilience of communication networks. We propose scalable, data-driven methods to help stakeholders both assess and improve the resilience of networks to disasters.
We first study the global risk of earthquakes to Internet Exchange Points (IXPs). We find that many facilities are at risk of earthquakes and that, when an earthquake occurs, it is not unlikely that multiple facilities will fail simultaneously. Fortunately, our analysis also shows that larger IXPs tend to be located in less earthquake-prone areas, and that peering at multiple facilities significantly reduces the impact of earthquakes to IXPs and autonomous systems. To help network operators in reducing the impact of earthquakes on their autonomous systems, we propose a novel metric for selecting peering facilities, based on the probability of simultaneous facility failures. We show that applying our metric can significantly increase the resilience of individual autonomous systems, as well as that of the Internet as a whole.
To effectively improve the resilience of communication networks to natural disasters, stakeholders need to make well-informed trade-offs between costs, network performance, and network resilience. To help stakeholders make these decisions, we propose a single-disaster and a successive-disaster framework for assessing the resilience of a network to natural disasters. These frameworks can help stakeholders anticipate potential disasters, and compare the effects of any trade-off on the resilience of their networks.
The main principle behind both frameworks is to assess the disaster resilience of a network based on a large set of representative disaster scenarios (called the disaster set). This approach is flexible with respect to the underlying disaster dataset, and can be applied to datasets of widely varying sizes and properties. Our single-disaster framework allows one to efficiently compute the distribution of a network performance metric, assuming that a single, random disaster strikes the network and damages one or more network components in a confined area. Our method speeds up computation by first computing the distribution of the state of the network after a random disaster (the number of possible states tends to be much smaller than the disaster set itself), and only then computing the performance of the network in each of these states.
In addition to studying the impact of a single disaster on a network, we also address the issue of successive disasters. We first define the concept of successive disasters: a subsequent disaster that strikes the network while the damage due to a previous disaster is still being repaired. We then propose a framework capable of modeling a sequence of disasters in time, while taking into account recovery operations. We develop both an exact and a Monte Carlo method to compute the vulnerability of a network to successive disasters and find that the probability of a second disaster striking the network during recovery can be significant even for short repair times.
Our successive disaster framework can not only be applied to subsequent disasters, but also to potential follow-up attacks. Experiments on two network topologies show that even small targeted attacks can greatly aggravate the network disruption caused by a natural disaster. Fortunately, we find that this effect can be mitigated - at almost no cost to network performance - by adopting a calculated repair strategy that takes into account the possibility of follow-up attacks.
In addition to providing methods for assessing the resilience of networks, we also provide algorithms for improving the resilience of networks to natural disasters. These algorithms can help stakeholders (1) recover network functionality more effectively in the initial period after a disaster, and (2) reduce the initial impact of a disaster on network performance.
After a disaster, a network operator can quickly restore some functionality by replacing nodes with temporary emergency nodes. These emergency nodes should be deployed as soon as possible. However, selecting an optimal set of replacement nodes is computationally intensive, and the complete state of the network might still be unknown after the disaster. Thus, we propose selecting a disaster strategy a priori - before the occurrence of the disaster. We give an algorithm for evaluating such strategies, by extending our single-disaster assessment framework.
An effective, but costly, method of improving the disaster resilience of a network is to add new, geographically redundant, cable connections. These redundant connections ensure that more areas remain connected after a disaster strikes the network, and thus reduce the initial impact of the disaster on the network. We provide algorithms for finding cable routes that minimize a function of disaster impact and cable cost under any disaster set. Since this problem is NP-hard, we give an exact algorithm, as well as a heuristic, for solving it.