Hawkes Processes in Large-Scale Service Systems

Improving service management at ING

More Info
expand_more

Abstract

Through the expansion of large-scale service systems and the exponential growth of data generated by complex IT infrastructure components, gaining a comprehensive overview of the different levels of service within an IT system has become increasingly challenging. In particular, this brought to the fore the question from a large commercial bank of how IT monitoring data streams generated by their complex IT infrastructure can be associated with one another.

In more detail, the data from the monitoring stream consists (among other things) of a message and a time stamp. Moreover, the monitoring data stream of this bank consists of two natures of information. These natures are either automatically generated warnings in the form of events or unplanned outages, referred to as incidents. The events and incidents are referred to as arrivals. As a first requirement to obtain better granularity, both event and incident messages with similar semantics should be grouped together. To this extent, the message component from each arrival is transformed into a numerical vector, the dimension of the obtained vector is reduced, and the collection of vectors is clustered. Once the individual arrival from the IT monitoring data stream is attached to a cluster based on their message component, the arrival is assigned a mark. This mark consists of a combination of the assigned cluster, the nature, and three different levels of service from the IT architecture on which the arrival occurred.

From a mathematical point of view, we can now view the monitoring data stream from different levels of service as a marked point process. Our primary focus centers on a specific category of marked point processes, known as marked Hawkes processes. Given the marked Hawkes process, we assume that each arrival from the IT monitoring data stream results in an instantaneous increase in the probability of some other arrivals in the near future. From here, we estimate the excitation matrix, representing the instantaneous increases among all assigned marks. Once the estimated excitation matrix is obtained, we decompose it into the different levels of service as defined within the mark. In particular, the decomposition has been performed through means of hierarchical linear models. Finally, the decomposition resulted in a comprehensive overview of the excitation behavior in large-scale service systems. This overview can directly be incorporated into the field of Software Architecture in order to uncover associations within complex IT infrastructures.