The State of Data Streaming Practices at ING

More Info
expand_more

Abstract

The development of data stream processing has become one of the key themes in the database and distributed system community throughout the world as data has grown on a large scale and in a range of industries over the last several years. Because data stream processing is a relatively new breakthrough in data-driven approaches, several teams at ING are investigating its possibilities. Thus, this thesis aims to provide insight on data stream processing practices at ING using research survey methodology. We conducted an extensive study that included a review of data streaming academic publications, online questionnaire distributed to 45 practitioners at ING, and in-depth interviews with 5 streaming practitioners. Our survey research aimed at understanding: (i) the use cases of data streaming; (ii) the types of streamed data users have; (iii) the streaming tasks and computation users run on their stream; (iv) the machine learning task users performed in their streams; and (v) the streaming software and tools used to process their streams. Results from academic review became the basis of designing the questionnaire. We discussed the answers of the participants to our questionnaire by highlighting common trends and challenges they faced. Through our interviews, we were able to get detailed answers on some of our questions. Our research discovered several interesting observations regarding data stream processing in practice. Particularly, real-time monitoring and event categorization are the popular use case for data streaming, data contained in streams represent a diverse range of entities and is homogeneous in format, type and category, machine learning implementation in streaming environment is prevalence, Apache Kafka is a commonly used stream processing engine and complexity of data streaming implementation is the challenge most expressed by our participants.