DatApollo
Orchestration of Serverless Functions for Scalable Data Mining
Mahtab Shahin (Tallinn University of Technology)
Markus Bertl (Vienna University of Economics and Business)
Nasim Janatian (TU Delft - Civil Engineering & Geosciences)
Juan Aznar-Poveda (University of Innsbruck)
Syed Attique Shah (Birmingham City University)
Thomas Fahringer (University of Innsbruck)
Sijo Arakkal Peious (Tallinn University of Technology)
Dirk Draheim (Tallinn University of Technology)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
With the exponential growth of data generated from enterprise systems, social networks, and the Internet of Things, traditional data mining techniques face major challenges in terms of scalability and efficiency. As a foundational unsupervised learning method for detecting patterns in transactional datasets, Association Rule Mining (ARM) is commonly encountered in distributed environments with performance bottlenecks due to excessive memory consumption, static resource provisioning, and costly data shuffle. The present paper presents DatApollo, a novel serverless orchestration framework that enables the execution of distributed ARM workflows in a scalable and efficient manner. DatApollo, based on the Apollo orchestration engine, offers stateless cloud functions, dynamic scheduling, intermediate state persistence, and fault-tolerant coordination in order to address the limitations of both traditional cluster-based architectures and existing Function-as-a-Service models. By decomposing ARM pipelines into orchestrated microfunctions, the framework enables elastic, cloud-native execution with minimal idle time. Using real-world healthcare and meteorological datasets, we describe the architectural design, algorithmic components, and computational complexity of DatApollo and perform a comprehensive experimental evaluation. DatApollo provides up to five times faster execution time compared to Apache Spark and lowers infrastructure costs by utilizing elastic scaling and event-driven function invocations. The results demonstrate that DatApollo is a robust, cost-effective and high-performance alternative to ARM in dynamic, large-scale data environments.