The workflow of a data science practitioner includes gathering information from different sources and applying machine learning (ML) models. Such dispersed information can be combined through a process known as Data Integration (DI), which defines relations between entities and a
...
The workflow of a data science practitioner includes gathering information from different sources and applying machine learning (ML) models. Such dispersed information can be combined through a process known as Data Integration (DI), which defines relations between entities and attributes. When all information is combined in one source suited for ML purposes, this source often contains duplicate data, resulting in longer operation times. Recent work has created algebraic rewrite rules for ML such that computations can be pushed down to individual sources. This method is referred to as factorized learning. In this work, we present an implementation of Amalur, a system for automated factorized learning that is novel because of its applicability in any DI scenario. Amalur shows significant speedups for some datasets and models, while for some other datasets and models, our implementation of factorized learning results in slowdowns. Whether factorized learning is efficient depends on the operations involved in the applied ML model and the data supplied to the model. The process of estimating the efficiency of factorization is known as cost estimation. Previous efforts on cost estimation do not generalize to the DI scenarios included in Amalur, are not tailored to different ML models, and are not properly evaluated. In this thesis, we create a cost estimation procedure for Amalur suitable for any DI scenario, which adapts to individual ML models. To evaluate this procedure we create a data generator capable of generating customizable DI scenarios. Our method outperforms its competitor on DI scenarios covered by the state of the art, and shows comparable performance on newly covered DI scenarios.