Feature Discovery for Data-Centric AI
More Info
expand_more
Abstract
We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on data has led to the development of an economy around data, creating data marketplace platforms where data is traded as a commodity. However, trading data involves constraints that reflect the specific needs of users, such as enriching or augmenting their datasets or creating datasets with particular properties. These constraints pose challenges the data management community has already addressed independently of the marketplace platform context. As such, in this thesis, as a first act of research, we integrate approaches and practices from the data management community into the context of an open-source data marketplace platform, following a survey of industry professionals who produce, trade, and purchase data assets.
Aligned with the objectives of the data-centric AI paradigm to create high-quality training datasets, our research is focused on developing automated methods to identify relevant and related features (e.g., columns) that can be augmented to a given dataset. This effort has led to the research and design of feature discovery, which sits at the intersection of dataset discovery by discovering related datasets, data integration by joining datasets, and feature selection by selecting high-predictive features for ML models. We have developed an automated approach for feature discovery that improves upon existing automated data augmentation techniques, improving the effectiveness and efficiency of finding the most relevant features.
However, with the adoption of automatic approaches, we discovered that in moving towards data-centric AI, we risk detaching not only from model-centric but also from user-centric AI. To assess the extent to which users (e.g., data scientists, data engineers, ML engineers) rely on and trust automatic approaches and to determine their feature discovery pipeline, we conducted 19 interviews based on a use-case study. The results revealed that users doubt the automated methods and want to be involved in the process instead. Consequently, we decided to incorporate the users into the feature discovery process and to explore whether their involvement (e.g., by adding domain and business knowledge) improves the quality of the resulting dataset and the feature discovery process.
Thus, we created a human-in-the-loop approach for feature discovery, which was evaluated by conducting interviews with a subset of our initial candidate pool. The results confirmed that a human-in-the-loop method is more approachable for users as it provides control over and insights into the process, as well as the opportunity to inject their knowledge, ensuring that the resulting dataset is relevant for their data tasks.
With this thesis, we make scientific contributions to the field of data management by offering novel insights into users' workflows and designing and developing resources that enhance feature discovery. We hope our contributions will serve as a valuable resource for future work in user-centric and data-centric feature discovery.