SlimML: Removing Non- critical Input Data in Large-scale Iterative Machine Learning

None, None; None, None; None, None; None, None; None, None; None, None; None, None

SlimML: Removing Non- critical Input Data in Large-scale Iterative Machine Learning

Journal Article (2019)

Author(s)

Rui Han (Beijing Institute of Technology)

Chi Harold Harold Liu (Beijing Institute of Technology)

Shilin Li (Beijing Institute of Technology)

Lydia Y. Chen (TU Delft - Data-Intensive Systems)

Guoren Wang (Beijing Institute of Technology)

Jian Tang (DiDi AI Labs)

Jieping Ye (DiDi AI Labs)

Research Group

Data-Intensive Systems

Copyright

DOI related publication

https://doi.org/10.1109/TKDE.2019.2951388

MapReduce Iterative machine learning Large input datasets Model parameter updating

To reference this document use:

https://resolver.tudelft.nl/uuid:b9fefd3f-a9fc-45ba-8c6b-ef30361ca341

More Info

expand_more

Publication Year

2019

Language

English

Copyright

Research Group

Data-Intensive Systems

Issue number

5

Volume number

33

Pages (from-to)

2223-2236

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The core of many large-scale machine learning (ML) applications, such as neural networks (NN), support vector machine (SVM), and convolutional neural network (CNN), is the training algorithm that iteratively updates model parameters by processing massive datasets. From a plethora of studies aiming at accelerating ML, being data parallelization and parameter server, the prevalent assumption is that all data points are equivalently relevant to model parameter updating. In this article, we challenge this assumption by proposing a criterion to measure a data point's effect on model parameter updating, and experimentally demonstrate that the majority of data points are non-critical in the training process. We develop a slim learning framework, termed SlimML, which trains the ML models only on the critical data and thus significantly improves training performance. To such an end, SlimML efficiently leverages a small number of aggregated data points per iteration to approximate the criticalness of original input data instances. The proposed approach can be used by changing a few lines of code in a standard stochastic gradient descent (SGD) procedure, and we demonstrate experimentally, on NN regression, SVM classification, and CNN training, that for large datasets, it accelerates model training process by an average of 3.61 times while only incurring accuracy losses of 0.37 percent.

Files

08890886.pdf

(pdf | 2.46 Mb)

License info not available