Also for k-means

more data does not imply better performance

Journal Article (2023)
Author(s)

M Loog (TU Delft - Pattern Recognition and Bioinformatics, Radboud Universiteit Nijmegen)

Jesse H. Krijthe (Radboud Universiteit Nijmegen, TU Delft - Pattern Recognition and Bioinformatics)

Manuele Bicego (University of Verona)

Research Group
Pattern Recognition and Bioinformatics
Copyright
© 2023 M. Loog, J.H. Krijthe, Manuele Bicego
DOI related publication
https://doi.org/10.1007/s10994-023-06361-6
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 M. Loog, J.H. Krijthe, Manuele Bicego
Research Group
Pattern Recognition and Bioinformatics
Issue number
8
Volume number
112
Pages (from-to)
3033-3050
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Arguably, a desirable feature of a learner is that its performance gets better with an increasing amount of training data, at least in expectation. This issue has received renewed attention in recent years and some curious and surprising findings have been reported on. In essence, these results show that more data does actually not necessarily lead to improved performance—worse even, performance can deteriorate. Clustering, however, has not been subjected to such kind of study up to now. This paper shows that k-means clustering, a ubiquitous technique in machine learning and data mining, suffers from the same lack of so-called monotonicity and can display deterioration in expected performance with increasing training set sizes. Our main, theoretical contributions prove that 1-means clustering is monotonic, while 2-means is not even weakly monotonic, i.e., the occurrence of nonmonotonic behavior persists indefinitely, beyond any training sample size. For larger k, the question remains open.