T.J. Viering | TU Delft Repository

Learning Learning Curves

Journal article (2025) - O. Taylan Turan, David M.J. Tax, Tom J. Viering, Marco Loog

Learning curves depict how a model’s expected performance changes with varying training set sizes, unlike training curves, showing a gradient-based model’s performance with respect to training epochs. Extrapolating learning curves can be useful for determining the performance gain with additional data. Parametric functions, that assume monotone behaviour of the curves, are a prevalent methodology to model and extrapolate learning curves. However, learning curves do not necessarily follow a specific parametric shape: they can have peaks, dips, and zigzag patterns. These unconventional shapes can hinder the extrapolation performance of commonly used parametric curve-fitting models. In addition, the objective functions for fitting such parametric models are non-convex, making them initialization-dependent and brittle. In response to these challenges, we propose a convex, data-driven approach that extracts information from available learning curves to guide the extrapolation of another targeted learning curve. Our method achieves this through using a learning curve database. Using the initial segment of the observed curve, we determine a group of similar curves from the database and reduce the dimensionality via Functional Principle Component Analysis FPCA. These principal components are used in a semi-parametric kernel ridge regression (SPKR) model to extrapolate targeted curves. The solution of the SPKR can be obtained analytically and does not suffer from initialization issues. To evaluate our method, we create a new database of diverse learning curves that do not always adhere to typical parametric shapes. Our method performs better than parametric non-parametric learning curve-fitting methods on this database for the learning curve extrapolation task. ...

Global patterns in vegetation accessible subsurface water storage emerge from spatially varying importance of individual drivers

Journal article (2024) - Fransje van Oorschot, Markus Hrachowitz, Tom Viering, Andrea Alessandri, Ruud J van der Ent

Vegetation roots play an essential role in regulating the hydrological cycle by removing water from the subsurface and releasing it to the atmosphere. However, the present understanding of the drivers of ecosystem-scale root development and their spatial variability globally is limited. This study investigates the varying roles of climate, landscape, and vegetation on the magnitude of root zone storage capacity (Sr) worldwide, which is defined as the maximum volume of subsurface moisture accessible to vegetation roots. To this aim, we quantified Sr and evaluated 21 possible climate, landscape, and vegetation controls for 3612 river catchments worldwide using a random forest machine learning model. Our findings reveal climate as primary, but spatially varying, driver of ecosystem scale Sr with landscape and vegetation characteristics playing a minor role. More specifically, we found the mean inter-storm duration as most dominant control of Sr globally, followed by mean temperature, mean precipitation, and mean topographic slope. While the inter-storm duration, temperature, and slope exhibit a consistent relation with Sr globally, the relation between precipitation and Sr varies spatially. Based on this spatial variability, we classified two different regimes: precipitation driven and energy limited. The precipitation-driven regime exhibits a positive relation between precipitation and Sr for precipitation of up to 3 mm d−1, above which the relation flattens and eventually becomes negative. The energy-limited regime exhibits a strictly negative relation between precipitation and Sr. Using the random forest model based on these three dominant climate variables and the landscape variable slope, we generated a global gridded dataset of Sr, which closely resembles other global datasets of root characteristics. This suggests that our parsimonious approach based on four globally available variables to estimate Sr on a global scale has the potential to be readily and easily integrated into the parameterization of Sr in global hydrological and land surface models. This may enhance the accuracy of global predictions of land–atmosphere exchange fluxes and hydrological extremes by providing a robust representation of both spatial and temporal variability in vegetation root characteristics. ...

Vegetation roots play an essential role in regulating the hydrological cycle by removing water from the subsurface and releasing it to the atmosphere. However, the present understanding of the drivers of ecosystem-scale root development and their spatial variability globally is limited. This study investigates the varying roles of climate, landscape, and vegetation on the magnitude of root zone storage capacity (Sr) worldwide, which is defined as the maximum volume of subsurface moisture accessible to vegetation roots. To this aim, we quantified Sr and evaluated 21 possible climate, landscape, and vegetation controls for 3612 river catchments worldwide using a random forest machine learning model. Our findings reveal climate as primary, but spatially varying, driver of ecosystem scale Sr with landscape and vegetation characteristics playing a minor role. More specifically, we found the mean inter-storm duration as most dominant control of Sr globally, followed by mean temperature, mean precipitation, and mean topographic slope. While the inter-storm duration, temperature, and slope exhibit a consistent relation with Sr globally, the relation between precipitation and Sr varies spatially. Based on this spatial variability, we classified two different regimes: precipitation driven and energy limited. The precipitation-driven regime exhibits a positive relation between precipitation and Sr for precipitation of up to 3 mm d−1, above which the relation flattens and eventually becomes negative. The energy-limited regime exhibits a strictly negative relation between precipitation and Sr. Using the random forest model based on these three dominant climate variables and the landscape variable slope, we generated a global gridded dataset of Sr, which closely resembles other global datasets of root characteristics. This suggests that our parsimonious approach based on four globally available variables to estimate Sr on a global scale has the potential to be readily and easily integrated into the parameterization of Sr in global hydrological and land surface models. This may enhance the accuracy of global predictions of land–atmosphere exchange fluxes and hydrological extremes by providing a robust representation of both spatial and temporal variability in vegetation root characteristics.

The unreasonable effectiveness of early discarding after one epoch in neural network hyperparameter optimization

Journal article (2024) - Romain Egele, Felix Mohr, Tom Viering, Prasanna Balaprakash

To reach high performance with deep learning, hyperparameter optimization (HPO) is essential. This process is usually time-consuming due to costly evaluations of neural networks. Early discarding techniques limit the resources granted to unpromising candidates by observing the empirical learning curves and canceling neural network training as soon as the lack of competitiveness of a candidate becomes evident. Despite two decades of research, little is understood about the trade-off between the aggressiveness of discarding and the loss of predictive performance. Our paper studies this trade-off for several commonly used discarding techniques such as successive halving and learning curve extrapolation. Our surprising finding is that these commonly used techniques offer minimal to no added value compared to the simple strategy of discarding after a constant number of epochs of training. The chosen number of epochs mostly depends on the available compute budget. We call this approach i-Epoch (i being the constant number of epochs with which neural networks are trained) and suggest to assess the quality of early discarding techniques by comparing how their Pareto-Front (in consumed training epochs and predictive performance) complement the Pareto-Front of i-Epoch. ...

The Shape of Learning Curves

A Review

Review (2023) - Tom Viering, Marco Loog

Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified. ...

On Safety in Machine Learning

Doctoral thesis (2023) - T.J. Viering

This dissertation focuses on safety in machine learning. Our adopted safety notion is related to robustness of learning algorithms. Related to this concept, we touch upon three topics: explainability, active learning and learning curves. Complex models can often achieve better performance compared to simpler ones. Such larger models are more like blackboxes, whose inner workings are much harder to understand. However, explanations for their decisions may be required by law when these models are used, and may help us further improve them. For image data and CNNs, Grad-CAM produces explanations in the form of a heatmap. We construct CNNs whose heatmaps are manipulated, but whose predictions remain accurate, illustrating that Grad-CAM may not be robust enough for high stakes tasks such as self-driving cars. Machine learning often require large amounts of data for learning. Data annotation is often expensive or difficult. Active learning aims to reduce labeling costs by selecting data in a smart way — instead of the default, random sampling. Active learning algorithms aim to find the most useful samples. Surprisingly, we find that active learning algorithms with strictly better performance guarantees perform worse empirically. The cause: their worst-case analysis is unrealistic. A more optimistic average-case analysis does explain our empirical results. Thus better guarantees do not always translate to better performance. A learning curve visualizes the expected performance versus the sample size a learning algorithm is trained on. These curves are important for various applications, such as estimating the amount of data needed for learning. The conventional wisdom is that more data equals better performance. This means a learning curve strictly improves with more data, or in other words, is monotone. Deviations can surely be explained away by noise, chance, or a faulty experimental setup? To many in our field this may come as a surprise, but this behavior cannot be explained away. We survey the literature and highlight various non-monotone behaviors, even in cases where the learner uses a correct model. Our survey finds that learning curves can have a variety of shapes, such as power laws or exponentials, but there is no consensus and a complete characterization remains an open problem. We also find simple learning problems in classification and regression that show new non-monotone behaviors. Our problems can be tuned so non-monotonicity occurs for any sample size. Is there a universal solution to make learners montone? We design a wrapper algorithm that only adopt a new model if its performance is significantly better on validation data. We prove that the learning curve of the wrapper is monotone with a certain probability. This provides a first step towards safe learners that are guaranteed to improve with more data. Many questions regarding safety remain, however, this thesis may provide inspiration to develop more robust learning algorithms. The main take-aways are (TLDR): • Strictly tighter generalization bounds do not imply better performance. • Explanations provided by Grad-CAM can be misleading. • Even in simple settings more data can lead to worse performance. • We provide ideas to construct learners that always improve with more data. ...

This dissertation focuses on safety in machine learning. Our adopted safety notion is related to robustness of learning algorithms. Related to this concept, we touch upon three topics: explainability, active learning and learning curves. Complex models can often achieve better performance compared to simpler ones. Such larger models are more like blackboxes, whose inner workings are much harder to understand. However, explanations for their decisions may be required by law when these models are used, and may help us further improve them. For image data and CNNs, Grad-CAM produces explanations in the form of a heatmap. We construct CNNs whose heatmaps are manipulated, but whose predictions remain accurate, illustrating that Grad-CAM may not be robust enough for high stakes tasks such as self-driving cars. Machine learning often require large amounts of data for learning. Data annotation is often expensive or difficult. Active learning aims to reduce labeling costs by selecting data in a smart way — instead of the default, random sampling. Active learning algorithms aim to find the most useful samples. Surprisingly, we find that active learning algorithms with strictly better performance guarantees perform worse empirically. The cause: their worst-case analysis is unrealistic. A more optimistic average-case analysis does explain our empirical results. Thus better guarantees do not always translate to better performance. A learning curve visualizes the expected performance versus the sample size a learning algorithm is trained on. These curves are important for various applications, such as estimating the amount of data needed for learning. The conventional wisdom is that more data equals better performance. This means a learning curve strictly improves with more data, or in other words, is monotone. Deviations can surely be explained away by noise, chance, or a faulty experimental setup? To many in our field this may come as a surprise, but this behavior cannot be explained away. We survey the literature and highlight various non-monotone behaviors, even in cases where the learner uses a correct model. Our survey finds that learning curves can have a variety of shapes, such as power laws or exponentials, but there is no consensus and a complete characterization remains an open problem. We also find simple learning problems in classification and regression that show new non-monotone behaviors. Our problems can be tuned so non-monotonicity occurs for any sample size. Is there a universal solution to make learners montone? We design a wrapper algorithm that only adopt a new model if its performance is significantly better on validation data. We prove that the learning curve of the wrapper is monotone with a certain probability. This provides a first step towards safe learners that are guaranteed to improve with more data. Many questions regarding safety remain, however, this thesis may provide inspiration to develop more robust learning algorithms. The main take-aways are (TLDR): • Strictly tighter generalization bounds do not imply better performance. • Explanations provided by Grad-CAM can be misleading. • Even in simple settings more data can lead to worse performance. • We provide ideas to construct learners that always improve with more data.

LCDB 1.0

An Extensive Learning Curves Database for Classification Tasks

Conference paper (2023) - Felix Mohr, Tom J. Viering, Marco Loog, Jan N. van Rijn

The use of learning curves for decision making in supervised machine learning is standard practice, yet understanding of their behavior is rather limited. To facilitate a deepening of our knowledge, we introduce the Learning Curve Database (LCDB), which contains empirical learning curves of 20 classification algorithms on 246 datasets. One of the LCDB’s unique strength is that it contains all (probabilistic) predictions, which allows for building learning curves of arbitrary metrics. Moreover, it unifies the properties of similar high quality databases in that it (i) defines clean splits between training, validation, and test data, (ii) provides training times, and (iii) provides an API for convenient access (pip install lcdb). We demonstrate the utility of LCDB by analyzing some learning curve phenomena, such as convexity, monotonicity, peaking, and curve shapes. Improving our understanding of these matters is essential for efficient use of learning curves for model selection, speeding up model training, and to determine the value of more training data. ...

Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings

Conference paper (2020) - Katja Geertruida Schmahl, Tom Julian Viering, Stavros Makrodimitris, Arman Naseri Jahfari, David Tax, Marco Loog

Large text corpora used for creating word embeddings (vectors which represent word meanings) often contain stereotypical gender biases. As a result, such unwanted biases will typically also be present in word embeddings derived from such corpora and downstream applications in the field of natural language processing (NLP). To minimize the effect of gender bias in these settings, more insight is needed when it comes to where and how biases manifest themselves in the text corpora employed. This paper contributes by showing how gender bias in word embeddings from Wikipedia has developed over time. Quantifying the gender bias over time shows that art related words have become more female biased. Family and science words have stereotypical biases towards respectively female and male words. These biases seem to have decreased since 2006, but these changes are not more extreme than those seen in random sets of words. Career related words are more strongly associated with male than with female, this difference has only become smaller in recently written articles. These developments provide additional understanding of what can be done to make Wikipedia more gender neutral and how important time of writing can be when considering biases in word embeddings trained from Wikipedia or from other text corpora. ...

Making Learners (More) Monotone

Conference paper (2020) - Tom Julian Viering, Alexander Mey, Marco Loog

Learning performance can show non-monotonic behavior. That is, more data does not necessarily lead to better models, even on average. We propose three algorithms that take a supervised learning model and make it perform more monotone. We prove consistency and monotonicity with high probability, and evaluate the algorithms on scenarios where non-monotone behaviour occurs. Our proposed algorithm MT_HT makes less than 1% non-monotone decisions on MNIST while staying competitive in terms of error rate compared to several baselines. Our code is available at https://github.com/tomviering/monotone. ...

A brief prehistory of double descent

Journal article (2020) - Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, David M. J. Tax

A Distribution Dependent and Independent Complexity Analysis of Manifold Regularization

Conference paper (2020) - Alexander Mey, Tom Julian Viering, Marco Loog

Manifold regularization is a commonly used technique in semi-supervised learning. It enforces the classification rule to be smooth with respect to the data-manifold. Here, we derive sample complexity bounds based on pseudo-dimension for models that add a convex data dependent regularization term to a supervised learning process, as is in particular done in Manifold regularization. We then compare the bound for those semi-supervised methods to purely supervised methods, and discuss a setting in which the semi-supervised method can only have a constant improvement, ignoring logarithmic terms. By viewing Manifold regularization as a kernel method we then derive Rademacher bounds which allow for a distribution dependent analysis. Finally we illustrate that these bounds may be useful for choosing an appropriate manifold regularization parameter in situations with very sparsely labeled data. ...

Nuclear discrepancy for single-shot batch active learning

Journal article (2019) - Tom J. Viering, Jesse H. Krijthe, Marco Loog

Active learning algorithms propose what data should be labeled given a pool of unlabeled data. Instead of selecting randomly what data to annotate, active learning strategies aim to select data so as to get a good predictive model with as little labeled samples as possible. Single-shot batch active learners select all samples to be labeled in a single step, before any labels are observed.We study single-shot active learners that minimize generalization bounds to select a representative sample, such as the maximum mean discrepancy (MMD) active learner.We prove that a related bound, the discrepancy, provides a tighter worst-case bound. We study these bounds probabilistically, which inspires us to introduce a novel bound, the nuclear discrepancy (ND). The ND bound is tighter for the expected loss under optimistic probabilistic assumptions. Our experiments show that the MMD active learner performs better than the discrepancy in terms of the mean squared error, indicating that tighter worst case bounds do not imply better active learning performance. The proposed active learner improves significantly upon the MMD and discrepancy in the realizable setting and a similar trend is observed in the agnostic setting, showing the benefits of a probabilistic approach to active learning. Our study highlights that assumptions underlying generalization bounds can be equally important as bound-tightness, when it comes to active learning performance. Code for reproducing our experimental results can be found at https://github.com/tomviering/ NuclearDiscrepancy. ...

Minimizers of the empirical risk and risk monotonicity

Conference paper (2019) - M. Loog, T.J. Viering, A. Mey

Plotting a learner’s average performance against the number of training samples results in a learning curve. Studying such curves on one or more data sets is a way to get to a better understanding of the generalization properties of this learner. The behavior of learning curves is, however, not very well understood and can display (for most researchers) quite unexpected behavior. Our work introduces the formal notion of risk monotonicity, which asks the risk to not deteriorate with increasing training set sizes in expectation over the training samples. We then present the surprising result that various standard learners, specifically those that minimize the empirical risk, can act nonmonotonically irrespective of the training sample size. We provide a theoretical underpinning for specific instantiations from classification, regression, and density estimation. Altogether, the proposed monotonicity notion opens up a whole new direction of research. ...

Generalization Bound Minimization for Active Learning

Abstract (2017) - Tom Viering, Jesse Krijthe, Marco Loog