O.T. Turan | TU Delft Repository

On Sample-Wise Strict Monotonicity with a Gradient Update

Conference paper (2026) - O. Taylan Turan, Marco Loog, David M.J. Tax

Learning curves describe how the performance of a model evolves with increasing training data. Although more data is generally expected to improve model performance, in practice models can exhibit non-monotonic behavior where additional data leads to performance degradation. Sample-wise double descent is one particular example. We address the question of how a learner can have a provably monotone learning curve. For isotropic Gaussian covariates under a Gaussian noise model and a linear predictor, we prove that a single step of steepest descent guarantees sample-wise monotonicity in the learning curve, if the step size does not exceed an upper bound. Furthermore, we present a practical procedure that ensures monotonicity without explicit regularization or cross-validation, using initialization from the previous training set size. Experiments on real-world datasets demonstrate that this method achieves monotone behavior and improved sample efficiency compared to ordinary least squares and optimally regularized ridge regression. We also explore extensions to binary classification, where monotonicity depends on the chosen performance metric. While our guarantees are derived under simplifying assumptions, they provide both theoretical and practical insights for constructing monotone learners and for understanding and mitigating sample-wise double descent behavior. ...

Generalization performance distributions along learning curves

Journal article (2026) - O. Taylan Turan, Marco Loog, David M.J. Tax

Learning curves show the expected performance with respect to training set size. This is often used to evaluate and compare models, tune hyper-parameters and determine how much data is needed for a specific performance. However, the distributional properties of performance are frequently overlooked on learning curves. Generally, only an average with standard error or standard deviation is used. In this paper, we analyze the distributions of generalization performance on the learning curves. We compile a high-fidelity learning curve database, both with respect to training set size and repetitions of the sampling for a fixed training set size. Our investigation reveals that generalization performance rarely follows a Gaussian distribution for classical classifiers, regardless of dataset balance, loss function, sampling method, or hyper-parameter tuning along learning curves. Furthermore, we show that the choice of statistical summary, mean versus measures like quantiles affect the top model rankings. Our findings highlight the importance of considering different statistical measures and use of non-parametric approaches when evaluating and selecting machine learning models with learning curves. ...

Learning Curves with Little Data

Doctoral thesis (2026) - O.T. Turan, M.J.T. Reinders, M. Loog, D.M.J. Tax

Obtaining data is often costly, making it important to assess whether collecting additional data is justified by the expected improvement in performance. Learning curves, which describe the expected performance of a learner as a function of dataset size, provide a useful tool for this purpose. They can help practitioners assess whether further data collection is justified by the anticipated gains. However, additional data does not always lead to improved performance, which makes estimating the potential benefit challenging. Under such conditions, model selection also becomes more difficult, as it is unclear how to compare models when no data is available to evaluate their performance at a hypothetical training set size.

In this context, the thesis takes a step back and asks a more fundamental question: how can we reliably reason about generalization when data is scarce and the behavior of learning curves is itself uncertain? Rather than treating learning curves as simple, and monotonic functions, we study their full statistical structure. We show that variability across training subsets can influence model comparison, decision making, and performance extrapolation. In addition, we investigate conditions under which monotonic improvement can be guaranteed or encouraged. Beyond single task learning, we also examine meta-learning, where information from multiple related tasks is leveraged to improve generalization performance while reducing the amount of data required from any individual task.

We begin by showing that the mean, as a statistical summary of learning curves, may not provide a reliable estimate of performance. We demonstrate that generalization performance distributions are often skewed and heavy tailed, regardless of how they are obtained. As a result, relying solely on the mean for model selection can be suboptimal for some problems.

Next, we propose a semi parametric extrapolation method that adapts its inductive bias to capture complex and potentially non monotonic patterns. This approach improves predictive reliability in settings where additional data collection is costly or infeasible and where learning curves may not exhibit monotonic behavior.

We then study the monotonicity of learning curves under specific conditions. For linear regression, we show that a single gradient update is sufficient to ensure monotonic improvement, provided that the learning rate does not exceed a certain threshold. To construct similarly monotonic learners in practice, we propose a data driven approach for selecting both the learning rate and the initial parameter estimates.

Finally, we investigate the learning curves of a meta learning algorithm. Through controlled synthetic experiments, we analyze the generalization performance of both meta learners and task specific learners, providing insights into how properties of the task distribution influence generalization under a limited adaptation stage consisting of a single gradient update.
...

Obtaining data is often costly, making it important to assess whether collecting additional data is justified by the expected improvement in performance. Learning curves, which describe the expected performance of a learner as a function of dataset size, provide a useful tool for this purpose. They can help practitioners assess whether further data collection is justified by the anticipated gains. However, additional data does not always lead to improved performance, which makes estimating the potential benefit challenging. Under such conditions, model selection also becomes more difficult, as it is unclear how to compare models when no data is available to evaluate their performance at a hypothetical training set size.

In this context, the thesis takes a step back and asks a more fundamental question: how can we reliably reason about generalization when data is scarce and the behavior of learning curves is itself uncertain? Rather than treating learning curves as simple, and monotonic functions, we study their full statistical structure. We show that variability across training subsets can influence model comparison, decision making, and performance extrapolation. In addition, we investigate conditions under which monotonic improvement can be guaranteed or encouraged. Beyond single task learning, we also examine meta-learning, where information from multiple related tasks is leveraged to improve generalization performance while reducing the amount of data required from any individual task.

We begin by showing that the mean, as a statistical summary of learning curves, may not provide a reliable estimate of performance. We demonstrate that generalization performance distributions are often skewed and heavy tailed, regardless of how they are obtained. As a result, relying solely on the mean for model selection can be suboptimal for some problems.

Next, we propose a semi parametric extrapolation method that adapts its inductive bias to capture complex and potentially non monotonic patterns. This approach improves predictive reliability in settings where additional data collection is costly or infeasible and where learning curves may not exhibit monotonic behavior.

We then study the monotonicity of learning curves under specific conditions. For linear regression, we show that a single gradient update is sufficient to ensure monotonic improvement, provided that the learning rate does not exceed a certain threshold. To construct similarly monotonic learners in practice, we propose a data driven approach for selecting both the learning rate and the initial parameter estimates.

Finally, we investigate the learning curves of a meta learning algorithm. Through controlled synthetic experiments, we analyze the generalization performance of both meta learners and task specific learners, providing insights into how properties of the task distribution influence generalization under a limited adaptation stage consisting of a single gradient update.

Learning Learning Curves

Journal article (2025) - O. Taylan Turan, David M.J. Tax, Tom J. Viering, Marco Loog

Learning curves depict how a model’s expected performance changes with varying training set sizes, unlike training curves, showing a gradient-based model’s performance with respect to training epochs. Extrapolating learning curves can be useful for determining the performance gain with additional data. Parametric functions, that assume monotone behaviour of the curves, are a prevalent methodology to model and extrapolate learning curves. However, learning curves do not necessarily follow a specific parametric shape: they can have peaks, dips, and zigzag patterns. These unconventional shapes can hinder the extrapolation performance of commonly used parametric curve-fitting models. In addition, the objective functions for fitting such parametric models are non-convex, making them initialization-dependent and brittle. In response to these challenges, we propose a convex, data-driven approach that extracts information from available learning curves to guide the extrapolation of another targeted learning curve. Our method achieves this through using a learning curve database. Using the initial segment of the observed curve, we determine a group of similar curves from the database and reduce the dimensionality via Functional Principle Component Analysis FPCA. These principal components are used in a semi-parametric kernel ridge regression (SPKR) model to extrapolate targeted curves. The solution of the SPKR can be obtained analytically and does not suffer from initialization issues. To evaluate our method, we create a new database of diverse learning curves that do not always adhere to typical parametric shapes. Our method performs better than parametric non-parametric learning curve-fitting methods on this database for the learning curve extrapolation task. ...

Cooperative data-driven modeling

Journal article (2023) - Aleksandr Dekhovich, O. Taylan Turan, Jiaxiang Yi, Miguel A. Bessa

Data-driven modeling in mechanics is evolving rapidly based on recent machine learning advances, especially on artificial neural networks. As the field matures, new data and models created by different groups become available, opening possibilities for cooperative modeling. However, artificial neural networks suffer from catastrophic forgetting, i.e. they forget how to perform an old task when trained on a new one. This hinders cooperation because adapting an existing model for a new task affects the performance on a previous task trained by someone else. The authors developed a continual learning method that addresses this issue, applying it here for the first time to solid mechanics. In particular, the method is applied to recurrent neural networks to predict history-dependent plasticity behavior, although it can be used on any other architecture (feedforward, convolutional, etc.) and to predict other phenomena. This work intends to spawn future developments on continual learning that will foster cooperative strategies among the mechanics community to solve increasingly challenging problems. We show that the chosen continual learning strategy can sequentially learn several constitutive laws without forgetting them, using less data to achieve the same error as standard (non-cooperative) training of one law per model. ...