J. Söhl | TU Delft Repository

Competing in a prediction tournament

Bachelor thesis (2026) - T.T. den Rooijen, J. Söhl, R. Versendaal

In recent years, prediction tournaments have been organized more frequently. Organizers of these tournaments aim to identify statistical models that perform best in predicting future events. In most cases, the winner of a prediction tournament receives a reward.

In a prediction tournament, each contestant is asked a number of questions about the probability that an event will occur before a specific date. Simulations indicate that contestants who perfectly predict these probabilities almost never win the tournament. This effect suggests that an accurate forecaster could increase her chance of winning by introducing some noise into her predictions. The aim of this report is to identify strategies that contestants can use to increase their chance of winning.

In this report, five strategies are introduced: hard-thresholding, soft-thresholding, polynomial strategy, exponential strategy, and random exponential strategy. Each strategy depends on a single parameter. For each strategy, simulations are performed under different settings to determine which strategy results in the most victories. To determine the best parameter for each strategy, polynomial regression is applied to the simulation data.

The simulations suggest that the exponential strategy has the largest positive impact on the number of wins for accurate contestants when all opponents use no additional strategies. If half of the opponents use an exponential or random exponential strategy, then the most accurate contestants are recommended to use a random exponential strategy. Using a strategy appears to have only a negative impact on a contestant’s chance of winning if all opponents use an exponential or random exponential strategy.

The best parameter for an exponential strategy appears to be smaller for less accurate forecasters. The least accurate contestants competing in a prediction tournament are recommended to use no strategy in all previously described situations. ...

In recent years, prediction tournaments have been organized more frequently. Organizers of these tournaments aim to identify statistical models that perform best in predicting future events. In most cases, the winner of a prediction tournament receives a reward.

In a prediction tournament, each contestant is asked a number of questions about the probability that an event will occur before a specific date. Simulations indicate that contestants who perfectly predict these probabilities almost never win the tournament. This effect suggests that an accurate forecaster could increase her chance of winning by introducing some noise into her predictions. The aim of this report is to identify strategies that contestants can use to increase their chance of winning.

In this report, five strategies are introduced: hard-thresholding, soft-thresholding, polynomial strategy, exponential strategy, and random exponential strategy. Each strategy depends on a single parameter. For each strategy, simulations are performed under different settings to determine which strategy results in the most victories. To determine the best parameter for each strategy, polynomial regression is applied to the simulation data.

The simulations suggest that the exponential strategy has the largest positive impact on the number of wins for accurate contestants when all opponents use no additional strategies. If half of the opponents use an exponential or random exponential strategy, then the most accurate contestants are recommended to use a random exponential strategy. Using a strategy appears to have only a negative impact on a contestant’s chance of winning if all opponents use an exponential or random exponential strategy.

The best parameter for an exponential strategy appears to be smaller for less accurate forecasters. The least accurate contestants competing in a prediction tournament are recommended to use no strategy in all previously described situations.

The randomness in prediction tournaments

Bachelor thesis (2026) - E.T. de Vries, J. Söhl, E. Emsiz

Prediction tournaments are competitions in which participants report probabilistic forecasts about uncertain future events, after this forecasters are ranked based on their scores conducted with the help of a scoring rule. The main objective of such a tournament is to select the most accurate forecaster as the winner. However, a fundamental problem known as the prediction tournament paradox shows that in standard winner-take-all competitions, the most accurate forecaster does not have the highest probability of winning. The reasoning behind this paradox is that extreme predictions introduce higher variance in realized scores, which can lead to a winning score despite being less accurate on average.

This thesis analyzes and compares four forecasting competition mechanisms: the standard deterministic mechanism, the Event Lotteries Forecasting Competition mechanism (ELF), the Independent Event Lotteries Forecasting mechanism (I-ELF), and the Wisdom of the Most Accurate Crowd mechanism (WOMAC). ELF and I-ELF add a amount of randomness in choosing the winner, which makes these mechanisms incentive compatible, although the forecaster with the highest score does not always win. The last mechanism which is introduced is WOMAC, this mechanism scores forecasters against a reference prediction made from other forecasters predictions, letting the forecaster with the highest score win and having Bayes-Nash incentive compatibility. The disadvantage of this mechanism is that there is a amount of randomness added by scoring forecasters against a reference prediction and not against the true probabilities. To select the best mechanism to use in a prediction tournament, simulations are made for comparisons. These simulations are made with the help of the point mass noise model for realistic forecasting errors. The mechanisms are evaluated on two criteria: the probability of selecting the most accurate forecaster and the degree of randomness introduced in winner selection, quantified using the expectation of the winner's rank. The results show that while ELF and I-ELF achieve strict dominant strategy incentive compatibility, both mechanisms introduce substantial randomness into winner selection, particularly when the accuracy gap between forecasters is small. The I-ELF mechanism was designed by Witkowski et al. (2021) to reduce this randomness as the number of events grows, and a lower bound on the required number of events is derived using Hoeffding's inequality. After conducting simulations in this thesis, it is found that for this bound an unrealistic high number of events is needed. These simulations confirmed that an unrealistically large number of events would be required to reduce randomness enough to guarantee a desired probability of the best forecaster winning. The WOMAC mechanism, which scores forecasters against a reference prediction constructed from the other forecasters rather than against the realized outcome, achieves Bayes-Nash incentive compatibility and consistently selects the best forecaster with higher probability and less randomness than ELF and I-ELF across all simulated settings.

The findings suggest that for organizations designing prediction tournaments under the given conditions, WOMAC represents the most practical choice, offering the best trade-off between incentive compatibility and reliable identification of the most accurate forecaster. ...

Prediction tournaments are competitions in which participants report probabilistic forecasts about uncertain future events, after this forecasters are ranked based on their scores conducted with the help of a scoring rule. The main objective of such a tournament is to select the most accurate forecaster as the winner. However, a fundamental problem known as the prediction tournament paradox shows that in standard winner-take-all competitions, the most accurate forecaster does not have the highest probability of winning. The reasoning behind this paradox is that extreme predictions introduce higher variance in realized scores, which can lead to a winning score despite being less accurate on average.

This thesis analyzes and compares four forecasting competition mechanisms: the standard deterministic mechanism, the Event Lotteries Forecasting Competition mechanism (ELF), the Independent Event Lotteries Forecasting mechanism (I-ELF), and the Wisdom of the Most Accurate Crowd mechanism (WOMAC). ELF and I-ELF add a amount of randomness in choosing the winner, which makes these mechanisms incentive compatible, although the forecaster with the highest score does not always win. The last mechanism which is introduced is WOMAC, this mechanism scores forecasters against a reference prediction made from other forecasters predictions, letting the forecaster with the highest score win and having Bayes-Nash incentive compatibility. The disadvantage of this mechanism is that there is a amount of randomness added by scoring forecasters against a reference prediction and not against the true probabilities. To select the best mechanism to use in a prediction tournament, simulations are made for comparisons. These simulations are made with the help of the point mass noise model for realistic forecasting errors. The mechanisms are evaluated on two criteria: the probability of selecting the most accurate forecaster and the degree of randomness introduced in winner selection, quantified using the expectation of the winner's rank. The results show that while ELF and I-ELF achieve strict dominant strategy incentive compatibility, both mechanisms introduce substantial randomness into winner selection, particularly when the accuracy gap between forecasters is small. The I-ELF mechanism was designed by Witkowski et al. (2021) to reduce this randomness as the number of events grows, and a lower bound on the required number of events is derived using Hoeffding's inequality. After conducting simulations in this thesis, it is found that for this bound an unrealistic high number of events is needed. These simulations confirmed that an unrealistically large number of events would be required to reduce randomness enough to guarantee a desired probability of the best forecaster winning. The WOMAC mechanism, which scores forecasters against a reference prediction constructed from the other forecasters rather than against the realized outcome, achieves Bayes-Nash incentive compatibility and consistently selects the best forecaster with higher probability and less randomness than ELF and I-ELF across all simulated settings.

The findings suggest that for organizations designing prediction tournaments under the given conditions, WOMAC represents the most practical choice, offering the best trade-off between incentive compatibility and reliable identification of the most accurate forecaster.

Statistical analysis of replicate measurements of DNA mixtures

Master thesis (2026) - J.F. Koks, J. Söhl, R.J.F. Ypma, D. Kurowicka

At the Netherlands Forensic Institute, additional replicate measurements of the same DNA trace, referred to as rework, can be performed to obtain more information from a DNA mixture profile. Rework may increase the evidential value, expressed by the likelihood ratio (LR), but it also costs laboratory time, resources, and DNA sample material. This thesis investigates whether the LR after rework can be predicted from the original DNA mixture profile.

Two main contributions were made. First, a simulation framework was developed to construct predictive distributions for the rework LR. Starting from the deconvolution of the original profile, plausible contributor genotypes are sampled, additional replicate profiles are simulated, and the LR of the combined profile is calculated. Second, a Bayesian MCMC implementation was developed for the EuroForMix/DNAStatistX peak-height model, making it possible to propagate uncertainty in the nuisance parameters when computing LR values.

The framework was evaluated on cleaned two-person NFI research data, focusing on minor contributors. The frequentist plug-in simulation was not sufficiently calibrated: nominal 95% prediction intervals covered only 69.0% of the observed minor true-donor rework LRs. Including Bayesian parameter uncertainty improved the empirical coverage to 81.6% and reduced the mean interval score from 50.5 to 21.6. However, the predicted distributions remained insufficiently calibrated for casework use.

Overall, this thesis shows that predicting rework LRs is possible in principle and that parameter uncertainty is important for such predictions. The current framework should be viewed as a mathematical proof of concept rather than an operational tool. Further work is needed on artefact modelling, computational scaling, full MCMC validation, extension to more complex mixtures, and validation on casework-like data. ...

Financial Applications of Calibrated Lévy Models

Master thesis (2026) - J. Duro Garijo, J. Söhl, F. Yu

Exponential Lévy models are popular for option pricing, as their jump component captures empirical features of financial data that the Black-Scholes model cannot. Belomestny and Reiß introduced a spectral calibration approach for a single maturity, assuming a constant Lévy triplet on the whole interval. Tendijck and Koorevaar extended this to a time-inhomogeneous approach estimating a distinct triplet on each maturity interval from European put and call prices. Koorevaar further established asymptotic normality and confidence intervals for each of the estimated triplets.
This thesis extends that pointwise normality result to a functional CLT in L²(K) (where K ⊂ ℝ is compact) for the estimated Lévy density. We first derive a candidate covariance kernel of the exponentially-tilted estimation error, and identify a central structural obstruction: the rescaled kernel converges to an oscillatory kernel that is not integrable in ℝ². Hence, the associated covariance operator is not nuclear in L²(ℝ), and a Giné-León Hilbert space CLT cannot be obtained. We therefore restrict the domain to a compact set and modulate the error by a cosine factor to remove the oscillations, which yields a Giné-León CLT for the linear part of the error. The bias and remainder terms vanish under appropriate scaling, yielding a functional Central Limit Theorem for the full cosine-modulated error.
As an application, this result is transferred through the Gil-Pelaez formula to obtain a convergence-in-distribution result for the pricing error of a digital call option where the error enters through estimation of the Lévy density. This enables the computation of finite-sample confidence intervals that bridge the gap between the theory and practice. Finally, possible extensions are discussed. ...

Forensic Evidence Interpretation Using Likelihood Ratios

A Study on Prior Probabilities and LR Distributions for DNA Donors

Master thesis (2025) - R.M. Hallema, J. Söhl, R.J.F. Ypma, R.C. Kraaij

This thesis investigates the interpretation of forensic evidence through the use of likelihood ratios (LRs), with a particular focus on the role of prior probabilities and LR distributions in forensic DNA analysis. In forensic science, LRs are commonly used to quantify the strength of evidence in favor of one hypothesis over another. However, challenges arise in practice due to the complexity of DNA mixtures and the necessity of integrating prior information in certain scenarios. The first part of this work explores when and how prior probabilities must be incorporated into LR calculations, demonstrating through theoretical exposition and case studies that neglecting priors or assuming equal priors can lead to misleading conclusions.

Two detailed case studies illustrate the impact of introducing new persons of interest (PoIs) and how prior knowledge about associations between individuals can alter posterior probabilities. A comparison is also drawn between categorical and probabilistic approaches in body fluid analysis, with the latter offering a more nuanced interpretation of mRNA profiling data.

In the second part, the thesis introduces methods to estimate LR distributions for DNA contributors. These include threshold-based and genotype sampling techniques, which are tested across synthetic mixtures with varying contributor ratios. Furthermore, the behavior of LRs is studied for relatives of the true donor.

The findings underscore the importance of transparently reporting assumptions about priors and the value of presenting LR tables to facilitate Bayesian reasoning by decision makers. Overall, the thesis contributes to a more robust and interpretable application of statistical reasoning in forensic science. ...

Confidence Sets Based on Shrinkage

Bachelor thesis (2025) - A.C. Bor, J. Söhl, O. Jaibi

We study the problem of constructing confidence sets for the mean vector θ of a k-variate spherically symmetric distribution by centring them at the positive-part James–Stein estimator. Exploiting its superior risk properties whenever k ≥ 3, we first derive in Chapter 2 the exact sampling law of the test statistic T (X ) = ∥ ˆθ+ J S − θ∥2, showing it consists of a point mass at ∥θ∥2 and a continuous density component valid for any spherically symmetric model. In Chapter 3, we develop level-α procedures for both simple and composite hypotheses about θ, illustrated by a worked example with k = 4, α = 0.05. Chapter 4 then inverts these tests to form (1 − α)-confidence sets via two approaches: (i) a plug-in method using various norm estimators (including the James–Stein shrinkage itself) and (ii) a test-inversion principle guaranteeing exact coverage. Numerical comparisons confirm that the plug-in method produces smaller radii than classical sample-mean–centred sets. Our work thus extends classical multivariate inference by integrating shrinkage estimation into confidence-set theory for spherical distributions. ...

Analysis of the Prediction Tournament Paradox

Bachelor thesis (2025) - V.M. van der Eng, J. Söhl, H.M. Schuttelaars

In a prediction tournament, contestants are tasked with predicting the distribution of a random variable. To determine which contestant makes the most accurate predictions, scores are assigned based on the outcomes of the random variables. The scoring rules are designed such that a contestant’s expected score decreases as their predicted values approach the true distribution. This implies that the contestant with the lowest score should be the most accurate predictor. However, simulation results show that this is not the case. In this report, we found that for the common case of Bernoulli random variables, the true success probabilities affect the distribution of winners: it has a positive effect when the probability is closer to 0 or 1, and a negative effect when it is near 0.5. We also found that this distribution is not affected by whether contestant errors are drawn from a continuous distribution with fixed variance σ² or are simply +σ or −σ. Furthermore, contestants who make extreme predictions (always predicting 0 or 1) do not outperform those who predict values close to the true success probability. While the choice of scoring rule does influence the distribution of winners, it does not eliminate the paradox. We found that the the Pseudospherical and Power score with parameter β close to 1, and the Logarithmic score performed the best. We extend our analysis to random variables with multiple categories. To support this extension, we introduce a new sampling method that builds on the one used in earlier simulations. In the binary model, we only needed one success probability for each random variable, but now we need multiple per random variable, while making sure that the sum of all the probabilities is exactly 1. Using a statistical distance, we determine how to model contestant predictions. For these random variables, we also analyze various scoring rules. In this case, we found that both the Pseudospherical score and the Power score, with β slightly larger than 1, and the Logarithmic score performed the best across various numbers of categories. Similarly, we extend our analysis to continuous random variables. Because of time constraints, we only look at Normal distributions with known variance. We use the same statistical distance as for the multi-categorical random variables, the total variation distance, to determine how to model contestant predictions. We again look at several scoring rules and found that the Power and Pseudospherical scoring rules for values of β close to 1 and the Logarithmic score, performed the best in this scenario. ...

In a prediction tournament, contestants are tasked with predicting the distribution of a random variable. To determine which contestant makes the most accurate predictions, scores are assigned based on the outcomes of the random variables. The scoring rules are designed such that a contestant’s expected score decreases as their predicted values approach the true distribution. This implies that the contestant with the lowest score should be the most accurate predictor. However, simulation results show that this is not the case. In this report, we found that for the common case of Bernoulli random variables, the true success probabilities affect the distribution of winners: it has a positive effect when the probability is closer to 0 or 1, and a negative effect when it is near 0.5. We also found that this distribution is not affected by whether contestant errors are drawn from a continuous distribution with fixed variance σ² or are simply +σ or −σ. Furthermore, contestants who make extreme predictions (always predicting 0 or 1) do not outperform those who predict values close to the true success probability. While the choice of scoring rule does influence the distribution of winners, it does not eliminate the paradox. We found that the the Pseudospherical and Power score with parameter β close to 1, and the Logarithmic score performed the best. We extend our analysis to random variables with multiple categories. To support this extension, we introduce a new sampling method that builds on the one used in earlier simulations. In the binary model, we only needed one success probability for each random variable, but now we need multiple per random variable, while making sure that the sum of all the probabilities is exactly 1. Using a statistical distance, we determine how to model contestant predictions. For these random variables, we also analyze various scoring rules. In this case, we found that both the Pseudospherical score and the Power score, with β slightly larger than 1, and the Logarithmic score performed the best across various numbers of categories. Similarly, we extend our analysis to continuous random variables. Because of time constraints, we only look at Normal distributions with known variance. We use the same statistical distance as for the multi-categorical random variables, the total variation distance, to determine how to model contestant predictions. We again look at several scoring rules and found that the Power and Pseudospherical scoring rules for values of β close to 1 and the Logarithmic score, performed the best in this scenario.

Deep Learning and Side-Channel Analysis

A Language Model-Inspired framework

Master thesis (2025) - J.N. Ferreira Henriques De Oliveira Martinho, S. Picek, J. Söhl

As the world becomes increasingly digital, cybersecurity—particularly cryptography—has become a defining concern of this century. Beyond designing robust algorithms, it is vital to evaluate the resilience of devices to adversaries who exploit various aspects of algorithm execution. Side-channel analysis targets physical leakages, such as power consumption and electromagnetic emissions, to extract secret information. State-of-the-art research identifies machine learning attacks using Convolutional Neural Networks (CNNs) and Multi-Layer Perceptrons (MLPs) as the most effective. Language Models have achieved success across diverse domains, some unrelated to language. This thesis investigates their applicability to side-channel analysis and compares their performance with current state-of-the-art methods. Sane or Silly, a language model - inspired framework, is introduced and used to attack the ASCAD datasets. Results demonstrate that this approach can successfully retrieve the key in both ASCADf and ASCADv using only one trace, regardless of whether the secret masks are known during profiling. Desynchronization hindered but did not fully prevent successful attacks. These findings highlight the potential of language models as powerful tools for side-channel analysis. ...

Maskeringstechnieken gebruiken bij auteurherkenning

Optimale aantal te maskeren woorden vinden, rekeninghoudend met de prestaties en onderwerprobuustheid van het model

Bachelor thesis (2025) - A.I. Elias, J. Söhl, C. Kraaikamp

Maskeringstechnieken kunnen nuttig zijn bij auteurherkenning om auteurherkenningsmethoden minder onderwerpafhankelijk te maken. Echter, als er teveel gemaskeerd wordt gaat er relevante informatie verloren waardoor deauteurherkenningsmethodejuist minder goed werkt. Hierin moeteenzorgvuldige afweging gemaakt worden. Het doel van dit onderzoek is om een aanbeveling te doen wat betreft het optimale aantal te maskeren woorden voor verschillende datasets. Ook worden er verschillende maskeringstechnieken bij verschil lende classificatiemethoden vergeleken. Er wordt gekeken naar maskering met behulp van een al gemene woordenlijst (COCA-woordenlijst) en maskering met behulp van een eigen frequentielijst per dataset. Hierbij worden twee classificatiemethoden gebruikt: support vector machines en logistis che regressie. Voor drie verschillende datasets: tweets, literaire teksten en brieven wordt gekeken welke manier van maskeren en welke classificatiemethode het beste werkt. Dit wordt gedaan door de prestaties te vergelijken. Ook wordt er daarnaast, vanuit de literatuur, gekeken naar de verschillen in onderwerprobuustheid en op basis daarvan wordt, gecombineerd met de informatie wat betreft de prestaties, een aanbeveling gedaan wat betreft het optimale aantal te maskeren woorden. ...

Authorship Attribution in a Forensic Context

Master thesis (2024) - W. Hajer, J. Söhl, A. F. van Luenen, G.F. Nane

Authorship attribution is the task of determining the unknown author of a text. In forensic authorship attribution, the likelihood that a suspect has written a specific text of unknown origin is computed based on reference texts from both the suspect and a background population. The current method used at the Netherlands Forensic Institute contains a manual and a computational part. In this thesis, we attempted to improve the computational part of this process. We study this problem from three directions.

Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.

Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.

Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.

Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.
...

Authorship attribution is the task of determining the unknown author of a text. In forensic authorship attribution, the likelihood that a suspect has written a specific text of unknown origin is computed based on reference texts from both the suspect and a background population. The current method used at the Netherlands Forensic Institute contains a manual and a computational part. In this thesis, we attempted to improve the computational part of this process. We study this problem from three directions.

Firstly, the performance of state-of-the-art computational authorship attribution methods was assessed on Dutch, forensically relevant corpora. The compared methods were support vector machines combined with masking, using either word or character n-grams as features, BERT-based models using a mean pooling strategy to handle long texts and the baseline, which consists of a logistic regression model with the 100 most frequent Dutch words as features. We notice similar performance differences between state-of-the-art methods as in the literature. The best-performing method was a support vector machine without masking using character n-grams as features. In comparison, both the baseline and BERT-based models perform worse on our corpora.

Secondly, a score-based likelihood ratio system was created to modify the computational authorship attribution methods for usage in forensics. This method is based on kernel density estimators and uses cross-calibration to handle the small number of training and calibration texts of the suspect. For most methods, the performance is in line with the previous performances outside the likelihood ratio system, except for the BERT-based methods, which significantly underperform when part of a likelihood ratio system. This is likely caused by the combination of cross-calibration and the randomness in finetuning BERT models.

Additionally, authorship attribution methods should be topic-robust, such that their attribution is not biased by the topic of a text. We introduced two new metrics to measure the topic-robustness of authorship attribution methods, ‘topic impact’ and ‘conversation impact’. These metrics can only be used on specific types of corpora, the topic impact can be computed on topic-controlled corpora and the conversation impact can be computed on conversational corpora. To study whether these metrics both measured the topic-robustness of authorship attribution methods for their respective corpus type, we computed the correlation between the results of the metrics for varying authorship attribution methods.
We found a correlation of 0.68. As a result, we cannot conclude that the conversation impact is a perfect metric to measure the topic-robustness of methods using conversational corpora, but it does give a good indication of large differences between methods.

Using this new metric, we found that our best-performing methods suffered from a high conversation impact and, as a result, might be more likely to have a low topic-robustness. If more of the infrequent words were masked, the conversation impact decreased, but so did the performance. A trade-off between high performance and high topic-robustness must be made when a model is chosen for real forensic case work. The conversation impact metric we proposed can help quantify these effects on forensically relevant corpora and therefore assist in making better choices.

Prediction of Future Values of Player Performance KPIs in Football

Master thesis (2024) - K.W. van Arem, Jakob Söhl, Floris Goes-Smit

The introduction of data-based modeling in football (soccer) in the last decade has led to the creation of models that describe player performance through key performance indicators (KPIs). However, relying solely on historical and current KPI values is insufficient for scouting departments, as predicting future values could significantly enhance transfer decision-making. This research aimed to identify the optimal model for forecasting the development of player performance KPIs over the next year, focusing on explainability, uncertainty quantification, and predictive performance.
To achieve this, we implemented linear models, tree-based models, and time- series-based kNN models to forecast two specific KPIs one year in the future: SciSkill, which measures the general quality of a player, and Estimated Transfer Value, representing the player’s monetary value. Tree-based models showed the best predictive performance. The random forest in particular emerged as the best due to its explainable predictions, uncertainty quantification method based on bagging, and good predictive performance. In the Sciskill case study, the random forest model achieved low loss values, especially for young players. For the Estimated Transfer Value, the random forest model demonstrated the best predictive performance on the general set of players, and specifically on the subset of players valued at over €10 million.
Our findings suggest that tree-based models, particularly the random forest, are well-suited for predicting the future development of football player perfor- mance KPIs. Although it is important to monitor the predictive performance using the most recent data, the insights and the resulting models of this research can enhance scouting decisions via both data-informed and data-based decision- making. Finally, this research paves the way to study the influence of time series information or contextual information on player performance metrics. ...

The introduction of data-based modeling in football (soccer) in the last decade has led to the creation of models that describe player performance through key performance indicators (KPIs). However, relying solely on historical and current KPI values is insufficient for scouting departments, as predicting future values could significantly enhance transfer decision-making. This research aimed to identify the optimal model for forecasting the development of player performance KPIs over the next year, focusing on explainability, uncertainty quantification, and predictive performance.
To achieve this, we implemented linear models, tree-based models, and time- series-based kNN models to forecast two specific KPIs one year in the future: SciSkill, which measures the general quality of a player, and Estimated Transfer Value, representing the player’s monetary value. Tree-based models showed the best predictive performance. The random forest in particular emerged as the best due to its explainable predictions, uncertainty quantification method based on bagging, and good predictive performance. In the Sciskill case study, the random forest model achieved low loss values, especially for young players. For the Estimated Transfer Value, the random forest model demonstrated the best predictive performance on the general set of players, and specifically on the subset of players valued at over €10 million.
Our findings suggest that tree-based models, particularly the random forest, are well-suited for predicting the future development of football player perfor- mance KPIs. Although it is important to monitor the predictive performance using the most recent data, the insights and the resulting models of this research can enhance scouting decisions via both data-informed and data-based decision- making. Finally, this research paves the way to study the influence of time series information or contextual information on player performance metrics.

Covariance intersection for continuous Kalman filters

Master thesis (2024) - J.C.C. Schikhof, F.H.J. Redig, R.C. Kraaij, J. Söhl

The Kalman filter is a recursive algorithm that estimates the state of a dynamic system subject to measurement and model noise. If all noise terms affecting the system are white Gaussian noise with known mean and variance, and all noise terms are independent of each other, then the Kalman filter is the optimal estimator for the state variable. When measurements are collected from multiple sources, the covariance between these sources should be known or the sources should be independent to ensure that the estimate made by the Kalman filter is optimal. When the covariance between dependent measurement sources is not known, various methods exist which provide a solution to this problem. This thesis discusses two methods: the H∞ filter and covariance intersection... ...

Efficiënt schatten van viscositeit en diffusiviteit met beperkte rekencapaciteit

Bachelor thesis (2024) - R.H.P. Rouwhorst, J. Söhl, I.A.M. Goddijn

In deze scriptie wordt een methode ontwikkeld om de viscositeit en zelfdiffusiviteit van moleculen te schatten door middel van moleculaire simulaties. De simulaties worden uitgevoerd in kubusvormige volumes met verschillende ribbelengtes $L$, waarbij de computationele kosten en de nauwkeurigheid van de metingen afhankelijk zijn van de kubusgrootte. Metingen in kleinere kubussen vereisen minder rekenkracht maar leveren onnauwkeurige resultaten op, terwijl metingen in grotere kubussen preciezere schattingen geven tegen hogere computationele kosten.

De relatie tussen de werkelijke zelfdiffusiviteit $D^\infty_{\text{self}}$ en de gemeten zelfdiffusiviteit $D^{\text{MS}}_{\text{self}}$ wordt beschreven met de formule:

$$D^\infty_{\text{self}} = D^{\text{MS}}_{\text{self}} + \frac{\xi k_B T}{6\pi \eta L}$$

waarbij $L$ de ribbelengte is en $\eta$ de viscositeit. Dit leidt tot een lineair regressiemodel waarbij $D^{\text{MS}}_{\text{self}}$ de afhankelijke variabele is en $\frac{1}{L}$ de onafhankelijke variabele is. De schatting van de intercept en de helling moet nauwkeurig worden bepaald.

Er is gebruik gemaakt van gewogen regressie om heteroscedasticiteit in de fouten van de simulaties aan te pakken. Metingen in grotere kubussen tellen zwaarder mee voor de regressie omdat ze nauwkeuriger zijn, terwijl metingen in kleinere kubussen minder meetellen. De variantiefunctie $\sigma(x)$ en de kostenfunctie $f(x)$ spelen een cruciale rol bij het bepalen van de optimale ribbelengtes en gewichten. In het bijzonder is aangetoond dat voor hogere ordes van de variantiefunctie (bijv. $\sigma(x) = x^3$) de voorkeur wordt gegeven aan grotere ribbelengtes.

Daarnaast zijn de simultane schattingen van zowel de viscositeit als de zelfdiffusiviteit verbeterd door gebruik te maken van de covariantiematrix van de regressie. De afhankelijkheid tussen de intercept en helling wordt gebruikt voor een verkleining van het betrouwbaarheidsgebied met 85\% ten opzichte van een traditionele benadering zonder covariantie. Dit is geïllustreerd met behulp van betrouwbaarheidsgebieden, waaruit bleek dat het meenemen van de covariantie bijdraagt aan het reduceren van de onzekerheid in de schattingen. ...

In deze scriptie wordt een methode ontwikkeld om de viscositeit en zelfdiffusiviteit van moleculen te schatten door middel van moleculaire simulaties. De simulaties worden uitgevoerd in kubusvormige volumes met verschillende ribbelengtes $L$, waarbij de computationele kosten en de nauwkeurigheid van de metingen afhankelijk zijn van de kubusgrootte. Metingen in kleinere kubussen vereisen minder rekenkracht maar leveren onnauwkeurige resultaten op, terwijl metingen in grotere kubussen preciezere schattingen geven tegen hogere computationele kosten.

De relatie tussen de werkelijke zelfdiffusiviteit $D^\infty_{\text{self}}$ en de gemeten zelfdiffusiviteit $D^{\text{MS}}_{\text{self}}$ wordt beschreven met de formule:

$$D^\infty_{\text{self}} = D^{\text{MS}}_{\text{self}} + \frac{\xi k_B T}{6\pi \eta L}$$

waarbij $L$ de ribbelengte is en $\eta$ de viscositeit. Dit leidt tot een lineair regressiemodel waarbij $D^{\text{MS}}_{\text{self}}$ de afhankelijke variabele is en $\frac{1}{L}$ de onafhankelijke variabele is. De schatting van de intercept en de helling moet nauwkeurig worden bepaald.

Er is gebruik gemaakt van gewogen regressie om heteroscedasticiteit in de fouten van de simulaties aan te pakken. Metingen in grotere kubussen tellen zwaarder mee voor de regressie omdat ze nauwkeuriger zijn, terwijl metingen in kleinere kubussen minder meetellen. De variantiefunctie $\sigma(x)$ en de kostenfunctie $f(x)$ spelen een cruciale rol bij het bepalen van de optimale ribbelengtes en gewichten. In het bijzonder is aangetoond dat voor hogere ordes van de variantiefunctie (bijv. $\sigma(x) = x^3$) de voorkeur wordt gegeven aan grotere ribbelengtes.

Daarnaast zijn de simultane schattingen van zowel de viscositeit als de zelfdiffusiviteit verbeterd door gebruik te maken van de covariantiematrix van de regressie. De afhankelijkheid tussen de intercept en helling wordt gebruikt voor een verkleining van het betrouwbaarheidsgebied met 85\% ten opzichte van een traditionele benadering zonder covariantie. Dit is geïllustreerd met behulp van betrouwbaarheidsgebieden, waaruit bleek dat het meenemen van de covariantie bijdraagt aan het reduceren van de onzekerheid in de schattingen.

Applications of Statistical Theory to Sensor Data Analysis

Doctoral thesis (2024) - M.G. Ciszewski, G. Jongbloed, J. Söhl

Technological progress irreversibly changes the nature of sports. The relevance of technology in sports can be seen with relative ease to most spectators in tennis, football and many other elite sports. Some technologies have changed the sport in a way that many spectators might not be aware of. Behind any professional sport, there are countless hours of training and preparation. Athletes are pushing their own limits in achieving perfection. Coaches are trying to make sure that the training the athletes go through results in improvement of their performance, but without straining themselves too much which can lead to an injury. The technology of today helps with this training process and coaches need to be able to use it to provide good feedback to their athletes.
This thesis is written in the context of the Citius Altius Sanius (CAS) project aimed at injury prevention and performance improvement in sports. The CAS project combines the expertise of data scientists, industrial designers and biomechanical engineers together with the resources of sports associations and sports equipment designers among others. The goal of the CAS project is to initiate collaboration between various universities and departments to develop sensor technology, provide analysis based on the sensor data and provide a clear guideline of feedback to the athlete.
The primary goal of this thesis is to extract meaningful insights from sensor data through statistical modeling. Two sources of sensor data are used within the thesis: data from prototype sensor trousers worn by football players during training and data from a sensor sleeve worn by tennis players during serve practice. The research employs supervised learning algorithms within the framework of machine learning and deep learning models for capturing intricate patterns in the data as well as functional data analysis techniques such as functional principal components analysis and functional regression models applied for imputation purposes and dimension reduction.
We used neural network architecture, which mixes both convolutional and recurrent layers, consistently throughout this thesis. The main application of this network lies in recognizing football-related activities using sensor data. The neural network achieves good accuracy and is easily adaptable to other human activity recognition problems. We also considered various other models for this task, however none could match the computational speed and accuracy of the neural network. Nonetheless, given a plethora of methods that were tested and dissatisfaction with the accuracy measures used to assess the goodness-of-fit of the tested methods, a novel quality measure was introduced for activity recognition problems, to leverage the domain knowledge for the purpose of determining accuracy of an activity recognition method. In the case of our application, one of the constraints is the length of activities that are predicted. This measure accounts for the fact that activities such as jumping or passing a ball realistically have a minimum duration. Instances where a prediction model outputs an activity shorter than physically plausible incur harsh penalties.
We also propose a novel post-processing procedure tailored specifically to human activity recognition problems, ensuring that predictive models adhere to physical constraints, such as the minimum duration of activities. This post-processing method aims to increase the accuracy of prediction models which violate these constraints and as a result, to narrow the gap in accuracy between different prediction methods.
In the context of tennis, we encountered difficulties in predicting the serve performance metrics using sensor data. While predicting the ball speed can be easily achieved, accurately predicting the velocity-accuracy index (VA index), which combines ball speed with serve accuracy, proved more complex. To assess the effectiveness of our model in distinguishing true predictions from noise, we applied a permutation test. Notably, the main contribution of this research lies in the rigorous formulation of the null hypothesis for this test, linking it to established permutation test theory.
This research contributes to the fields of sports science and data analysis by offering insights into activity recognition and performance prediction using sensor data. The methodologies developed here have potential applications across various other sports as well as activities unrelated to sports. While data provided for purposes of this research comes from wearable sensors, it is possible to also apply these models and procedures in other types of sensor data or even beyond. ...

Technological progress irreversibly changes the nature of sports. The relevance of technology in sports can be seen with relative ease to most spectators in tennis, football and many other elite sports. Some technologies have changed the sport in a way that many spectators might not be aware of. Behind any professional sport, there are countless hours of training and preparation. Athletes are pushing their own limits in achieving perfection. Coaches are trying to make sure that the training the athletes go through results in improvement of their performance, but without straining themselves too much which can lead to an injury. The technology of today helps with this training process and coaches need to be able to use it to provide good feedback to their athletes.
This thesis is written in the context of the Citius Altius Sanius (CAS) project aimed at injury prevention and performance improvement in sports. The CAS project combines the expertise of data scientists, industrial designers and biomechanical engineers together with the resources of sports associations and sports equipment designers among others. The goal of the CAS project is to initiate collaboration between various universities and departments to develop sensor technology, provide analysis based on the sensor data and provide a clear guideline of feedback to the athlete.
The primary goal of this thesis is to extract meaningful insights from sensor data through statistical modeling. Two sources of sensor data are used within the thesis: data from prototype sensor trousers worn by football players during training and data from a sensor sleeve worn by tennis players during serve practice. The research employs supervised learning algorithms within the framework of machine learning and deep learning models for capturing intricate patterns in the data as well as functional data analysis techniques such as functional principal components analysis and functional regression models applied for imputation purposes and dimension reduction.
We used neural network architecture, which mixes both convolutional and recurrent layers, consistently throughout this thesis. The main application of this network lies in recognizing football-related activities using sensor data. The neural network achieves good accuracy and is easily adaptable to other human activity recognition problems. We also considered various other models for this task, however none could match the computational speed and accuracy of the neural network. Nonetheless, given a plethora of methods that were tested and dissatisfaction with the accuracy measures used to assess the goodness-of-fit of the tested methods, a novel quality measure was introduced for activity recognition problems, to leverage the domain knowledge for the purpose of determining accuracy of an activity recognition method. In the case of our application, one of the constraints is the length of activities that are predicted. This measure accounts for the fact that activities such as jumping or passing a ball realistically have a minimum duration. Instances where a prediction model outputs an activity shorter than physically plausible incur harsh penalties.
We also propose a novel post-processing procedure tailored specifically to human activity recognition problems, ensuring that predictive models adhere to physical constraints, such as the minimum duration of activities. This post-processing method aims to increase the accuracy of prediction models which violate these constraints and as a result, to narrow the gap in accuracy between different prediction methods.
In the context of tennis, we encountered difficulties in predicting the serve performance metrics using sensor data. While predicting the ball speed can be easily achieved, accurately predicting the velocity-accuracy index (VA index), which combines ball speed with serve accuracy, proved more complex. To assess the effectiveness of our model in distinguishing true predictions from noise, we applied a permutation test. Notably, the main contribution of this research lies in the rigorous formulation of the null hypothesis for this test, linking it to established permutation test theory.
This research contributes to the fields of sports science and data analysis by offering insights into activity recognition and performance prediction using sensor data. The methodologies developed here have potential applications across various other sports as well as activities unrelated to sports. While data provided for purposes of this research comes from wearable sensors, it is possible to also apply these models and procedures in other types of sensor data or even beyond.

On the Influence of Whitening Transformations on Hyperspectral Data

Master thesis (2023) - G.K. van der Wal, M.B. van Gijzen, J. Söhl, R. G. Satink

The whitening transformation transforms a random matrix into a whitened matrix with expectation 0 and covariance matrix I. By removing the first and second order statistical structures, higher order structures can be looked at for better classification. This is why Stage Gate 11 B.V. has employed whitening in the preprocessing of their hyperspectral data. The aim of this work is to gain insight into the whitening transformation and how it influences hyperspectral data.
To gain this insight, synthetic data was created and used to make synthetic scans. The signal-to-noise ratio of a target spectrum was calculated, and Monte Carlo simulations were used to reveal hidden patterns in the data. In case of a high contrast scenario, multi-area whitening was employed and the cosine similarity between the target spectrum and its signature was determined. It was observed that the shape and intensity of the whitened target spectrum differs, depending on if pixels were used as observations or wavelengths. However, both are subject to the ‘bleeding’ effect. Further, it was found that if the number of pixels in the scan is greater than the number of spectral bands (548), then the signal-to-noise ratio becomes better as the number of whitened pixels in the scan increases. In case of a high contrast scenario, multi-area whitening guarantees the uniformity of the spectra, resulting in a higher
cosine similarity between the target spectrum and its signature. But as multi-area whitening uses a smaller
number of pixels in the scan, it cannot be concluded if multi-area whitening is better than global whitening, as it is not known how the increase in cosine similarity and the decrease in signal-to-noise ratio relate to the classification process. Finally, it is concluded that when working with real and unknown data, using pixels as
observations is much more feasible. ...

The Expressive Power of (Multi)Set-based higher-order Graph Neural Networks

Master thesis (2023) - A. VASILEIOU, J. Söhl, Christopher Morris

Graph data is widely used in various applications, driving the rapid development of graph-based machine learning methods. However, traditional algorithms tailored for graphs have constraints in capturing intricate node relationships and higher-order patterns. Recent insights from prior research have shed light on comparing different graph neural network architectures. This work introduced higher-order neural networks capable of grasping complex graph patterns. Nevertheless, these models encounter scalability issues that hinder their application to real-world datasets. This thesis builds on this foundation by theoretically assessing the various neural architectures proposed in those studies. Moreover, we present novel models aiming to find a middle ground between capturing higher-order patterns and maintaining scalability. Our objective is to enhance the modeling capabilities of graph-based algorithms and address existing limitations. Additionally, we implemented our models on benchmark datasets to gauge their performance. The outcomes confirm that our models achieve notably improved generalization compared to conventional graph neural networks. Furthermore, our models exhibit substantial scalability enhancements when contrasted with other higher-order graph neural networks. This research contributes to graph machine learning by offering more efficient and scalable methods for capturing higher-order patterns in graph data. ...

Bootstrap-based bias correction for the out-of-sample Sharpe ratio

Bachelor thesis (2023) - D. Karjadi, J. Söhl, L.E. Meester

Looking for making an investment, one objective could be to find a portfolio where the Sharpe ratio for in the future, known as the out-of-sample Sharpe ratio, is maximized. Since future data is not avail-able, the Sharpe ratio needs to be predicted using historical data, the in-sample data. This is often done using the Sharpe Ratio Information Criterion, which determines the bias for the in-sample Sharpe ratio to es-timate the out-of-sample Sharpe ratio. However, this approach assumes that the covariance matrix is known. In portfolio management, the covariance matrix is typically unknown and can only be estimated. This project will use the bootstrap method to estimate the out-of-sample Sharpe ratio using the estimated co-variance matrix and analogous methods used for the Akaike Information Criterion. By eliminating the assumption of a known covariance matrix, this method becomes more applicable. Simulations will also be done with a known covariance matrix, demonstrating that the bootstrap method is an effective approach for estimating the out-of-sample Sharpe ratio. We then look at some extensions for the bootstrap method and finally we will apply the bootstrap method to stocks in the Dutch and American stock markets, showing that the in-sample Sharpe ratio is often overly optimistic compared to the out-of-sample Sharpe ratio. We reached our goal that we found an effective way to estimate the out-of-sample Sharpe ratio without the assumption that the covariance matrix is known, resulting this method becomes much more suitable for predicting the Sharpe ratio in the future.
1
...

Football activity recognition

Improving and testing football activity recognition based on signal data using deep learning

Master thesis (2023) - R. Tebbens, K.M.B. Jansen, G. Jongbloed, J. Söhl, M.G. Ciszewski

There is a raising demand for player statistics in the world of football. With the developments over the last years in wearable sensors, Human Activity Recognition (HAR) based on wearable IMU sensors can be used to tackle this problem. This thesis builds upon an earlier research done for this topic, where an end-to-end pipeline based on deep learning was created that is able to be trained and used for football activity recognition. The goal is to test and improve said pipeline. This was done by adding change of directions (COD’s) to the classifiable activities. Furthermore, run velocities were build as a spectrum with several categories depending on the speed. A combination of convolutional and recurrent layers resulted in test accuracies up to 88.9%.
Afterwards, the pipeline was used to evaluate larger datasets containing football drill and a football physiotherapy training. For this a sliding window evaluation procedure was proposed. These evaluations gave promising results. Many actions and football related activities could be recognized, however many smaller, shorter actions were missed. This can be seen as lack in trainingdata. In this data, little activities with the ball were present. Hence the deep learning models could not be trained accordingly. Later, it was researched
if additional training of activities with ball increased the evaluation. This was indeed confirmed, since the evaluations showed more detailed and realistic results. Including even more additional trainingdata, could result in the pipeline performing reliably in real-life football scenario’s. ...

Mathematics as a secret weapon against criminals

Employing score-based likelihood ratio systems for the comparison of handwriting and studying their quality of performance

Bachelor thesis (2022) - A.J. Wijker, J. Söhl, A.T. Hensbergen

In this report a new approach to (forensic) handwriting analysis is presented; score-based likelihood ratio (SLR) systems are employed and their quality of performance is studied. These systems compare elements of handwriting based on their characteristics and give an insight into the degree of uncertainty of the statement that two writings have the same writer. They can be used in forensic and fraud investigations. ...

Detecting irrigation of potato parcels in the Northern Netherlands using remotely sensed SAR images

Master thesis (2022) - R.M. Vos, A.W. Heemink, Frederike de Visser-Bleijenberg, J. Söhl

As a response to the dry summer of 2018, Witteveen+Bos developed a model for water demand prediction to improve insight into water demands. Validation by water board "Hunze en Aas" has revealed the predictive power of the irrigation model to be very limited. For this thesis project, we developed a methodology for the detection of irrigation of crop parcels based on the radar vegetation index (RVI) derived from remote SAR images. This methodology can be used to improve the existing irrigation model.

To achieve this, we developed a novel model to describe the evolution of a vegetation index (such as RVI) during the growth season. Unlike existing models, the model presented in this thesis includes the effect of precipitation deficit, both as a temporary inhibitor of a vegetation index, and as a long-term influence on the crop growth. The model is non-linear in many of its model parameters. Therefore, heuristic calibration methods are unavoidable. We show that the standard calibration methods non-linear least squares and differential evolution are outperformed by a hybrid of both methods that we specifically designed for this application.

After calibrating the model to time series of 1167 potato parcels in the north-east of the Netherlands, we investigate different ways to cluster the model parameters. We propose explanations for three important clusterings through their RVI time series (speculative) environmental factors. Comparison with information on irrigated parcels for the years 2018-2020 reveals a statistically significant correlation between some of the clusters and irrigation. However, the variation in irrigation rate never exceeded a factor two. Therefore, no accurate classifier can be built based on these clusters.

We recommend two important ways to improve the current implementation. Firstly, the baseline RVI is consistently overestimated, resulting in mostly negative normalized RVI. Because of this, the model cannot properly describe precipitation deficit-driven fluctuations in the RVI. These fluctuations are an important part of system behaviour, so improving the estimation of the baseline RVI should be the first priority for future research.

Secondly, the exact irrigation dates of a set of parcels will be very useful. Comparing these dates to the corresponding RVI time series will make it possible to uncover features of the RVI evolution that are indicators of irrigation. The model parameterization can then be tuned to optimize sensitivity to these features. ...

As a response to the dry summer of 2018, Witteveen+Bos developed a model for water demand prediction to improve insight into water demands. Validation by water board "Hunze en Aas" has revealed the predictive power of the irrigation model to be very limited. For this thesis project, we developed a methodology for the detection of irrigation of crop parcels based on the radar vegetation index (RVI) derived from remote SAR images. This methodology can be used to improve the existing irrigation model.

To achieve this, we developed a novel model to describe the evolution of a vegetation index (such as RVI) during the growth season. Unlike existing models, the model presented in this thesis includes the effect of precipitation deficit, both as a temporary inhibitor of a vegetation index, and as a long-term influence on the crop growth. The model is non-linear in many of its model parameters. Therefore, heuristic calibration methods are unavoidable. We show that the standard calibration methods non-linear least squares and differential evolution are outperformed by a hybrid of both methods that we specifically designed for this application.

After calibrating the model to time series of 1167 potato parcels in the north-east of the Netherlands, we investigate different ways to cluster the model parameters. We propose explanations for three important clusterings through their RVI time series (speculative) environmental factors. Comparison with information on irrigated parcels for the years 2018-2020 reveals a statistically significant correlation between some of the clusters and irrigation. However, the variation in irrigation rate never exceeded a factor two. Therefore, no accurate classifier can be built based on these clusters.

We recommend two important ways to improve the current implementation. Firstly, the baseline RVI is consistently overestimated, resulting in mostly negative normalized RVI. Because of this, the model cannot properly describe precipitation deficit-driven fluctuations in the RVI. These fluctuations are an important part of system behaviour, so improving the estimation of the baseline RVI should be the first priority for future research.

Secondly, the exact irrigation dates of a set of parcels will be very useful. Comparing these dates to the corresponding RVI time series will make it possible to uncover features of the RVI evolution that are indicators of irrigation. The model parameterization can then be tuned to optimize sensitivity to these features.