S.R. Bongers | TU Delft Repository

The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

Bachelor thesis (2024) - T. Sabău, F.A. Oliehoek, S.R. Bongers, C.M. Jonker

Behavior-agnostic reinforcement learning is a rapidly expanding research area focusing on developing algorithms capable of learning effective policies without explicit knowledge of the environment's dynamics or specific behavior policies. It proposes robust techniques to perform off-policy evaluation, namely Distribution Correction Estimation (DICE) methods, in the context of infinite horizon Markov Decision Processes (MDPs). This research paper investigates the impact of the initial start distribution mismatch on the accuracy of DICE estimators in behavior-agnostic reinforcement learning. To achieve this, seven systematic initial start distributions were created and utilized to calculate the initial start distribution mismatch via Kullback–Leibler (KL) divergence. Furthermore, off-policy evaluation performance was assessed using DICE estimators, with Mean Squared Error (MSE) comparisons against ground truth values. The study reveals that, based on the conducted experiments, the initial start distribution mismatch does not have a clear influence on the performance of the DICE estimators. Therefore, future research is required to increase the scope of the experiments and address some of the limitations of this study to accurately assess the impact of the initial start distribution mismatch on off-policy evaluation using DICE methods. This paper underscores the complexity of the initial start distribution choice in behavior-agnostic reinforcement learning, calling for further research to effectively evaluate its impact across diverse environments and measures. Additionally, exploring the relation between the initial start distribution and policies could provide deeper insights and further refine the understanding of their influence on DICE estimators. ...

The Effect of State-visitation Mismatch on Off-policy Performance in Behaviour-agnostic Reinforcement Learning

Bachelor thesis (2024) - K.Y. Chen, S.R. Bongers, F.A. Oliehoek, C.M. Jonker

Off-policy evaluation has some key problems with one of them being the “curse of horizon”. With recent breakthroughs [1] [2], new estimators have emerged that utilise importance sampling of the individual state-action pairs and reward rather than over the whole trajectory. With the difference between behaviour and target policy, the state-visitation mismatch occurs. This paper is interested in answering the question how the degree of state-visitation mismatch affects the overall target policy performance. The approach is to calculate the state-visitation mismatch with the KL divergence, which consists of the state-visitation distribution of the behaviour policy and the distribution correction ratio of the DICE estimator. The state-visitation mismatch can be quantified in way. Furthermore, the effect on the target policy performance is quantified by the MSE between the estimated, empirical cumulative reward and the estimated reward by the DICE estimator. By analysing the KL divergence and MSE value, one may argue that the state-visitation mismatch does impact the performance of the target policy but further research needs to be conducted. ...

SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation

Bachelor thesis (2024) - C. Brita, F.A. Oliehoek, S.R. Bongers, C.M. Jonker

In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic model of the environment can improve the sample efficiency of the algorithm, but this mismatch can lead to the generation of suboptimal experiences. We propose SimuDICE, an algorithm that enhances the sampling of imaginary experiences using Dual stationary DIstribution Correction (DICE), and iteratively improves the DICE estimations with synthetically generated experiences. SimuDICE addresses the objective mismatch issue by iteratively updating both the world model and the DICE estimator, aligning the model's training objective (imitating the environment) with its usage objective (policy improvement). We show that SimuDICE requires less pre-collected data and fewer simulated experiences to achieve comparable results to other algorithms while having greater robustness to lower data quality. ...

Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

Bachelor thesis (2024) - H. Cho, F.A. Oliehoek, S.R. Bongers, C.M. Jonker

In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitation mismatch methods on the performance of on-policy RL methods, an area crucial for improving policy performance in real-world applications where behavior policies are often unknown. Specifically, we compare the convergence speed and performance of Q-learning when initialized with Q-values learned through the Distribution Correction Estimation (DICE) method versus traditional random initialization. By generating datasets representing behavior and target policies, we employ the DICE estimator to initialize Q-values, and subsequently run Q-learning for both DICE-initialized and randomly-initialized scenarios. Our results demonstrate that initializing Q-learning with DICE Q-values enhances convergence speed, leading to faster attainment of near-optimal policies. This study provides valuable insights into the effectiveness of state visitation mismatch methods in improving the efficiency and performance of on-policy RL algorithms, contributing to the development of more robust RL applications in behavior-agnostic settings. ...

Use of sample-splitting and cross-fitting techniques to mitigate the risks of double-dipping in behaviour-agnostic reinforcement learning

Comparative Analysis

Bachelor thesis (2024) - Y. Aslan, S.R. Bongers, F.A. Oliehoek, C.M. Jonker

This paper addresses the issue of double-dipping in off-policy evaluation (OPE) in behaviour-agnostic reinforcement learning, where the same dataset is used for both training and estimation, leading to overfitting and inflated performance metrics especially for variance. We introduce SplitDICE, which incorporates sample-splitting and cross-fitting techniques to mitigate double-dipping effects in the DICE family of estimators. Focusing specifically on 2-fold and 5-fold cross-fitting strategies, the original off-policy dataset is partitioned with random-split to get separate training and evaluation datasets. Experimental results demonstrate that SplitDICE, particularly with 5-fold cross-fitting, significantly reduces error, bias, and variance compared to naive DICE implementations, providing a more doubly-robust solution for behavior-agnostic OPE. ...

Understanding Risk Extrapolation (REx) and when it finds Invariant Relationships

Bachelor thesis (2022) - J.L. Hofland, J.H. Krijthe, R.K.A. Karlsson, S.R. Bongers, T. Höllt

Generalizing models for new unknown datasets is a common problem in machine learning. Algorithms that perform well for test instances with the same distribution as their training dataset often perform severely on new datasets with a different distribution. This problem is caused by distributional shifts between the training of the model and applying that model to a test domain. This paper addresses whether and in what situations Risk Extrapolation (REx) can tackle this problem of Out-Of-Distribution generalization by exploiting invariant relationships. These relationships are based on features that are invariant across all domains. By learning these relationships, REx aims to learn the concept of the problem we are trying to solve. We show in what situations REx can learn these invariant relationships and when it does not. We translate the definition of an invariant relationship into a homoscedastic synthetic dataset with either covariate, confounded, anti-causal, or hybrid shift. We expose REx to experiments in sample complexity, the number of training domains, and the training domain distance. We show that REx performs better for invariant prediction in situations with larger sample sizes and training domain distance and that if these criteria are met, REx performs equivalently in all four distributional shifts. We also compare REx to Invariant- and Empirical Risk Minimization and show that; REx is less sensitive and thus robust to the shifting of the average distributional variance in the training domains; REx asymptotically out-performs the methods in the more complex distributional shifts. ...

Can Invariant Risk Minimization resist the temptation of learning spurious correlations?

Bachelor thesis (2022) - AJ.A.E. van Lith, R.K.A. Karlsson, S.R. Bongers, J.H. Krijthe

Learning algorithms can perform poorly in unseen environments when they learn
spurious correlations. This is known as the out-of-domain (OOD) generalization problem. Invariant Risk Minimization (IRM) is a method that attempts to solve this problem by learning invariant relationships. Motivating examples as well as counterexamples have been proposed about the performance of IRM. This work aims to clarify when the method works well and when it fails by testing its ability to learn invariant relationships. Therefore, experiments are done on a synthetic data model which simulates four data distribution shifts: covariate shift (CS), confounder based shift (CF), anti-causal shift (AC), and hybrid shift (HB). The experiments exploit IRM’s behaviour with respect to hetero- and homoskedasticity and adaptation of the training environments. We measure the error with regards to the optimal invariant predictor and compare to the non invariant Empirical Risk Minimization (ERM). The results show that IRM is generally able to learn invariance for the CS and CF shifts, especially when the deviation between the training environments is large. In the AC and HB shifts, this strongly depends on the values of the training environments.
...

Evaluating the Performance of the Model Selection with Average ECE and Naive Calibration in Out-of-Domain Generalization Problems for Binary Classifiers

Bachelor thesis (2022) - A. Liu, J.H. Krijthe, R.K.A. Karlsson, S.R. Bongers, T. Höllt

Out-of-domain (OOD) generalization refers to learning a model from one or more different but related domain(s) that can be used in an unknown test domain. It is challenging for existing machine learning models. Several methods have been proposed to solve this problem, and multi-domain calibration is one of these methods. Model selection with the average expected calibration error (ECE) across training domains and naive calibration are two approaches to implementing multi-domain calibration. However, it might happen that neither approach can learn a genuinely well-calibrated model in the multi-domain setting. Hence, this paper intends to evaluate how naive calibration and model selection with average ECE perform in the OOD generalization problem for binary classifiers. We generated many synthetic datasets and set up three experiments to answer this question. Finally, the conclusions based on empirical results are obtained: 1) Although naive calibration can improve the average accuracy across unseen domains (OOD accuracy) and the average area under the ROC Curve across unseen domains (OOD AUROC) for some binary classifiers, it does not work for all binary classifiers. However, at least it does not make the model worse for OOD generalization. 2) On the synthetic datasets we generated, if the number of training domains increases, most binary classifiers' OOD accuracy will also increase. 3) Average ECE is a reasonable metric for selecting a model in the OOD generalization problem and is better than validation accuracy. This is because a strong linear relationship exists between OOD accuracy and the average ECE across the training domains. This linear relationship is stronger than the linear relationship between OOD accuracy and validation accuracy. ...

Group Distributionally Robust Optimization for Solving Out-Of-Domain Generalization and Finding Causal Invariant Relationships

Bachelor thesis (2022) - Z. Guan, J.H. Krijthe, R.K.A. Karlsson, S.R. Bongers, T. Höllt

Out-of-Domain (OOD) generalization is a challenging problem in machine learning about learning a model from one or more domains and making the model perform well on an unseen domain. Empirical Risk Minimization (ERM), the standard machine learning method, suffers from learning spurious correlation in the training domain, therefore may perform badly when the unseen domain has different distribution from the training domain. Group Distributionally Robust Optimization (group DRO) is a method proposed to handle the OOD generalization problem. In this paper, the goals are to 1) measure if group DRO has a better OOD generalization performance than ERM. 2) evaluate if group DRO finds causally invariant relationships between the input and output. Semi-synthetic bird images with different backgrounds are used to form our data sets to construct a binary image classification problem for experiments. Results show that group DRO improves OOD generalization performance over ERM, and group DRO can find invariant relationships. However, the ability of group DRO to find invariant relationships is limited when the spurious correlation in the training domain is strong. ...

Empirical Evaluation of the Performance of CEVAE under Misspecification of the Latent Dimensionality

Bachelor thesis (2022) - P. Barták, J.H. Krijthe, S.R. Bongers, A.R. Bidarra

Causal machine learning deals with the inference of causal relationships between variables in observational datasets.
For certain datasets, it is correct to assume a causal graph where information about unobserved confounders can only be obtained through noisy proxies, and CEVAE aims to address this case.
The number of dimensions of the latent space modelled by CEVAE must be specified ahead of time, and this paper investigates the effect of this dimensionality misspecification on the performance of CEVAE.
Results support the idea that underspecification and overspecification both degrade the performance of CEVAE, but indicate that underspecification is worse, at least for the case with few confounders.
In general, the model does not always achieve best performance when the model dimensionality corresponds to the data dimensionality.
Finally, conclusions made on data with linear-Gaussian proxies are the same as those obtained with nonlinear-Gaussian proxies, which indicates these conclusions generalize over different datasets to some extent. ...

An empirical study of the effects of unconfoundedness on the performance of Propensity Score Matching

Bachelor thesis (2022) - A. Erdelský, J.H. Krijthe, S.R. Bongers, A.R. Bidarra

The purpose of this research is to analyze the performance of Propensity Score Matching, a causal inference method for causal effect estimation. More specifically, investigate how Propensity Score Matching reacts to breaking the unconfoundedness assumption, one of its core conceptual pillars. This has been achieved by running PSM on synthetic data that upholds the unconfoundedness condition, and then comparing these results with measurements obtained from running the algorithm on data with confounding features with varying contribution to other variable values and hiding these features individually or in progressively higher numbers. These results are also then compared to Linear Regression, a generic machine learning algorithm, for the sake of comparison of performance. The results obtained point to the observation that when hiding variables that only contribute to the main effect, treatment effect or treatment propensity calculation respectively, PSM performs with the same error no matter which of the three effects the hidden feature affects, making them equivalent in their error contribution. Additionally, it has also become apparent that in all experimental scenarios used in this work, PSM performed very similarly to Linear Regression and did not seem to offer any advantages over the latter in these specific situations. ...

Honesty in Causal Forests, is it worth it ?

Bachelor thesis (2022) - M. Havelka, S.R. Bongers, J.H. Krijthe, A.R. Bidarra

Causal machine learning is a relatively new field which tries to find a causal relation between the treatment and the outcome, rather than a correlation between the features and the outcome. To achieve this, many different models were proposed, one of which is the causal forest. Causal forest is made up of a random forest, with a different estimation function in the leaf node, which means it suffers from the same problems, like being easy to overfit. The reason why honesty was introduced was to ensure mathematically that forests do not overfit as easily. This research however, only provided preliminary results and no real testing was done in terms of causal inference. In this paper three scenarios are tested where a comparison is made between a causal forest with and without honesty. Based on the results it seems that honesty does indeed help for trees to not overfit. However in a general setting it hurts the model as it only trains with half of the available data. This makes honest causal forest less accurate in general settings where there is not a lot of training data. In a setting where a large amount of data is provided it seems that honesty does not change the performance, meaning it creates a theoretical guarantee against overfitting with no repercussions for the performance. ...

Treatment Effect Estimation of the DragonNet under Overlap Violations

Bachelor thesis (2022) - R.J. van Veen, S.R. Bongers, J.H. Krijthe, A.R. Bidarra

The large amounts of observational data available nowadays have sparked considerable interest in learning causal relations from such data using machine learning methods. One recent method for doing this, which provided promising results, is the DragonNet (Shi et al., 2019), which utilises neural networks in order to estimate average treatment effects in populations. The performance of the model, however, was not tested on datasets which contain low amounts of overlap between the treated and non-treated subpopulations, which makes it harder to accurately estimate treatment effects. Therefore, the goal of this paper is to investigate the performance of the DragonNet when used on datasets with (near) overlap violations. This has been done by looking at the mean absolute errors and variances of the estimated treatment effects and comparing these to other models. The results showed that the performance of the DragonNet becomes significantly worse compared to other models when large portions of the population suffer from low overlap. Additionally, the variance of the results also increases in these cases, making the results less reliable. From the obtained results, it can be concluded that it is best to choose another model for treatment effect estimation if relatively large amounts of overlap violations are suspected.
...

Empirical study of GANITE’s robustness to hidden confounders

Bachelor thesis (2022) - V.C.O. van Oudenhoven, J.H. Krijthe, S.R. Bongers, A.R. Bidarra

An empirical study is performed exploring the sensitivity to hidden confounders of GANITE, a method for Individualized Treatment Effect (ITE) estimation. Most real world datasets do not measure all confounders and thus it is important to know how crucial this is in order to obtain comparable predictions. This is explored through the removal of confounders with varying strengths and by removing subsets of the confounders simultaneously. The sensitivity is measured through the change in Precision in Estimating Heterogeneous Effects (PEHE) and through the divergence in the estimation of Average Treatment Effect (ATE) from the GT. Experiments are performed on synthetic and semi-synthetic data. The number of removed hidden confounders increases the error and variability of predictions, both for ITE and ATE. The strength of the removed confounders does not show a conclusive relationship on the error metrics. The effect of removing confounders with different causal graphs is explored but fails to show any clear patterns due to the high variance of the results. ...