R.K.A. Karlsson | TU Delft Repository

Safer causal inference

Theory and algorithms for falsification, trial augmentation and policy evaluation

Doctoral thesis (2026) - R.K.A. Karlsson, M.J.T. Reinders, J.H. Krijthe

Estimating the effect of an intervention on an outcome is a central challenge across science and society. In medicine, we may ask whether a drug effectively treats a disease, and in economics, whether a new policy reduces unemployment. Estimating such effects from data, a process known as causal inference, is essential but inherently difficult because it often relies on untestable assumptions to ensure unbiased identification of treatment effects. A key example of such an untestable assumption is the absence of unmeasured confounding, meaning that no hidden variable influences both the treatment and the outcome. When this assumption fails, something which we cannot directly verify, treatment effect estimates may become biased. This ultimately can lead to untrustworthy conclusions and, in the worst case, unsafe decisions, such as prescribing the wrong drug to a patient. The central question of this dissertation is therefore whether we can develop methods for safer causal inference that either detect violations of its underlying assumptions or remain robust when those assumptions are violated.
In Part One, we address the first aspect of detecting violations of causal identification assumptions. We focus on settings with data from multiple sources, such as hospitals or locations, where distributional shifts naturally occur. Under specific independence conditions on the causal mechanisms driving these shifts, we first present a nonparametric test to falsify the assumption of no unmeasured confounding. To obtain these results, we introduce a novel technique utilizing hierarchical causal graphical models. Thereafter, we focus on improving the statistical efficiency of this test, which is achieved by reformulating the independence condition using parameterized linear models. Finally, we extend the hierarchical modeling approach to other identification settings, specifically by testing the validity of mediators and instrumental variables used in two additional common identification strategies.
In Parts Two and Three, we develop methods that instead are robust when causal identification assumptions are violated.We revisit two commonly occurring problem settings when doing causal inference and demonstrate that it is possible to develop methods that either remove the need for, or rely on, weaker and more plausible assumptions than those traditionally made. In the first setting, we study the problemof augmenting randomized trials using external data to improve efficiency in treatment effect estimation. Typically, such approaches rely on a transportability assumption that relate the populations underlying the trial and external data. But when this transportability assumption is violated, integrating external data can introduce substantial bias. To address this, we propose a novel and efficient estimator that incorporates external data and show that this estimator improves inference on the average treatment effect while guaranteeing that it never performs worse, and sometimes performs better, than the estimator that relies solely on trial data.We further adapt this estimator to learn heterogeneous treatment effects within the trial population and show that similar safety guarantees hold for this problem.
In the second setting, we examine the evaluation of treatment allocation strategies using Qini curves. Standard methods for estimating Qini curves assume no interference between treated units, meaning that the treatment of one unit does not affect others. However, when interference is present, these Qini curves can be misleading and lead to incorrect evaluation of treatment allocation strategies.We therefore propose multiple estimators to handle the interference, specifically in settings where units within a cluster may affect one another but not units in other clusters.We identify a bias-variance trade-off in these estimators and, through both theoretical and empirical results, provide practical guidance on how practitioners can choose among them. The dissertation concludes with a discussion of broader considerations, limitations of the presented research, and potential directions for future work.We find that it is indeed possible to make causal inference safer by detecting assumption violations and reducing reliance on untestable assumptions. Nonetheless, many open and important questions remain, offering promising avenues for further research on this topic. ...

Estimating the effect of an intervention on an outcome is a central challenge across science and society. In medicine, we may ask whether a drug effectively treats a disease, and in economics, whether a new policy reduces unemployment. Estimating such effects from data, a process known as causal inference, is essential but inherently difficult because it often relies on untestable assumptions to ensure unbiased identification of treatment effects. A key example of such an untestable assumption is the absence of unmeasured confounding, meaning that no hidden variable influences both the treatment and the outcome. When this assumption fails, something which we cannot directly verify, treatment effect estimates may become biased. This ultimately can lead to untrustworthy conclusions and, in the worst case, unsafe decisions, such as prescribing the wrong drug to a patient. The central question of this dissertation is therefore whether we can develop methods for safer causal inference that either detect violations of its underlying assumptions or remain robust when those assumptions are violated.
In Part One, we address the first aspect of detecting violations of causal identification assumptions. We focus on settings with data from multiple sources, such as hospitals or locations, where distributional shifts naturally occur. Under specific independence conditions on the causal mechanisms driving these shifts, we first present a nonparametric test to falsify the assumption of no unmeasured confounding. To obtain these results, we introduce a novel technique utilizing hierarchical causal graphical models. Thereafter, we focus on improving the statistical efficiency of this test, which is achieved by reformulating the independence condition using parameterized linear models. Finally, we extend the hierarchical modeling approach to other identification settings, specifically by testing the validity of mediators and instrumental variables used in two additional common identification strategies.
In Parts Two and Three, we develop methods that instead are robust when causal identification assumptions are violated.We revisit two commonly occurring problem settings when doing causal inference and demonstrate that it is possible to develop methods that either remove the need for, or rely on, weaker and more plausible assumptions than those traditionally made. In the first setting, we study the problemof augmenting randomized trials using external data to improve efficiency in treatment effect estimation. Typically, such approaches rely on a transportability assumption that relate the populations underlying the trial and external data. But when this transportability assumption is violated, integrating external data can introduce substantial bias. To address this, we propose a novel and efficient estimator that incorporates external data and show that this estimator improves inference on the average treatment effect while guaranteeing that it never performs worse, and sometimes performs better, than the estimator that relies solely on trial data.We further adapt this estimator to learn heterogeneous treatment effects within the trial population and show that similar safety guarantees hold for this problem.
In the second setting, we examine the evaluation of treatment allocation strategies using Qini curves. Standard methods for estimating Qini curves assume no interference between treated units, meaning that the treatment of one unit does not affect others. However, when interference is present, these Qini curves can be misleading and lead to incorrect evaluation of treatment allocation strategies.We therefore propose multiple estimators to handle the interference, specifically in settings where units within a cluster may affect one another but not units in other clusters.We identify a bias-variance trade-off in these estimators and, through both theoretical and empirical results, provide practical guidance on how practitioners can choose among them. The dissertation concludes with a discussion of broader considerations, limitations of the presented research, and potential directions for future work.We find that it is indeed possible to make causal inference safer by detecting assumption violations and reducing reliance on untestable assumptions. Nonetheless, many open and important questions remain, offering promising avenues for further research on this topic.

Falsification of Unconfoundedness by Testing Independence of Causal Mechanisms

Journal article (2025) - Rickard K.A. Karlsson, Jesse H. Krijthe

A major challenge in estimating treatment effects in observational studies is the reliance on untestable conditions such as the assumption of no unmeasured confounding. In this work, we propose an algorithm that can falsify the assumption of no unmeasured confounding in a setting with observational data from multiple heterogeneous sources, which we refer to as environments. Our proposed falsification strategy leverages a key observation that unmeasured confounding can cause observed causal mechanisms to appear dependent. Building on this observation, we develop a novel two-stage procedure that detects these dependencies with high statistical power while controlling false positives. The algorithm does not require access to randomized data and, in contrast to other falsification approaches, functions even under transportability violations when the environment has a direct effect on the outcome of interest. To showcase the practical relevance of our approach, we show that our method is able to efficiently detect confounding on both simulated and semi-synthetic data. ...

Benchmarking surrogate-based optimisation algorithms on expensive black-box functions

Journal article (2023) - Laurens Bliek, Arthur Guijt, Rickard Karlsson, Sicco Verwer, Mathijs de Weerdt

Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the results of the six algorithms on all evaluated problem instances. This unique new dataset lowers the bar for researching new methods as the number of expensive evaluations required for comparison and for the creation of new surrogate models is significantly reduced. ...

Putting Causal Identification to the Test: Falsification using Multi-Environment Data

Preprint (2023) - R.K.A. Karlsson, S. Creastă, J.H. Krijthe

We study the problem of falsifying the assumptions behind a set of broadly applied causal identification strategies: namely back-door adjustment, front-door adjustment, and instrumental variable estimation. While these assumptions are untestable from observational data in general, we show that with access to data coming from multiple heterogeneous environments, there exist novel independence constraints that can be used to falsify the validity of each strategy. Most interestingly, we make no parametric assumptions, instead relying on that changes between environments happen under the principle of independent causal mechanisms. ...

Detecting hidden confounding in observational data using multiple environments

Preprint (2023) - R.K.A. Karlsson, J.H. Krijthe

A common assumption in causal inference from observational data is that there is no hidden confounding. Yet it is, in general, impossible to verify this assumption from a single dataset. Under the assumption of independent causal mechanisms underlying the data-generating process, we demonstrate a way to detect unobserved confounders when having multiple observational datasets coming from different environments. We present a theory for testable conditional independencies that are only absent when there is hidden confounding and examine cases where we violate its assumptions: degenerate & dependent mechanisms, and faithfulness violations. Additionally, we propose a procedure to test these independencies and study its empirical finite-sample behavior using simulation studies and semi-synthetic data based on a real-world dataset. In most cases, the proposed procedure correctly predicts the presence of hidden confounding, particularly when the confounding bias is large. ...

Continuous Surrogate-Based Optimization Algorithms Are Well-Suited for Expensive Discrete Problems

Conference paper (2021) - Rickard Karlsson, Laurens Bliek, Sicco Verwer, Mathijs de Weerdt

One method to solve expensive black-box optimization problems is to use a surrogate model that approximates the objective based on previous observed evaluations. The surrogate, which is cheaper to evaluate, is optimized instead to find an approximate solution to the original problem. In the case of discrete problems, recent research has revolved around discrete surrogate models that are specifically constructed to deal with these problems. A main motivation is that literature considers continuous methods, such as Bayesian optimization with Gaussian processes as the surrogate, to be sub-optimal (especially in higher dimensions) because they ignore the discrete structure by, e.g., rounding off real-valued solutions to integers. However, we claim that this is not true. In fact, we present empirical evidence showing that the use of continuous surrogate models displays competitive performance on a set of high-dimensional discrete benchmark problems, including a real-life application, against state-of-the-art discrete surrogate-based methods. Our experiments with different kinds of discrete decision variables and time constraints also give more insight into which algorithms work well on which type of problem. ...