M. Havelka | TU Delft Repository

Leave-Multiple-Out Informal Benchmarking

Understanding the Behavior of Informal Benchmarking for Multivariate Confounding

Bachelor thesis (2026) - N.T. Borodjiev, M. Havelka, J.H. Krijthe, A. Anand

Informal benchmarking is a popular approach for calibrating sensitivity bounds for hidden confounding by treating observed covariates as if they were unobserved. While leave-one-out (LOO) benchmarking removes a single covariate, leave-multiple-out (LMO) benchmarking removes sets of covariates to approximate multidimensional confounding. In this study, we examine whether LMO benchmarking recovers the confounding strength as the number of features dropped increases. Using a synthetic dataset with bounded covariates and known confounding structure, we compare empirical bounds with an Oracle-like benchmark and the true theoretical value. The theoretical bound increases monotonically as more covariates are omitted, but the empirical LMO bound does not follow this pattern - it plateaus and then declines. The experiments show that this behavior is not explained by estimation error alone. Rather, it is a consequence of informal benchmarking being restricted by the given sample: large bounds are obtained from individuals with certain covariate values. This issue becomes more important as larger subsets are omitted, because the strongest theoretical benchmarks depend on increasingly specific patterns in the omitted covariates. As a result, LMO benchmarking may be more reliable for small omitted subsets, but should be interpreted with increasing caution for larger ones. We conclude that LMO informal benchmarking results should be read as sample-realized benchmarks rather than as the maximum confounding strength possible over the full covariate space. ...

When the Propensity Model Is Wrong

Informal Benchmarking and a False Sense of Robustness in Causal Sensitivity Analysis

Bachelor thesis (2026) - R. Vízner, J.H. Krijthe, M. Havelka, A. Anand

Causal effect estimates from observational data rely on the assumption that all confounders, variables that influence both treatment and outcome, are observed. Sensitivity analysis with the Marginal Sensitivity Model (MSM) relaxes this assumption through a parameter Γ that bounds how strongly a hidden confounder may distort an individual’s probability of treatment, but choosing a realistic value for Γ is difficult. A common solution, Informal Benchmarking (IB), estimates Γ by removing observed covariates from the propensity model (the model of treatment probability) and measuring the resulting shift. Because IB depends entirely on this model, this paper investigates how IB and the resulting sensitivity bounds behave when the propensity model is misspecified. A controlled simulation study isolates a single functional-form error: a non-linear term that is part of the true treatment mechanism is omitted from the fitted model. Even though the benchmark is computed only on covariates that are individually well specified, the omitted term shrinks every fitted coefficient toward zero, and this leakage deflates the benchmark below the value a correctly specified model reports. The result is falsely robust bounds that understate the true risk of hidden confounding, the more dangerous direction of error, and the effect grows with the strength of the omitted term while standard diagnostics give no warning. A simple safeguard is proposed: refit the propensity model with a richer specification and rerun the benchmark, treating any rise in the estimate as evidence that the original was deflated. ...

Applying Informal Benchmarking to the f-Sensitivity Model

Benchmarking the Unobserved

Bachelor thesis (2026) - A. Slics, J.H. Krijthe, M. Havelka, A. Anand

Sensitivity analysis asks how much unobserved confounding would overturn a causal conclusion. Every framework leaves the analyst to choose how much confounding to allow for. For the marginal sensitivity model (MSM), informal benchmarking sets this choice from the data. Each observed covariate is dropped in turn, and the resulting shift in treatment odds is taken as a plausible value. We ask whether the same idea transfers to the f-sensitivity model, whose parameter ρ bounds confounding by an average within each covariate value rather than by a single worst case. We show that it does. The transfer relies on a single new quantity, a benchmark ρ bench. This is the symmetric-KL divergence that a dropped covariate induces between the treatment arms. We take the strongest covariate rather than the average, as informal benchmarking does for the MSM and as ρ requires. We compute ρ bench from the covariates. It is stable across seeds, and it separates covariates that the MSM treats as identical. As a rare confounding spike grows, ρ bench stays nearly flat while the MSM’s worst-case reading climbs, which behavior is to be expected of. On simulated data with a known hidden confounder, the benchmark recovers the divergence that the confounder induces, and it covers the true ρ in every scenario tested. It has its shortcomings as it can under-report confounding that is concentrated in a low-density region of a covariate’s range. ...

Benchmarking the Unobserved

Coverage Failure in Omitted-Variable Sensitivity Bounds

Bachelor thesis (2026) - V. Popdonchev, M. Havelka, J.H. Krijthe, A. Anand

Researchers often use observational data to estimate the causal effect of a treatment on an outcome. The central threat to such estimates is an unobserved confounder: a variable that affects both the treatment and the outcome but is not measured. An omitted confounder biases the estimated effect, and this bias does not shrink as the sample grows. Sensitivity analysis addresses this threat by asking how strong a hidden confounder would need to be to overturn a result. A widely used method for linear regression, introduced by Cinelli and Hazlett [6] and extended by Chernozhukov et al. [5] and implemented in the sensemakr software, reports an upper bound on the possible bias together with a small set of summary statistics. The bound is valid for whatever confounder strengths the analyst specifies; it cannot supply those strengths, which are unknown. In practice they are supplied by benchmarking, which compares the confounder to an observed covariate. This makes two claims at once: that the confounder is no stronger than the covariate, and that the covariate is an appropriate reference for it. The second claim cannot be checked from the data. We study the formal leave-one-out version of this procedure and ask a question its validity proof leaves open: when this assumption is false, does the reported bound still contain the true bias? We answer with Monte Carlo simulations in which the confounder is known, so that the bias and the bound can be compared directly. The bound covers the true bias until the confounder reaches roughly the strength of the covariate it is benchmarked against, and then fails sharply rather than gradually. The strength at which it fails depends on the covariate set in raw terms, so to locate it we derive a relative-strength coordinate that expresses the confounder’s strength in the bound’s own units. In these units the failure sits at a single point across structurally different covariate sets, the point at which the confounder overtakes the benchmark, and it shifts predictably when the worst-case assumption behind the bound is relaxed. None of the summary statistics warn of any of this: as the bound begins to fail, every one moves in the direction that appears more reassuring. We conclude that these summary statistics, read on their own, do not establish that an estimate is robust to unobserved confounding; they establish robustness only when the benchmarking assumption holds. We recommend that analysts either defend this assumption explicitly or set the confounder’s strengths directly from subject-matter knowledge, rather than treating the reported statistics as sufficient. ...

Researchers often use observational data to estimate the causal effect of a treatment on an outcome. The central threat to such estimates is an unobserved confounder: a variable that affects both the treatment and the outcome but is not measured. An omitted confounder biases the estimated effect, and this bias does not shrink as the sample grows. Sensitivity analysis addresses this threat by asking how strong a hidden confounder would need to be to overturn a result. A widely used method for linear regression, introduced by Cinelli and Hazlett [6] and extended by Chernozhukov et al. [5] and implemented in the sensemakr software, reports an upper bound on the possible bias together with a small set of summary statistics. The bound is valid for whatever confounder strengths the analyst specifies; it cannot supply those strengths, which are unknown. In practice they are supplied by benchmarking, which compares the confounder to an observed covariate. This makes two claims at once: that the confounder is no stronger than the covariate, and that the covariate is an appropriate reference for it. The second claim cannot be checked from the data. We study the formal leave-one-out version of this procedure and ask a question its validity proof leaves open: when this assumption is false, does the reported bound still contain the true bias? We answer with Monte Carlo simulations in which the confounder is known, so that the bias and the bound can be compared directly. The bound covers the true bias until the confounder reaches roughly the strength of the covariate it is benchmarked against, and then fails sharply rather than gradually. The strength at which it fails depends on the covariate set in raw terms, so to locate it we derive a relative-strength coordinate that expresses the confounder’s strength in the bound’s own units. In these units the failure sits at a single point across structurally different covariate sets, the point at which the confounder overtakes the benchmark, and it shifts predictably when the worst-case assumption behind the bound is relaxed. None of the summary statistics warn of any of this: as the bound begins to fail, every one moves in the direction that appears more reassuring. We conclude that these summary statistics, read on their own, do not establish that an estimate is robust to unobserved confounding; they establish robustness only when the benchmarking assumption holds. We recommend that analysts either defend this assumption explicitly or set the confounder’s strengths directly from subject-matter knowledge, rather than treating the reported statistics as sufficient.