J.H. Krijthe
Please Note
61 records found
1
Optimising Labeling
The limits of weakly supervised osteophytes severity grading and localization in Hip X-Rays
teoarthritis (OA), but grading their severity in specific hip locations is a time consum-
ing process that requires an expert. In many cases it is expensive to scale datasets
with location annotated severity labelling by experts, where as weak labels, containing
only the global presence of osteophytes is much easier to attain. This paper investi-
gates whether such weak global label can improve localized severity grading through a
multitask deep learning framework.
We study a ResNet-18 based convolutional network that shares and updates its
weights across two output heads, a global binary classification head and four regional
ordinal heads for femur superior, femur inferior, acetabulum superior and acetabulum
inferior. The model is trained under four supervision strategies: a strong-only config-
uration using only quadrant-level labels, a masked baseline that incorporates weakly
labelled negatives via label propagation and ignores weak positives in the local loss,
and two Multi-Instance Learning variants that use a Noisy-OR loss to propagate weak
positive labels to the quadrants. We systematically vary the ratio of weak to strong la-
bels and evaluate performance using quadratic weighted Cohen’s kappa as the primary
metric.
Experiments show that the masked baseline with weak labels improves regional
kappa score compared to the strong-only configuration, while MIL variants fail to out-
perform the baseline and can degrade performance at higher weak-to-strong ratios. We
further observe that selecting checkpoints by minimal joint validation loss underesti-
mates achievable kappa score, due to faster convergence of the global task, whereas
selecting by maximal kappa score yields substantially better localized grading. Overall
the findings highlight the trade off between localization and classification performance
in weakly supervised multitask learning pipelines for regional osteophytes grading in
hip X-Rays. ...
teoarthritis (OA), but grading their severity in specific hip locations is a time consum-
ing process that requires an expert. In many cases it is expensive to scale datasets
with location annotated severity labelling by experts, where as weak labels, containing
only the global presence of osteophytes is much easier to attain. This paper investi-
gates whether such weak global label can improve localized severity grading through a
multitask deep learning framework.
We study a ResNet-18 based convolutional network that shares and updates its
weights across two output heads, a global binary classification head and four regional
ordinal heads for femur superior, femur inferior, acetabulum superior and acetabulum
inferior. The model is trained under four supervision strategies: a strong-only config-
uration using only quadrant-level labels, a masked baseline that incorporates weakly
labelled negatives via label propagation and ignores weak positives in the local loss,
and two Multi-Instance Learning variants that use a Noisy-OR loss to propagate weak
positive labels to the quadrants. We systematically vary the ratio of weak to strong la-
bels and evaluate performance using quadratic weighted Cohen’s kappa as the primary
metric.
Experiments show that the masked baseline with weak labels improves regional
kappa score compared to the strong-only configuration, while MIL variants fail to out-
perform the baseline and can degrade performance at higher weak-to-strong ratios. We
further observe that selecting checkpoints by minimal joint validation loss underesti-
mates achievable kappa score, due to faster convergence of the global task, whereas
selecting by maximal kappa score yields substantially better localized grading. Overall
the findings highlight the trade off between localization and classification performance
in weakly supervised multitask learning pipelines for regional osteophytes grading in
hip X-Rays.
Anatomical Priors for Weakly Supervised Osteophyte Detection and Localization in Hip X-rays
Evaluating BoneFinder-Derived Guidance Under Image-Level Supervision
This work investigates whether anatomical priors derived from landmark points can improve weakly supervised osteophyte detection and localization in hip X-rays when only image-level labels are available. We propose modified ResNet-18 architectures that integrate anatomical guidance to highlight likely osteophyte regions.
We evaluate the proposed models across varying training data sizes. The results show that models with anatomical guidance generally outperform baseline models, with the most consistent improvements observed in classification metrics, while localization results are less conclusive. Additionally, experiments performed without guidance during testing led to reduced classification performance. Overall, the results suggest that anatomical priors provide useful complementary information for weakly supervised osteophyte detection, although they do not fully compensate for limited training data. Moreover, the benefit of guidance information varies across architectures and training set sizes. ...
This work investigates whether anatomical priors derived from landmark points can improve weakly supervised osteophyte detection and localization in hip X-rays when only image-level labels are available. We propose modified ResNet-18 architectures that integrate anatomical guidance to highlight likely osteophyte regions.
We evaluate the proposed models across varying training data sizes. The results show that models with anatomical guidance generally outperform baseline models, with the most consistent improvements observed in classification metrics, while localization results are less conclusive. Additionally, experiments performed without guidance during testing led to reduced classification performance. Overall, the results suggest that anatomical priors provide useful complementary information for weakly supervised osteophyte detection, although they do not fully compensate for limited training data. Moreover, the benefit of guidance information varies across architectures and training set sizes.
Annotation-Efficient Osteophyte Severity Estimation in Hip X-rays
Combining Binary Presence Labels with Limited OARSI Grade Supervision
osteoarthritis, is expensive because it requires expert annotation, whereas coarser binary presence labels are far easier to obtain. This study investigates how effectively
these binary labels can be combined with a limited number of graded labels to estimate ordinal osteophyte severity in hip X-ray crops, and whether the choice of which samples to grade matters. We formulate the task as cumulative ordinal regression over four anatomical locations per hip, in which binary labels supervise the presence threshold and graded labels supervise the higher severity thresholds, while thresholds with no available grade are left unsupervised. A binary-only baseline detected osteophyte presence well and produced confidence scores that rose with true grade, but could not resolve the higher grades. A few graded labels enabled ordinal expected-severity estimates and reduced macro-averaged mean absolute error, with the largest gains at the smallest budgets and diminishing returns beyond. Comparing score-stratified sampling against random selection of the graded subset, the score-based strategy was competitive but not consistently better, indicating that most of the benefit comes from adding graded supervision rather than from how the samples are chosen. All results are reported on a held-out test set, averaged over three seeds. Combining many binary labels with relatively few graded labels is a promising way to reduce expert annotation burden while still producing useful ordinal severity estimates. ...
osteoarthritis, is expensive because it requires expert annotation, whereas coarser binary presence labels are far easier to obtain. This study investigates how effectively
these binary labels can be combined with a limited number of graded labels to estimate ordinal osteophyte severity in hip X-ray crops, and whether the choice of which samples to grade matters. We formulate the task as cumulative ordinal regression over four anatomical locations per hip, in which binary labels supervise the presence threshold and graded labels supervise the higher severity thresholds, while thresholds with no available grade are left unsupervised. A binary-only baseline detected osteophyte presence well and produced confidence scores that rose with true grade, but could not resolve the higher grades. A few graded labels enabled ordinal expected-severity estimates and reduced macro-averaged mean absolute error, with the largest gains at the smallest budgets and diminishing returns beyond. Comparing score-stratified sampling against random selection of the graded subset, the score-based strategy was competitive but not consistently better, indicating that most of the benefit comes from adding graded supervision rather than from how the samples are chosen. All results are reported on a held-out test set, averaged over three seeds. Combining many binary labels with relatively few graded labels is a promising way to reduce expert annotation burden while still producing useful ordinal severity estimates.
Landmark-Based Anatomical Priors as Penalty Masks in Weakly Supervised Learning
Effects on Classification Performance and Heatmap Distribution in Hip Osteophyte Detection
When the Propensity Model Is Wrong
Informal Benchmarking and a False Sense of Robustness in Causal Sensitivity Analysis
Leave-Multiple-Out Informal Benchmarking
Understanding the Behavior of Informal Benchmarking for Multivariate Confounding
Benchmarking the Unobserved
Coverage Failure in Omitted-Variable Sensitivity Bounds
Applying Informal Benchmarking to the f-Sensitivity Model
Benchmarking the Unobserved
We develop a framework combining input-distribution diagnostics, label-distribution analysis, and bidirectional cross-domain model evaluation to assess whether observed differences are consistent with annotation shift. The approach is evaluated through controlled synthetic experiments and experiments using osteoarthritis radiographs.
Across both settings, annotation shift produces characteristic directional asymmetries in cross-domain prediction errors that differ from the signatures of prevalence and acquisition shifts. These asymmetries provide a basis for distinguishing annotation shift from other forms of domain shift, enabling more reliable interpretation of cross-domain model failures. ...
We develop a framework combining input-distribution diagnostics, label-distribution analysis, and bidirectional cross-domain model evaluation to assess whether observed differences are consistent with annotation shift. The approach is evaluated through controlled synthetic experiments and experiments using osteoarthritis radiographs.
Across both settings, annotation shift produces characteristic directional asymmetries in cross-domain prediction errors that differ from the signatures of prevalence and acquisition shifts. These asymmetries provide a basis for distinguishing annotation shift from other forms of domain shift, enabling more reliable interpretation of cross-domain model failures.
Positivity Sized-Up Effectively
Assessing Stochastic Positivity in Causal Inference via Effective Sample Size
The primary aim of this study was to develop and validate a machine learning prediction model for respiratory deterioration in mechanically ventilated Intensive Care Unit (ICU) patients. The secondary aim was to identify physiological parameters associated with respiratory failure during mechanical ventilation.
Methods
Two distinct prediction models were developed using data from ICU patients admitted to the Leiden University Medical Centre (LUMC) between 2018 and 2023. Patients receiving invasive mechanical ventilation (IMV) for at least 48 hours with a PaO2/FiO2 ratio below 40 kPa were included and allocated to COVID training, COVID test, or non-COVID test sets. Model 1 predicts respiratory deterioration within six hours after switching from controlled to assisted ventilation. Model 2 is an hourly updating model predicting respiratory deterioration occurring more than six hours after this switch. XGBoost models were cross-validated on the COVID training set to identify the optimal observation windows and prediction horizons, after which feature selection and hyperparameter optimisation were performed. Model 1 was optimised for the area under the receiver operating characteristic (AUROC) and Model 2 for the area under the precision-recall curve (AUPRC). Discriminative performance, generalisability, and clinical utility were evaluated on the COVID and non-COVID test sets.
Results
A total of 296 patients were included in the COVID training set, 78 in the COVID test set, and 755 to the non-COVID test set. For Model 1, a one-hour observation window was selected. The most important features were the mean fraction of inspired oxygen (FiO2), propofol infusion rate, and peripheral oxygen saturation (SpO2). This model achieved an AUROC of 0.78 on the COVID test and 0.76 on the non-COVID test set. For model 2, a two-hour observation window and a six-hour prediction horizon were selected, with the SpO2/FiO2 ratio as the most important input feature. This model achieved an AUPRC of 0.05 on the COVID test set and 0.03 on the non-COVID test set.
Conclusion
Model 1 demonstrated moderate discriminative performance but limited clinical utility at relevant operating points. Model 2 showed very limited predictive value, primarily due to extreme class imbalance. Consequently, neither model is currently suitable for clinical implementation. With larger datasets and more advanced modelling techniques, Model 1 may have the potential to become a clinically useful decision support tool to support decisions on switching from controlled to assisted ventilation. ...
The primary aim of this study was to develop and validate a machine learning prediction model for respiratory deterioration in mechanically ventilated Intensive Care Unit (ICU) patients. The secondary aim was to identify physiological parameters associated with respiratory failure during mechanical ventilation.
Methods
Two distinct prediction models were developed using data from ICU patients admitted to the Leiden University Medical Centre (LUMC) between 2018 and 2023. Patients receiving invasive mechanical ventilation (IMV) for at least 48 hours with a PaO2/FiO2 ratio below 40 kPa were included and allocated to COVID training, COVID test, or non-COVID test sets. Model 1 predicts respiratory deterioration within six hours after switching from controlled to assisted ventilation. Model 2 is an hourly updating model predicting respiratory deterioration occurring more than six hours after this switch. XGBoost models were cross-validated on the COVID training set to identify the optimal observation windows and prediction horizons, after which feature selection and hyperparameter optimisation were performed. Model 1 was optimised for the area under the receiver operating characteristic (AUROC) and Model 2 for the area under the precision-recall curve (AUPRC). Discriminative performance, generalisability, and clinical utility were evaluated on the COVID and non-COVID test sets.
Results
A total of 296 patients were included in the COVID training set, 78 in the COVID test set, and 755 to the non-COVID test set. For Model 1, a one-hour observation window was selected. The most important features were the mean fraction of inspired oxygen (FiO2), propofol infusion rate, and peripheral oxygen saturation (SpO2). This model achieved an AUROC of 0.78 on the COVID test and 0.76 on the non-COVID test set. For model 2, a two-hour observation window and a six-hour prediction horizon were selected, with the SpO2/FiO2 ratio as the most important input feature. This model achieved an AUPRC of 0.05 on the COVID test set and 0.03 on the non-COVID test set.
Conclusion
Model 1 demonstrated moderate discriminative performance but limited clinical utility at relevant operating points. Model 2 showed very limited predictive value, primarily due to extreme class imbalance. Consequently, neither model is currently suitable for clinical implementation. With larger datasets and more advanced modelling techniques, Model 1 may have the potential to become a clinically useful decision support tool to support decisions on switching from controlled to assisted ventilation.
Safer causal inference
Theory and algorithms for falsification, trial augmentation and policy evaluation
In Part One, we address the first aspect of detecting violations of causal identification assumptions. We focus on settings with data from multiple sources, such as hospitals or locations, where distributional shifts naturally occur. Under specific independence conditions on the causal mechanisms driving these shifts, we first present a nonparametric test to falsify the assumption of no unmeasured confounding. To obtain these results, we introduce a novel technique utilizing hierarchical causal graphical models. Thereafter, we focus on improving the statistical efficiency of this test, which is achieved by reformulating the independence condition using parameterized linear models. Finally, we extend the hierarchical modeling approach to other identification settings, specifically by testing the validity of mediators and instrumental variables used in two additional common identification strategies.
In Parts Two and Three, we develop methods that instead are robust when causal identification assumptions are violated.We revisit two commonly occurring problem settings when doing causal inference and demonstrate that it is possible to develop methods that either remove the need for, or rely on, weaker and more plausible assumptions than those traditionally made. In the first setting, we study the problemof augmenting randomized trials using external data to improve efficiency in treatment effect estimation. Typically, such approaches rely on a transportability assumption that relate the populations underlying the trial and external data. But when this transportability assumption is violated, integrating external data can introduce substantial bias. To address this, we propose a novel and efficient estimator that incorporates external data and show that this estimator improves inference on the average treatment effect while guaranteeing that it never performs worse, and sometimes performs better, than the estimator that relies solely on trial data.We further adapt this estimator to learn heterogeneous treatment effects within the trial population and show that similar safety guarantees hold for this problem.
In the second setting, we examine the evaluation of treatment allocation strategies using Qini curves. Standard methods for estimating Qini curves assume no interference between treated units, meaning that the treatment of one unit does not affect others. However, when interference is present, these Qini curves can be misleading and lead to incorrect evaluation of treatment allocation strategies.We therefore propose multiple estimators to handle the interference, specifically in settings where units within a cluster may affect one another but not units in other clusters.We identify a bias-variance trade-off in these estimators and, through both theoretical and empirical results, provide practical guidance on how practitioners can choose among them. The dissertation concludes with a discussion of broader considerations, limitations of the presented research, and potential directions for future work.We find that it is indeed possible to make causal inference safer by detecting assumption violations and reducing reliance on untestable assumptions. Nonetheless, many open and important questions remain, offering promising avenues for further research on this topic. ...
In Part One, we address the first aspect of detecting violations of causal identification assumptions. We focus on settings with data from multiple sources, such as hospitals or locations, where distributional shifts naturally occur. Under specific independence conditions on the causal mechanisms driving these shifts, we first present a nonparametric test to falsify the assumption of no unmeasured confounding. To obtain these results, we introduce a novel technique utilizing hierarchical causal graphical models. Thereafter, we focus on improving the statistical efficiency of this test, which is achieved by reformulating the independence condition using parameterized linear models. Finally, we extend the hierarchical modeling approach to other identification settings, specifically by testing the validity of mediators and instrumental variables used in two additional common identification strategies.
In Parts Two and Three, we develop methods that instead are robust when causal identification assumptions are violated.We revisit two commonly occurring problem settings when doing causal inference and demonstrate that it is possible to develop methods that either remove the need for, or rely on, weaker and more plausible assumptions than those traditionally made. In the first setting, we study the problemof augmenting randomized trials using external data to improve efficiency in treatment effect estimation. Typically, such approaches rely on a transportability assumption that relate the populations underlying the trial and external data. But when this transportability assumption is violated, integrating external data can introduce substantial bias. To address this, we propose a novel and efficient estimator that incorporates external data and show that this estimator improves inference on the average treatment effect while guaranteeing that it never performs worse, and sometimes performs better, than the estimator that relies solely on trial data.We further adapt this estimator to learn heterogeneous treatment effects within the trial population and show that similar safety guarantees hold for this problem.
In the second setting, we examine the evaluation of treatment allocation strategies using Qini curves. Standard methods for estimating Qini curves assume no interference between treated units, meaning that the treatment of one unit does not affect others. However, when interference is present, these Qini curves can be misleading and lead to incorrect evaluation of treatment allocation strategies.We therefore propose multiple estimators to handle the interference, specifically in settings where units within a cluster may affect one another but not units in other clusters.We identify a bias-variance trade-off in these estimators and, through both theoretical and empirical results, provide practical guidance on how practitioners can choose among them. The dissertation concludes with a discussion of broader considerations, limitations of the presented research, and potential directions for future work.We find that it is indeed possible to make causal inference safer by detecting assumption violations and reducing reliance on untestable assumptions. Nonetheless, many open and important questions remain, offering promising avenues for further research on this topic.
Adversarial generative models applied to diagnosing Osteoarthritis
Evaluating different techniques for fine-tuning discriminator models to classify osteoarthritis
This paper systematically investigates how the depth and width of TARNet affect the CATE estimation in diverse simulated data environments. The research investigates two central questions: how TARNet's performance varies across data regimes (e.g., confounding strength, sample size), and how its optimal architecture changes in response to these conditions.
A comprehensive set of simulation-based experiments is conducted using the CATENets framework, isolating and varying factors such as sample size, feature dimensionality, confounding strength, and the presence of noise. The results demonstrate that deeper architectures generally yield better performance in complex or high-dimensional scenarios, whereas narrower networks are preferable in small-sample or high-noise settings due to their regularizing effect. Furthermore, the findings suggest that there is no universally optimal architecture. The best configuration depends on the specific characteristics of the data. The study concludes with practical recommendations for architecture selection based on the experiments conducted.
...
This paper systematically investigates how the depth and width of TARNet affect the CATE estimation in diverse simulated data environments. The research investigates two central questions: how TARNet's performance varies across data regimes (e.g., confounding strength, sample size), and how its optimal architecture changes in response to these conditions.
A comprehensive set of simulation-based experiments is conducted using the CATENets framework, isolating and varying factors such as sample size, feature dimensionality, confounding strength, and the presence of noise. The results demonstrate that deeper architectures generally yield better performance in complex or high-dimensional scenarios, whereas narrower networks are preferable in small-sample or high-noise settings due to their regularizing effect. Furthermore, the findings suggest that there is no universally optimal architecture. The best configuration depends on the specific characteristics of the data. The study concludes with practical recommendations for architecture selection based on the experiments conducted.