Circular Image

J.H. Krijthe

info

Please Note

61 records found

The limits of weakly supervised osteophytes severity grading and localization in Hip X-Rays

Bachelor thesis (2026) - A.D. Ye, G. van Tulder, J.H. Krijthe, I.M. Olkhovskaia
Osteophytes are bony protrusions that are key radiographic indicators of hip os-
teoarthritis (OA), but grading their severity in specific hip locations is a time consum-
ing process that requires an expert. In many cases it is expensive to scale datasets
with location annotated severity labelling by experts, where as weak labels, containing
only the global presence of osteophytes is much easier to attain. This paper investi-
gates whether such weak global label can improve localized severity grading through a
multitask deep learning framework.
We study a ResNet-18 based convolutional network that shares and updates its
weights across two output heads, a global binary classification head and four regional
ordinal heads for femur superior, femur inferior, acetabulum superior and acetabulum
inferior. The model is trained under four supervision strategies: a strong-only config-
uration using only quadrant-level labels, a masked baseline that incorporates weakly
labelled negatives via label propagation and ignores weak positives in the local loss,
and two Multi-Instance Learning variants that use a Noisy-OR loss to propagate weak
positive labels to the quadrants. We systematically vary the ratio of weak to strong la-
bels and evaluate performance using quadratic weighted Cohen’s kappa as the primary
metric.
Experiments show that the masked baseline with weak labels improves regional
kappa score compared to the strong-only configuration, while MIL variants fail to out-
perform the baseline and can degrade performance at higher weak-to-strong ratios. We
further observe that selecting checkpoints by minimal joint validation loss underesti-
mates achievable kappa score, due to faster convergence of the global task, whereas
selecting by maximal kappa score yields substantially better localized grading. Overall
the findings highlight the trade off between localization and classification performance
in weakly supervised multitask learning pipelines for regional osteophytes grading in
hip X-Rays. ...

Evaluating BoneFinder-Derived Guidance Under Image-Level Supervision

Bachelor thesis (2026) - I. Onea, J.H. Krijthe, G. van Tulder, I.M. Olkhovskaia
Osteophytes, bony projections associated with Osteoarthritis, are traditionally identified through time-consuming and subjective manual X-ray assessment. While deep learning approaches have shown promising results in medical image analysis, relatively few methods are designed to detect the presence and localization of osteophytes, particularly in settings where only image-level labels are available and precise pixel-level annotations are missing.

This work investigates whether anatomical priors derived from landmark points can improve weakly supervised osteophyte detection and localization in hip X-rays when only image-level labels are available. We propose modified ResNet-18 architectures that integrate anatomical guidance to highlight likely osteophyte regions.

We evaluate the proposed models across varying training data sizes. The results show that models with anatomical guidance generally outperform baseline models, with the most consistent improvements observed in classification metrics, while localization results are less conclusive. Additionally, experiments performed without guidance during testing led to reduced classification performance. Overall, the results suggest that anatomical priors provide useful complementary information for weakly supervised osteophyte detection, although they do not fully compensate for limited training data. Moreover, the benefit of guidance information varies across architectures and training set sizes. ...

Combining Binary Presence Labels with Limited OARSI Grade Supervision

Detailed OARSI grading of osteophytes, an important radiographic indicator of hip
osteoarthritis, is expensive because it requires expert annotation, whereas coarser binary presence labels are far easier to obtain. This study investigates how effectively
these binary labels can be combined with a limited number of graded labels to estimate ordinal osteophyte severity in hip X-ray crops, and whether the choice of which samples to grade matters. We formulate the task as cumulative ordinal regression over four anatomical locations per hip, in which binary labels supervise the presence threshold and graded labels supervise the higher severity thresholds, while thresholds with no available grade are left unsupervised. A binary-only baseline detected osteophyte presence well and produced confidence scores that rose with true grade, but could not resolve the higher grades. A few graded labels enabled ordinal expected-severity estimates and reduced macro-averaged mean absolute error, with the largest gains at the smallest budgets and diminishing returns beyond. Comparing score-stratified sampling against random selection of the graded subset, the score-based strategy was competitive but not consistently better, indicating that most of the benefit comes from adding graded supervision rather than from how the samples are chosen. All results are reported on a held-out test set, averaged over three seeds. Combining many binary labels with relatively few graded labels is a promising way to reduce expert annotation burden while still producing useful ordinal severity estimates. ...
Weakly supervised osteophyte classification in hip X-ray images is challenging because only image-level labels are available, providing no explicit information about osteophyte location. However, anatomical landmarks can be used to identify regions where osteophytes are most likely to occur and guide the model towards clinically relevant structures. At the same time, broader anatomical context may also contain useful information for classification. As a result, it remains unclear whether models benefit more from broad anatomical context or from localized regions centered on anatomically relevant structures. This project evaluates whether anatomically guided preprocessing can improve weakly supervised hip osteophyte classification compared to a baseline preprocessing approach. Hip X-rays from the Osteoarthritis Initiative (OAI) and CHECK datasets were processed using two strategies: broad femoral head centered crops and localized landmark based crops generated using BoneFinder anatomical landmarks. ResNet-18 models were trained for binary osteophyte classification and evaluated using ROC-AUC. We further hypothesized that anatomically guided preprocessing would be particularly beneficial when training data is limited, as focusing on clinically relevant regions may improve data efficiency. To investigate this, additional experiments were conducted using reduced training set sizes (50%, 25%, and 10% of the available training data). Unexpectedly, the results show that the baseline preprocessing approach consistently achieved higher classification performance than the anatomically guided approach across all evaluated anatomical regions, despite using lower resolution crops than the landmark-guided approach. For example, the baseline model achieved an ROC-AUC of 0.889 for superior femoral osteophyte classification, whereas the corresponding landmark-based model achieved an ROC-AUC of 0.783. Reducing the training set size generally reduced performance for both approaches. These findings suggest that localized landmark based crops do not necessarily improve weakly supervised osteophyte classification and that broader anatomical context may provide important information to predict accurately. Future work could investigate alternative localization strategies and more precise osteophyte annotations. The source code used in this study is publicly available at: https://github.com/egeyarar/osteophyte-classification ...

Effects on Classification Performance and Heatmap Distribution in Hip Osteophyte Detection

Bachelor thesis (2026) - M. Chen, G. van Tulder, J.H. Krijthe
Weakly supervised learning can reduce the annotation burden for radiographic osteophyte detection because models can be trained with image-level labels rather than pixel-level masks. However, image-level supervision does not specify where the pathology is located, and a classifier may therefore base its decisions on irrelevant anatomical regions. This paper studies whether landmark-based anatomical priors can improve the classification performance and spatial behaviour of a weakly supervised hip osteophyte classifier. Using the CHECK and OAI datasets, we train a ResNet-18 baseline to predict four binary osteophyte targets and compare it with a prior-guided model that adds a penalty to class activation maps during training. The penalty is constructed from BoneFinder landmarks and uses a plateau Gaussian mask around four anatomical target zones. Performance is evaluated using AUC, heatmap centre-of-mass distance, peak distance, spread, paired Wilcoxon signed-rank tests, and qualitative heatmap visualizations. The prior-guided model produces more compact heatmaps that are significantly closer to the landmark-defined anatomical target zones, with mean centre-of-mass distance reductions between 23.9 and 74.0 pixels, mean peak distance reductions between 24.9 and 73.1 pixels, and spread reductions between 24.5 and 33.7 when evaluated on positive osteophyte cases only. Classification performance remains similar to the baseline, with AUC differences between -0.01 and +0.01. These findings indicate that landmark-based penalty masks can improve alignment of class-discriminative heatmaps in weakly supervised hip osteophyte detection without requiring pixel-level osteophyte annotations. ...

Informal Benchmarking and a False Sense of Robustness in Causal Sensitivity Analysis

Bachelor thesis (2026) - R. Vízner, J.H. Krijthe, M. Havelka, A. Anand
Causal effect estimates from observational data rely on the assumption that all confounders, variables that influence both treatment and outcome, are observed. Sensitivity analysis with the Marginal Sensitivity Model (MSM) relaxes this assumption through a parameter Γ that bounds how strongly a hidden confounder may distort an individual’s probability of treatment, but choosing a realistic value for Γ is difficult. A common solution, Informal Benchmarking (IB), estimates Γ by removing observed covariates from the propensity model (the model of treatment probability) and measuring the resulting shift. Because IB depends entirely on this model, this paper investigates how IB and the resulting sensitivity bounds behave when the propensity model is misspecified. A controlled simulation study isolates a single functional-form error: a non-linear term that is part of the true treatment mechanism is omitted from the fitted model. Even though the benchmark is computed only on covariates that are individually well specified, the omitted term shrinks every fitted coefficient toward zero, and this leakage deflates the benchmark below the value a correctly specified model reports. The result is falsely robust bounds that understate the true risk of hidden confounding, the more dangerous direction of error, and the effect grows with the strength of the omitted term while standard diagnostics give no warning. A simple safeguard is proposed: refit the propensity model with a richer specification and rerun the benchmark, treating any rise in the estimate as evidence that the original was deflated. ...

Understanding the Behavior of Informal Benchmarking for Multivariate Confounding

Bachelor thesis (2026) - N.T. Borodjiev, M. Havelka, J.H. Krijthe, A. Anand
Informal benchmarking is a popular approach for calibrating sensitivity bounds for hidden confounding by treating observed covariates as if they were unobserved. While leave-one-out (LOO) benchmarking removes a single covariate, leave-multiple-out (LMO) benchmarking removes sets of covariates to approximate multidimensional confounding. In this study, we examine whether LMO benchmarking recovers the confounding strength as the number of features dropped increases. Using a synthetic dataset with bounded covariates and known confounding structure, we compare empirical bounds with an Oracle-like benchmark and the true theoretical value. The theoretical bound increases monotonically as more covariates are omitted, but the empirical LMO bound does not follow this pattern - it plateaus and then declines. The experiments show that this behavior is not explained by estimation error alone. Rather, it is a consequence of informal benchmarking being restricted by the given sample: large bounds are obtained from individuals with certain covariate values. This issue becomes more important as larger subsets are omitted, because the strongest theoretical benchmarks depend on increasingly specific patterns in the omitted covariates. As a result, LMO benchmarking may be more reliable for small omitted subsets, but should be interpreted with increasing caution for larger ones. We conclude that LMO informal benchmarking results should be read as sample-realized benchmarks rather than as the maximum confounding strength possible over the full covariate space. ...

Coverage Failure in Omitted-Variable Sensitivity Bounds

Bachelor thesis (2026) - V. Popdonchev, M. Havelka, J.H. Krijthe, A. Anand
Researchers often use observational data to estimate the causal effect of a treatment on an outcome. The central threat to such estimates is an unobserved confounder: a variable that affects both the treatment and the outcome but is not measured. An omitted confounder biases the estimated effect, and this bias does not shrink as the sample grows. Sensitivity analysis addresses this threat by asking how strong a hidden confounder would need to be to overturn a result. A widely used method for linear regression, introduced by Cinelli and Hazlett [6] and extended by Chernozhukov et al. [5] and implemented in the sensemakr software, reports an upper bound on the possible bias together with a small set of summary statistics. The bound is valid for whatever confounder strengths the analyst specifies; it cannot supply those strengths, which are unknown. In practice they are supplied by benchmarking, which compares the confounder to an observed covariate. This makes two claims at once: that the confounder is no stronger than the covariate, and that the covariate is an appropriate reference for it. The second claim cannot be checked from the data. We study the formal leave-one-out version of this procedure and ask a question its validity proof leaves open: when this assumption is false, does the reported bound still contain the true bias? We answer with Monte Carlo simulations in which the confounder is known, so that the bias and the bound can be compared directly. The bound covers the true bias until the confounder reaches roughly the strength of the covariate it is benchmarked against, and then fails sharply rather than gradually. The strength at which it fails depends on the covariate set in raw terms, so to locate it we derive a relative-strength coordinate that expresses the confounder’s strength in the bound’s own units. In these units the failure sits at a single point across structurally different covariate sets, the point at which the confounder overtakes the benchmark, and it shifts predictably when the worst-case assumption behind the bound is relaxed. None of the summary statistics warn of any of this: as the bound begins to fail, every one moves in the direction that appears more reassuring. We conclude that these summary statistics, read on their own, do not establish that an estimate is robust to unobserved confounding; they establish robustness only when the benchmarking assumption holds. We recommend that analysts either defend this assumption explicitly or set the confounder’s strengths directly from subject-matter knowledge, rather than treating the reported statistics as sufficient. ...
Bachelor thesis (2026) - A. Slics, J.H. Krijthe, M. Havelka, A. Anand
Sensitivity analysis asks how much unobserved confounding would overturn a causal conclusion. Every framework leaves the analyst to choose how much confounding to allow for. For the marginal sensitivity model (MSM), informal benchmarking sets this choice from the data. Each observed covariate is dropped in turn, and the resulting shift in treatment odds is taken as a plausible value. We ask whether the same idea transfers to the f-sensitivity model, whose parameter ρ bounds confounding by an average within each covariate value rather than by a single worst case. We show that it does. The transfer relies on a single new quantity, a benchmark ρ bench. This is the symmetric-KL divergence that a dropped covariate induces between the treatment arms. We take the strongest covariate rather than the average, as informal benchmarking does for the MSM and as ρ requires. We compute ρ bench from the covariates. It is stable across seeds, and it separates covariates that the MSM treats as identical. As a rare confounding spike grows, ρ bench stays nearly flat while the MSM’s worst-case reading climbs, which behavior is to be expected of. On simulated data with a known hidden confounder, the benchmark recovers the divergence that the confounder induces, and it covers the true ρ in every scenario tested. It has its shortcomings as it can under-report confounding that is concentrated in a low-density region of a covariate’s range. ...
This paper introduces a diagnostic framework for assessing annotation shift in cross-domain machine learning, with a focus on medical imaging applications. We formally define annotation shift as a change in the conditional distribution of assigned labels given the underlying target state. This distinction separates annotation-related effects from prevalence and acquisition-related shifts, which may produce similar observable patterns.

We develop a framework combining input-distribution diagnostics, label-distribution analysis, and bidirectional cross-domain model evaluation to assess whether observed differences are consistent with annotation shift. The approach is evaluated through controlled synthetic experiments and experiments using osteoarthritis radiographs.

Across both settings, annotation shift produces characteristic directional asymmetries in cross-domain prediction errors that differ from the signatures of prevalence and acquisition shifts. These asymmetries provide a basis for distinguishing annotation shift from other forms of domain shift, enabling more reliable interpretation of cross-domain model failures. ...

Assessing Stochastic Positivity in Causal Inference via Effective Sample Size

Master thesis (2026) - Q.B. Hofstede, J.H. Krijthe
Causal inference relies on several key identifying assumptions, including positivity: all treatment levels must have non-zero probability for every possible covariate combination. Violations lead to unreliable causal effect estimates, yet positivity is often overlooked, and existing diagnostics have limitations. This assumption is particularly relevant for observational data, because treatment assignment is not independent of confounders. To remove this dependence, Inverse probability of treatment weighting (IPTW) estimators can be used. However, IPTW relies on the positivity assumption, and near-violations lead to extreme weights and unstable estimates. We investigate effective sample size (ESS) as a practical diagnostic for evaluating the estimability of causal effects in the face of near-positivity violations. The key contribution is a theoretical definition of ‘targeted ESS’ that aligns with causal inference. Targeted ESS can quantify how many observations effectively contribute to weighted estimates and can serve as an intuitive tool for communicating positivity concerns. Through analysis and simulations, we demonstrate its strengths and limitations. Notably, targeted ESS cannot detect severe cases of positivity violations or propensity model misspecifications. Additionally, we show why conventional ESS is not generally suitable in this setting. This work offers practical guidance for assessing IPTW estimate reliability in observational causal inference. ...
Objective
The primary aim of this study was to develop and validate a machine learning prediction model for respiratory deterioration in mechanically ventilated Intensive Care Unit (ICU) patients. The secondary aim was to identify physiological parameters associated with respiratory failure during mechanical ventilation.

Methods
Two distinct prediction models were developed using data from ICU patients admitted to the Leiden University Medical Centre (LUMC) between 2018 and 2023. Patients receiving invasive mechanical ventilation (IMV) for at least 48 hours with a PaO2/FiO2 ratio below 40 kPa were included and allocated to COVID training, COVID test, or non-COVID test sets. Model 1 predicts respiratory deterioration within six hours after switching from controlled to assisted ventilation. Model 2 is an hourly updating model predicting respiratory deterioration occurring more than six hours after this switch. XGBoost models were cross-validated on the COVID training set to identify the optimal observation windows and prediction horizons, after which feature selection and hyperparameter optimisation were performed. Model 1 was optimised for the area under the receiver operating characteristic (AUROC) and Model 2 for the area under the precision-recall curve (AUPRC). Discriminative performance, generalisability, and clinical utility were evaluated on the COVID and non-COVID test sets.

Results
A total of 296 patients were included in the COVID training set, 78 in the COVID test set, and 755 to the non-COVID test set. For Model 1, a one-hour observation window was selected. The most important features were the mean fraction of inspired oxygen (FiO2), propofol infusion rate, and peripheral oxygen saturation (SpO2). This model achieved an AUROC of 0.78 on the COVID test and 0.76 on the non-COVID test set. For model 2, a two-hour observation window and a six-hour prediction horizon were selected, with the SpO2/FiO2 ratio as the most important input feature. This model achieved an AUPRC of 0.05 on the COVID test set and 0.03 on the non-COVID test set.

Conclusion
Model 1 demonstrated moderate discriminative performance but limited clinical utility at relevant operating points. Model 2 showed very limited predictive value, primarily due to extreme class imbalance. Consequently, neither model is currently suitable for clinical implementation. With larger datasets and more advanced modelling techniques, Model 1 may have the potential to become a clinically useful decision support tool to support decisions on switching from controlled to assisted ventilation. ...

Theory and algorithms for falsification, trial augmentation and policy evaluation

Doctoral thesis (2026) - R.K.A. Karlsson, M.J.T. Reinders, J.H. Krijthe
Estimating the effect of an intervention on an outcome is a central challenge across science and society. In medicine, we may ask whether a drug effectively treats a disease, and in economics, whether a new policy reduces unemployment. Estimating such effects from data, a process known as causal inference, is essential but inherently difficult because it often relies on untestable assumptions to ensure unbiased identification of treatment effects. A key example of such an untestable assumption is the absence of unmeasured confounding, meaning that no hidden variable influences both the treatment and the outcome. When this assumption fails, something which we cannot directly verify, treatment effect estimates may become biased. This ultimately can lead to untrustworthy conclusions and, in the worst case, unsafe decisions, such as prescribing the wrong drug to a patient. The central question of this dissertation is therefore whether we can develop methods for safer causal inference that either detect violations of its underlying assumptions or remain robust when those assumptions are violated.
In Part One, we address the first aspect of detecting violations of causal identification assumptions. We focus on settings with data from multiple sources, such as hospitals or locations, where distributional shifts naturally occur. Under specific independence conditions on the causal mechanisms driving these shifts, we first present a nonparametric test to falsify the assumption of no unmeasured confounding. To obtain these results, we introduce a novel technique utilizing hierarchical causal graphical models. Thereafter, we focus on improving the statistical efficiency of this test, which is achieved by reformulating the independence condition using parameterized linear models. Finally, we extend the hierarchical modeling approach to other identification settings, specifically by testing the validity of mediators and instrumental variables used in two additional common identification strategies.
In Parts Two and Three, we develop methods that instead are robust when causal identification assumptions are violated.We revisit two commonly occurring problem settings when doing causal inference and demonstrate that it is possible to develop methods that either remove the need for, or rely on, weaker and more plausible assumptions than those traditionally made. In the first setting, we study the problemof augmenting randomized trials using external data to improve efficiency in treatment effect estimation. Typically, such approaches rely on a transportability assumption that relate the populations underlying the trial and external data. But when this transportability assumption is violated, integrating external data can introduce substantial bias. To address this, we propose a novel and efficient estimator that incorporates external data and show that this estimator improves inference on the average treatment effect while guaranteeing that it never performs worse, and sometimes performs better, than the estimator that relies solely on trial data.We further adapt this estimator to learn heterogeneous treatment effects within the trial population and show that similar safety guarantees hold for this problem.
In the second setting, we examine the evaluation of treatment allocation strategies using Qini curves. Standard methods for estimating Qini curves assume no interference between treated units, meaning that the treatment of one unit does not affect others. However, when interference is present, these Qini curves can be misleading and lead to incorrect evaluation of treatment allocation strategies.We therefore propose multiple estimators to handle the interference, specifically in settings where units within a cluster may affect one another but not units in other clusters.We identify a bias-variance trade-off in these estimators and, through both theoretical and empirical results, provide practical guidance on how practitioners can choose among them. The dissertation concludes with a discussion of broader considerations, limitations of the presented research, and potential directions for future work.We find that it is indeed possible to make causal inference safer by detecting assumption violations and reducing reliance on untestable assumptions. Nonetheless, many open and important questions remain, offering promising avenues for further research on this topic. ...
Master thesis (2025) - L. Goemans, J.H. Krijthe, G. van Tulder, T. Höllt
Osteoarthritis (OA) is a prevalent musculoskeletal disease, and radiographic assessment remains the standard for diagnosis and grading. However, expert grading is subjective and intensity-based automated methods are sensitive to imaging variability. As a potential solution to these problems, landmark-based approaches are worth exploring. Landmark-based representations of bone geometry offer an alternative to pixel-based inputs, reducing sensitivity to imaging artifacts and emphasizing structural variation. This thesis compares four landmark encodings (raw x,y coordinates, Procrustes-aligned points, pairwise distances, and polar coordinates) and evaluates them using both linear dimensionality reduction (PCA) and nonlinear generative modeling (VAEs) on hip radiographs from a publicly available dataset. We evaluate reconstruction fidelity, latent space traversal, correlation with clinical outcomes, and classification performance. Results show that raw point coordinates provide a strong baseline, often matching or outperforming more complex encodings in classification, while alternative representations improved interpretability but not discriminative power. PCA preserved clinically meaningful variability, whereas VAEs underperformed in this unsupervised setting. These findings suggest that landmark annotations already contain sufficient information for supervised OA tasks, while more advanced models may be needed for unsupervised or generative applications. ...
Combining data from Randomized Controlled Trials (RCTs) is a widely used method to estimate causal treatment effects. In order to combine data, the property of transportability, under which different covariate vectors exhibit similar treatment benefit, must hold between the RCTs. However, differences in study design, execution, and the underlying effect modifier distributions can violate transportability which could in turn lead to estimating incorrect causal treatment effect estimates. This thesis addresses the challenge of validating transportability between multiple RCTs and identifying subsets of RCTs between which transportability holds. Our contributions include studying a linear regression-based framework for testing transportability between multiple RCTs and a clustering-based approach for identifying transportable RCT subgroups. Through simulations and analysis of real-world RCTs concerning corticosteroid treatment for Community-acquired pneumonia (CAP), we evaluate the power, robustness, and limitations of our proposed framework. ...

Evaluating different techniques for fine-tuning discriminator models to classify osteoarthritis

Osteoarthritis is a chronic joint disease in which the protective cartilage between bones deteriorates over time, leading to pain, stiffness, and reduced mobility. Diagnosis is a time-consuming and somewhat subjective process. To address this challenge, machine learning techniques can be applied. However, training supervised models on medical images is often challenging because of the limited availability of labeled training data. Self-supervised methods, which pretrain models to learn useful features without labels, offer a potential solution to this issue. In this paper, we explore the use of Generative Adversarial Networks (GANs) as a pre-training step for osteoarthritis diagnosis. The first step is the training of a GAN on a semi-public dataset of x-ray images. In the second stage, we explore different strategies for fine-tuning the discriminator model to diagnose osteoarthritis. Our experiments suggest that while GAN-based pre-training offers slight improvements over purely supervised approaches, the performance gains remain modest. ...
Bachelor thesis (2025) - D. Stoyanova, J.H. Krijthe, G. van Tulder, M. Weinmann
Self-supervised learning (SSL) is a promising approach for medical imaging tasks by reducing the need for labeled data, but most existing SSL methods treat each scan as an isolated sample and overlook the fact that patients often have multiple radiographs taken over time. These longitudinal sequences—multiple scans of the same hip acquired at different visits—encode the natural progression of osteoarthritis (OA) and thus could enrich representation learning. In this study, we evaluate whether incorporating temporal information from these longitudinal radiographic sequences into SSL pretraining yields more transferable representations and leads to improved downstream classification of hip OA severity. We focus on a temporal contrastive task (Contrastive Predictive Coding, CPC), which learns to predict future scan representations from earlier ones, and compare it to a SimCLR-based pretraining that treats each radiograph independently. We also investigate a multitask framework that combines both objectives — either by sequentially pretraining with CPC then SimCLR, or by interleaving the two tasks. Experiments on the Osteoarthritis Initiative (OAI) dataset for binary classification of KL-grade severity show that CPC alone does not surpass SimCLR-based pretraining. However, both the sequential and interleaved multitask approaches significantly improve classification accuracy over either single-task method. These findings demonstrate that even though temporal prediction by itself isn’t sufficient — combining temporal and within-scan contrastive learning can yield stronger models for hip OA severity assessment. ...
Bachelor thesis (2025) - Z. Yancheva, J.H. Krijthe, G. van Tulder, M. Weinmann
Supervised learning approaches have proven to be useful in diagnosing Osteoarthritis from X-ray images, aiding professionals in an otherwise time-consuming and subjective process. However, in the medical field, labeled data is scarce. For this reason, we investigate a contrastive self-supervised approach, SimCLR, capable of learning useful representations from unlabeled data. Specifically, we explore a core component of this method – the data augmentation techniques. While these augmentations are highly effective in introducing variability in conventional image datasets, they are too aggressive for medical images, often altering their semantic meaning. In this paper, we implement custom anatomy-aware augmentation techniques, which aim to preserve the main region of interest needed for a diagnosis. We evaluate these anatomy-aware augmentations including Gaussian blur, Contrast enhancement, Random resized crop, and Random erasing, against their classical counterparts by training multiple encoders based on different combinations of those augmentations. The findings of our study have shown that utilizing this anatomy-aware approach for all data augmentations a model uses does not lead to a significant improvement in its performance. However, selective use of anatomy-awareness on geometric-based approaches seems to show promising initial results. ...
Estimating the Conditional Average Treatment Effect (CATE) with neural networks adapted for causal inference, like TARNet, is a promising approach, yet the impact of model architecture on performance remains underexplored.
This paper systematically investigates how the depth and width of TARNet affect the CATE estimation in diverse simulated data environments. The research investigates two central questions: how TARNet's performance varies across data regimes (e.g., confounding strength, sample size), and how its optimal architecture changes in response to these conditions.
A comprehensive set of simulation-based experiments is conducted using the CATENets framework, isolating and varying factors such as sample size, feature dimensionality, confounding strength, and the presence of noise. The results demonstrate that deeper architectures generally yield better performance in complex or high-dimensional scenarios, whereas narrower networks are preferable in small-sample or high-noise settings due to their regularizing effect. Furthermore, the findings suggest that there is no universally optimal architecture. The best configuration depends on the specific characteristics of the data. The study concludes with practical recommendations for architecture selection based on the experiments conducted.
...
Bachelor thesis (2025) - R. Allu, R.K.A. Karlsson, J.H. Krijthe
Interventional Normalizing Flows (INFs) are a recently proposed method for estimating interventional outcome distributions from observational data. A central component of this approach is the nuisance flow, whose function is to estimate the propensity score and the conditional outcome distribution. INFs are claimed to be doubly robust, meaning they can yield valid estimates even if only one of these components is correctly specified. This study investigates the practical limits of this robustness by asking two questions: (1) How do interventional estimates behave when nuisance flow components are entirely misspecified? and (2) How sensitive are these estimates to more realistic imperfections such as suboptimal hyperparameters or injected noise? Through experiments on four benchmark datasets with varying levels of confounding and distributional complexity, we find that INFs remain robust under low-confounding conditions even when both nuisance components are broken. However, in highconfounding settings, even partial misspecification can cause estimates to degrade substantially, undermining the doubly robust property. These results highlight the importance of carefully validating nuisance components and suggest that the theoretical guarantees of INFs may not always hold in practice. ...