Gv

G. van Tulder

info

Please Note

17 records found

The limits of weakly supervised osteophytes severity grading and localization in Hip X-Rays

Bachelor thesis (2026) - A.D. Ye, G. van Tulder, J.H. Krijthe, I.M. Olkhovskaia
Osteophytes are bony protrusions that are key radiographic indicators of hip os-
teoarthritis (OA), but grading their severity in specific hip locations is a time consum-
ing process that requires an expert. In many cases it is expensive to scale datasets
with location annotated severity labelling by experts, where as weak labels, containing
only the global presence of osteophytes is much easier to attain. This paper investi-
gates whether such weak global label can improve localized severity grading through a
multitask deep learning framework.
We study a ResNet-18 based convolutional network that shares and updates its
weights across two output heads, a global binary classification head and four regional
ordinal heads for femur superior, femur inferior, acetabulum superior and acetabulum
inferior. The model is trained under four supervision strategies: a strong-only config-
uration using only quadrant-level labels, a masked baseline that incorporates weakly
labelled negatives via label propagation and ignores weak positives in the local loss,
and two Multi-Instance Learning variants that use a Noisy-OR loss to propagate weak
positive labels to the quadrants. We systematically vary the ratio of weak to strong la-
bels and evaluate performance using quadratic weighted Cohen’s kappa as the primary
metric.
Experiments show that the masked baseline with weak labels improves regional
kappa score compared to the strong-only configuration, while MIL variants fail to out-
perform the baseline and can degrade performance at higher weak-to-strong ratios. We
further observe that selecting checkpoints by minimal joint validation loss underesti-
mates achievable kappa score, due to faster convergence of the global task, whereas
selecting by maximal kappa score yields substantially better localized grading. Overall
the findings highlight the trade off between localization and classification performance
in weakly supervised multitask learning pipelines for regional osteophytes grading in
hip X-Rays. ...
Weakly supervised osteophyte classification in hip X-ray images is challenging because only image-level labels are available, providing no explicit information about osteophyte location. However, anatomical landmarks can be used to identify regions where osteophytes are most likely to occur and guide the model towards clinically relevant structures. At the same time, broader anatomical context may also contain useful information for classification. As a result, it remains unclear whether models benefit more from broad anatomical context or from localized regions centered on anatomically relevant structures. This project evaluates whether anatomically guided preprocessing can improve weakly supervised hip osteophyte classification compared to a baseline preprocessing approach. Hip X-rays from the Osteoarthritis Initiative (OAI) and CHECK datasets were processed using two strategies: broad femoral head centered crops and localized landmark based crops generated using BoneFinder anatomical landmarks. ResNet-18 models were trained for binary osteophyte classification and evaluated using ROC-AUC. We further hypothesized that anatomically guided preprocessing would be particularly beneficial when training data is limited, as focusing on clinically relevant regions may improve data efficiency. To investigate this, additional experiments were conducted using reduced training set sizes (50%, 25%, and 10% of the available training data). Unexpectedly, the results show that the baseline preprocessing approach consistently achieved higher classification performance than the anatomically guided approach across all evaluated anatomical regions, despite using lower resolution crops than the landmark-guided approach. For example, the baseline model achieved an ROC-AUC of 0.889 for superior femoral osteophyte classification, whereas the corresponding landmark-based model achieved an ROC-AUC of 0.783. Reducing the training set size generally reduced performance for both approaches. These findings suggest that localized landmark based crops do not necessarily improve weakly supervised osteophyte classification and that broader anatomical context may provide important information to predict accurately. Future work could investigate alternative localization strategies and more precise osteophyte annotations. The source code used in this study is publicly available at: https://github.com/egeyarar/osteophyte-classification ...

Combining Binary Presence Labels with Limited OARSI Grade Supervision

Detailed OARSI grading of osteophytes, an important radiographic indicator of hip
osteoarthritis, is expensive because it requires expert annotation, whereas coarser binary presence labels are far easier to obtain. This study investigates how effectively
these binary labels can be combined with a limited number of graded labels to estimate ordinal osteophyte severity in hip X-ray crops, and whether the choice of which samples to grade matters. We formulate the task as cumulative ordinal regression over four anatomical locations per hip, in which binary labels supervise the presence threshold and graded labels supervise the higher severity thresholds, while thresholds with no available grade are left unsupervised. A binary-only baseline detected osteophyte presence well and produced confidence scores that rose with true grade, but could not resolve the higher grades. A few graded labels enabled ordinal expected-severity estimates and reduced macro-averaged mean absolute error, with the largest gains at the smallest budgets and diminishing returns beyond. Comparing score-stratified sampling against random selection of the graded subset, the score-based strategy was competitive but not consistently better, indicating that most of the benefit comes from adding graded supervision rather than from how the samples are chosen. All results are reported on a held-out test set, averaged over three seeds. Combining many binary labels with relatively few graded labels is a promising way to reduce expert annotation burden while still producing useful ordinal severity estimates. ...

Evaluating BoneFinder-Derived Guidance Under Image-Level Supervision

Bachelor thesis (2026) - I. Onea, J.H. Krijthe, G. van Tulder, I.M. Olkhovskaia
Osteophytes, bony projections associated with Osteoarthritis, are traditionally identified through time-consuming and subjective manual X-ray assessment. While deep learning approaches have shown promising results in medical image analysis, relatively few methods are designed to detect the presence and localization of osteophytes, particularly in settings where only image-level labels are available and precise pixel-level annotations are missing.

This work investigates whether anatomical priors derived from landmark points can improve weakly supervised osteophyte detection and localization in hip X-rays when only image-level labels are available. We propose modified ResNet-18 architectures that integrate anatomical guidance to highlight likely osteophyte regions.

We evaluate the proposed models across varying training data sizes. The results show that models with anatomical guidance generally outperform baseline models, with the most consistent improvements observed in classification metrics, while localization results are less conclusive. Additionally, experiments performed without guidance during testing led to reduced classification performance. Overall, the results suggest that anatomical priors provide useful complementary information for weakly supervised osteophyte detection, although they do not fully compensate for limited training data. Moreover, the benefit of guidance information varies across architectures and training set sizes. ...

Effects on Classification Performance and Heatmap Distribution in Hip Osteophyte Detection

Bachelor thesis (2026) - M. Chen, G. van Tulder, J.H. Krijthe
Weakly supervised learning can reduce the annotation burden for radiographic osteophyte detection because models can be trained with image-level labels rather than pixel-level masks. However, image-level supervision does not specify where the pathology is located, and a classifier may therefore base its decisions on irrelevant anatomical regions. This paper studies whether landmark-based anatomical priors can improve the classification performance and spatial behaviour of a weakly supervised hip osteophyte classifier. Using the CHECK and OAI datasets, we train a ResNet-18 baseline to predict four binary osteophyte targets and compare it with a prior-guided model that adds a penalty to class activation maps during training. The penalty is constructed from BoneFinder landmarks and uses a plateau Gaussian mask around four anatomical target zones. Performance is evaluated using AUC, heatmap centre-of-mass distance, peak distance, spread, paired Wilcoxon signed-rank tests, and qualitative heatmap visualizations. The prior-guided model produces more compact heatmaps that are significantly closer to the landmark-defined anatomical target zones, with mean centre-of-mass distance reductions between 23.9 and 74.0 pixels, mean peak distance reductions between 24.9 and 73.1 pixels, and spread reductions between 24.5 and 33.7 when evaluated on positive osteophyte cases only. Classification performance remains similar to the baseline, with AUC differences between -0.01 and +0.01. These findings indicate that landmark-based penalty masks can improve alignment of class-discriminative heatmaps in weakly supervised hip osteophyte detection without requiring pixel-level osteophyte annotations. ...
This paper introduces a diagnostic framework for assessing annotation shift in cross-domain machine learning, with a focus on medical imaging applications. We formally define annotation shift as a change in the conditional distribution of assigned labels given the underlying target state. This distinction separates annotation-related effects from prevalence and acquisition-related shifts, which may produce similar observable patterns.

We develop a framework combining input-distribution diagnostics, label-distribution analysis, and bidirectional cross-domain model evaluation to assess whether observed differences are consistent with annotation shift. The approach is evaluated through controlled synthetic experiments and experiments using osteoarthritis radiographs.

Across both settings, annotation shift produces characteristic directional asymmetries in cross-domain prediction errors that differ from the signatures of prevalence and acquisition shifts. These asymmetries provide a basis for distinguishing annotation shift from other forms of domain shift, enabling more reliable interpretation of cross-domain model failures. ...
Master thesis (2025) - L. Goemans, J.H. Krijthe, G. van Tulder, T. Höllt
Osteoarthritis (OA) is a prevalent musculoskeletal disease, and radiographic assessment remains the standard for diagnosis and grading. However, expert grading is subjective and intensity-based automated methods are sensitive to imaging variability. As a potential solution to these problems, landmark-based approaches are worth exploring. Landmark-based representations of bone geometry offer an alternative to pixel-based inputs, reducing sensitivity to imaging artifacts and emphasizing structural variation. This thesis compares four landmark encodings (raw x,y coordinates, Procrustes-aligned points, pairwise distances, and polar coordinates) and evaluates them using both linear dimensionality reduction (PCA) and nonlinear generative modeling (VAEs) on hip radiographs from a publicly available dataset. We evaluate reconstruction fidelity, latent space traversal, correlation with clinical outcomes, and classification performance. Results show that raw point coordinates provide a strong baseline, often matching or outperforming more complex encodings in classification, while alternative representations improved interpretability but not discriminative power. PCA preserved clinically meaningful variability, whereas VAEs underperformed in this unsupervised setting. These findings suggest that landmark annotations already contain sufficient information for supervised OA tasks, while more advanced models may be needed for unsupervised or generative applications. ...

Evaluating different techniques for fine-tuning discriminator models to classify osteoarthritis

Osteoarthritis is a chronic joint disease in which the protective cartilage between bones deteriorates over time, leading to pain, stiffness, and reduced mobility. Diagnosis is a time-consuming and somewhat subjective process. To address this challenge, machine learning techniques can be applied. However, training supervised models on medical images is often challenging because of the limited availability of labeled training data. Self-supervised methods, which pretrain models to learn useful features without labels, offer a potential solution to this issue. In this paper, we explore the use of Generative Adversarial Networks (GANs) as a pre-training step for osteoarthritis diagnosis. The first step is the training of a GAN on a semi-public dataset of x-ray images. In the second stage, we explore different strategies for fine-tuning the discriminator model to diagnose osteoarthritis. Our experiments suggest that while GAN-based pre-training offers slight improvements over purely supervised approaches, the performance gains remain modest. ...

How effectively can a VAE’s latent space reflect osteoarthritis severity and enable diagnostic accuracy under label scarcity and label noise?

Bachelor thesis (2025) - P. Dimieva, G. van Tulder
Osteoarthritis (OA) is a prevalent and progressive joint disease whose diagnosis from radiographs often requires expert-labeled data, which is expensive and time-consuming to obtain. Variational Autoencoders (VAEs) offer a way to learn compact, unsupervised representations that may be reused for downstream classification in low-label scenarios. In this work, we assess whether a VAE can learn latent features from hip radiographs that support OA classification with minimal supervision. We evaluate the model’s reconstruction quality, latent space structure, and diagnostic utility under label scarcity and label noise. Results show that VAE-derived features outperform raw pixel and random baselines, suggesting the latent space captures diseaserelevant structure. These findings underscore the potential of VAEs as scalable, label-efficient tools for clinical imaging tasks like OA diagnosis. ...
Bachelor thesis (2025) - Z. Yancheva, J.H. Krijthe, G. van Tulder, M. Weinmann
Supervised learning approaches have proven to be useful in diagnosing Osteoarthritis from X-ray images, aiding professionals in an otherwise time-consuming and subjective process. However, in the medical field, labeled data is scarce. For this reason, we investigate a contrastive self-supervised approach, SimCLR, capable of learning useful representations from unlabeled data. Specifically, we explore a core component of this method – the data augmentation techniques. While these augmentations are highly effective in introducing variability in conventional image datasets, they are too aggressive for medical images, often altering their semantic meaning. In this paper, we implement custom anatomy-aware augmentation techniques, which aim to preserve the main region of interest needed for a diagnosis. We evaluate these anatomy-aware augmentations including Gaussian blur, Contrast enhancement, Random resized crop, and Random erasing, against their classical counterparts by training multiple encoders based on different combinations of those augmentations. The findings of our study have shown that utilizing this anatomy-aware approach for all data augmentations a model uses does not lead to a significant improvement in its performance. However, selective use of anatomy-awareness on geometric-based approaches seems to show promising initial results. ...
Bachelor thesis (2025) - D. Stoyanova, J.H. Krijthe, G. van Tulder, M. Weinmann
Self-supervised learning (SSL) is a promising approach for medical imaging tasks by reducing the need for labeled data, but most existing SSL methods treat each scan as an isolated sample and overlook the fact that patients often have multiple radiographs taken over time. These longitudinal sequences—multiple scans of the same hip acquired at different visits—encode the natural progression of osteoarthritis (OA) and thus could enrich representation learning. In this study, we evaluate whether incorporating temporal information from these longitudinal radiographic sequences into SSL pretraining yields more transferable representations and leads to improved downstream classification of hip OA severity. We focus on a temporal contrastive task (Contrastive Predictive Coding, CPC), which learns to predict future scan representations from earlier ones, and compare it to a SimCLR-based pretraining that treats each radiograph independently. We also investigate a multitask framework that combines both objectives — either by sequentially pretraining with CPC then SimCLR, or by interleaving the two tasks. Experiments on the Osteoarthritis Initiative (OAI) dataset for binary classification of KL-grade severity show that CPC alone does not surpass SimCLR-based pretraining. However, both the sequential and interleaved multitask approaches significantly improve classification accuracy over either single-task method. These findings demonstrate that even though temporal prediction by itself isn’t sufficient — combining temporal and within-scan contrastive learning can yield stronger models for hip OA severity assessment. ...
Self Supervised Learning (SSL) has been shown to effectively utilise unlabelled data for pre-training models used in down-stream medical tasks. This property of SSL enables it to use much larger datasets when compared to supervised models, which require manually labelled data. Medical classification tasks often require the identification of patterns inside a small Region Of Interest (ROI) known to be relevant for radiographic diagnosis. This contrasts standard image classification tasks, which generally rely on broader patterns. To guide a model in learning such anatomically relevant features, we investigated the hip osteoarthritis classification performance of a ROI-guided Masked Autoencoder (MAE) with a Convolutional Neural Network (CNN)-based architecture. Unlike conventional MAEs, which learn latent features by reconstructing randomly masked images, our alternative uses generated anatomical landmarks to exclusively mask the ROI or background. Contradicting similar research on Vision Transformer (ViT)-based MAEs, random masking outperformed our ROI-guided alternatives, revealing a fundamental difference in what drives performance for the two architectures, and guiding future research on more sophisticated ROI-guided masking strategies. The code is available on GitHub: https://github.com/Jasperdetweede/AnatAMAE/ ...
Master thesis (2025) - J. Luu, J.H. Krijthe, G. van Tulder, E. Demirović
Statistical distribution alignment methods for domain adaptation assume similar class distributions across domains, but this assumption cannot always be guaranteed in medical imaging data. This research investigates the effect of cross-domain class imbalance on statistical distribution alignment in unsupervised domain adaptation for medical image classification. Our experiments demonstrate that statistical distribution alignment using MMD performs reliably under mild domain shifts but struggles when both severe cross-domain class imbalance and complex domain shifts are present. To address this, we implement class-conditioned domain alignment with a new weighted minibatch sampling method. Under conditions of extreme domain shift and severe cross-domain class imbalance, combining statistical distribution alignment with more complex sampling strategies results in small improvements compared to alignment with random sampling, suggesting that class-conditioned distribution alignment offers limited practical benefits. The model appears robust to label noise, but since the performance gains are tiny, the choice of sampling strategy could have limited influence on overall performance. In our experiments, we employ the CHECK and OAI hip X-ray datasets to investigate binary osteoarthritis classification under varying levels of domain shift and cross-domain class imbalance. ...

Evaluating the Impact of Traditional Data Augmentation Techniques on the generalizability across Datasets

An accurate segmentation model for hip compo- nents could improve the diagnosis of Osteoarthritis, a prevalent age-related condition affecting joints. A significant challenge in developing effective and robust segmentation models are the domain differ- ences across various datasets. In this study, we in- vestigate the impact of different data augmentation and preprocessing techniques on the generalizabil- ity of femur segmentation models across datasets. Using two labeled datasets, we evaluate the perfor- mance of a U-Net segmentation model, focusing on the effectiveness of augmentations like image flip- ping, random rotations, blur, contrast, and bright- ness adjustments. Our findings reveal that certain augmentations, particularly random rotations of up to 15 degrees, vertical image flipping and light blurring, significantly improve the model’s gener- alization to another data set, reducing boundary er- rors and enhancing segmentation accuracy. These results underscore the potential of targeted data augmentations in developing robust, generalizable models for hip joint component segmentation. ...

A Study on Generalization of Hip X-Ray Segmentation for Osteoarthritis

Osteoarthritis is a degenerative disease that affects the aging population by degrading the cartilage in the joints. The early and accurate diagnosis of this disease is key to effective treatment. For an early and accurate diagnosis of this disease, clinicians often use X-ray imaging. This allows medical professionals to manually measure the joint space width (JSW) in X-rays images to determine the progression of the disease. This method however proves to be both time-consuming and variable based on the professional. This research addresses the automation of the measurement of the JSW for the hip, using deep learning techniques, to improve precision and efficiency. The automated measurement of the JSW is challenged by variations in the imaging conditions across different clinical settings. To address these discrepancies and keep a good performance, domain adaptation techniques are used to counter these domain shifts to ensure a consistent JSW segmentation across different imaging domains. The study investigates whether a specific domain adaptation technique can enhance the accuracy and robustness of deep learning models specifically for femur segmentation in X-ray images across different datasets. A base deep learning model is developed for femur segmentation, and supervised domain adaptation is applied. The study compares the performance of the adapted model with the base model across two different datasets. Results indicate that supervised domain adaptation does not significantly improve the model’s robustness and accuracy in femur segmentation among two different datasets. These unexpected findings suggest that incorporating domain adaptation techniques may not always lead to a more reliable and efficient diagnosis of osteoarthritis, reducing the manual workload for clinicians. ...

Segmentation of the hip joint space based on a radial projection originating from the center of the femoral head

The severity of hip osteoarthritis is measured a.o. by the minimal distance between the femoral head and the acetabular roof in an X-ray image. However, the whole joint space profile might be a more accurate estimator, since it would include irregularities in the bone surface. These irregular bulges (osteophytes) on the bone surface are one of the signals that a person might have OA. Thus the stage of OA might be better estimated automatically by having this data in the joint space profile instead of just using the minimal joint space.

For this joint space profile, the distance between the femoral head and the acetabular roof needs to be calculated. Therefore, the positions of these parts in the hip joint are required to be know. These can be retrieved from e.g. a segmentation mask.

One way of calculating the distance in a joint is to use a radial projection. A radial projection is a way of projecting points from a curved space to a plane by projecting lines from a central point along increasing angles.

In this paper, we investigate how the joint space profile can be segmented most accurately from a radial projection originating from the center of the femoral head by several comparing noise filtering and edge-finding algorithms. After which is shown that a custom algorithm based on the theory behind edge detection in noisy images works most reliably and accurately.

There are still multiple points of improvement for this algorithm. The femoral head can be segmented more accurately than the acetabular roof, the segmentation of the latter could be optimized by detecting the brightest line (peaks) instead of the most sudden change (steepest gradient) in the X-ray image as the edge for the femoral head. The algorithm could be further improved by taking care of local outliers off those edges.

In conclusion, this paper compares multiple ways of segmenting the joint space of the hip joint. The best-performing algorithm could in the future be used in an assisting tool for doctors to highlight important irregularities and measurements in the hip joint space. ...
Deep learning based architectures have been applied to semantic segmentation tasks in medicalimaging with great success. However, such modelsare heavily reliant on the quality of the groundtruth segmentation mask and hence are susceptibleto label noise. To address this issue, thispaper introduces SuperLoss, a loss function thatpushes semantic boundaries towards superpixeledges. Superpixels are compact, homogeneous regionswithin an image that group pixels with similarcharacteristics, such as pixel intensity. Our losscan be combined with other loss functions for differentsegmentation architectures. We demonstrateour framework on a combination of two large publicdatasets of hip joint X-Ray images. We comparea U-Net model with and without our loss,when trained with different fractions of noise in thetraining dataset. Our approach achieves a 1 − 2%improvement in Intersection-over-Union and Hausdorffdistance for some cases, yet yields worse insome other cases. We also perform hypothesis testingand show that our results are statistically significantwith low to medium effect size. ...