Multi-Metric Clinical Validation of an Auto-Contour Refinement Tool in Head-and-Neck Radiotherapy
J.L. Scharn (TU Delft - Mechanical Engineering)
N. Tümer – Mentor (TU Delft - Mechanical Engineering)
Frank J.W.M. Dankers – Mentor (Leiden University Medical Center)
Prerak Mody – Mentor (Leiden University Medical Center)
Marius Staring – Graduation committee member (Leiden University Medical Center)
Q. Tao – Graduation committee member (TU Delft - Applied Sciences)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Background:
Interactive segmentation models combine auto-segmentation methods with user interaction to overcome the inconvenience of manually adjusting contours generated by imperfect auto-contouring models. However, these models have not yet been implemented for tumor target volume segmentation in clinical radiotherapy settings. Therefore, this study validates a previously developed auto-contour refinement tool at the LUMC for Head-and-Neck (H&N) radiotherapy, demonstrating its robustness and trustworthiness.
Methods:
A user study with six non-expert participants was performed, who iteratively refined a contour-refinement model prediction to align as closely as possible with the corresponding ground truth for six tumor volumes from six patients. The contour-refinement model updated its prior predictions based on user-provided foreground (tumor) and background (non-tumor) scribbles. This enabled Three Dimensional (3D) refinement until a satisfactory result was achieved.
User inputs were collected and evaluated using performance metrics such as Dice and Surface Dice to evaluate robustness of the model, along with two newly introduced evaluation metrics proposed in this study to evaluate trustworthiness: local and non-local (Surface) Dice.
Results:
Robust behavior is observed, as the model reacts in a highly consistent manner across all users. Only minor differences in model performance (Delta Dice scores of 0.1407 vs. 0.1296) were observed across users when different user inputs were applied.
The AI pencil yields a strong initial improvement compared to manual annotations (27.4% vs. 6.4%, Wilcoxon p = 0.047), whereas subsequent iterations show variability. This variability was frequently observed in cases of incorrect user input, distortions caused by dental implants, anatomically complex regions, and during the segmentation of slices at the tumor boundaries.
In all other cases the model showed a high trustworthiness, as it follows the users intent during the contouring process.
Conclusion:
The incorporation of user feedback into the contour-refinement model results in a rapid improvement in segmentation quality across the entire volume. However, manual refinement by clinicians remains necessary for anatomically complex slices.
Overall, this research shows that the model is robust to variations in user input and (apart from the first few iterations) there are no spurious changes in non-local areas. These are important findings when working towards clinical adoption of these interactive contour refinement models.