Analyzing Components of a Transformer under Different Dataset Scales in 3D Prostate CT Segmentation

Tan, Yicong; Mody, P.; van der Valk, Viktor; Staring, M.; van Gemert, J.C.

doi:10.1117/12.2651572

Analyzing Components of a Transformer under Different Dataset Scales in 3D Prostate CT Segmentation

Title

Analyzing Components of a Transformer under Different Dataset Scales in 3D Prostate CT Segmentation

Author

Tan, Yicong (Student TU Delft)
Mody, P. (Leiden University Medical Center)
van der Valk, Viktor (Leiden University Medical Center)
Staring, M. (Leiden University Medical Center)
van Gemert, J.C. (TU Delft Pattern Recognition and Bioinformatics)

Contributor

Colliot, Olivier (editor)
Isgum, Ivana (editor)

Date

2023

Abstract

Literature on medical imaging segmentation claims that hybrid UNet models containing both Transformer and convolutional blocks perform better than purely convolutional UNet models. This recently touted success of hybrid Transformers warrants an investigation into which of its components contribute to its performance. Also, previous work has a limitation of analysis only at fixed dataset scales as well as unfair comparisons with other models where parameter counts are not equivalent. Here, we investigate the performance of a hybrid Transformer network i.e. the nnFormer for organ segmentation in prostate CT scans. We do this in context of replacing its various components and by constructing learning curves by plotting model performance at different dataset scales. To compare with literature, the first experiment replaces all the shifted-window(swin) Transformer blocks of the nnFormer with convolutions. Results show that the convolution prevails as the data scale increases. In the second experiment, to reduce complexity, the self-attention mechanism within the swin-Transformer block is replaced with an similar albeit simpler spatial mixing operation i.e. max-pooling. We observe improved performance for max-pooling in smaller dataset scales, indicating that the window-based Transformer may not be the best choice in both small and larger dataset scales. Finally, since convolution has an inherent local inductive bias of positional information, we conduct a third experiment to imbibe such a property to the Transformer by exploring two kinds of positional encodings. The results show that there are insignificant improvements after adding positional encoding, indicating the hybrid swin-Transformers deficiency in capturing positional information given our dataset at its various scales. Through this work, we hope to motivate the community to use learning curves under fair experimental settings to evaluate the efficacy of newer architectures like Transformers for their medical imaging tasks. Code is available on https://github.com/prerakmody/ window-transformer-prostate-segmentation.

Subject

3D Swin-Transformer
Convolution
Pooling
Positional Encoding Learning curves
Radiotherapy
Segmentation

To reference this document use:

http://resolver.tudelft.nl/uuid:161767b4-242d-4a01-a1fd-8a95f2c3c611

DOI

https://doi.org/10.1117/12.2651572

Publisher

SPIE

Embargo date

2023-10-03

ISBN

9781510660335

Source

Medical Imaging 2023: Image Processing

Event

Medical Imaging 2023: Image Processing, 2023-02-19 → 2023-02-23, San Diego, United States

Series

Progress in Biomedical Optics and Imaging - Proceedings of SPIE, 1605-7422, 12464

Bibliographical note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Part of collection

Institutional Repository

Document type

conference paper

Rights

Files

PDF

1246408.pdf

22.37 MB

Close viewer