Intensity-Aware Rank Estimation for Dimensionality Reduction in Imaging Mass Spectrometry

More Info
expand_more

Abstract

Imaging Mass Spectrometry (IMS) is a spectral imaging technique, which enables detection of the spatial distribution of molecules by collecting a mass spectrum for every pixel across a tissue sample. As such, IMS enables the detection of disease-introduced anomalies in tissue samples as well as the gaining of deeper insight on a molecular level into biological processes. The dimensionality of IMS data is high, considering that every bin (or ion) along a mass spectrum represents a separate image and the number of pixels per image is relatively high. Manual analysis of the data suffers from this high dimensionality as visualization becomes increasingly difficult. Furthermore, analysis of such large datasets becomes problematic or infeasible for computational techniques both in time and computational resources. Moreover, the dimensionality of current IMS measurements hampers new applications capturing even more data. Linear dimensionality reduction methods, such as Principal Component Analysis (PCA) and Nonnegative Matrix Factorization (NMF), seek to reduce these datasets to a set of (principal) components. These components span an underlying feature subspace within the original measurement space. Rank estimation determines the quantity of such components, estimating the number needed to represent the original dataset in a lower-dimensional space while incurring minimal information loss. In the context of IMS, this task is typically performed without the use of domain-specific knowledge. Intensity-aware rank estimation seeks to utilize domain knowledge - in the form of an ion intensity threshold - to help estimate the rank. This threshold emerges naturally from IMS, due to prior knowledge on instrument and ionization process inaccuracies in the low ion intensity region. The ion intensity threshold defines a lower bound for which variations in measurements are reliable. Establishing an intensity-aware version of rank estimation requires the threshold, defined in the original measurement space, to be linked to the abstract feature subspace, defined by NMF or PCA, where the rank estimation takes place. This connection is nontrivial to make and is, therefore, a central topic of this thesis. Furthermore, intensity-aware rank estimation requires the abstract subspace to represent the majority of the information above the threshold in the first set of components, which is not guaranteed in pure NMF and PCA formulations. In this thesis, we demonstrate threshold-aware rank estimation and residual-fraction rank estimation which make rank estimation for PCA intensity-aware. Threshold-aware rank estimation applies a histogram transformation to the intensities in the original measurement space to emphasize threshold-exceeding intensities. Consecutively, we estimate the rank based on the percentage of explained variance. Residual-fraction rank estimation uses untransformed measurements but instead estimates rank based on the ratio of the above- and below-threshold residuals. We demonstrate that both rank estimations are able to find the correct rank in a synthetic dataset. With threshold-aware rank estimation applied to an IMS dataset, we show that the transformation before application of PCA leads to a lower overall estimate of rank based on a percentage of the explained variance. With residual-fraction rank estimation applied to an IMS dataset, we show that we can obtain rank estimates based on the structure of dataset close to cross-validation rank estimates for the same dataset.

Files

Thesis_thijsvanwinden_final.pd... (pdf)
(pdf | 8.65 Mb)
- Embargo expired in 17-04-2020
Unknown license