Data-driven Parameter Optimization and Spatial Awareness for Uniform Manifold Approximation and Projection (UMAP) of IMS data sets

A step towards parameter-free dimensionality reduction

More Info
expand_more

Abstract

Imaging Mass Spectrometry (IMS) is a powerful technique capable of extracting unlabeled spatial and chemical information from a biological tissue sample. Ever-increasing technological advancements have resulted in rapid growth of IMS data set sizes, scaling quadratically with the increasingly refined spatial resolution of its ion images. Dimensionality reduction techniques aim to make large data sets practically approachable by reducing the number of dimensions in the data while retaining as much information as possible. Nonlinear Dimensionality Reduction (NLDR) methods attempt to uncover an underlying nonlinear manifold or structure in the data by constructing a Low-Dimensional (LD) feature space with variables that are nonlinear combinations of the original features (i.e. mass-to-charge ratio (m/z) bins in IMS measurements). The t-Distributed Stochastic Neighbor Embedding (t-SNE) has been a common approach in NLDR methods, using force-directed graphs. Uniform Manifold Approximation and Projection (UMAP) works in a similar manner, but leans on a more versatile mathematical foundation in topology, is generally faster, and the method tends to scale better than t-SNE.

In this thesis, we introduce UMAPLUS, an extension of UMAP that takes not only spectral, but also spatial information into account. Our experiments show that UMAPLUS is capable of extracting the structure of a synthetic IMS data set better than UMAP. Besides UMAP’s standard spectral and random initializations, we introduce a new spatial initialization that provides more intuitive insight into the LD embedding relative to the spatial image domain. Standard UMAP entails several parameters that influence consistency, reliability, and quality of its LD embedding. Thus far, the influence of these parameters on IMS data sets has been barely investigated, and previous studies have consistently applied the default settings of UMAP. In this thesis, a Data-Driven UMAP (DD-UMAP) is constructed, which includes optimization of UMAP parameters in a data-driven manner. Several distance metrics are investigated, with cosine similarity providing the most robust results. A naive 1-D optimization procedure is compared to a multivariate Bayesian optimization approach, capable of optimizing multiple parameters simultaneously. Using an evaluation function partly based on the cost function of UMAP and the addition of spatial information, DD-UMAP is able to estimate and utilize an optimized input parameter set for UMAP in an unsupervised manner and on unlabeled IMS data sets. This is demonstrated on generated synthetic data, and two real world IMS data sets; a full mouse pup and a murine kidney.