Retrieving actionable information from large datasets is increasingly computationally expensive due to the current trend of ever-increasing dataset sizes. Reducing dataset sizes with dimensionality reduction techniques is often necessary for statistical analysis techniques, such
...
Retrieving actionable information from large datasets is increasingly computationally expensive due to the current trend of ever-increasing dataset sizes. Reducing dataset sizes with dimensionality reduction techniques is often necessary for statistical analysis techniques, such as classification, to be computationally feasible. Most dimensionality reduction methods do not require any additional information to accomplish their task. However, datasets used for classification, for example, are accompanied by a set of class-labels as well. This extra information can improve dimensionality reduction techniques by explicitly preserving features that explain differences between classes. A field where high-dimensional and large datasets are standard is Imaging Mass Spectrometry (IMS), a technique that simultaneously records the abundance and spatial location of molecules throughout biological tissue samples. Classification has been applied to IMS datasets for a wide range of scenarios, including the diagnosis of disease, distinguishing between tumour types for personalized treatment, and identifying biomarkers. A recently introduced dimensionality reduction method called Soft Discriminant Map (SDM), designed to incorporate class information and prevent overfitting when used on high-dimensional datasets, is a promising candidate to reduce the size and dimensionality of IMS datasets. However, SDM currently requires manual setting of a free parameter β that influences class separation in the newly constructed feature-space. This thesis explores the use of SDM on IMS datasets in classification use cases and proposes a framework to set β in a data-driven way: Data-Driven Soft Discriminant Map (DD-SDM). Furthermore, the sensitivity of the classification performance to changes in β is examined. DD-SDM is compared to similar state-of-the-art dimensionality reduction methods in terms of classification performance. The performed experiments show that DD-SDM successfully finds a value for β where the classification performance is on par with, or in some scenarios better than, state-of-the-art dimensionality reduction methods while using fewer features. Setting β either too low or too high results in a suboptimal feature space and worsens classification performance. Golden section search, the search strategy used to find the optimal β in DD-SDM, succeeds in finding the optimal β in fewer iterations than more naive methods. With the use of an artificial dataset in combination with a novel evaluation metric, the Peak Conservation Score (PCS), the distinctive ability of DD-SDM to discard features that are common between classes and to actively select for discriminative features is demonstrated. The DD-SDM framework is furthermore applied to real-world IMS measurements of rat brain and mouse kidney tissue.