High-Dimensional Data Visualization via Sampling-Based Approaches
Effect of Perplexity at different levels of Sampling-Based Approach
M.A. Bhatti (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Martin Skrodzki – Mentor (TU Delft - Computer Graphics and Visualisation)
K Hildebrandt – Mentor (TU Delft - Computer Graphics and Visualisation)
C. Lofi – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Visualizing high-dimensional data is a key challenge in modern data analysis. T-distributed Stochastic Neighbor Embedding (t-SNE) is a popular nonlinear dimensionality reduction technique that maps such data into a low-dimensional embedding while preserving local relationships. A critical hyperparameter in t-SNE is perplexity. Choosing an appropriate value of perplexity for a particular use-case is non-trivial, especially for large datasets, where repeated t-SNE computations become computationally prohibitive. To mitigate this, the sample-based approach runs t-SNE twice: first on a downsampled subset of the data and then on the full dataset. This introduces two perplexity parameters: sample perplexity for the first run and full perplexity for the second run.
In this work, we systematically investigate the impact of varying combinations of sample perplexity and full perplexity on the quality of the final t-SNE embedding. Our findings show that sample perplexity predominantly determines the global layout of the embedding, while full perplexity influences local refinement. We also compare our approach with different strategies for choosing perplexity values, and find that while some offer better preservation of structural details, they provide less flexibility.