Quasi-Newton Methods for tSNE

More Info
expand_more

Abstract

TSNE is a popular technique for visualizing high-dimensional data. It finds a low-dimensional representation of the data, also known as embedding, by optimizing a highly non-linear cost function. The optimization process is done iteratively, often with first-order methods such as gradient descent (GD). The performance of iterative optimization procedures depends on the time that it takes to execute each iteration and the number of iterations that are required to converge to an optimum. In the context of tSNE, different methods that tackle these issues have appeared, but, despite their success, they are still limited by their separate development. For instance, BH-SNE, one of the most popular variants of tSNE, uses the Barnes-Hut algorithm to accelerate each iteration, but still uses GD, which needs a large number of them to converge. On the other hand, quasi-Newton optimization methods are effective at reducing the number of iterations, but available implementations are constrained by the additional per-iteration cost. In our work, we attempted to reconcile these two research branches. For this, we investigated the usage of Line Search (LS) and Trust Region (TR) based quasi-Newton methods for tSNE and improved their performance by identifying and addressing several of their bottlenecks. Among the different aspects of quasi-Newton optimization for tSNE that we covered were the selection of a suitable Hessian approximation, the usage of approximate gradients and cost function evaluations, and the choice of linear solver, which the algorithm uses to find the next iterate. The resulting techniques, which we denote SD-LS-CG and SD-TR-CG, have a lower per-iteration cost than alternative quasi-Newton methods and converge in fewer iterations than the traditional implementation of tSNE, which uses a highly tuned version of GD. The results suggest that these quasi-Newton methods could be a better alternative to fast tSNE implementations such as BH-SNE, given their overall faster convergence. Furthermore, we observed that SD-LS-CG was able to consistently avoid local optima if initialized with PCA, suggesting that the choice of Hessian approximation can have a similar role to the early exaggeration phase in tSNE.