Partial Least Squares on Hyperspectral data

A data-driven approach for predicting the growth of potato plants

More Info
expand_more

Abstract

The potato is a plant that can grow from another potato tuber. Farmers and producers have to contend with the declining quality of soil and are producing less vital potato tubers. The differences in emergence speed between batches of the same variety of potato tubers cannot be explained. This is why Project ‘Flight to Vitality’ (FtV) is developing a diagnostic test to make predictions about the vitality of different varieties of potato tubers, with various measurements. One of these measurements consists of hyperspectral images taken from potato tubers, prior to planting. Another measurement is the actual growth of the potato plants, which was recorded after emerging. This thesis attempts to relate both measurements by means of regression. After grouping both data sets on each batch of tubers, we are left with an explanatory matrix that has more columns than rows. This causes an ordinary least squares model to overfit new data. Therefore an alternative regression model is proposed, which is Partial Least Squares (PLS). The goal of PLS is to find latent variables, that maximizes the covariance between both sets, to explain the variance in the explanatory space as good as possible, and to provide good predictions on the response space. The predictive performance of PLS, which is for the most part measured with the Mean Squared Error, relies on the number of latent variables used. Cross-validation is performed for finding this number. By splitting the data into training and testing sets, we evaluate PLS and find that some of the variances are explained well with PLS. This model is extended by including variable selection on the explanatory variables, which are the frequency bands taken from hyperspectral imaging. With this extension, the same experiment was rerun and we saw that the predictive performance can increase. However, the number of variables omitted seem to be inconsistent across different experiments. We have also found that the performance can slightly decrease, but the number of omitted variables remain much more consistent (and were overlapping across different experiments). With this established baseline model, we have also looked at the predictive information on different tuber parts. By conducting a similar experiment, it appears that the model is capable of explaining almost all variance and performs outstanding. We have also looked at leaving one tuber variety out before training the model. This variety is then used for evaluation. From this experiment, it seems that there is a strong variety-dependent component found in the FtV data, which makes it impossible to predict the relative vitality of the tubers. Finally, we compare PLS with another model that belongs to the same Krylov Subspace method, which is LSQR. The same experiment is conducted on both models and it seems like PLS and LSQR are not identical in terms of finding the number of latent variables used. Regardless, as an independent regression estimator, LSQR is much faster than PLS.