K. Batselier
Please Note
32 records found
1
Tensor Networks for Model Predictive Control
A study of tensor train representation of the semi-explicit MPC operators
On top of this, hyperparameter tuning for these types of models can be costly and complex. A fully Bayesian kernel machine, employing a CP decomposition has been shown to perform well on regression and classification tasks without additional computational cost. hyperparameter tuning was aided for this model through the introduction of ARD priors. This model is still limited to a relatively small data dimension D, due to numerical instabilities caused by repeated Hadamard products in the CPD formulation. Alternative tensor decompositions such as the tensor train decomposition exist that do not rely on Hadamard products in their formulation and may therefore be more numerically stable.
This thesis proposes a Bayesian Tensor Train Kernel Machine, a fully Bayesian tensor train decomposed kernel machine. The proposed model will include ARD hyperpriors to aid in model selection, these priors allow for automatic inference of model complexity, as well as increasing model interpretability. Mean-field variational inference is used to approximate posterior distributions of all model parameters and hyperparameters. Experiments on synthetic datasets show the functioning of the ARD hyperpriors, and the increased numerical stability compared to the CP based model. Experiments on real world datasets showcases the performance of the proposed model in terms of prediction accuracy and uncertainty quantification compared to currently available models. ...
On top of this, hyperparameter tuning for these types of models can be costly and complex. A fully Bayesian kernel machine, employing a CP decomposition has been shown to perform well on regression and classification tasks without additional computational cost. hyperparameter tuning was aided for this model through the introduction of ARD priors. This model is still limited to a relatively small data dimension D, due to numerical instabilities caused by repeated Hadamard products in the CPD formulation. Alternative tensor decompositions such as the tensor train decomposition exist that do not rely on Hadamard products in their formulation and may therefore be more numerically stable.
This thesis proposes a Bayesian Tensor Train Kernel Machine, a fully Bayesian tensor train decomposed kernel machine. The proposed model will include ARD hyperpriors to aid in model selection, these priors allow for automatic inference of model complexity, as well as increasing model interpretability. Mean-field variational inference is used to approximate posterior distributions of all model parameters and hyperparameters. Experiments on synthetic datasets show the functioning of the ARD hyperpriors, and the increased numerical stability compared to the CP based model. Experiments on real world datasets showcases the performance of the proposed model in terms of prediction accuracy and uncertainty quantification compared to currently available models.
Inference is performed within a coordinate ascent variational inference (CAVI) framework. The proposed algorithm was applied to fUS datasets and validated against simultaneous recordings of neural firing rates. Results demonstrate that the model successfully captures neural activity, achieving a Pearson correlation coefficient (PCC) of approximately 0.24-0.30, and provides a modest improvement over the conventional boxcar input stimulus model. Additionally, the jointly estimated HRFs were consistent with existing literature, and regional HRF estimates revealed differences in response dynamics between the visual cortex and hippocampus, highlighting region-specific hemodynamic properties. ...
Inference is performed within a coordinate ascent variational inference (CAVI) framework. The proposed algorithm was applied to fUS datasets and validated against simultaneous recordings of neural firing rates. Results demonstrate that the model successfully captures neural activity, achieving a Pearson correlation coefficient (PCC) of approximately 0.24-0.30, and provides a modest improvement over the conventional boxcar input stimulus model. Additionally, the jointly estimated HRFs were consistent with existing literature, and regional HRF estimates revealed differences in response dynamics between the visual cortex and hippocampus, highlighting region-specific hemodynamic properties.
A Novel Multivariate Hidden Markov Model Learning method using Coupled Canonical Polyadic Decomposition
Applied on the Sleep Physionet dataset to uncover sleep stages
HMMs represent the probability of transitions between different states of a system, which operates as a discrete Markov chain and is not directly observable. Additionally, they describe the probability of observing specific measurements based on the current state of the model. These probabilities are characterized by a state transition matrix (T), an emission matrix (O), and an initial state distribution (π).
The current methods for fitting HMMs to data primarily utilize the Baum-Welch algorithm, which is a specific type of Expectation Maximization (EM). One of the advantages of Baum-Welch is its ability to be extended to continuous observations or multivariate data. However, the algorithm involves a forward-backward pass during each iteration, which makes it increasingly slow as the size of the dataset grows. Moreover, it can easily become trapped in local minima.
An alternative approach is to use Canonical Polyadic Decomposition (CPD) to decompose a Joint Probability Tensor (JPT). This decomposition allows for the extraction of factor matrices that can be used to calculate the HMM matrices (T, O, π). Compared to the Baum-Welch algorithm, CPD and other JPT decomposition methods tend to be faster. However, they are often sensitive and may suffer from instability if the data does not fully capture the statistical behavior of the underlying HMM. Additionally, methods for JPT decomposition in HMM learning have not yet been extended to multivariate datasets or continuous settings.
To enhance stability and accommodate multivariate data, we propose a novel method that extends JPT decomposition HMM learning to a multivariate setting. This involves using a coupled CPD problem, where each observational sequence has a separate emission matrix (O) but shares a common state transition matrix (T). By transitioning from Baum-Welch to CPD, the process of learning HMMs can be significantly accelerated. Coupled CPD makes HMM learning more robust than Uncoupled CPD by relating the multivariate data through a common transition matrix T and initial distribution π. This improvement should enhance the iterability of HMM methods and make them more suitable for rapid prototyping in scientific research and the aforementioned fields.
We show in the results that Coupled CPD indeed outperforms both Uncoupled CPD and the industry-standard Baum-Welch, in most cases. It has improved stability over both methods, as well as significantly improving the calculation time, and results in higher accuracy when it comes to estimating the HMM matrices. Finally, more future improvements are suggested concerning calculation time and extensions.
...
HMMs represent the probability of transitions between different states of a system, which operates as a discrete Markov chain and is not directly observable. Additionally, they describe the probability of observing specific measurements based on the current state of the model. These probabilities are characterized by a state transition matrix (T), an emission matrix (O), and an initial state distribution (π).
The current methods for fitting HMMs to data primarily utilize the Baum-Welch algorithm, which is a specific type of Expectation Maximization (EM). One of the advantages of Baum-Welch is its ability to be extended to continuous observations or multivariate data. However, the algorithm involves a forward-backward pass during each iteration, which makes it increasingly slow as the size of the dataset grows. Moreover, it can easily become trapped in local minima.
An alternative approach is to use Canonical Polyadic Decomposition (CPD) to decompose a Joint Probability Tensor (JPT). This decomposition allows for the extraction of factor matrices that can be used to calculate the HMM matrices (T, O, π). Compared to the Baum-Welch algorithm, CPD and other JPT decomposition methods tend to be faster. However, they are often sensitive and may suffer from instability if the data does not fully capture the statistical behavior of the underlying HMM. Additionally, methods for JPT decomposition in HMM learning have not yet been extended to multivariate datasets or continuous settings.
To enhance stability and accommodate multivariate data, we propose a novel method that extends JPT decomposition HMM learning to a multivariate setting. This involves using a coupled CPD problem, where each observational sequence has a separate emission matrix (O) but shares a common state transition matrix (T). By transitioning from Baum-Welch to CPD, the process of learning HMMs can be significantly accelerated. Coupled CPD makes HMM learning more robust than Uncoupled CPD by relating the multivariate data through a common transition matrix T and initial distribution π. This improvement should enhance the iterability of HMM methods and make them more suitable for rapid prototyping in scientific research and the aforementioned fields.
We show in the results that Coupled CPD indeed outperforms both Uncoupled CPD and the industry-standard Baum-Welch, in most cases. It has improved stability over both methods, as well as significantly improving the calculation time, and results in higher accuracy when it comes to estimating the HMM matrices. Finally, more future improvements are suggested concerning calculation time and extensions.
3D Primal Attention
Using the primal dual KMLSVD framework to describe self-attention in 3D
We propose that TNs, with their ability to represent high-dimensional data through low-rank structures, can effectively alleviate the limitations of kernel machines. Our research is structured around three central inquiries: first, we examine how TNs can accelerate kernel machine scalability while accurately approximating kernel functions; second, we elucidate the theoretical links between TN-constrained kernel machines and Gaussian processes, providing insights into convergence and generalization; finally, we introduce a novel optimization framework characterizing a specific TN, the multi-linear singular value decomposition (MLSVD), in terms of primal and dual problems. ...
We propose that TNs, with their ability to represent high-dimensional data through low-rank structures, can effectively alleviate the limitations of kernel machines. Our research is structured around three central inquiries: first, we examine how TNs can accelerate kernel machine scalability while accurately approximating kernel functions; second, we elucidate the theoretical links between TN-constrained kernel machines and Gaussian processes, providing insights into convergence and generalization; finally, we introduce a novel optimization framework characterizing a specific TN, the multi-linear singular value decomposition (MLSVD), in terms of primal and dual problems.
While probabilistic methods have many benefits, such as recursive estimation and uncertainty quantification, they often come with substantial memory and compute requirements.
Computational challenges are particularly pronounced in large-scale settings, where data sets contain a high number of measurements, and for high-dimensional problems, which require exponentially many parameters to describe probability distributions.
These scenarios can suffer from the curse of dimensionality, which requires exponentially growing computing resources, making conventional approaches computationally intractable.
This dissertation addresses computational challenges by leveraging tensor networks (TNs) to develop computationally efficient probabilistic algorithms.
TNs, also known as tensor decompositions, extend matrix decomposition to higher dimensions by representing large multidimensional arrays, i.e., tensors, in a compact, decomposed format, defined by TN components and TN ranks.
Under the assumption of low-rank structure, TNs enable efficient storage and computation, making large-scale and high-dimensional problems more tractable, even on resource-constrained hardware such as conventional laptops.
The focus of this work is on scalable solutions for Bayesian estimation problems involving Gaussian distributions and exact inference, including recursive filtering and Gaussian process (GP) regression. ...
While probabilistic methods have many benefits, such as recursive estimation and uncertainty quantification, they often come with substantial memory and compute requirements.
Computational challenges are particularly pronounced in large-scale settings, where data sets contain a high number of measurements, and for high-dimensional problems, which require exponentially many parameters to describe probability distributions.
These scenarios can suffer from the curse of dimensionality, which requires exponentially growing computing resources, making conventional approaches computationally intractable.
This dissertation addresses computational challenges by leveraging tensor networks (TNs) to develop computationally efficient probabilistic algorithms.
TNs, also known as tensor decompositions, extend matrix decomposition to higher dimensions by representing large multidimensional arrays, i.e., tensors, in a compact, decomposed format, defined by TN components and TN ranks.
Under the assumption of low-rank structure, TNs enable efficient storage and computation, making large-scale and high-dimensional problems more tractable, even on resource-constrained hardware such as conventional laptops.
The focus of this work is on scalable solutions for Bayesian estimation problems involving Gaussian distributions and exact inference, including recursive filtering and Gaussian process (GP) regression.
Many studies aim to remedy this problem from an implementation perspective by developing efficient data loading and processing pipelines.
This study, on the other hand, explores an alternative approach by augmenting data representations to achieve lower storage complexity known as point cloud compression.
A broad analysis is presented on novel point cloud compression codecs using tensor decomposition methods.
Several point cloud representations and tensor decomposition methods are considered over a range of hyperparameter choices and compression values.
In order to assess the performance of the presented codecs: the compression rate, quality of the reconstruction, and time complexity is compared to the octree-based baseline model: TMC13.
Compared to the baseline model, the performance of the presented tensor decomposition-based codecs falls short. One of the presented codecs does notably outperform the others. This codec uses synthetic tensorization followed by sorting using z-location and decomposition using the TT-SVD algorithm.
Sorting by z-value isolates the ground plane, which is a dominant low-rank feature, which can effectively be decomposed using the TT-SVD algorithm yielding adequate results.
Several limitations of the presented tensor decomposition-based codecs are: the omission of bitwise compression on the factor matrices, and the trade-off between bitwise precision and truncation due to tensor decomposition.
Future work could improve in these areas along with considering the use of different heuristics and optimizing the tensor network topology. ...
Many studies aim to remedy this problem from an implementation perspective by developing efficient data loading and processing pipelines.
This study, on the other hand, explores an alternative approach by augmenting data representations to achieve lower storage complexity known as point cloud compression.
A broad analysis is presented on novel point cloud compression codecs using tensor decomposition methods.
Several point cloud representations and tensor decomposition methods are considered over a range of hyperparameter choices and compression values.
In order to assess the performance of the presented codecs: the compression rate, quality of the reconstruction, and time complexity is compared to the octree-based baseline model: TMC13.
Compared to the baseline model, the performance of the presented tensor decomposition-based codecs falls short. One of the presented codecs does notably outperform the others. This codec uses synthetic tensorization followed by sorting using z-location and decomposition using the TT-SVD algorithm.
Sorting by z-value isolates the ground plane, which is a dominant low-rank feature, which can effectively be decomposed using the TT-SVD algorithm yielding adequate results.
Several limitations of the presented tensor decomposition-based codecs are: the omission of bitwise compression on the factor matrices, and the trade-off between bitwise precision and truncation due to tensor decomposition.
Future work could improve in these areas along with considering the use of different heuristics and optimizing the tensor network topology.
Towards Sustainable CNNs: Tensor decompositions for Green AI solutions
Exploring Energy Consumption of Large CNNs
The ever-increasing complexity of Artificial Intelligence (AI) models has led to environmental challenges due to high computation and energy demands. This thesis explores the application of tensor decomposition methods—CP, Tucker, and TT—to improve the energy efficiency of large Convolutional Neural Networks (CNNs) during inference by reducing energy consumption. The energy consumption of several convolution layers was measured using a watt meter across various CNN configurations and different hardware architectures (Central Processing Unit(CPU) and Graphics Processing Unit (GPU)). In addition, several regression models were fitted to estimate energy savings, incorporating memory usage. It was found that TT decomposition consistently provided the most significant energy savings across various compression ratios, influenced by CNN hyperparameters such as input/output channels, feature sizes, and kernel sizes, whereas CP decomposition was the least effective in reducing energy. The GPU implementations generally resulted in additional energy consumption, and the GPU regression models suggested a need for more complex relationships. The thesis also revealed that the efficiency of tensor decompositions might be highly dependent on the implementation details of software libraries, such as TensorlyTorch, which can significantly impact the computation and memory complexities. These findings underscore the importance of both hardware specific considerations and careful software implementation in achieving energy efficient CNNs, providing a foundation for further research in energy-constrained environments. ...
The ever-increasing complexity of Artificial Intelligence (AI) models has led to environmental challenges due to high computation and energy demands. This thesis explores the application of tensor decomposition methods—CP, Tucker, and TT—to improve the energy efficiency of large Convolutional Neural Networks (CNNs) during inference by reducing energy consumption. The energy consumption of several convolution layers was measured using a watt meter across various CNN configurations and different hardware architectures (Central Processing Unit(CPU) and Graphics Processing Unit (GPU)). In addition, several regression models were fitted to estimate energy savings, incorporating memory usage. It was found that TT decomposition consistently provided the most significant energy savings across various compression ratios, influenced by CNN hyperparameters such as input/output channels, feature sizes, and kernel sizes, whereas CP decomposition was the least effective in reducing energy. The GPU implementations generally resulted in additional energy consumption, and the GPU regression models suggested a need for more complex relationships. The thesis also revealed that the efficiency of tensor decompositions might be highly dependent on the implementation details of software libraries, such as TensorlyTorch, which can significantly impact the computation and memory complexities. These findings underscore the importance of both hardware specific considerations and careful software implementation in achieving energy efficient CNNs, providing a foundation for further research in energy-constrained environments.
Canonical Polyadic Decomposition in Autoencoders for ECG Analysis
Exploring the effect of the CPD in unsupervised transfer learning methods for cardiac arrhythmia detection
The unsupervised transfer learning method was designed with an autoencoder for the unsupervised part and a linear network for the supervised part. The first experiment explored four models to select the most optimal architecture to apply the CPD to. These models include an autoencoder adaptation of the ResNet and ConvNeXt models, the U-Net autoencoder and a basic implementation of a convolutional autoencoder. In the second experiment, the autoencoder model is decomposed using the CPD and evaluated at various compression ratios on its reconstruction capabilities, classification accuracy and computational performance. The CPD implementation is also tested on its convergence speed and data efficiency as compared to its uncompressed counterpart.
The first experiment found that the basic implementation of a convolutional autoencoder performed best overall. The U-Net model had high reconstruction quality, however lacked the predictive accuracy. The ResNet model was found to have slightly worse reconstruction and prediction capabilities while having a larger parameter count. The ConvNeXt model failed to accurately reconstruct the images.
The second experiment showed the CP-decomposed model approached the uncompressed model in terms of predictive capabilities, while having lower reconstruction qualities. This is likely due to the regularization effect of the CPD, suggesting significant redundancy in the uncompressed model. Despite the reduction in forward pass FLOPs for the CP-decomposed model, it was found that both the computational complexity of the backpropagation process was higher than the uncompressed model at the lower compression ratios and that the memory allocation suffered a significant increase. This resulted in longer and less efficient training of the CP-decomposed models. It was furthermore found that the CP-decomposed models converged faster and had higher data efficiency as compared to the uncompressed model. ...
The unsupervised transfer learning method was designed with an autoencoder for the unsupervised part and a linear network for the supervised part. The first experiment explored four models to select the most optimal architecture to apply the CPD to. These models include an autoencoder adaptation of the ResNet and ConvNeXt models, the U-Net autoencoder and a basic implementation of a convolutional autoencoder. In the second experiment, the autoencoder model is decomposed using the CPD and evaluated at various compression ratios on its reconstruction capabilities, classification accuracy and computational performance. The CPD implementation is also tested on its convergence speed and data efficiency as compared to its uncompressed counterpart.
The first experiment found that the basic implementation of a convolutional autoencoder performed best overall. The U-Net model had high reconstruction quality, however lacked the predictive accuracy. The ResNet model was found to have slightly worse reconstruction and prediction capabilities while having a larger parameter count. The ConvNeXt model failed to accurately reconstruct the images.
The second experiment showed the CP-decomposed model approached the uncompressed model in terms of predictive capabilities, while having lower reconstruction qualities. This is likely due to the regularization effect of the CPD, suggesting significant redundancy in the uncompressed model. Despite the reduction in forward pass FLOPs for the CP-decomposed model, it was found that both the computational complexity of the backpropagation process was higher than the uncompressed model at the lower compression ratios and that the memory allocation suffered a significant increase. This resulted in longer and less efficient training of the CP-decomposed models. It was furthermore found that the CP-decomposed models converged faster and had higher data efficiency as compared to the uncompressed model.
Epileptic seizure classification using scalp EEG data
A support tensor machine approach
Sparse reconstruction for High Dimensional Tensors
Low complexity methods for large scale sensing
This thesis focuses on the application of block compressed sensing to signals of high dimension to gain insight into the relation between reconstruction performance and computational complexity. This is done by, first investigating how theoretical reconstruction guarantees change, when the problem is divided into smaller sub-problems and by doing a complexity analysis of the reconstruction itself. Each sub-problem solves for a portion of a signal, defined as a block. Next, experiments are conducted in order to get insight into the trade-off between computational complexity and quality of the reconstruction. It can be found that, by using this block-wise approach, the computational complexity of the reconstruction problem decreases, but at the same time, quality of the reconstruction deteriorates. Besides, a method to compensate for this performance loss is proposed. The key idea of this method is that, by propagating prior information among the different blocks, the reconstructions of the blocks can be improved. Finally, block compressed sensing and prior-aware block compressed sensing are analysed in a higher order tensor compressed sensing setting. Nevertheless, this setting was found to exhibit a less favourable complexity-performance trade-off than the normal one, as this setting resulted in both a more complex and a less accurate reconstruction than the normal one. ...
This thesis focuses on the application of block compressed sensing to signals of high dimension to gain insight into the relation between reconstruction performance and computational complexity. This is done by, first investigating how theoretical reconstruction guarantees change, when the problem is divided into smaller sub-problems and by doing a complexity analysis of the reconstruction itself. Each sub-problem solves for a portion of a signal, defined as a block. Next, experiments are conducted in order to get insight into the trade-off between computational complexity and quality of the reconstruction. It can be found that, by using this block-wise approach, the computational complexity of the reconstruction problem decreases, but at the same time, quality of the reconstruction deteriorates. Besides, a method to compensate for this performance loss is proposed. The key idea of this method is that, by propagating prior information among the different blocks, the reconstructions of the blocks can be improved. Finally, block compressed sensing and prior-aware block compressed sensing are analysed in a higher order tensor compressed sensing setting. Nevertheless, this setting was found to exhibit a less favourable complexity-performance trade-off than the normal one, as this setting resulted in both a more complex and a less accurate reconstruction than the normal one.
Uncertainty quantification for tensor network constrained kernel machines
A frequentist and Bayesian approach
Three different methods are investigated for quantifying the uncertainty of the predictions of the CPD constrained kernel machine. Firstly, the delta method is proposed which is a frequentist approach that linearizes a nonlinear parametric model around the estimated model. By estimating the covariance of the model parameters, the delta method can estimate the uncertainty in the model predictions based on the estimated parameter uncertainties. The delta method is compared to two other methods that are able to reflect the prediction uncertainty: the Bayesian method and Single Bayesian Core (SBC) method. The Bayesian method treats the parameters in the factor matrices of the CPD as probability distributions rather than single values and the SBC method incorporates both frequentist and Bayesian aspects. A comparison between the three different methods is performed based on an assessment on the correctness and informativeness of the uncertainty measures of prediction intervals and confidence Intervals.
It was found by regression and classification experiments that all three methods can provide valuable uncertainty quantification measures in terms of correctness and informativeness for the CPD constrained kernel machine. However, the Bayesian method provides in general more conservative uncertainty intervals compared to the delta and SBC method. A major drawback of the Bayesian method is its lack of scalability as the size of the mean and covariance, constructed by the unscented transform in the Bayesian method, scale exponentially. Furthermore, the delta and SBC method produce high quality uncertainty intervals and the methods provide remarkably similar uncertainty quantification on the prediction error variance. ...
Three different methods are investigated for quantifying the uncertainty of the predictions of the CPD constrained kernel machine. Firstly, the delta method is proposed which is a frequentist approach that linearizes a nonlinear parametric model around the estimated model. By estimating the covariance of the model parameters, the delta method can estimate the uncertainty in the model predictions based on the estimated parameter uncertainties. The delta method is compared to two other methods that are able to reflect the prediction uncertainty: the Bayesian method and Single Bayesian Core (SBC) method. The Bayesian method treats the parameters in the factor matrices of the CPD as probability distributions rather than single values and the SBC method incorporates both frequentist and Bayesian aspects. A comparison between the three different methods is performed based on an assessment on the correctness and informativeness of the uncertainty measures of prediction intervals and confidence Intervals.
It was found by regression and classification experiments that all three methods can provide valuable uncertainty quantification measures in terms of correctness and informativeness for the CPD constrained kernel machine. However, the Bayesian method provides in general more conservative uncertainty intervals compared to the delta and SBC method. A major drawback of the Bayesian method is its lack of scalability as the size of the mean and covariance, constructed by the unscented transform in the Bayesian method, scale exponentially. Furthermore, the delta and SBC method produce high quality uncertainty intervals and the methods provide remarkably similar uncertainty quantification on the prediction error variance.
Tensor decomposition for Independent Component Analysis
Through implicit cumulant tensor manipulation
It is studied how decomposing the cumulant tensor in implicit fashion can be used to solve the BSS problem while simultaneously addressing its scalablity issue. A class of implicit cumulant tensor decomposition algorithms is derived which scale more favorably than their explicit counterparts in terms of either computational cost, storage cost or both. Firstly, a novel QR-Tensor algorithm (QRT) is introduced which allows for the simultaneous diagonalization of a tensor’s outer-slices. It is theoretically shown how an implicit version of the QRT algorithm can be used to solve the BSS problem at a linearly scaling computational cost. Secondly, a fixed-point Canonical Polyadic Decomposition (CPD) iteration method is presented. It reduces the computational complexity from a quartic dependence to a linear dependence on the amount of signals to estimate. The source estimation performance of the devised implicit decomposition methods is compared to that of the state-of-the-art FastICA for an artificial linear BSS problem.
Results show that both fixed-point CPD and QRT are superior to FastICA when it comes to the computation time needed to reach convergence, while producing estimated sources of similar quality. It is shown that when the amount of sources to estimate is increased both QRT and FastICA struggle to converge. In contrast, the fixed-point CPD method converges within a consistent amount of iterations, suggesting a method more suitable for the estimation of a large amount of sources. ...
It is studied how decomposing the cumulant tensor in implicit fashion can be used to solve the BSS problem while simultaneously addressing its scalablity issue. A class of implicit cumulant tensor decomposition algorithms is derived which scale more favorably than their explicit counterparts in terms of either computational cost, storage cost or both. Firstly, a novel QR-Tensor algorithm (QRT) is introduced which allows for the simultaneous diagonalization of a tensor’s outer-slices. It is theoretically shown how an implicit version of the QRT algorithm can be used to solve the BSS problem at a linearly scaling computational cost. Secondly, a fixed-point Canonical Polyadic Decomposition (CPD) iteration method is presented. It reduces the computational complexity from a quartic dependence to a linear dependence on the amount of signals to estimate. The source estimation performance of the devised implicit decomposition methods is compared to that of the state-of-the-art FastICA for an artificial linear BSS problem.
Results show that both fixed-point CPD and QRT are superior to FastICA when it comes to the computation time needed to reach convergence, while producing estimated sources of similar quality. It is shown that when the amount of sources to estimate is increased both QRT and FastICA struggle to converge. In contrast, the fixed-point CPD method converges within a consistent amount of iterations, suggesting a method more suitable for the estimation of a large amount of sources.
This thesis uses tensors and graphs to represent available data. Emphasis is given to capturing higher-order interactions present between the data. The use of tensors is motivated as matrices cannot capture data with higher-order relations, such as variation of user ratings to items with time. The transition to using tensors has been promising with the development of efficient tensor decomposition methods and powerful machines. Graphs can capture the correlation between different entities, providing additional information intrinsic to the underlying graph structure. A Graph Regularized CANDECOMP/PARAFAC (GRCP) tensor decomposition model framework is proposed in this thesis. The thesis highlights how to graph Laplacian regularizers (GLRs) benefit CP tensor decomposition methods to build RS. The model framework is evaluated with the MovieLens data set. The model records lesser Normalized Mean Squared Error (NMSE) values than those reported in the literature. The combination of varied data sources notably aids in overcoming the drawbacks of current RS models, offering scalability with computational efficiency in linear time. ...
This thesis uses tensors and graphs to represent available data. Emphasis is given to capturing higher-order interactions present between the data. The use of tensors is motivated as matrices cannot capture data with higher-order relations, such as variation of user ratings to items with time. The transition to using tensors has been promising with the development of efficient tensor decomposition methods and powerful machines. Graphs can capture the correlation between different entities, providing additional information intrinsic to the underlying graph structure. A Graph Regularized CANDECOMP/PARAFAC (GRCP) tensor decomposition model framework is proposed in this thesis. The thesis highlights how to graph Laplacian regularizers (GLRs) benefit CP tensor decomposition methods to build RS. The model framework is evaluated with the MovieLens data set. The model records lesser Normalized Mean Squared Error (NMSE) values than those reported in the literature. The combination of varied data sources notably aids in overcoming the drawbacks of current RS models, offering scalability with computational efficiency in linear time.
All-at-once optimization for kernel machines with canonical polyadic decompositions
Enabling large scale learning for kernel machines
It is studied in which situation it is more advantageous to apply iterative all-at-once opti- mization compared to Alternating Least Squares (ALS) to solve the CPD constrained kernel machine problem. Specifically, all-at-once gradient descent methods are studied. An efficient analytical algorithm for the all-at-once gradient is derived. Furthermore, it is shown that automatic differentiation (AD) can also be applied, but it is found to be slower than the analytical method.
The selection of a step size is found to be challenging. It is shown that the magnitude of the gradient of the mean squared error (MSE) term decreases for an increasing number of features. As a result, selecting the step size becomes more difficult for more features. To overcome this, the Line search and the Adam method are studied. A general expression for the exact line search solution is derived. It can be applied to compute the optimal step size for any step direction and any number of features. However, the Adam method performs better in terms of loss after training, convergence and the training run time. The mini-batch Adam method is used to evaluate the performance of all-at-once optimization for large scale learning.
It is found that the Adam method no longer performs well for data sets with around 16 features or more, likely due to the decrease in the magnitude of the gradient of the MSE term. On large scale data sets with fewer features, the Adam method outperforms ALS in terms of run time until convergence while achieving similar training and validation losses. The Adam method reached convergence on a data set with 11 million samples within ten minutes. Furthermore, it is shown that the scaling of the run time of the Adam method in terms of the feature map order and the CP-rank is more than an order lower than the scaling of ALS when the methods are run on a GPU. This makes to Adam method more suitable for more complex models. ...
It is studied in which situation it is more advantageous to apply iterative all-at-once opti- mization compared to Alternating Least Squares (ALS) to solve the CPD constrained kernel machine problem. Specifically, all-at-once gradient descent methods are studied. An efficient analytical algorithm for the all-at-once gradient is derived. Furthermore, it is shown that automatic differentiation (AD) can also be applied, but it is found to be slower than the analytical method.
The selection of a step size is found to be challenging. It is shown that the magnitude of the gradient of the mean squared error (MSE) term decreases for an increasing number of features. As a result, selecting the step size becomes more difficult for more features. To overcome this, the Line search and the Adam method are studied. A general expression for the exact line search solution is derived. It can be applied to compute the optimal step size for any step direction and any number of features. However, the Adam method performs better in terms of loss after training, convergence and the training run time. The mini-batch Adam method is used to evaluate the performance of all-at-once optimization for large scale learning.
It is found that the Adam method no longer performs well for data sets with around 16 features or more, likely due to the decrease in the magnitude of the gradient of the MSE term. On large scale data sets with fewer features, the Adam method outperforms ALS in terms of run time until convergence while achieving similar training and validation losses. The Adam method reached convergence on a data set with 11 million samples within ten minutes. Furthermore, it is shown that the scaling of the run time of the Adam method in terms of the feature map order and the CP-rank is more than an order lower than the scaling of ALS when the methods are run on a GPU. This makes to Adam method more suitable for more complex models.