Statistical Integration of Heterogeneous Data with PO2PLS

The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.


Introduction
Many studies collect multiple omics datasets to gather novel insights into various stages of biological processes: genome-wide DNA markers reflecting the genetic code, transcriptomics and epigenetics, providing information on expressed and silenced genes, proteomics measuring the abundance of proteins.To link the information in these omics datasets, a joint integration approach is needed (Richardson et al. 2016).Several challenges exist: datasets are often high dimensional, measurements are highly correlated within and across datasets, and the presence of heterogeneity among datasets due to measuring different biological levels and using different technologies to measure them.Many machine learning methods have been proposed that address some of these challenges and are therefore increasingly popular (Li et al. 2016).However, they neither provide statistical evidence for a relationship between the datasets nor identify relevant variables that contribute to this relationship.We propose a probabilistic latent variable modeling framework for inferring the relationship between two omics datasets x and y.Our method reduces data dimensionality, captures correlations within and between sets of variables, addresses heterogeneity and performs statistical inference on the relation between x and y.
For our probabilistic approach, we propose to use multivariate normal distributions for x and y.The correlation structure between and within x and y is modeled by joint and data-specific components, formed by linear combinations of the variables in x and y.
(van der Kloet et al. 2016;Shu et al. 2020).All parameters of the model are identifiable and estimated with maximum likelihood.To this end, a memory-efficient EM algorithm (Meng and Rubin 1993) is implemented that can handle high dimensional data.We derive standard errors for the estimators and formulate a global test statistic for the null hypothesis of no relation between x and y.For overparametrized models such as latent variable models, the regularity conditions under which maximum likelihood estimators are asymptotically normally distributed may not hold (Sun et al. 2015).We will derive the asymptotic distribution of our estimators as well as the distribution of our proposed test statistic; we apply the mathematical theory that investigates asymptotic properties of estimators based on minimizing a proper discrepancy function (Shapiro 1983).Finally, to deal with the size of the resulting asymptotic covariance matrix, which quadratically increases with the number of x and y variables, we develop an approximation that is very fast to compute, even in high dimensions.
Various latent variable approaches are available, differing in models and estimation techniques.Instead of a maximum likelihood approach, algorithmic methods have been popular.These methods are often sequential, and the algorithm stops when the information in the datasets is sufficiently captured.Examples of algorithmic methods that only include joint parts are partial least squares (PLS) (Wold 1973) and canonical correlation analysis (CCA) (Hotelling 1936).Methods also incorporating data specific parts are twoway orthogonal PLS (O2PLS) (Trygg and Wold 2003) and JIVE (Lock et al. 2013).JIVE is less flexible than O2PLS as it restricts the joint components of x and y to be exactly equal (details are given in Section 2.1).We have recently shown that when this assumption does not hold, convergence problems may arise and the performance of the estimators might be poor (el Bouhaddani et al. 2018b).A shortcoming of algorithmic approaches is that standard errors are not available; hence a global test requires (computer-intensive) permutations to provide a p-value.
In contrast to algorithmic approaches, likelihood approaches provide a way to calculate standard errors.In addition, likelihood-based methods can assume a direction, i.e. x influences y, which may lead to more efficient estimation of the true relation.An example is envelope regression (Cook and Zhang 2015), which fully models the covariance structure and is therefore not suited for high dimensional data.Alternatively, probabilistic PLS (PPLS) (el Bouhaddani et al. 2018a) uses a simpler covariance structure with less parameters and is applicable to high dimensional datasets.In contrast to PPLS and envelope regression, SIFA (Li and Jung 2017) models specific components.However, just as JIVE, SIFA assumes the joint components to be exactly equal and might not perform well if this condition does not hold.Our novel data integration framework, probabilistic O2PLS (PO2PLS), models joint and specific parts in x and y.The models of SIFA and PPLS can be viewed as specific cases of our PO2PLS model.
Nowadays, omics data are available in studies based on different designs, such as crosssectional and follow-up population studies for common phenotypes, and case-control studies for rare diseases.We apply PO2PLS to data from a large cross-sectional population study and a small case-control study.For the population study, DNA markers (p ≈ 10 5 ) and glycomics (q = 20) data are available for N = 885 subjects (Wahl et al. 2018).This study has been part of genome-wide association studies, which test for associations between a marker and a glycan using single pair methods.Since glycans are highly correlated, these methods do not fully use the available information.We will perform a global test for association between the genetic markers and the glycan abundances, and assess which genes and glycans contribute most to this association.For the case-control study, epigenetics and transcriptomics data (p, q ≈ 10 4 ) are available for 23 subjects, of which 13 suffer from hypertrophic cardiomyopathy (HCM) and ten are healthy controls.Differential expression analyses are usually performed to infer significant relations between each pair of measurements.Instead, we globally test for an association between epigenetic activity and gene transcription.Genes in the joint components contributing to this association may play an important role in HCM.
The main contributions of this paper are threefold.We propose an EM algorithm to estimate the parameters of our PO2PLS model, which is computationally efficient and freely available on GitHub (github.com/selbouhaddani/PO2PLS) and will soon be released on CRAN.We formulate a global test to test the null hypothesis of no relationship between x and y.We show the added value of our methods by applying them to omics datasets from two different studies.In Section 2, the PO2PLS model is formulated, and identifiability of the parameters is shown.Furthermore, maximum likelihood estimates are derived, and a global test of the relation between x and y is proposed.In Section 3, the performance of PO2PLS is studied in a range of simulation scenarios.We focus on feature selection, prediction performance, type I error and power of the statistical test.In Section 4, PO2PLS is applied to the case studies to test and describe the relation between two sets of omics variables.We conclude with a discussion.

PO2PLS: model and estimation
2.1 The model Let x and y be two random row-vectors of size p and q, respectively.In the PO2PLS model, both x and y are expressed in terms of a joint part, a specific part, and a noise part.The joint parts involve random vectors t and u of size r, with r usually a small number.The specific parts involve independent random vectors t ⊥ and u ⊥ of size r x and r y , respectively.
The noise random vectors are denoted by e (p-dimensional), f (q-dimensional) and h (rdimensional).Here, h represents heterogeneity in the joint parts, leading to differences between t and u.More precisely, the PO2PLS model for x and y is described by The parameter matrices W (p × r) and C (q × r) are called joint loadings.The matrices W ⊥ (p × r x ) and C ⊥ (q × r y ) are referred to as data-specific loadings.
The random vectors e and f are independent multivariate normally distributed random vectors, with zero mean and covariance matrices σ 2 e I p and σ 2 f I q , respectively.Furthermore, t, t ⊥ , u ⊥ and h are zero mean multivariate normals, with diagonal covariance matrices Σ t , Σ t ⊥ , Σ u ⊥ and Σ h , respectively.The covariance matrix of u follows from (1): Σ u = B T Σ t B + Σ h .Here, B is a diagonal r × r matrix.

All parameters are collected in
It parameterizes the distribution of (x, y) ∼ N (0, Σ θ ) (the explicit expression for Σ θ is given in the supplementary material).
Note that the model for the relation between u and t is taken asymmetrically, as often a certain hierarchy is assumed for x and y (Crick 1970).For instance, it is reasonable to assume that genetic variability induces glycomic variation, so a model for u in terms of t better reflects the underlying biology.
PO2PLS as a general data integration framework PO2PLS models the relationship between x and y through t and u as described in (1).It can be seen as a generalization of other models.Firstly, if the joint principal components (JPCs) are assumed to be exactly equal, i.e. u = t, the SIFA model is retrieved.In this case, B = I and Σ h = 0, so u and t have the same scale and a correlation of one.However, datasets are typically heterogeneous, so the two sets of JPCs should represent different mechanisms (e.g.genetic versus glycomic pathways).Therefore, they may not be perfectly correlated or on the same scale.Also, assuming homogeneity of datasets can negatively affect estimation performance (el Bouhaddani et al. 2018b).Secondly, if additional to assuming u = t, the columns of the concatenated components (W W ⊥ ) and (CC ⊥ ) are orthogonal, the JIVE model is recovered.
In this case, combinations of features involved in the joint and specific parts have to be orthogonal, which is a strong restriction.Thirdly, the probabilistic PLS model is obtained by setting Σ t ⊥ and Σ u ⊥ to zero in (1).
In the envelope regression (ER) model, the number of noise variance parameters to estimate is of order O(p + q), whereas PPLS and PO2PLS introduce one σ 2 e and σ 2 f for x and y, respectively.When p or q is larger than the sample size (i.e. a high dimensional setting) or the covariance matrix of x or y is singular, the ER estimator cannot be obtained due to singularity.
From an estimation point of view, PO2PLS can be placed in the category of maximum likelihood estimators using EM (details about estimation is found below in Section 2.3).
Other probabilistic approaches, such as ER, directly optimize the likelihood over Grassmann manifolds (Cook and Zhang 2015).Since this involves calculating the covariance of (x, y), it is not feasible to use in high dimensions.Alternative approaches to maximum likelihood consider sequential algorithms to estimate joint and specific components.For example, the O2PLS estimator (see Trygg and Wold (2003); el Bouhaddani et al. ( 2016)) is as follows: first the covariance between xW and yC is optimized, then the covariance between xW and x − xW W T is optimized to get estimates for the specific parts, and finally after subtracting these parts, the covariance between x * W and y * C is optimized with the star indicating a deflation step.Our EM implementation of PO2PLS appears to be competitive with fast algorithmic approaches in terms of memory usage and is reasonably fast in high dimensional settings (see Section 3).In Table 1, an overview is shown with several methods and their features.
Table 1: Features of several data integration methods.An 'X' indicates presence of a feature.All methods estimate a joint part.The first row indicates methods that also estimate specific components W ⊥ and C ⊥ .The second row indicates methods that are based on a probability distribution for x and y.For the next row, an X is placed if the method does not restrict the model to u = t.The last row indicates methods that can cope with data where p, q are larger than the sample size.

Identifiability of PO2PLS
Linear latent variable models are typically unidentifiable due to rotation indeterminacy of the loading components.For example, given a rotation matrix R such that RR T = I, the models x = tW T and x = (tR)(W R) T yield the same x while W and W R are not the same.Note that if Cov(t) is diagonal with distinct elements, Cov(tR) is not diagonal unless R is also diagonal.In PCA, the loading matrices are restricted to be semi-orthogonal, i.e.W T W = I, whereas in Factor analysis, the latent variables are standard normally distributed.However, these assumptions separately do not solve the rotation indeterminacy.In PO2PLS, identifiability can be obtained using similar assumptions, namely semi-orthogonal loading matrices and diagonal covariance matrices for the latent variables.
The assumptions in PO2PLS are firstly, and [CC ⊥ ] must not have linearly dependent columns.Note that the columns of W ⊥ and C ⊥ do not have to be orthogonal to the columns of W and C, respectively.Second, the diagonal elements of B are restricted to be positive.This does not restrict the PO2PLS model, as Finally, the sequence (σ 2 t k b k ) r k=1 is assumed to be strictly decreasing in k.Regarding the number of components, we assume that 0 < r + r x < p and 0 < r + r y < q, where r is positive and both r x and r y are non-negative.
Given these assumptions, the loading matrices are identified up to sign and the other parameters in θ are uniquely identified.The following Theorem makes this precise.
Theorem 2.1 Let r, r x , r y and θ satisfy the above assumptions.Let Σ θ 1 and Σ θ 2 be the covariance matrices corresponding to PO2PLS parameters θ 1 and θ 2 , and suppose and all other parameters in θ 1 and θ 2 are equal.
The proof is given in the supplementary material.

Maximum Likelihood Estimation of the parameters
We propose maximum likelihood to estimate θ.Contrary to the sequential O2PLS algorithm, the estimation is simultaneous over both joint and specific parts.The log of the likelihood associated with the PO2PLS model ( 1) is given by Note that L is a complicated and highly non-linear function of θ, and its computation requires computing and storing covariance matrices of size (p + q) 2 .If the latent variables t, u, t ⊥ and u ⊥ would be observable, maximizing the log-likelihood becomes analytically tractable and computationally feasible, even for large p and q.However, the latent variables are not observable.In an EM algorithm (Dempster et al. 1977), predictions are calculated for these missing quantities, and iteratively, maximizers are obtained.Therefore, we propose an EM algorithm to obtain maximum likelihood estimates for θ.
Denote the complete data vector by (x, y, t, u, t ⊥ , u ⊥ ).For each current estimate θ , the EM algorithm considers the objective function Here, the complete data likelihood can be written (with abuse of notation) as These factors depend on distinct sets of parameters.For example f (x|t, t ⊥ ) depends only on W , W ⊥ and σ 2 e , yielding separate optimization problems.The expectation step involves a conditional expectation of the complete data likelihood.
Since f in (3) is a multivariate normal density, this expectation can be written in terms of the first and second conditional moments of the latent variables t, u, t ⊥ and u ⊥ given x and y.Focusing on the first factor in (4), the conditional expectation of log f (x|t, t ⊥ ) is given by This expectation involves first and second conditional moments of the vector (t, t ⊥ ) given θ , x and y.These terms can be explicitly calculated and are given in the supplementary material.
In the maximization step, the function in ( 5) is optimized over all semi-orthogonal matrices W and W ⊥ .By introducing Lagrange multipliers Λ W and Λ W ⊥ , maximizing (5) over semi-orthogonal W and W ⊥ is then equivalent to minimizing the following objective function Note that the objective function involves both W and W ⊥ and cannot be decoupled.Instead of numerical optimization, we consider a variant of EM that performs sequential optimization (Meng and Rubin 1993).First, ( 6) is minimized over W , keeping W ⊥ constant.Then we minimize over W ⊥ , keeping W equal to its minimizer.Under standard conditions, this algorithm monotonically approaches a (local) maximum of the observed likelihood L (Meng and Rubin 1993).
The above derivation is conditional on the dimensions of the latent spaces.Typically, the number of components r, r x and r y are unknown a priori.Strategies that can be used to select the number of PO2PLS components include cross-validation (Geisser 1993) and eigenvalue (scree) plots (Mardia et al. 1979).
The expectation and maximization step for the other parts in (4) are calculated analogously (see the supplementary material).In this calculation, the orthogonalization operator is used to obtain semi-orthogonal loading matrices, defined as follows.
Definition 2.2 Let A be a p × a full rank matrix with singular value decomposition A = U DV T .Let R = V D. Then we define the operator orth : R p×a → R p×a as orth (A) = A(R T ) −1 .
Using this operator, the EM parameter updates are made explicit in Theorem A.1 in the Appendix.

Statistical inference: formulation of a global test
One of the challenges in data integration is to assess the statistical evidence for the relationship between x and y.In our model, this relationship is represented by the equation u = tB + h in (1).Thus the null hypothesis of no relationship corresponds with To test this null hypothesis, we propose the following Wald-type test statistic, We refer to ( 7) with (8) as the global test.To apply the global test statistic in practice, the asymptotic distribution of all parameters θ, including B, needs to be derived.Since our model is overparameterized, standard maximum likelihood theory cannot readily be implemented.
Under certain regularity conditions, consistency of the estimator θ and its asymptotic distribution N (θ, Π θ ) follows from Shapiro's Proposition 4.2 (Shapiro 1986) applied to the PO2PLS model ( 1).Here, a suitable discrepancy function S the sample covariance matrix of (x, y) is used.Details and proofs are given in the supplement.
Given the asymptotic covariance matrix Π θ , standard errors for the elements of θ are obtained by calculating the square root of the diagonal elements of Π θ .An estimate of Π θ is obtained from the inverse observed Fisher information matrix.In an EM algorithm, this matrix is given by (Louis 1982): Here, S( θ) = ∇L( θ) and B(θ) = −∇ 2 L( θ) are the gradient and negative of the second derivative of the log likelihood L, respectively, evaluated in θ.The derivation of the Fisher information matrix for the parameters of the PO2PLS model is given in the supplement.
To obtain the standard errors for B, the submatrix of I −1 θ with respect to B has to be calculated.However, this requires inverting a matrix of size O((pqr) 2 ), which is computationally infeasible even for moderate p and q.Under the assumptions that B and θ/ B are asymptotically independent and Σh is non random, the observed Fisher information matrix I B and thus SE B are given by the following formula, Details of the derivation of this formula are given in the supplementary material.Note that the first part on the right-hand side is the Fisher information matrix based on the general linear model, had t and u been observed.Standard errors for B are given by the square root of the diagonal elements of I −1 B .Thus, to test the global hypothesis (7), we apply our statistic T B and calculate the corresponding p-value.

Simulation study
We conduct a simulation study to evaluate the performance of PO2PLS in terms of feature selection, prediction and performance of our global test.Four metrics are considered: true positive rates, root mean squared error of the prediction, type I error and power.We compare PO2PLS to existing approaches PLS, O2PLS, PPLS and SIFA, covering algorithmic and probabilistic methods with and without specific parts (see Table 1).We investigate robustness against model assumptions.Finally we assess computational efficiency.
For performance in feature selection and prediction ability, we consider combinations of small and large sample sizes (N = 100, 1000) and low and high dimensional data (p = 2000, 10000; q = 25, 125).We also include two proportions of noise relative to the total variation: in the 'small noise proportion', we set the variance of e and f to be 40% of the variance of x and y.In the 'large noise proportion', these values are 95% and 5% for x and y, respectively.We set B = I and Σ h = 0 to comply with the SIFA assumptions, see Section 2.1.The impact of heterogeneity of joint parts is considered by increasing the joint residual variance Σ h from 0% to 80% of the total joint variance Σ u .Finally, we set r, r x , and r y to five components.These scenarios are commonly encountered in data analysis.
To assess the feature selection performance, we calculate the proportion of true top 25% features among the estimated top 25% (i.e True Positives Rate, TPR).We then average these proportions across components to obtain an aggregated measure.Predictive performance is measured by calculating the RMSEP, defined as the square root of E||y − ŷ|| 2 with ŷ predicted from x.The RMSEP is calculated in both training and test data; the test data consist of N = 10 4 independent samples generated from the same model as the training data.
To evaluate the performance of the PO2PLS global test T B described in Section 2.4, we first estimate the type I error for increasing sample size of 50, 500, 5000, and 10000.
The dimension of x is set to 20.The number of simulation replicates here is 50000.Next, we consider increasing dimensionality, namely p = 20, 200, 2000.The sample size is set to 500, and we replicate 2000 times.The type I error is calculated as the proportion of rejecting the null hypothesis B = 0 at a 5% level when simulating under this hypothesis.
Next, we estimate the power of T B in (8).We compare four procedures, namely using the normal distribution for T B with calculated standard errors using our approximation of its covariance matrix, with standard errors obtained from parametric and from non-parametric bootstrapping, and with using the empirical distribution of T B via permutations.The proportion of false and true rejections are reported and compared for increasing B. We consider a sample size of 50 and 500, and dimensionality p of 20 and 200.The number of bootstrap and permutation iterations is 250 and 500, respectively, and we repeat 500 times.In all three simulations, the dimension of y is kept to 5, the noise proportion is 50%, and we set r = 2, r x = 1, and r y = 0.
Three additional simulation studies are carried out to study the robustness of PO2PLS against model deviations.PO2PLS is applied to high dimensional simulated datasets from a selected case-control study design, mimicking the second data analysis in Section 4. We compare the error of predicting the outcome using the PO2PLS joint components with aforementioned alternatives.Then, we assess the impact of rank misspecification when fitting PO2PLS, by estimating too few components, and the impact of non-normality of the latent variables, using four commonly encountered distributions.Finally, we study the computational efficiency of the PO2PLS implementation, measured by the cpu time and memory demand of the EM algorithm.Details of these simulations and results are given in the supplementary material.

Simulation results
We first present the accuracy and prediction performance in the low dimensional setting, see Figure 1.Boxplots of the accuracy and prediction error are shown across the scenarios.Differences in accuracy with respect to PO2PLS are also shown.In terms of feature selection, PO2PLS performed good compared to the other methods.When considering the TPR difference between each method and PO2PLS per simulation run, PO2PLS generally had the highest TPR.This difference tend to increase with larger noise proportions and more heterogeneous joint parts settings.The differences between PO2PLS and PPLS are not shown for better visual comparison.Regarding the prediction error, PO2PLS generally performed better than the other methods.SIFA had the highest prediction error when heterogeneity between the joint parts was present.Furthermore, PLS and O2PLS seemed to overfit in noisy, small sample size scenarios: the training error was lower than the test error compared to the other methods.In the high dimensional settings, similar results were obtained.Details can be found in the supplementary material.For the high dimensional settings, the implementation of SIFA gave 'out-of-memory' errors.Hence, we could not include SIFA in these comparisons.
Results for the global inference are shown in Figure 2. The type I error of the PO2PLS test was around 5% for increasing sample size and dimensionality.Based on the proportion of rejections under the null hypothesis (7), the PO2PLS test had type I error around 5% for all but the smallest sample size; in that case, the type I error was about 7%.It also had more power under the alternative than the other approaches, with the permutation test being severely underpowered in small sample size.
We briefly present the key results of the additional simulations.In the selected casecontrol simulation study, PO2PLS had highest TPR, and suffered less from overfitting than PLS and O2PLS.When estimating one component less than the true number of joint and specific components, PO2PLS performed similarly to the algorithmic methods (PLS and O2PLS).Further, PO2PLS was robust against non-normal distributions.Finally, the increase in CPU time and memory usage for increasing data dimensionality was similar across the methods.The full findings are given in the supplementary material.

Applications to omics datasets
We illustrate the PO2PLS model with datasets from two different studies.Firstly, PO2PLS is applied to test and estimate genetic contributions to glycomic variation in a population based cohort.Secondly, PO2PLS is used to infer a relation between DNA regulation and gene expression using data from a case-control study.Here we also investigate whether the joint components reveal the case control status and whether the top features overlap with findings in cardiovascular diseases.For comparison, we also applied O2PLS to these datasets.

Data integration in a population cohort
Glycosylation is one of the most common post-translational modifications that enrich the functionality of proteins in many biological processes, such as cell signaling, immune response and apoptosis (Wahl et al. 2018).Previously, genome-wide association studies (GWAS) were performed between pairs of single nucleotide polymorphisms (SNPs) and glycans to investigate genetic regulation of glycosylation (Lauc et al. 2010;Wahl et al. 2018).However, glycans abundances are highly correlated and associated with multiple genes.For example, the glycan G0 was found associated with multiple genes, including FUT8, and this gene was itself associated with multiple glycans (Klarić et al. 2020).Therefore a multivariate approach might provide new insights.We first confirm that genetics play a significant role in regulating of glycans.Then, we investigate whether the joint glycan components represent biological structures.Finally, we compare our top genes with genes identified in GWAS.
Genetic and glycomic data were measured, yielding 333858 genotyped SNPs and 20 IgG1 glycan abundances for N = 885 participants in the Croatian Korcula cohort (Lauc et al. 2010).The SNPs were aggregated on the gene level by combining SNPs around the same gene with PCA, yielding a Genetic PCs (GPCs) dataset.Then, the GPCs and glycomics datasets were pre-processed, resulting in datasets X (p = 37819) and Y (q = 20), respectively.Based on scree plots of the eigenvalues of X T X, X T Y and Y T Y , five joint, five genetic-specific, and no glycan-specific components were retained.
A global test for the association between genetics and glycans was performed using PO2PLS.The T B statistic for each component was between four (for the first component) and three (for the last component).With corresponding p-values of 10 −5 and 10 −3 , there is statistical evidence of a relationship between genetics and glycans.
The loading values of each glycan variable for the five joint components are depicted in Figure 3.Each joint glycan component appears to represent different aspects of glycans and their molecular structure.While the first component represents the 'average' glycan (first component), the second component represents presence of fucose, the third component represents the presence of galactose, and the last two components represent GlcNAc (el Bouhaddani et al. 2018b).The top gene in the second joint genetic component is FUT8 which has been linked to fucosylation (Lauc et al. 2010).Note that the second glycan component reflects "presence of fucose".The same article reports more genes linked to glycosylation that we did not find, but their GWAS results are based on imputed genetic data from multiple cohorts.With our joint approach, several other top genes were found, e.g.DNAJC10 and AKAP9, that have links to synthesis and degradation of glycoproteins or (more generally) with inflammation and immune responses.
A second independent study of 714 participants from the Croatian Vis cohort is available.To replicate our findings in the Korcula cohort, we apply PO2PLS to this cohort and compare the components underlying the genetics and glycomics data.The results from the second study, shown in the supplementary material, are consistent with the above findings, indicating that the obtained components are not specific to one study.Finally, we compared the prediction error of Y given X of the models estimated with PO2PLS and O2PLS in Korcula, evaluated using the data from Vis.The ratio of training (Korcula) and test (Vis) error appeared to be 5/23 for O2PLS and 20/21 for PO2PLS.This is conform the simulation study that O2PLS is prone to overfitting.

GlcNAc
No Yes

Data integration in a case-control study
Hypertrophic cardiomyopathy (HCM) is a rare heart muscle disease negatively affecting blood circulation and leading to heart failure.Several studies have shown that several molecular factors, such as epigenetics and gene transcription, play an important role in HCM (Hemerich et al. 2019).We investigate whether epigenetic variation affects transcription and test these relationship using PO2PLS.Since the samples consist of HCM cases and controls, an obvious question is whether one of the joint components represents this segregation of cases and controls.
Data on epigenetics (DNA regulation) and transcriptomics (gene expression) is available, obtained from the heart tissue of thirteen HCM patients and ten controls.Epigenetic data were measured using ChIP-seq, yielding regulation levels of 33642 regions after preprocessing.Transcriptomics data were measured using RNA-seq, yielding 15882 expression levels after pre-processing (TMM normalization, followed by log transformation).Statistical challenges are the small sample size of 23 and the large number of features (around 45000).
PO2PLS is applied to the epigenetics (X) and transcriptomics (Y ) data, using two joint components and one specific component for both datasets.These numbers are determined using scree plots.The T B test statistic for the first component was 9.12, and 2.35 for the second component.The p-values were smaller than 0.001 for the first component and 0.018 for the second, so the two component were statistically significant.
To investigate whether the top genes in the joint components are involved in cardiovascular outcomes, we clustered the 500 genes with highest loading values in the first joint PC using DisGeNET (a database of gene-disease associations (Sabater-Molina et al. 2018)).
The top 10 most significant clusters appear to represent a broad spectrum of cardiovascular diseases (Table 2.
In Figure 4, PO2PLS scores are plotted for the first two joint components, and each dot is colored according to its case-control status.The plots indicate that the first joint component picked up the case-control segregation.Additionally, the O2PLS scores are plotted, showing a similar pattern as PO2PLS.

Discussion
We propose probabilistic two-way orthogonal partial least squares (PO2PLS) to model the relation between two sets of variables x and y in the presence of data-specific characteristics.
Our method is suited for heterogeneous, high dimensional, correlated datasets commonly available in the life sciences.For estimation, we derived a memory efficient EM algorithm.
For testing, we derived a Wald type test statistic and its approximate distribution under the null hypothesis of no relationship between x and y.
Via an extensive simulation study, we showed that PO2PLS often performed better than PPLS, SIFA, O2PLS and PLS.In terms of feature selection and prediction, it performed better than PLS, PPLS and SIFA when heterogeneity exists between the datasets.
These results were expected since, contrary to the other methods, PO2PLS models the heterogeneity and therefore better estimates the joint components.PO2PLS performed better than O2PLS and PLS in terms of prediction when the datasets are small.For noisy and small datasets, PO2PLS also had a better true positive rate than O2PLS and PLS.
PO2PLS had a smaller risk of overfitting, probably because it models all the available information in the data.This reduction in overfitting was also confirmed in studying the relationship between genetic data and glycans, for which we had a replication cohort.The common belief is that PLS and O2PLS, as distribution-free methods, are more suited for small sample size scenarios than probabilistic methods (Wold 1985).Contrary to this belief, in these scenarios, PO2PLS yielded better true positive rate and prediction performance.
Via simulations, we also showed that PO2PLS is robust against model deviations such as using a too small number of components and non-normality of the data.
We also showed with simulations that our proposed test statistic for testing the null hypothesis of no relationship is asymptotically normally distributed and performs well in terms of type I error and power.In algorithmic latent variable approaches, testing for a relationship is carried out by empirically estimating the distribution of the test statistic.Since this is time-consuming, evidence for relevance of the top features (e.g.genes, proteins, glycans) is instead obtained by relating the findings to historical ones (Domingo-Fernández et al. 2019).For example, it is tested whether specific molecular pathways or interaction networks are over-represented in the top feature ranking.Such an approach has the advantage that prior domain knowledge is incorporated.A drawback is a focus on existing findings and a bias against novel discoveries.Moreover, it is often unclear how much evidence exists for pathways and networks in these databases.Each database uses its own scoring mechanisms, often not based on a formal scoring method.Further, there might be a lack of information in the context for new diseases or measurement techniques, and using the information on related diseases or datasets may result in incorrect conclusions about relations (Mubeen et al. 2019).A formal testing procedure quantifies the evidence and might lead to the identification of relevant relationships.
PO2PLS was applied to omics data from two case studies.The first one is a typical epidemiological population cohort, designed to identify new molecular drivers and build omics predictors for common diseases.These studies are also well suited to study relationships between multiple omics datasets.We applied PO2PLS to genetic and glycomics datasets.The relationship between genetics and glycomics was statistically significant, which confirmed the known high heritability of glycans and the multiple hits of genome-wide association studies (GWAS) (Zaytseva et al. 2020).Moreover, our findings overlapped with GWAS results.We did not replicate all GWAS findings since we restricted ourselves to genotyped SNPs in a gene's neighborhood.On the other hand, modeling the joint distribution of glycans and genes also led to new findings.We replicated the estimated components with relevant features in a second cohort study.The second case study was a small case-control study.To identify molecular markers for rare diseases, omics datasets are measured in cases and controls.Typically these studies are small, either because of the limited number of available cases (rare disease) or costs.Note that multiple diseases can be studied in epidemiological studies, while a case control study is typically limited to one outcome.We applied PO2PLS to epigenetic and transcriptomic data in HCM cases and controls.The relationship between the two sets was statistically significant.Clustering of the top genes using DisGeNET showed that the top genes are in gene clusters associated with several cardiovascular diseases.Moreover, when plotting the first two joint components against each other, a structure representing case control status was evident.This might be expected since all analyses are conditional on the outcome status, and the outcome is a collider for features of the datasets that affect the outcome variable (Balliu et al. 2015;Tissier et al. 2017).More research is needed here.
A possible approach is to include the outcome variable in the model.Several penalized regression models have been used to identify sets of variables related across the different datasets x or y which predict z (Vinga 2020).These approaches do not model the within and across correlations and are hard to interpret when correlations between x and y are present (Tissier 2018).For a more holistic approach, one could consider the joint distribution of (x, y, z).Based on the probabilistic O2PLS framework, this distribution can be specified conditional on latent joint and specific variables.In such a framework, the relation between x and y is modeled, and their association with the outcome z is simultaneously incorporated and estimated.Extending our framework in this direction would enable formal tests for the relationship between x and y jointly with the outcome.
More generally, z might be a third dataset instead of an outcome.Here, the interest may lie in inferring relations between the three sets of variables.A complication is that the direction of the relationship between the sets of variables needs to be considered, which might be unknown.The majority of integration approaches for more than two datasets avoid this issue by specifying a common set of latent variables t for all sets of variables (Meng et al. 2016), similar to SIFA.Another approach proposes optimizing a sum of objective functions for each pair of datasets (Löfstedt and Trygg 2011), while accounting for heterogeneity in the joint parts.
Many epidemiological cohort studies have multiple omics datasets measured.Currently, we are developing a meta-analysis approach to obtain more robust results by including multiple cohorts in one analysis.For factor analysis, several methods have been proposed to combine the estimated correlation matrices (Cheung 2015) or factor loadings (Jak and Cheung 2020) across cohorts.However, the pooling step is not based on the asymptotic variance of the estimators, but an arbitrary covariance matrix.For PO2PLS, the asymptotic variance is available as output (for low-dimensional data).Therefore, as an alternative, the PO2PLS model can be extended by adding cohort-common and cohort-specific parameters to the model.Maximum likelihood estimation would yield an 'optimal shared joint space' that incorporates information from each cohort.In such a framework, integration is possible in both 'horizontal' (i.e.across studies) and 'vertical' (across datasets in the same study) direction.
Several extensions of the model can be considered.For example, a penalty term can be added to the likelihood function to incorporate prior belief about which variables are more important or belong together.For O2PLS, such a method was recently proposed (Gu et al. 2021).Extending this approach to PO2PLS would be straightforward.Another extension uses functional counterparts to model functional data such as images or temporal data from devices, which are topics of future research.To conclude, PO2PLS is a complete framework to test for relationships between omics datasets, identify relevant features and predict outcomes.
APPENDIX: An EM algorithm for PO2PLS Theorem A.1 Let X and Y be data matrices with N i.i.d.PO2PLS replicates of (x, y) across the rows.Let r, r x and r y be fixed, satisfying max(r + r x , r + r y ) < N .The loading matrix W is estimated with the following iterative scheme in k, given known starting values for k = 0. Here, The proof is given in the supplementary material.

Figure 1 :
Figure 1: Simulation study: feature selection and prediction error.Upper left figure: proportion of true top 25% among estimated top 25% (TPR) for each method, stratified by simulation scenario.The left and right boxplots represent low and high noise, respectively.Lower left figure: Difference in TPR of several methods and PO2PLS.Lower values are in favor of PO2PLS.Right figure: Root mean squared error of prediction stratified by method and scenario.The left and right boxplots represent the training and test error, respectively.The black line represents the median test error when using true parameter values.

Figure 2 :
Figure 2: Simulation study: global inference.Left panel : Type I error of the global test based on asymptotic PO2PLS statistic, for increasing sample size and p = 20 (upper plot), and increasing dimensionality and N = 500 (lower plot).These are based on 50000 and 2000 replicates, respectively.For all plots, 95% confidence intervals are added based on the binomial distribution, indicated by the points above and below the lines.The dashed horizontal lines represent the 5% rejection level.Right panel : Rejection proportions of the global test performed by the four approaches (asymptotic, non-parametric, parametric bootstrapping, and permutations), for increasing effect size.The sample size was N = 50 with p = 200 (upper plot) and N = 500 with p = 20 (lower plot).These are based on 500 replicates.

Figure 3 :
Figure 3: Glycomic PO2PLS joint components.The five JPCs are plotted one by one, from left to right, from top to bottom.The dots represent the loading values of each glycan, indicating their importance in the genetic-glycomic relationship.The colors and shapes represent the biological grouping of the glycans.In the last row and column, a graphical representation of the structure of a particular glycan is shown.

Figure 4 :
Figure 4: Joint principal component scores across HCM cases and controls.The two joint component scores are plotted against each other, where the first JPC is on the x-axis.The upper plots show transcriptomics resp.epigenetics scores from the PO2PLS fit.The lower plots show O2PLS scores.Each dot represents either an HCM patient (blue circle) or a control (red triangle).

Table 2 :
Annotation of genes in the first transcriptomics joint PC in the HCM analysis.Using PO2PLS, the top 500 genes were clustered using DisGeNET (a database of gene-disease associations).These top 500 genes are primary drivers of the association with epigenetics across HCM cases and controls.Here, p-values are calculated with a Fisher exact test and corrected for multiple testing.The 10 most significant clusters are shown.