Classification with O-PLS-DA
Partial least squares (PLS) is a versatile algorithm which can be used to predict either continuous or discrete/categorical variables. Classification with PLS is termed PLS-DA, where the DA stands for discriminant analysis. The PLS-DA algorithm has many favorable properties for dealing with multivariate data; one of the most important of which is how variable collinearity is dealt with, and the model’s ability to rank variables’ predictive capacities within a multivariate context. Orthogonal signal correction PLS-DA or O-PLS-DA is an extension of PLS-DA which seeks to maximize the explained variance between groups in a single dimension or the first latent variable (LV), and separate the within group variance (orthogonal to classification goal) into orthogonal LVs. The variable loadings and/or coefficient weights from a validated O-PLS-DA model can be used to rank all variables with respect to their performance for discriminating between groups. This can be used part of a dimensional reduction or feature selection task which seek to identify the top predictors for a given model.
Like with most predictive modeling or forecasting tasks, model validation is a critical requirement. Otherwise the produced models maybe overfit or perform no better than coin flips. Model validation is the process of defining the models performance, and thus ensuring that the model’s internal variable rankings are actually informative.
Below is a demonstration of the development and validation of an O-PLS-DA multivariate classification model for the famous Iris data set.
O-PLS-DA model validation Tutorial
- Data pretreatment and preparation
- Model optimization
- Permutation testing
- Internal cross-validation
- External cross-validation
The Iris data only contains 4 variables, but the sample sizes are favorable for demonstrating a two tiered testing and training scheme (internal and external cross-validation). However O-PLS really shines when building models with many correlated variables (coming soon).
Orthogonal Signal Correction Partial Least Squares (O-PLS) in R
I often need to analyze and model very wide data (variables >>>samples), and because of this I gravitate to robust yet relatively simple methods. In my opinion partial least squares (PLS) is a particular useful algorithm. Simply put, PLS is an extension of principal components analysis (PCA), a non-supervised method to maximizing variance explained in X, which instead maximizes the covariance between X and Y(s). Orthogonal signal correction partial least squares (O-PLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between X and Y on the first latent variable, and components >1 capture variance in X which is orthogonal (or unrelated) to Y.
Because R does not have a simple interface for O-PLS, I am in the process of writing a package, which depends on the existing package pls.
Today I wanted to make a small example of conducting O-PLS in R, and at the same time take a moment to try out the R package knitr and RStudio for markdown generation.
You can take a look at the O-PLS/O-PLS-DA tutorials.
I was extremely impressed with ease of using knitr and generating markdown from code using RStudio. A big thank you to Yihui Xie and the RStudio developers (Joe Cheng). This is an amazing capability which I will make much more use of in the future!
Evaluation of Orthogonal Signal Correction for PLS modeling (O-PLS)
Partial least squares projection to latent structures or PLS is one of my favorite modeling algorithms.
PLS is an optimal algorithm for predictive modeling using wide data or data with rows << variables. While there is s a wealth of literature regarding the application of PLS to various tasks, I find it especially useful for biological data which is often very wide and comprised of heavily inter-correlated parameters. In this context PLS is useful for generating single dimensional answers for multidimensional or multi-
factorial questions while overcoming the masking effects of redundant information or multicollinearity.
In my opinion an optimal PLS-based classification/discrimination model (PLS-DA) should capture the maximum difference between groups/classes being classified in the first dimension or latent variable (LV) and all information orthogonal to group discrimination should omitted from the model.
Unfortunately this is almost never the case and typically the plane of separation between class scores in PLS-DA models span two or more dimensions. This is sub-optimal because we are then forced to consider more than one dimension or model latent variable (LV) when answering the question: how are variables the same/different between classes and which of differences are the most important.
To the left is an example figure showing how principal components (PCA), PLS-DA and orthogonal signal correction PLS-DA (O-PLS-DA) vary in their ability to capture the maximum variance between classes (red and cyan) in the first dimension or LV (x-axis).
The aim O-PLS-DA is to maximize the variance between groups in the first dimension (x-axis).
Unfortunately there are no user friendly functions in R for carrying out O-PLS. Note- the package muma contains functions for O-PLS, but it is not easy to use because it is deeply embedded within an automated reporting scheme.
Luckily Ron Wehrens published an excellent book titled Chemometrics with R which contains an R code example for carrying out O-PLS and O-PLS-DA.
I adapted his code to make some user friendly functions (see below) for generating O-PLS models and plotting their results . I then used these to generate PLS-DA and O-PLS-DA models for a human glycomics data set. Lastly I compare O-PLS-DA to OPLS-DA (trademarked Umetrics, calculated using SIMCA 13) model scores.
The first task is to calculate a large (10 LV) exploratory model for 0 and 1 O-LVs.
Doing this we see that a 2 component model minimize the root mean squared error of prediction on the training data (RMSEP), and the O-PLS-DA model has a lower error than PLS-DA. Based on this we can calculate and compare the sample scores, variable loadings, and changes in model weights for 0 and 1 orthogonal signal corrected (OSC) latent variables PLS-DA models.
Comparing model (sample/row) scores between PLS-DA (0 OSC) and O-PLS-DA (1 OSC) models we can see that the O-PLS-DA model did a better job of capturing the maximal separation between the two sample classes (0 and 1) in the first dimension (x-axis).
Next we can look at how model variable loadings for the 1st LV are different between the PLS-DA and O-PLS-DA models.
We can see that for the majority of variables the magnitude for the model loading was not changed much however there were some parameters whose sign for the loading changed (example: variable 8). If we we want to use the loadings in the 1st LV to encode the variables importance for discriminating between classes in some other visualization (e.g. to color and size nodes in a model network) we need to make sure that the sign of the variable loading accurately reflects each parameters relative change between classes.
To specifically focus on how orthogonal signal correction effects the models perception of variables importance or weights we can calculate the differences in weights (delta weights ) between PLS-DA and O-PLS-DA models.
Comparing changes in weights we see that there looks to be a random distribution of increases or decreases in weight. variables 17 and 44 were the most increased in weight post OSC and 10 and 38 most decreased. Next we probably would want to look at the change in weight relative to the absolute weight (not shown).
Finally I wanted to compare O-PLS-DA and OPLS-DA model scores. I used Simca 13 to calculate the OPLS-DA (trademarked) model parameters and then devium and inkcape to make a scores visualization.
Generally PLS-DA and OPLS-DA show a similar degree of class separation in the 1st LV. I was happy to see that the O-PLS-DA model seems to have the largest class scores resolution and likely the best predictive performance of all three algorithms, but I will need to validate this by doing model permutations and training and testing evaluations.
Check out the R code used for this example HERE.
PCA to PLS modeling analysis strategy for WIDE DATA
Working with wide data is already hard enough, add to this row outliers and things can get murky fast.
Here is an example of an anlysis of a wide data set, 24 rows x 84 columns.
Using imDEV, written in R, to calculate and visualize a principal components analysis (PCA) on this data set. We find that 7 components capture >80% of the variance in the data or X. We can also clearly see that the first dimension, capturing 35% of the variance in X, is skewed towards one outlier, larger black point in the plots in the lower left triangle of the figure below, representing PCA scores.
In this plot representing the results from a PCA:
- Bottom left triangle = PCA SCORES, red and cyan ellipses display 2 groups, outlier is marked by a larger black point
- Diagonal = PCA Eigen values or variance in X explained by the component
- Top right triangle = PCA Loadings, representing linear combinations variable weights to reconstruct Sample scores
In actuality the large variance in the outlier is due to a single value imputation used to back fill missing values in 40% of the columns (variable) for this row (sample).
Removing this sample (and another more moderate outlier) and using partial least squares projection to latent structures discriminant analysis or PLS-DA to generate a projection of X which maximizes its covariance with Y which here is sample membership among the two groups noted by point and ellipse color (red = group 1, cyan = group 2).
The PLS scores separation for the two groups is largely captured in the first latent variable (LV1, x-axis). However we can’t be sure that this separation is better than random chance. To test this we can generate permuted NULL models by using our original X data to discriminate between randomly permuted sample group assignments (Y). When doing the random assignments we can optionally conserve the proportion of cases in each group or maintain this at 1:1.
Comparing our models (vertical hashed line) cross-validated fit to the training data , Q^2, to 100 randomly permuted models (TOP PANEL above, cyan distribution), we see that generally our “one” model is better fit than that achieved for the random data. Another interesting parameter is the comparison of our model’s root mean error of prediction, RMSEP, or out of sample error to the permuted models’ (BOTTOM PANEL above, dark green distribution).
To have more confidence in the assessment of our model we can conduct training and testing validations. We can do this by randomly splitting our original X data into 2/3 training and 1/3 test sets. By using the training data to fit the model, and then using this to predict group memberships for the test data we can get an idea of the model’s out of sample classification error, visualized below using an receiver operator characteristic curve (ROC, RIGHT PANEL).
Wow this one model looks “perfect”, based on its assessment using one training/testing evaluation (ROC curve above). However it is important to repeat this procedure to evaluate its performance for other random splits of the data into training and test sets.
After permuting the samples training/test assignments 100 times; now we see that our original “one” model (cyan line in ROC curve in the TOP PANEL) was overly optimistic compared to the average performance of 100 models (green lines and distribution above).
Now that we have more confidence in our models performance we can compare the distributions for its performance metrics to those we calculated for the permuted NULL models above. In the case of the “one” model we can use a single sample t-Test or for the “100” model a two-sample t-Test to determine the probability of achieving a similar performance to our model by random chance.
Now by looking at our models RMSEP compared to random chance (BOTTOM LEFT PANEL, out of sample error for our model, green, compared to random chance, yellow) we can be confident that our model is worth exploring further.
For example, we can now investigate the variables’ weights and loadings in the PLS model to understand key differences in these parameters which are driving the above models discrimination performance of our two groups of interest.
Visualizing Sample Scores Trajectory for Repeated Measures
PLS-DA sample scores for a discrimination model identifying multivariate changes in metabolomic measurements before (pre) or after (post) some experimental manipulation.
Based on scores plot for samples given changes in >300 biological parameters; it looks like there are two patterns of samples movement through this principal predictive plane. Few move in the direction capturing the most variance in data matrix or (x-axis, 31%), but the majority show an interaction between x and y (the second dimension explaining only 8%). Also, the most pre or before looking samples in the first dimension (142 and 77, note farthest right) are the least changed post or after the experimental treatment.
Covariate Adjustement for PLS-DA Models
A typical experiment may involve the testing of a wide variety of factors. For instance, here is an example of an experiment aimed at determining metabolic differences between two plant cultivars at three different ontological stages and in two different tissues. Exploratory principal components analysis (PCA) can be used to evaluate the major modes of variance in the data prior to conducting any univariate tests.
Based on the PCA (autoscaled data) we can see that the majority of the differences are driven by differences between tissues. This is evident from the scores separation in (a) between leaf and fruit tissues, which is driven by metabolites with large positive/negative loadings on the first dimension or x-axis in (b). A lesser mode of variance is captured in the second dimension, and particularly in fruit we can see that there is some separation in scores between the two cultivars and their different ontological stages. Based on this it was concluded to carry out test in leaf and fruit tissue separately. Additionally in order to identify the effects of cultivar on the metabolomic profiles which are independent of stage and vice versa, a linear covariate adjustments were applied to the data.
Again using PCA and focusing on fruit tissue, we can evaluate the variance in the data given our hypotheses (differences between cultivars or stages). Looking at (a) we can see that there is not a clear separation in scores in any one dimension between cultivars or stages. However there is separation in two dimensions. This is problematic in that this suggest that there is an interaction between cultivar and stage, which will complicate any univariate tests for these factors. We can see that carrying out linear covariate adjustment either for cultivar (b) or stage (c) translate the variance for the target hypothesis into one dimension, which therefore simplifies its testing. Note, this is exactly what is done when doing an analysis of covariance or ANCOVA. However if we want to use this same favorable variance environment for multivariate modeling like for example partial least squares projection to latent structures discriminant analysis (PLS-DA) we need to covariate adjust the data which in this case is achieved by taking the residuals from linear model for the covariate we want to adjust for.
Now that we have adjusted the data for covariate effects we can test the primary hypotheses (differences between cultivars, stages and tissues) using PLS-DA. Quick visual inspection of model scores can be used to get a feel for the quality of the models. Ideally we would like to see a scores separation between the various levels of our hypotheses in one dimension. We can see that both fruit models are of higher quality than that for leaf. However to fully validate these models we need to carry out some permutation testing or something similar. The benefit of PLS-DA is that we can use the information about the variables contribution to the scores separation or loadings to identify metabolomic differences between cultivars or with increasing maturity or stage.
Here is an example where PLS-DA variable loadings are mapped into a biochemical context using a chemical similarity network. This network represents differences in metabolites due to cultivar, wherein significant differences in metabolite means (ANCOVA, FDR adjusted p-value < 0.05) between cultivars are represented by nodes or vertices which are colored based on the sign of the loading and their size used to encode the magnitude of their loading in the model.
We can now compare the two networks representing metabolomic differences due to cultivar (far top) or to stage (above) to identify biochemical changes due to these factors which are independent of each others effects (or interaction).
Discriminating Between Iris Species
The Iris data set is a famous for its use to compare unsupervised classifiers.
The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.
One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).
Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.
Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of I. versicolor and I. verginica to I. setaosa. However, this supervised approach is not enough to identify a hyperplane of separation between all three species.
Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that this approach is the most success for resolving the two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package) the loadings are literally NA.






























