PLS-DA | Creative Data Solutions

Classification with O-PLS-DA

Partial least squares (PLS) is a versatile algorithm which can be used to predict either continuous or discrete/categorical variables. Classification with PLS is termed PLS-DA, where the DA stands for discriminant analysis. The PLS-DA algorithm has many favorable properties for dealing with multivariate data; one of the most important of which is how variable collinearity is dealt with, and the model’s ability to rank variables’ predictive capacities within a multivariate context. Orthogonal signal correction PLS-DA or O-PLS-DA is an extension of PLS-DA which seeks to maximize the explained variance between groups in a single dimension or the first latent variable (LV), and separate the within group variance (orthogonal to classification goal) into orthogonal LVs. The variable loadings and/or coefficient weights from a validated O-PLS-DA model can be used to rank all variables with respect to their performance for discriminating between groups. This can be used part of a dimensional reduction or feature selection task which seek to identify the top predictors for a given model.

Like with most predictive modeling or forecasting tasks, model validation is a critical requirement. Otherwise the produced models maybe overfit or perform no better than coin flips. Model validation is the process of defining the models performance, and thus ensuring that the model’s internal variable rankings are actually informative.

Below is a demonstration of the development and validation of an O-PLS-DA multivariate classification model for the famous Iris data set.

O-PLS-DA model validation Tutorial

The Iris data only contains 4 variables, but the sample sizes are favorable for demonstrating a two tiered testing and training scheme (internal and external cross-validation). However O-PLS really shines when building models with many correlated variables (coming soon).

September 29, 2013 | Categories: Uncategorized | Tags: model validation, O-PLS, O-PLS-DA, PLS-DA, predictive modeling, r-bloggers | 6 Comments

Sessions in Metabolomics 2013

The international summer sessions in metabolomics 2013 came to a happy conclusion this past Friday Sept 6th 2013. I had the pleasure of teaching the topics covering metabolomic data analysis. The class was split into lecture and lab sections. The lab section consisted of a hands on data analysis of:

fresh vs. lyophilized treatment comparison for tomatillo leaf primary metabolomics
tomatillo vs. pumpkin leaf primary metabolites

The majority of the data analyses were implemented using the open source software imDEV and Devium-web.

Download the FULL LAB. Take a look at the goals folder for each lesson. You can follow along with the lesson plans by looking at each subsections respective excel file (.xlsx). When you are done with a section unhide all the worksheets (right click on a tab at the bottom) to view the solutions .

The lectures, preceding the lab, covered the basics of metabolomic data analysis including:

Data Quality Overview and Statistical Analysis

View this document on Scribd

Multivariete Data analysis

View this document on Scribd

Metabolomic Case Studies

View this document on Scribd

September 8, 2013 | Categories: Uncategorized | Tags: ANCOVA, biochemical network, chemical similarity network, Cytoscape, metabolomics, network, O-PLS-DA, PCA, PLS, PLS-DA, summer sessions in metabolomics, tutorial, west coast metabolomics center | Leave a comment

Orthogonal Signal Correction Partial Least Squares (O-PLS) in R

I often need to analyze and model very wide data (variables >>>samples), and because of this I gravitate to robust yet relatively simple methods. In my opinion partial least squares (PLS) is a particular useful algorithm. Simply put, PLS is an extension of principal components analysis (PCA), a non-supervised method to maximizing variance explained in X, which instead maximizes the covariance between X and Y(s). Orthogonal signal correction partial least squares (O-PLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between X and Y on the first latent variable, and components >1 capture variance in X which is orthogonal (or unrelated) to Y.

Because R does not have a simple interface for O-PLS, I am in the process of writing a package, which depends on the existing package pls.

Today I wanted to make a small example of conducting O-PLS in R, and at the same time take a moment to try out the R package knitr and RStudio for markdown generation.

You can take a look at the O-PLS/O-PLS-DA tutorials.

I was extremely impressed with ease of using knitr and generating markdown from code using RStudio. A big thank you to Yihui Xie and the RStudio developers (Joe Cheng). This is an amazing capability which I will make much more use of in the future!

July 28, 2013 | Categories: Uncategorized | Tags: Devium, knitr, O-PLS, O-PLS-DA, PLS, PLS-DA, R, r-bloggers, RStudio | 2 Comments

Network Mapping Video

Here are a video and slides for a presentation of mine about my favorite topic :

View this document on Scribd

June 14, 2013 | Categories: Uncategorized | Tags: biochemical network, biochmical network, chemical similarity network, clustering, Cytoscape, data analysis, data visualization, metabolomics, multivariate, network, network mapping, networks, O-PLS, O-PLS-DA, PCA, PLS, PLS-DA | 1 Comment

Tutorials Covering Biological Data Analysis Strategies

I’ve posted two new tutorials focused on intermediate and advanced strategies for biological, and specifically metabolomic data analysis (click titles for pdfs).

View this document on Scribd

May 29, 2013 | Categories: Uncategorized | Tags: ANCOVA, chemical similarity network, classification, climate, correlation network, covariate adjustment, data analysis, data visualization, Gaussian graphical Markov metabolic network, imDEV, metabolomics, network, PCA, PLS, PLS-DA, R, research, science, TeachingDemos, tutorial | Leave a comment

Evaluation of Orthogonal Signal Correction for PLS modeling (O-PLS)

Partial least squares projection to latent structures or PLS is one of my favorite modeling algorithms.

PLS is an optimal algorithm for predictive modeling using wide data or data with rows << variables. While there is s a wealth of literature regarding the application of PLS to various tasks, I find it especially useful for biological data which is often very wide and comprised of heavily inter-correlated parameters. In this context PLS is useful for generating single dimensional answers for multidimensional or multi-
factorial questions while overcoming the masking effects of redundant information or multicollinearity.

In my opinion an optimal PLS-based classification/discrimination model (PLS-DA) should capture the maximum difference between groups/classes being classified in the first dimension or latent variable (LV) and all information orthogonal to group discrimination should omitted from the model.

Unfortunately this is almost never the case and typically the plane of separation between class scores in PLS-DA models span two or more dimensions. This is sub-optimal because we are then forced to consider more than one dimension or model latent variable (LV) when answering the question: how are variables the same/different between classes and which of differences are the most important.

To the left is an example figure showing how principal components (PCA), PLS-DA and orthogonal signal correction PLS-DA (O-PLS-DA) vary in their ability to capture the maximum variance between classes (red and cyan) in the first dimension or LV (x-axis).

The aim O-PLS-DA is to maximize the variance between groups in the first dimension (x-axis).

Unfortunately there are no user friendly functions in R for carrying out O-PLS. Note- the package muma contains functions for O-PLS, but it is not easy to use because it is deeply embedded within an automated reporting scheme.

Luckily Ron Wehrens published an excellent book titled Chemometrics with R which contains an R code example for carrying out O-PLS and O-PLS-DA.

I adapted his code to make some user friendly functions (see below) for generating O-PLS models and plotting their results . I then used these to generate PLS-DA and O-PLS-DA models for a human glycomics data set. Lastly I compare O-PLS-DA to OPLS-DA (trademarked Umetrics, calculated using SIMCA 13) model scores.

The first task is to calculate a large (10 LV) exploratory model for 0 and 1 O-LVs.

Doing this we see that a 2 component model minimize the root mean squared error of prediction on the training data (RMSEP), and the O-PLS-DA model has a lower error than PLS-DA. Based on this we can calculate and compare the sample scores, variable loadings, and changes in model weights for 0 and 1 orthogonal signal corrected (OSC) latent variables PLS-DA models.

Comparing model (sample/row) scores between PLS-DA (0 OSC) and O-PLS-DA (1 OSC) models we can see that the O-PLS-DA model did a better job of capturing the maximal separation between the two sample classes (0 and 1) in the first dimension (x-axis).

Next we can look at how model variable loadings for the 1st LV are different between the PLS-DA and O-PLS-DA models.

We can see that for the majority of variables the magnitude for the model loading was not changed much however there were some parameters whose sign for the loading changed (example: variable 8). If we we want to use the loadings in the 1st LV to encode the variables importance for discriminating between classes in some other visualization (e.g. to color and size nodes in a model network) we need to make sure that the sign of the variable loading accurately reflects each parameters relative change between classes.

To specifically focus on how orthogonal signal correction effects the models perception of variables importance or weights we can calculate the differences in weights (delta weights ) between PLS-DA and O-PLS-DA models.

Comparing changes in weights we see that there looks to be a random distribution of increases or decreases in weight. variables 17 and 44 were the most increased in weight post OSC and 10 and 38 most decreased. Next we probably would want to look at the change in weight relative to the absolute weight (not shown).

Finally I wanted to compare O-PLS-DA and OPLS-DA model scores. I used Simca 13 to calculate the OPLS-DA (trademarked) model parameters and then devium and inkcape to make a scores visualization.

Generally PLS-DA and OPLS-DA show a similar degree of class separation in the 1st LV. I was happy to see that the O-PLS-DA model seems to have the largest class scores resolution and likely the best predictive performance of all three algorithms, but I will need to validate this by doing model permutations and training and testing evaluations.

Check out the R code used for this example HERE.

March 15, 2013 | Categories: Uncategorized | Tags: ggplot2, O-PLS-DA, orthogonal signal correction, PLS, PLS-DA, R | Leave a comment

PCA to PLS modeling analysis strategy for WIDE DATA

Working with wide data is already hard enough, add to this row outliers and things can get murky fast.

Here is an example of an anlysis of a wide data set, 24 rows x 84 columns.

Using imDEV, written in R, to calculate and visualize a principal components analysis (PCA) on this data set. We find that 7 components capture >80% of the variance in the data or X. We can also clearly see that the first dimension, capturing 35% of the variance in X, is skewed towards one outlier, larger black point in the plots in the lower left triangle of the figure below, representing PCA scores.

In this plot representing the results from a PCA:

Bottom left triangle = PCA SCORES, red and cyan ellipses display 2 groups, outlier is marked by a larger black point
Diagonal = PCA Eigen values or variance in X explained by the component
Top right triangle = PCA Loadings, representing linear combinations variable weights to reconstruct Sample scores

In actuality the large variance in the outlier is due to a single value imputation used to back fill missing values in 40% of the columns (variable) for this row (sample).

Removing this sample (and another more moderate outlier) and using partial least squares projection to latent structures discriminant analysis or PLS-DA to generate a projection of X which maximizes its covariance with Y which here is sample membership among the two groups noted by point and ellipse color (red = group 1, cyan = group 2).

The PLS scores separation for the two groups is largely captured in the first latent variable (LV1, x-axis). However we can’t be sure that this separation is better than random chance. To test this we can generate permuted NULL models by using our original X data to discriminate between randomly permuted sample group assignments (Y). When doing the random assignments we can optionally conserve the proportion of cases in each group or maintain this at 1:1.

Comparing our models (vertical hashed line) cross-validated fit to the training data , Q^2, to 100 randomly permuted models (TOP PANEL above, cyan distribution), we see that generally our “one” model is better fit than that achieved for the random data. Another interesting parameter is the comparison of our model’s root mean error of prediction, RMSEP, or out of sample error to the permuted models’ (BOTTOM PANEL above, dark green distribution).

To have more confidence in the assessment of our model we can conduct training and testing validations. We can do this by randomly splitting our original X data into 2/3 training and 1/3 test sets. By using the training data to fit the model, and then using this to predict group memberships for the test data we can get an idea of the model’s out of sample classification error, visualized below using an receiver operator characteristic curve (ROC, RIGHT PANEL).

Wow this one model looks “perfect”, based on its assessment using one training/testing evaluation (ROC curve above). However it is important to repeat this procedure to evaluate its performance for other random splits of the data into training and test sets.

After permuting the samples training/test assignments 100 times; now we see that our original “one” model (cyan line in ROC curve in the TOP PANEL) was overly optimistic compared to the average performance of 100 models (green lines and distribution above).

Now that we have more confidence in our models performance we can compare the distributions for its performance metrics to those we calculated for the permuted NULL models above. In the case of the “one” model we can use a single sample t-Test or for the “100” model a two-sample t-Test to determine the probability of achieving a similar performance to our model by random chance.

Now by looking at our models RMSEP compared to random chance (BOTTOM LEFT PANEL, out of sample error for our model, green, compared to random chance, yellow) we can be confident that our model is worth exploring further.

For example, we can now investigate the variables’ weights and loadings in the PLS model to understand key differences in these parameters which are driving the above models discrimination performance of our two groups of interest.

March 2, 2013 | Categories: Uncategorized | Tags: imDEV, model validation, outliers, PCA, pcaMethods, permutation, PLS, PLS-DA, R | Leave a comment

Visualizing Sample Scores Trajectory for Repeated Measures

PLS-DA sample scores for a discrimination model identifying multivariate changes in metabolomic measurements before (pre) or after (post) some experimental manipulation.

Based on scores plot for samples given changes in >300 biological parameters; it looks like there are two patterns of samples movement through this principal predictive plane. Few move in the direction capturing the most variance in data matrix or (x-axis, 31%), but the majority show an interaction between x and y (the second dimension explaining only 8%). Also, the most pre or before looking samples in the first dimension (142 and 77, note farthest right) are the least changed post or after the experimental treatment.

February 15, 2013 | Categories: Uncategorized | Tags: network, PLS-DA, repeated measures, scores trajectory | Leave a comment

Visualization of Multivariate Biological Models (PLS-DA and O-PLS-DA)

Its not uncommon to be faced by multiple questions at the same time. For instance imagine the following experimental design. You have one MAIN question: what is different between groups A and B, but among groups A and B are subgroups 1 and 2. This complicates things because now the answer to the MAIN question (what is different between A and B) may be slightly different for the two sub groups A|1, A|2 and B|1, B|2.

In statistics we can account for these types of experimental designs by choosing different tests. For instance in the case outlined above we could use a two-way analysis of variance (2-way ANOVA) to identify differences between A|B which are independent of differences between 1|2 (and interaction between A|B and 1|2). In the case of multivariate modeling we can achieve a similar effect by using covariate adjustments. For example we can use the residuals from a simple linear model for differences between 1|2 as the 1|2-effect adjusted data to be used to test for differences between A|B. Here is a visual example of this approach using:

2) 1|2–adjusted PLS-DA model for A|B1) PCA to evaluate the data variance between A and B (GREEN and RED) and 1 and 2 (SMALL or LARGE)

3) 1|2–adjusted O-PLS-DA model for A|B

Based on the PCA we see that the differences between A|B are also affected by 1|2. This is evident in distribution of scores based on LARGE|SMALL among A ( A|1 (GREEN|SMALL) is more different (further right) from all B than A|2 (GREEN|LARGE). The same can be said for B, and in particular the greatest differences between all groups is between those which have the greatest separation in the X-axis (1st principal component) which are RED|LARGE and GREEN|SMALL.

To identify the greatest difference between RED|GREEN which is independent of differences due to SMALL|LARGE, we can use a SMALL|LARGE -adjusted data to create a PLS-DA model to discriminate between RED|GREEN.

This projection of the differences between A|B is the same for SMALL|LARGE groups. Ideally we want the two groups scores to be maximally separated in the X-axis or 1st LV. We see that this is not the case above, and instead the explanation of how the variables contribute to differences between GREEN|RED needs to be answered by explaining scores variance in X and Y axes or two dimensions.

Next we try the O-PLS-DA algorithm, which aims to rotate the projection of the data to maximize the separation between GREEN|RED on the X-axis and capture unrelated or orthogonal variance on the Y-axis.
The O-PLS-DA model loadings for the 1st LV provide information regarding differences in variable magnitudes between the two groups (GREEN|RED).

We can use network mapping to visualize these weights within a domain specific context. In the case of metabolomics data this is best achieved using biochemical/chemical similarity networks.

We can create these networks by assigning edges between vertices (representing metabolites) based on biochemical relationships (KEGG RPAIRs ) or chemical similarities (Tanimoto coefficient >0.7). We can then map the O-PLS-DA model loadings to this network’s visual properties (vertex: size, color, border, and inset graphic).

For example we can map vertex size to the matabolite’s importance in the explained discrimination between groups (loading on O-PLS-DA LV 1) and color the direction of change (blue, decrease; red, increase). Metabolites displaying significant differences between RED and GREEN groups (two-way ANOVA, p < 0.05 adjusting for 1|2) are shown at maximum size, with a black border and contain a box-plot visualization.

Here is network mapping the O-PLS-DA model loadings into a biological context and displaying graphs for import parameters means among groups stratified by A|B and 1|2 (left to right: A|1, A|2,B|1,B|2).

Here is another network with the same edge and vertex properties as above, except the inset graphs show differences between groups A|B adjusted for the effect of 1|2.

February 9, 2013 | Categories: Uncategorized | Tags: ANOVA, chemical similarity network, covariate adjustment, Cytoscape, ExCytR, imDEV, metabolomics, O-PLS, O-PLS-DA, PCA, PLS-DA | Leave a comment

Covariate Adjustement for PLS-DA Models

A typical experiment may involve the testing of a wide variety of factors. For instance, here is an example of an experiment aimed at determining metabolic differences between two plant cultivars at three different ontological stages and in two different tissues. Exploratory principal components analysis (PCA) can be used to evaluate the major modes of variance in the data prior to conducting any univariate tests.

Based on the PCA (autoscaled data) we can see that the majority of the differences are driven by differences between tissues. This is evident from the scores separation in (a) between leaf and fruit tissues, which is driven by metabolites with large positive/negative loadings on the first dimension or x-axis in (b). A lesser mode of variance is captured in the second dimension, and particularly in fruit we can see that there is some separation in scores between the two cultivars and their different ontological stages. Based on this it was concluded to carry out test in leaf and fruit tissue separately. Additionally in order to identify the effects of cultivar on the metabolomic profiles which are independent of stage and vice versa, a linear covariate adjustments were applied to the data.

Again using PCA and focusing on fruit tissue, we can evaluate the variance in the data given our hypotheses (differences between cultivars or stages). Looking at (a) we can see that there is not a clear separation in scores in any one dimension between cultivars or stages. However there is separation in two dimensions. This is problematic in that this suggest that there is an interaction between cultivar and stage, which will complicate any univariate tests for these factors. We can see that carrying out linear covariate adjustment either for cultivar (b) or stage (c) translate the variance for the target hypothesis into one dimension, which therefore simplifies its testing. Note, this is exactly what is done when doing an analysis of covariance or ANCOVA. However if we want to use this same favorable variance environment for multivariate modeling like for example partial least squares projection to latent structures discriminant analysis (PLS-DA) we need to covariate adjust the data which in this case is achieved by taking the residuals from linear model for the covariate we want to adjust for.

Now that we have adjusted the data for covariate effects we can test the primary hypotheses (differences between cultivars, stages and tissues) using PLS-DA. Quick visual inspection of model scores can be used to get a feel for the quality of the models. Ideally we would like to see a scores separation between the various levels of our hypotheses in one dimension. We can see that both fruit models are of higher quality than that for leaf. However to fully validate these models we need to carry out some permutation testing or something similar. The benefit of PLS-DA is that we can use the information about the variables contribution to the scores separation or loadings to identify metabolomic differences between cultivars or with increasing maturity or stage.

Here is an example where PLS-DA variable loadings are mapped into a biochemical context using a chemical similarity network. This network represents differences in metabolites due to cultivar, wherein significant differences in metabolite means (ANCOVA, FDR adjusted p-value < 0.05) between cultivars are represented by nodes or vertices which are colored based on the sign of the loading and their size used to encode the magnitude of their loading in the model.

We can now compare the two networks representing metabolomic differences due to cultivar (far top) or to stage (above) to identify biochemical changes due to these factors which are independent of each others effects (or interaction).

January 13, 2013 | Categories: Uncategorized | Tags: ANCOVA, chemical similarity network, covariate adjustment, Cytoscape, PCA, PLS-DA | 2 Comments

Comparison of Serum vs Urine metabolites +

Primary metabolites in human serum or urine.

Oh oh, there seem to be some outliers: serum samples looking like urine and vice versa. Fix these and evaluate using PCA and hierarchical clustering on rank correlations.

Now things look more believable. Next let us test the effects of data pre-treatment on PLS-DA model scores for a 3 group comparison in serum. Ideally group scores would be maximally resolved in the dimension of the first latent variable (x) and inter-group variance would be orthogonal or in the y-axis.

Compared to raw data (TOP) where ~ 3 top variables (glucose, urea and mannitol) dominate the variance structure, the autoscaled model, due to variable-wise mean subtraction and division by the standard deviation, displays a more balanced contribution to scores variance by variables. The larger separation between WHITE and RED class scores along the x-axis suggest improved classifier performance over raw data model and overview of samples with scores outside their respective group’s Hotelling’s T ellipse (95%) might point to a sample outlier to further investigate or potentially exclude from the current test.

December 16, 2012 | Categories: Uncategorized | Tags: autoscaling, clustering, imDEV, metabolomics, normalizations, outliers, PCA, PLS-DA | Leave a comment

Discriminating Between Iris Species

The Iris data set is a famous for its use to compare unsupervised classifiers.

The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.

One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).

Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.

Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of I. versicolor and I. verginica to I. setaosa. However, this supervised approach is not enough to identify a hyperplane of separation between all three species.

Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that this approach is the most success for resolving the two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package) the loadings are literally NA.

August 4, 2012 | Categories: Uncategorized | Tags: classification, data analysis, histogram, imDEV, Iris Data, neural networks, non-linear PCA, PCA, pcaMethods, PLS-DA, R, scatterplot, scatterplot matrix | Leave a comment