2014 UC Davis Proteomics Workshop
Recently I had the pleasure of teaching data analysis at the 2014 UC Davis Proteomics Workshop. This included a hands on lab for making gene ontology enrichment networks. You can check out my lecture and tutorial below or download all the material.
Introduction
Tutorial

2014 UC Davis Proteomics Workshop Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Using Repeated Measures to Remove Artifacts from Longitudinal Data
Recently I was tasked with evaluating and most importantly removing analytical variance form a longitudinal metabolomic analysis carried out over a few years and including >2,5000 measurements for >5,000 patients. Even using state-of-the-art analytical instruments and techniques long term biological studies are plagued with unwanted trends which are unrelated to the original experimental design and stem from analytical sources of variance (added noise by the process of measurement). Below is an example of a metabolomic measurement with and without analytical variance.

The noise pattern can be estimated based on replicated measurements of quality control samples embedded at a ratio of 1:10 within the larger experimental design. The process of data normalization is used to remove analytical noise from biological signal on a variable specific basis. At the bottom of this post, you can find an in-depth presentation of how data quality can be estimated and a comparison of many common data normalization approaches. From my analysis I concluded that a relatively simple LOESS normalization is a very powerful method for removal of analytical variance. While LOESS (or LOWESS), locally weighted scatterplot smoothing, is a relatively simple approach to implement; great care has to be taken when optimizing each variable-specific model.
In particular, the span parameter or alpha controls the degree of smoothing and is a major determinant if the model (calculated from repeated measures) is underfit, just right or overfit with regards to correcting analytical noise in samples. Below is a visualization of the effect of the span parameter on the model fit.

One method to estimate the appropriate span parameter is to use cross-validation with quality control samples. Having identified an appropriate span, a LOESS model can be generated from repeated measures data (black points) and is used to remove the analytical noise from all samples (red points).

Having done this we can now evaluate the effect of removing analytical noise from quality control samples (QCs, training data, black points above) and samples (test data, red points) by calculating the relative standard deviation of the measured variable (standard deviation/mean *100). In the case of the single analyte, ornithine, we can see (above) that the LOESS normalization will reduce the overall analytical noise to a large degree. However we can not expect that the performance for the training data (noise only) will converge with that of the test set, which contains both noise and true biological signal.
In addition to evaluating the normalization specific removal of analytical noise on a univariate level we can also use principal components analysis (PCA) to evaluate this for all variables simultaneously. Below is an example of the PCA scores for non-normalized and LOESS normalized data.

We can clearly see that the two largest modes of variance in the raw data explain differences in when the samples were analyzed, which is termed batch effects. Batch effects can mask true biological variability, and one goal of normalizations is to remove them, which we can see is accomplished in the LOESS normalized data (above right).
However be forewarned, proper model validation is critical to avoiding over-fitting and producing complete nonsense.

In case you are interested the full analysis and presentation can be found below as well as the majority of the R code used for the analysis and visualizations.
Principal Components Analysis Shiny App
I’ve recently started experimenting with making Shiny apps, and today I wanted to make a basic app for calculating and visualizing principal components analysis (PCA). Here is the basic interface I came up with. Test drive the app for yourself or check out the the R code HERE.
library(shiny)
runGist("5846650")
Above is an example of the user interface which consists of data upload (from.csv for now), and options for conducting PCA using the pcaMethods package. The various outputs include visualization of the eigenvalues and cross-validated eigenvalues (q2), which are helpful for selecting the optimal number of model components.
The PCA scores plot can be used to evaluate extreme (leverage) or moderate (DmodX) outliers. A Hotelling’s T-squared confidence intervals as an ellipse would also be a good addition for this.
The variable loadings can be used to evaluate the effects of data scaling and other pre-treatments.
The next step is to interface the calculation of PCA to a dynamic plot which can be used to map meta data to plotting characteristics.
Biochemical, Chemical and Mass Spectral Similarity Network
Here is an example of a network leveraging three dominant aspects of metabolomic experiments (biochemical, chemical and mass spectral knowledge) to connect measured variables. This is a network for a blinded data set (sample ids are not known), which I’ve made for a member of my lab presenting their work at the Metabolomics Society Conference in Glasgow, Scotland.
With out knowing the experimental design we can still analyze our data for analytical effects. For example below is a principal components analysis of ~400 samples and 600 variables, where I’ve annotated the sample scores to show data aquisition date (color) and experimental samples or laboratory quality controls (shape). One thing to look for are trends or scores grouping in the PCA scores which are correlated to analytical conditions like, batch, date, technician, etc.
Finally we can take a look at the PCA variable loadings which highlights a major bottleneck in metabolomics experiments, the large amount of structurally unknown molecular features.
Even using feature rich electron impact mass spectra (GC/TOF) only 40% of the unknown variables could be connected to known species based on a cosine correlation >0.75. To give you an idea the cosine correlation or dot product between the mass spectra of two structurally very similar molecules xylose and xylitol is ~ 0.8.
Multivariate Modeling Strategy
The following is an example of a clinical study aimed at identification of circulating metabolites related to disease phenotype or grade/severity/type (tissue histology, 4 classifications including controls).
The challenge is to make sense of 300 metabolic measurements for 300 patients.
The goal is to identify metabolites related to disease, while accounting covariate meta data such as gender and smoking.
The steps
- Exploratory Data Analysis – principal components analysis (PCA)
- Statistical Analysis – covariate adjustment and analysis of covariance or ANCOVA
- Multivariate Classification Modeling – orthogonal signal correction partial least squares discriminant analysis (O-PLS-DA)
Data exploration is useful for getting an idea of the data structure and to identify unusual or unexpected trends.
PCA above conducted on autoscaled data (300 samples and 300 measurements) was useful for identifying an interesting 2-cluster structure in the sample scores (top left). Unfortunately the goal of the study, disease severity, could not explain this pattern (top center). An unknown covariate was identified causing the observed clustering of samples (top right).
Next various covariate adjustment strategies were applied to the data and evaluated using the unsupervised PCA (bottom left) and the supervised O-PLS-DA.
Even after the initial covariate adjustment for the 2-cluster effect there remained a newly visible covariate (top ;left), the source of which could not me linked to the meta data.
After data pre-treatment and evaluation of testing strategies (top right) the next challenge is to select the best classifiers of disease status. Feature selection was undertaken to improve model performance and simplify its performance.
Variable correlation with O-PLS-DA sample scores and magnitude of variable loading in the model were used to select from the the full feature set (~300) only 64 (21%) top features which explained most of the models classification performance.
In conclusion preliminary data exploration was used to identify an unknown source of variance which negatively affected the experimental goal to identify metabolic predictors of disease severity. Multivariate techniques, PCA and O-PLS-DA, were used to identify an optimal data covariate adjustment and hypothesis testing strategy. Finally O-PLS-DA modeling including feature selection, training/testing validations (n=100) and permutation testing (n=100) were used to identify the top features (21%) which were most predictive of patients classifications as displaying or not displaying the disease phenotype.
PCA to PLS modeling analysis strategy for WIDE DATA
Working with wide data is already hard enough, add to this row outliers and things can get murky fast.
Here is an example of an anlysis of a wide data set, 24 rows x 84 columns.
Using imDEV, written in R, to calculate and visualize a principal components analysis (PCA) on this data set. We find that 7 components capture >80% of the variance in the data or X. We can also clearly see that the first dimension, capturing 35% of the variance in X, is skewed towards one outlier, larger black point in the plots in the lower left triangle of the figure below, representing PCA scores.
In this plot representing the results from a PCA:
- Bottom left triangle = PCA SCORES, red and cyan ellipses display 2 groups, outlier is marked by a larger black point
- Diagonal = PCA Eigen values or variance in X explained by the component
- Top right triangle = PCA Loadings, representing linear combinations variable weights to reconstruct Sample scores
In actuality the large variance in the outlier is due to a single value imputation used to back fill missing values in 40% of the columns (variable) for this row (sample).
Removing this sample (and another more moderate outlier) and using partial least squares projection to latent structures discriminant analysis or PLS-DA to generate a projection of X which maximizes its covariance with Y which here is sample membership among the two groups noted by point and ellipse color (red = group 1, cyan = group 2).
The PLS scores separation for the two groups is largely captured in the first latent variable (LV1, x-axis). However we can’t be sure that this separation is better than random chance. To test this we can generate permuted NULL models by using our original X data to discriminate between randomly permuted sample group assignments (Y). When doing the random assignments we can optionally conserve the proportion of cases in each group or maintain this at 1:1.
Comparing our models (vertical hashed line) cross-validated fit to the training data , Q^2, to 100 randomly permuted models (TOP PANEL above, cyan distribution), we see that generally our “one” model is better fit than that achieved for the random data. Another interesting parameter is the comparison of our model’s root mean error of prediction, RMSEP, or out of sample error to the permuted models’ (BOTTOM PANEL above, dark green distribution).
To have more confidence in the assessment of our model we can conduct training and testing validations. We can do this by randomly splitting our original X data into 2/3 training and 1/3 test sets. By using the training data to fit the model, and then using this to predict group memberships for the test data we can get an idea of the model’s out of sample classification error, visualized below using an receiver operator characteristic curve (ROC, RIGHT PANEL).
Wow this one model looks “perfect”, based on its assessment using one training/testing evaluation (ROC curve above). However it is important to repeat this procedure to evaluate its performance for other random splits of the data into training and test sets.
After permuting the samples training/test assignments 100 times; now we see that our original “one” model (cyan line in ROC curve in the TOP PANEL) was overly optimistic compared to the average performance of 100 models (green lines and distribution above).
Now that we have more confidence in our models performance we can compare the distributions for its performance metrics to those we calculated for the permuted NULL models above. In the case of the “one” model we can use a single sample t-Test or for the “100” model a two-sample t-Test to determine the probability of achieving a similar performance to our model by random chance.
Now by looking at our models RMSEP compared to random chance (BOTTOM LEFT PANEL, out of sample error for our model, green, compared to random chance, yellow) we can be confident that our model is worth exploring further.
For example, we can now investigate the variables’ weights and loadings in the PLS model to understand key differences in these parameters which are driving the above models discrimination performance of our two groups of interest.
Covariate Adjustement for PLS-DA Models
A typical experiment may involve the testing of a wide variety of factors. For instance, here is an example of an experiment aimed at determining metabolic differences between two plant cultivars at three different ontological stages and in two different tissues. Exploratory principal components analysis (PCA) can be used to evaluate the major modes of variance in the data prior to conducting any univariate tests.
Based on the PCA (autoscaled data) we can see that the majority of the differences are driven by differences between tissues. This is evident from the scores separation in (a) between leaf and fruit tissues, which is driven by metabolites with large positive/negative loadings on the first dimension or x-axis in (b). A lesser mode of variance is captured in the second dimension, and particularly in fruit we can see that there is some separation in scores between the two cultivars and their different ontological stages. Based on this it was concluded to carry out test in leaf and fruit tissue separately. Additionally in order to identify the effects of cultivar on the metabolomic profiles which are independent of stage and vice versa, a linear covariate adjustments were applied to the data.
Again using PCA and focusing on fruit tissue, we can evaluate the variance in the data given our hypotheses (differences between cultivars or stages). Looking at (a) we can see that there is not a clear separation in scores in any one dimension between cultivars or stages. However there is separation in two dimensions. This is problematic in that this suggest that there is an interaction between cultivar and stage, which will complicate any univariate tests for these factors. We can see that carrying out linear covariate adjustment either for cultivar (b) or stage (c) translate the variance for the target hypothesis into one dimension, which therefore simplifies its testing. Note, this is exactly what is done when doing an analysis of covariance or ANCOVA. However if we want to use this same favorable variance environment for multivariate modeling like for example partial least squares projection to latent structures discriminant analysis (PLS-DA) we need to covariate adjust the data which in this case is achieved by taking the residuals from linear model for the covariate we want to adjust for.
Now that we have adjusted the data for covariate effects we can test the primary hypotheses (differences between cultivars, stages and tissues) using PLS-DA. Quick visual inspection of model scores can be used to get a feel for the quality of the models. Ideally we would like to see a scores separation between the various levels of our hypotheses in one dimension. We can see that both fruit models are of higher quality than that for leaf. However to fully validate these models we need to carry out some permutation testing or something similar. The benefit of PLS-DA is that we can use the information about the variables contribution to the scores separation or loadings to identify metabolomic differences between cultivars or with increasing maturity or stage.
Here is an example where PLS-DA variable loadings are mapped into a biochemical context using a chemical similarity network. This network represents differences in metabolites due to cultivar, wherein significant differences in metabolite means (ANCOVA, FDR adjusted p-value < 0.05) between cultivars are represented by nodes or vertices which are colored based on the sign of the loading and their size used to encode the magnitude of their loading in the model.
We can now compare the two networks representing metabolomic differences due to cultivar (far top) or to stage (above) to identify biochemical changes due to these factors which are independent of each others effects (or interaction).
Discriminating Between Iris Species
The Iris data set is a famous for its use to compare unsupervised classifiers.
The goal is to use information about flower characteristics to accurately classify the 3 species of Iris. We can look at scatter plots of the 4 variables in the data set and see that no single variable nor bivariate combination can achieve this.
One approach to improve the separation between the two closely related Iris species, I.versicolor (blue) and I.virginica (green), is to use a combination of all 4 measurements, by constructing principal components (PCs).
Using the singular value decomposition to calculate PCs we see that the sample scores above are not resolved for the two species of interest.
Another approach is to use a supervised projection method like partial least squares (PLS), to identify Latent Variables (LVs) which are data projections similar to those of PCA, but which are also correlated with the species label. Interestingly this approach leads to a projection which changes the relative orientation of I. versicolor and I. verginica to I. setaosa. However, this supervised approach is not enough to identify a hyperplane of separation between all three species.
Non-linear PCA via neural networks can be used to identify the hypersurface of separation, shown above. Looking at the scores we can see that this approach is the most success for resolving the two closely related species. However, the loadings from this method, which help relate how the variables are combined achieve the classification, are impossible to interpret. In the case of the function used above(nlPca, pcaMethods R package) the loadings are literally NA.

































