ggplot2 | Creative Data Solutions

Push it to the limit: SOM + Clustering + Networks

What is the highest dimensional visualization you can think of? Now imagine it being interactive. The following details a Frankenstein visualization packing a smorgasbord of multivariate goodness.

composite2
Enter first, self-organizing maps (SOM). I first fell into a love dream with SOMs after using the kohonen package. The wines data set example is a beautiful display of information.

Eloquently, making the visualization above is relatively easy. SOM is used to organize the data into related groups on a grid. Hierarchical cluster analysis (HCA) is used to classify the SOM codes into three groups.

clusters

HCA cluster information is mapped to the SOM grid using hexagon background colors. The radial bar plots show the variable (wine compounds’) patterns for samples (wines).

radial

The goal for this project was to reproduce the kohonen.plot using ggplot2 and make it interactive using shiny.

som1

The main idea was to use SOM to calculated the grid coordinates, geom_hexagon for the grid packing and any ggplot for the hexagon-inset sub plots. Some basic inset plots could be bar or line plots.

Part of the beauty is the organization of any ggplot you can think of (optionally grouping the input data or SOM codes) based on the SOM unit classification.

A Pavlovian response might be; does it network?

network

Yes we can (network). Above is an example of different correlation patterns between wine components in related groups of wines. For example the green grid points identify wines showing a correlation between phenols and flavanoids (probably reds?). Their distance from each other could be explained (?) by the small grid size (see below).

The next question might be, does it scale?

multi

more lines

There is potential. The 4 x 4 grid shows radial bar plot patterns for 16 sub groups among the 3 larger sample groups. The next next 6 x 6 plot shows wine compound profiles for 36 ~related subsets of wines.

A useful side effect is that we can use SOM quality metrics to give us an extra-dimensional view into tuning the visualization. For example we can visualize the number of samples per grid point or distances between grid points (dissimilarity in patterns).

This is useful to identify parts of the somClustPlot showing the number of mapped samples and greatest differences.

One problem I experienced was getting the hexagon packing just right. I ended making controls to move the hexagons ~up/down and zoom in/out on the plot. It is not perfect but shows potential (?) for scaffolding highly multivariate visualizations? Some of my other concerns include the stochastic nature of SOM and the need for som random initialization for the embedding. Make sure to use it with set.seed() to make it reproducible, and might want to try a few seeds. Maybe someone out there knows how to make this aspect of SOM more robust?

May 19, 2016 | Categories: Uncategorized | Tags: data visualization, ggplot2, multivariate, network, r-bloggers, SOM | 3 Comments

Multivariate Data Analysis and Visualization Through Network Mapping

Recently I had the pleasure of speaking about one of my favorite topics, Network Mapping. This is a continuation of a general theme I’ve previously discussed and involves the merger of statistical and multivariate data analysis results with a network.

Over the past year I’ve been working on two major tools, DeviumWeb and MetaMapR, which aid the process of biological data (metabolomic) network mapping.

DeviumWeb– is a shiny based GUI written in R which is useful for:

data manipulation, transformation and visualization

statistical analysis (hypothesis testing, FDR, power analysis, correlations, etc)

clustering (heiarchical, TODO: k-means, SOM, distribution)

principal components analysis (PCA)

orthogonal partial least squares multivariate modeling (O-/PLS/-DA)

MetaMapR– is also a shiny based GUI written in R which is useful for calculation and visualization of various networks including:

biochemical

structural similarity

mass spectral similarity

correlation

Both of theses projects are under development, and my ultimate goal is to design a one-stop-shop ecosystem for network mapping.

In addition to network mapping,the video above and presentation below also discuss normalization schemes for longitudinal data and genomic, proteomic and metabolomic functional analysis both on a pathway and global level.

As always happy network mapping!

June 27, 2014 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, correlation network, Cytoscape, data analysis, data visualization, DeviumWeb, ggplot2, metabolomics, MetaMapR, multivariate, network mapping, O-PLS, R, r-bloggers, shiny, statistical analysis | 6 Comments

Using Repeated Measures to Remove Artifacts from Longitudinal Data

Recently I was tasked with evaluating and most importantly removing analytical variance form a longitudinal metabolomic analysis carried out over a few years and including >2,5000 measurements for >5,000 patients. Even using state-of-the-art analytical instruments and techniques long term biological studies are plagued with unwanted trends which are unrelated to the original experimental design and stem from analytical sources of variance (added noise by the process of measurement). Below is an example of a metabolomic measurement with and without analytical variance.

The noise pattern can be estimated based on replicated measurements of quality control samples embedded at a ratio of 1:10 within the larger experimental design. The process of data normalization is used to remove analytical noise from biological signal on a variable specific basis. At the bottom of this post, you can find an in-depth presentation of how data quality can be estimated and a comparison of many common data normalization approaches. From my analysis I concluded that a relatively simple LOESS normalization is a very powerful method for removal of analytical variance. While LOESS (or LOWESS), locally weighted scatterplot smoothing, is a relatively simple approach to implement; great care has to be taken when optimizing each variable-specific model.

In particular, the span parameter or alpha controls the degree of smoothing and is a major determinant if the model (calculated from repeated measures) is underfit, just right or overfit with regards to correcting analytical noise in samples. Below is a visualization of the effect of the span parameter on the model fit.

One method to estimate the appropriate span parameter is to use cross-validation with quality control samples. Having identified an appropriate span, a LOESS model can be generated from repeated measures data (black points) and is used to remove the analytical noise from all samples (red points).

Having done this we can now evaluate the effect of removing analytical noise from quality control samples (QCs, training data, black points above) and samples (test data, red points) by calculating the relative standard deviation of the measured variable (standard deviation/mean *100). In the case of the single analyte, ornithine, we can see (above) that the LOESS normalization will reduce the overall analytical noise to a large degree. However we can not expect that the performance for the training data (noise only) will converge with that of the test set, which contains both noise and true biological signal.

In addition to evaluating the normalization specific removal of analytical noise on a univariate level we can also use principal components analysis (PCA) to evaluate this for all variables simultaneously. Below is an example of the PCA scores for non-normalized and LOESS normalized data.

We can clearly see that the two largest modes of variance in the raw data explain differences in when the samples were analyzed, which is termed batch effects. Batch effects can mask true biological variability, and one goal of normalizations is to remove them, which we can see is accomplished in the LOESS normalized data (above right).

However be forewarned, proper model validation is critical to avoiding over-fitting and producing complete nonsense.

In case you are interested the full analysis and presentation can be found below as well as the majority of the R code used for the analysis and visualizations.

June 4, 2014 | Categories: Uncategorized | Tags: batch effects, data analysis, data quality, data visualization, Devium, ggplot2, normalizations, PCA, quality controls, r-bloggers, tutorial | Leave a comment

Tutorials- Statistical and Multivariate Analysis for Metabolomics

I recently had the pleasure in participating in the 2014 WCMC Statistics for Metabolomics Short Course. The course was hosted by the NIH West Coast Metabolomics Center and focused on statistical and multivariate strategies for metabolomic data analysis. A variety of topics were covered using 8 hands on tutorials which focused on:

data quality overview

statistical and power analysis

clustering

principal components analysis (PCA)

partial least squares (O-/PLS/-DA)

metabolite enrichment analysis

biochemical and structural similarity network construction

network mapping

I am happy to have taught the course using all open source software, including: R, and Cytoscape. The data analysis and visualization were done using Shiny-based apps: DeviumWeb and MetaMapR. Check out some of the slides below or download all the class material and try it out for yourself.

2014 WCMC LC-MS Data Processing and Statistics for Metabolomics by Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Special thanks to the developers of Shiny and Radiant by Vincent Nijs.

February 17, 2014 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, Cytoscape, data analysis, data visualization, Devium, ggplot2, hierarchical clustering, mass spectral similarity, metabolomics, MetaMapR, network, O-PLS, O-PLS-DA, PCA, R, r-bloggers, shiny, TeachingDemos, tutorial | 13 Comments

Principal Components Analysis Shiny App

I’ve recently started experimenting with making Shiny apps, and today I wanted to make a basic app for calculating and visualizing principal components analysis (PCA). Here is the basic interface I came up with. Test drive the app for yourself or check out the the R code HERE.

library(shiny)
runGist("5846650")

Above is an example of the user interface which consists of data upload (from.csv for now), and options for conducting PCA using the pcaMethods package. The various outputs include visualization of the eigenvalues and cross-validated eigenvalues (q2), which are helpful for selecting the optimal number of model components.The PCA scores plot can be used to evaluate extreme (leverage) or moderate (DmodX) outliers. A Hotelling’s T-squared confidence intervals as an ellipse would also be a good addition for this.

The variable loadings can be used to evaluate the effects of data scaling and other pre-treatments.

The next step is to interface the calculation of PCA to a dynamic plot which can be used to map meta data to plotting characteristics.

June 23, 2013 | Categories: Uncategorized | Tags: ggplot2, PCA, R, r-bloggers, shiny | 2 Comments

Dynamic Data Visualizations in the Browser Using Shiny

After being busy the last two weeks teaching and attending academic conferences, I finally found some time to do what I love, program data visualizations using R. After being interested in Shiny for a while, I finally decided to pull the trigger and build my first Shiny app!

I wanted to make a proof of concept app which contained the following dynamics which are the basics of any UI design:

1) dynamic UI options

2) dynamically updated plot based on UI inputs

Here is what I came up with.

Check out the app for yourself or the R code HERE.

library(shiny)
runGist('5792778')

The app consists of a user interface (UI) for selecting the data, variable to plot , grouping factor for colors and four plotting options: boxplot (above), histogram, density plot and bar graph. As an added bonus the user can select to show or hide jittered points in the boxplot visualization.

Generally #2 above was well described and easy to implement, but it took a lot of trial and error to figure out how to implement #1. Basically to generate dynamic UI objects, the UI objects need to be called using the function shiny:::uiOutput() in the ui.R file and their arguments set in the server.R file using the function shiny:::renderUI(). After getting this to work everything else fell in place.

Having some experience with making UI’s in VBA (visual basic) and gWidgets; Shiny is a joy to work with once you understand some of its inner workings. One aspect I felt which made the learning experience frustrating was the lack of informative errors coming from Shiny functions. Even using all the R debugging tools having Shiny constantly tell me something was not correctly called from a reactive environment or the error was in the runApp() did not really help. My advice to anyone learning Shiny is to take a look at the tutorials, and particularly the section on Dynamic UI. Then pick a small example to reverse engineer. Don’t start off too complicated else you will have a hard time understanding which sections of code are not working as expected.

Finally here are some screen shots, and keep an eye out for more advanced shiny apps in the near future.

June 16, 2013 | Categories: Uncategorized | Tags: bar graph, boxplot, data visualization, density plot, ggplot2, histogram, R, r-bloggers, shiny | 7 Comments

Evaluation of Orthogonal Signal Correction for PLS modeling (O-PLS)

Partial least squares projection to latent structures or PLS is one of my favorite modeling algorithms.

PLS is an optimal algorithm for predictive modeling using wide data or data with rows << variables. While there is s a wealth of literature regarding the application of PLS to various tasks, I find it especially useful for biological data which is often very wide and comprised of heavily inter-correlated parameters. In this context PLS is useful for generating single dimensional answers for multidimensional or multi-
factorial questions while overcoming the masking effects of redundant information or multicollinearity.

In my opinion an optimal PLS-based classification/discrimination model (PLS-DA) should capture the maximum difference between groups/classes being classified in the first dimension or latent variable (LV) and all information orthogonal to group discrimination should omitted from the model.

Unfortunately this is almost never the case and typically the plane of separation between class scores in PLS-DA models span two or more dimensions. This is sub-optimal because we are then forced to consider more than one dimension or model latent variable (LV) when answering the question: how are variables the same/different between classes and which of differences are the most important.

To the left is an example figure showing how principal components (PCA), PLS-DA and orthogonal signal correction PLS-DA (O-PLS-DA) vary in their ability to capture the maximum variance between classes (red and cyan) in the first dimension or LV (x-axis).

The aim O-PLS-DA is to maximize the variance between groups in the first dimension (x-axis).

Unfortunately there are no user friendly functions in R for carrying out O-PLS. Note- the package muma contains functions for O-PLS, but it is not easy to use because it is deeply embedded within an automated reporting scheme.

Luckily Ron Wehrens published an excellent book titled Chemometrics with R which contains an R code example for carrying out O-PLS and O-PLS-DA.

I adapted his code to make some user friendly functions (see below) for generating O-PLS models and plotting their results . I then used these to generate PLS-DA and O-PLS-DA models for a human glycomics data set. Lastly I compare O-PLS-DA to OPLS-DA (trademarked Umetrics, calculated using SIMCA 13) model scores.

The first task is to calculate a large (10 LV) exploratory model for 0 and 1 O-LVs.

Doing this we see that a 2 component model minimize the root mean squared error of prediction on the training data (RMSEP), and the O-PLS-DA model has a lower error than PLS-DA. Based on this we can calculate and compare the sample scores, variable loadings, and changes in model weights for 0 and 1 orthogonal signal corrected (OSC) latent variables PLS-DA models.

Comparing model (sample/row) scores between PLS-DA (0 OSC) and O-PLS-DA (1 OSC) models we can see that the O-PLS-DA model did a better job of capturing the maximal separation between the two sample classes (0 and 1) in the first dimension (x-axis).

Next we can look at how model variable loadings for the 1st LV are different between the PLS-DA and O-PLS-DA models.

We can see that for the majority of variables the magnitude for the model loading was not changed much however there were some parameters whose sign for the loading changed (example: variable 8). If we we want to use the loadings in the 1st LV to encode the variables importance for discriminating between classes in some other visualization (e.g. to color and size nodes in a model network) we need to make sure that the sign of the variable loading accurately reflects each parameters relative change between classes.

To specifically focus on how orthogonal signal correction effects the models perception of variables importance or weights we can calculate the differences in weights (delta weights ) between PLS-DA and O-PLS-DA models.

Comparing changes in weights we see that there looks to be a random distribution of increases or decreases in weight. variables 17 and 44 were the most increased in weight post OSC and 10 and 38 most decreased. Next we probably would want to look at the change in weight relative to the absolute weight (not shown).

Finally I wanted to compare O-PLS-DA and OPLS-DA model scores. I used Simca 13 to calculate the OPLS-DA (trademarked) model parameters and then devium and inkcape to make a scores visualization.

Generally PLS-DA and OPLS-DA show a similar degree of class separation in the 1st LV. I was happy to see that the O-PLS-DA model seems to have the largest class scores resolution and likely the best predictive performance of all three algorithms, but I will need to validate this by doing model permutations and training and testing evaluations.

Check out the R code used for this example HERE.

March 15, 2013 | Categories: Uncategorized | Tags: ggplot2, O-PLS-DA, orthogonal signal correction, PLS, PLS-DA, R | Leave a comment

Power Calculations – relationship between test power, effect size and sample size

I was interested in modeling the relationship between the power and sample size, while holding the significance level constant (p = 0.05) , for the common two-sample t-Test. Luckily R has great support for power analysis and I found the function I was looking for in the package pwr.

To calculate the power for the two-sample T-test at different effect and sample sizes I needed to wrap the basic function power.t.test().

# Need pwr package
if(!require(pwr)){install.packages("pwr");library("pwr")}

# t-TEST
#---------------------------------

d<-seq(.1,2,by=.1) # effect sizes
n<-1:150 # sample sizes

t.test.power.effect<-as.data.frame(do.call("cbind",lapply(1:length(d),function(i)
  {

	sapply(1:length(n),function(j)
		{
			power.t.test(n=n[j],d=d[i],sig.level=0.05,power=NULL,type= "two.sample")$power
		})
	})))

t.test.power.effect[is.na(t.test.power.effect)]<-0 # some powesr couldn't be calculated, set these to zero
colnames(t.test.power.effect)<-paste (d,"effect size")

The object t.test.power.effect is 150 x 20 column data frame which lists the power for from 1 to 150 samples and effects sizes from 0 to 2 by 0.1. While this is useful as a look up table we would optimally like to see a visualization of it. Here is some example code to plot this data using base and ggplot2 packages.

#plot results using base
#------------------------------------------------
obj<-t.test.power.effect # object to plot
cols<-1:ncol(obj)
color<-rainbow(length(cols), alpha=.5) # colors
lwd=5 # line thickness
lty<-rep(1,length(color))
lty[imp]<-c(2:(length(imp)+1))

#highligh important effect sizes
imp<-c(2,5,8) # cuts
cuts<-c("small","medium","large") # based on cohen 1988
color[imp]<-c("black")
wording<-d
wording[imp]<-cuts

par(fig=c(0,.8,0,1),new=TRUE)

#initialize plot
plot(1,type="n",frame.plot=FALSE,xlab="sample size",ylab="power",xlim=c(1,150),ylim=c(0,1),main="t-Test", axes = FALSE)
#add custom axis and grid
abline(v=seq(0,150,by=10),col = "lightgray", lty = "dotted")
abline(h=seq(0,1,by=.05),col = "lightgray", lty = "dotted")
axis(1,seq(0,150,by=10))
axis(2,seq(0,1,by=.05))
#plot lines
for(i in 1:length(cols)){lines(1:150,obj[,cols[i]],col=color[i],lwd=lwd,lty=lty[i])}
#legend
par(fig=c(.65,1,0,1),new=TRUE)
plot.new()
legend("top",legend=wording,col=color,lwd=3,lty=lty,title="Effect Size",bty="n")

Which makes the following graph.

Based on this graph, we can see the relationship between power, effect sizes and sample number. I’ve marked the cutoffs suggested by Cohen 1988 delineating small, medium and large effect sizes. Based on this we can see that if we are designing an experiment and are trying to select a sample size for which our test will be powerd at 0.8 we need to consider the expected effect of our experimental treatment. If we think that or treatment should have a moderate effect we should consider some where around 60 samples per group. However and even better analysis would be to directly calculate the sample number needed to achieve some power and significance level given experimentally derived effects sizes based on preliminary data!

And just for kicks here is the same data plotted using ggplot2.

#plot using ggplot2
#------------------------------------------------
#plot results using ggplot2
library(ggplot2);library(reshape)
x11() # graphic device on windows
obj<-cbind(size=1:150,t.test.power.effect) #flip object for melting
melted<-cbind(melt(obj, id="size"),effect=rep(d,each=150)) # melt and bind with effect for mapping
ggplot(data=melted, aes(x=melted$size, y=melted$value, color=as.factor(melted$effect))) + geom_line(size=2,alpha=.5) +
ylab("power") + xlab("sample size") + ggtitle("t-Test")+theme_minimal()

# wow ggplot2 is amazing in its brevity
# need to tweak legend and lty, but otherwise very similar

A little tweaking and these graphs are basically the same. Wow I really need to stop using base for my plots and fully embrace learning ggplot2!

And here is the gist if you want the source code.

January 17, 2013 | Categories: Uncategorized | Tags: effect size, ggplot2, power, power calculations, pwr, R, sample size, significance level | Leave a comment

Excel + Cytoscape + R = ExCytR

My new project is coming along nicely and should be released early 2013. It builds on the structures developed in imDEV to link Excel, Cytoscape and R using RExcel, RCytoscape, and CytoscapeRPC . This trio can be used to rapidly generate beautiful and informative network representations of data.

Here is an example of a undirected Gaussian graphical Markov metabolic network calculated from time course metabolomic measurements generated by gas chromatography time-of-flight mass spectrometry (GC/TOF).

Nodes represent metabolomic variables whose characteristics encode chemometric data and the results of statistical analyses and multivariate modeling. Ggplot2 is used to generate graphs of the time course data representing the means and standard error of metaboloite concentrations in two study populations. The connections between nodes or edges are calculated from q-order partial correlations using the R package qpgraph.