Machine Learning Powered Biological Network Analysis

Video

dave_data

Metabolomic network analysis can be used to interpret experimental results within a variety of contexts including: biochemical relationships, structural and spectral similarity and empirical correlation. Machine learning is useful for modeling relationships in the context of pattern recognition, clustering, classification and regression based predictive modeling. The combination of developed metabolomic networks and machine learning based predictive models offer a unique method to visualize empirical relationships while testing key experimental hypotheses. The following presentation focuses on data analysis, visualization, machine learning and network mapping approaches used to create richly mapped metabolomic networks. Learn more at www.createdatasol.com

dave

The following presentation also shows a sneak peak of a new data analysis visualization software, DAVe: Data Analysis and Visualization engine. Check out some early features. DAVe is built in R and seeks to support a seamless environment for advanced data analysis and machine learning tasks and biological functional and network analysis.

As an aside, building the main site (in progress) was a fun opportunity to experiment with Jekyll, Ruby and embedding slick interactive canvas elements into websites. You can checkout all the code here https://github.com/dgrapov/CDS_jekyll_site.

slides: https://www.slideshare.net/dgrapov/machine-learning-powered-metabolomic-network-analysis

June 11, 2017 | Categories: Uncategorized | Tags: clustering, data analysis, data visualization, genomics, machine learning, network, pathways, proteomics, R, r-bloggers, science, shiny, software, statistics | Leave a comment

Complex Systems Biology Informed Data Analysis

Metabolomics and the greater sphere of ‘Omic analyses are a burgeoning set tools for investigation of environmental and organismal mechanisms and interactions. Carrying out data analyses within complex biological system contexts is rewarding but also difficult. The following presentation considers components involved in conducting multivariate data analysis, modeling and visualization within biological contexts.

slides: https://www.slideshare.net/dgrapov/complex-systems-biology-informed-data-analysis-and-machine-learning

June 11, 2017 | Categories: Uncategorized | Tags: clustering, data visualization, genomics, lectures, machine learning, metabolomics, network, pathways, proteomics, research, science, software, statistical analysis | Leave a comment

Push it to the limit: SOM + Clustering + Networks

What is the highest dimensional visualization you can think of? Now imagine it being interactive. The following details a Frankenstein visualization packing a smorgasbord of multivariate goodness.

composite2
Enter first, self-organizing maps (SOM). I first fell into a love dream with SOMs after using the kohonen package. The wines data set example is a beautiful display of information.

Eloquently, making the visualization above is relatively easy. SOM is used to organize the data into related groups on a grid. Hierarchical cluster analysis (HCA) is used to classify the SOM codes into three groups.

clusters

HCA cluster information is mapped to the SOM grid using hexagon background colors. The radial bar plots show the variable (wine compounds’) patterns for samples (wines).

radial

The goal for this project was to reproduce the kohonen.plot using ggplot2 and make it interactive using shiny.

som1

The main idea was to use SOM to calculated the grid coordinates, geom_hexagon for the grid packing and any ggplot for the hexagon-inset sub plots. Some basic inset plots could be bar or line plots.

Part of the beauty is the organization of any ggplot you can think of (optionally grouping the input data or SOM codes) based on the SOM unit classification.

A Pavlovian response might be; does it network?

network

Yes we can (network). Above is an example of different correlation patterns between wine components in related groups of wines. For example the green grid points identify wines showing a correlation between phenols and flavanoids (probably reds?). Their distance from each other could be explained (?) by the small grid size (see below).

The next question might be, does it scale?

multi

more lines

There is potential. The 4 x 4 grid shows radial bar plot patterns for 16 sub groups among the 3 larger sample groups. The next next 6 x 6 plot shows wine compound profiles for 36 ~related subsets of wines.

A useful side effect is that we can use SOM quality metrics to give us an extra-dimensional view into tuning the visualization. For example we can visualize the number of samples per grid point or distances between grid points (dissimilarity in patterns).

This is useful to identify parts of the somClustPlot showing the number of mapped samples and greatest differences.

One problem I experienced was getting the hexagon packing just right. I ended making controls to move the hexagons ~up/down and zoom in/out on the plot. It is not perfect but shows potential (?) for scaffolding highly multivariate visualizations? Some of my other concerns include the stochastic nature of SOM and the need for som random initialization for the embedding. Make sure to use it with set.seed() to make it reproducible, and might want to try a few seeds. Maybe someone out there knows how to make this aspect of SOM more robust?

May 19, 2016 | Categories: Uncategorized | Tags: data visualization, ggplot2, multivariate, network, r-bloggers, SOM | 3 Comments

Try’in to 3D network: Quest (shiny + plotly)

I have an unnatural obsession with 4-dimensional networks. It might have started with a dream, but VR might make it a reality one day. For now I will settle for 3D networks in Plotly.

Presentation: R users group (more)

More: networkly

April 9, 2016 | Categories: Uncategorized | Tags: network, networkly, plotly, R, r-bloggers, shiny | Leave a comment

Network Visualization with Plotly and Shiny

R users: networkly: network visualization in R using Plotly

In addition to their more common uses, networks can be used as powerful multivariate data visualizations and exploration tools. Networks not only provide mathematical representations of data but are also one of the few data visualization methods capable of easily displaying multivariate variable relationships. The process of network mapping involves using the network manifold to display a variety of other information e.g. statistical, machine learning or functional analysis results (see more mapped network examples).

netmaping

The combination of Plotly and Shiny is awesome for creating your very own network mapping tools. Networkly is an R package which can be used to create 2-D and 3-D interactive networks which are rendered with plotly and can be easily integrated into shiny apps or markdown documents. All you need to get started is an edge list and node attributes which can then be used to generate interactive 2-D and 3-D networks with customizable edge (color, width, hover, etc) and node (color, size, hover, label, etc) properties.

2-Dimensional Network (interactive version) 2dnetwork

3-Dimensional Network (interactive version)

3dnetwork

View all code used to generate the networks above.

February 28, 2016 | Categories: Uncategorized | Tags: data analysis, data visualization, network, network mapping, networkly, plotly, R, r-bloggers, shiny | Leave a comment

Omic data integration strategies

Check out the full article pre-print version.

August 23, 2015 | Categories: Uncategorized | Tags: data analysis, data integration, genomics, metabolomics, network, pathways, proteomics | Leave a comment

2014 UC Davis Proteomics Workshop

Recently I had the pleasure of teaching data analysis at the 2014 UC Davis Proteomics Workshop. This included a hands on lab for making gene ontology enrichment networks. You can check out my lecture and tutorial below or download all the material.

Introduction

Tutorial

2014 UC Davis Proteomics Workshop Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

August 9, 2014 | Categories: Uncategorized | Tags: correlation network, Cytoscape, data analysis, data visualization, enrichment, gene ontology, multivariate, network, network enrichment, PCA, PLS, proteomics, r-bloggers, tutorial | Leave a comment

High Dimensional Biological Data Analysis and Visualization

High dimensional biological data shares many qualities with other forms of data. Typically it is wide (samples << variables), complicated by experiential design and made up of complex relationships driven by both biological and analytical sources of variance. Luckily the powerful combination of R, Cytoscape (< v3) and the R package RCytoscape can be used to generate high dimensional and highly informative representations of complex biological (and really any type of) data. Check out the following examples of network mapping in action or view a more indepth presentation of the techniques used below.

Partial correlation network highlighting changes in tumor compared to control tissue from the same patient.

Biochemical and structural similarity network of changes in tumor compared to control tissue from the same patient.

Hierarchical clusters (color) mapped to a biochemical and structural similarity network displaying difference before and after drug administration.

Partial correlation network displaying changes in metabolite relationships in response to drug treatment.

Partial correlation network displaying changes in disease and response to drug treatment.

Check out the full presentation below.

February 22, 2014 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, clustering, correlation network, Cytoscape, data analysis, data visualization, Devium, metabolomics, multivariate, network, network mapping, O-PLS-DA, r-bloggers, tutorial | Leave a comment

Tutorials- Statistical and Multivariate Analysis for Metabolomics

I recently had the pleasure in participating in the 2014 WCMC Statistics for Metabolomics Short Course. The course was hosted by the NIH West Coast Metabolomics Center and focused on statistical and multivariate strategies for metabolomic data analysis. A variety of topics were covered using 8 hands on tutorials which focused on:

data quality overview

statistical and power analysis

clustering

principal components analysis (PCA)

partial least squares (O-/PLS/-DA)

metabolite enrichment analysis

biochemical and structural similarity network construction

network mapping

I am happy to have taught the course using all open source software, including: R, and Cytoscape. The data analysis and visualization were done using Shiny-based apps: DeviumWeb and MetaMapR. Check out some of the slides below or download all the class material and try it out for yourself.

2014 WCMC LC-MS Data Processing and Statistics for Metabolomics by Dmitry Grapov is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Special thanks to the developers of Shiny and Radiant by Vincent Nijs.

February 17, 2014 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, Cytoscape, data analysis, data visualization, Devium, ggplot2, hierarchical clustering, mass spectral similarity, metabolomics, MetaMapR, network, O-PLS, O-PLS-DA, PCA, R, r-bloggers, shiny, TeachingDemos, tutorial | 13 Comments

Connecting Data with Context: Metabolomic Examples

I recently gave a presentation of some of my work in network mapping to my research lab. The following covers my progress in the development of my metabolomic network mapping tool MetaMapR, and its application to a variety of data sets including a comparison of normal and malignant lung tissue from the same patient.

November 21, 2013 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, correlation network, Cytoscape, data analysis, data visualization, Gaussian graphical Markov metabolic network, metabolomics, MetaMapR, multivariate, network, network mapping | Leave a comment

Sessions in Metabolomics 2013

The international summer sessions in metabolomics 2013 came to a happy conclusion this past Friday Sept 6th 2013. I had the pleasure of teaching the topics covering metabolomic data analysis. The class was split into lecture and lab sections. The lab section consisted of a hands on data analysis of:

fresh vs. lyophilized treatment comparison for tomatillo leaf primary metabolomics
tomatillo vs. pumpkin leaf primary metabolites

The majority of the data analyses were implemented using the open source software imDEV and Devium-web.

Download the FULL LAB. Take a look at the goals folder for each lesson. You can follow along with the lesson plans by looking at each subsections respective excel file (.xlsx). When you are done with a section unhide all the worksheets (right click on a tab at the bottom) to view the solutions .

The lectures, preceding the lab, covered the basics of metabolomic data analysis including:

Data Quality Overview and Statistical Analysis

View this document on Scribd

Multivariete Data analysis

View this document on Scribd

Metabolomic Case Studies

View this document on Scribd

September 8, 2013 | Categories: Uncategorized | Tags: ANCOVA, biochemical network, chemical similarity network, Cytoscape, metabolomics, network, O-PLS-DA, PCA, PLS, PLS-DA, summer sessions in metabolomics, tutorial, west coast metabolomics center | Leave a comment

Network Mapping Video

Here are a video and slides for a presentation of mine about my favorite topic :

View this document on Scribd

June 14, 2013 | Categories: Uncategorized | Tags: biochemical network, biochmical network, chemical similarity network, clustering, Cytoscape, data analysis, data visualization, metabolomics, multivariate, network, network mapping, networks, O-PLS, O-PLS-DA, PCA, PLS, PLS-DA | 1 Comment

American Society for Mass Spectrometry 2013

I am getting ready to present at the upcoming American Society for Mass Spectrometry (ASMS) conference in Minneapolis, Minnesota (dont’cha know).

If you are around check out my talk in the section Oral: ThOB am – Informatics: Metabolomics on Thursday (06/14) at 8:30 am in room L100. Here is teaser

Above is a network representation of biochemical (red edges, KEGG RPAIRS) and structural similarities (gray edges, Tanimoto coefficient> 0.7) of > 1100 biological molecules (see here for some of their descriptions). Keep an eye out for all the R code used to generate this network as well as all the slides from my talk.

Here is my talk abstract.

Multivariate and network tools for analysis and visualization of metabolomic data

Dmitry Grapov^{1, 2}; Oliver Fiehn^{1, 2}

¹West Coast Metabolomics Center, Davis, CA; ²University of California Davis, Davis, California

NOVEL ASPECT: A software tool for calculation and mapping of statistical and multivariate results from metabolomic experiments into biologically relevant contexts.

————————

INTRODUCTION: While a variety of tools capable of producing network representations of metabolomic data exist, none are fully integrated with statistical and multivariate methods necessary to analyze, visualize and summarize the high dimensional data. We have developed an open source toolset for the analysis of high dimensional biological data which combines the computational capabilities of the R statistical programming environment with the network mapping and visualization features of Cytoscape. A graphical user interface is used to seamlessly integrate calculation and interpretation of statistical and multivariate results in the context of network graphs which are constructed based on biological relationships, chemical similarities or empirical variable dependencies.

—————

METHODS: An R based GUI utilizing RCytoscape and CytoscapeRPC is used to connect R and Cytoscape. Data import, manipulation and export are achieved through an interface to MS Excel and Google Docs. R packages provide a variety of analyses methods including: parametric and non-parametric multiple hypotheses testing, false discovery rate correction, exploratory principal and independent components analyses, hierarchical and model based clustering, and multivariate predictive modeling such as partial least squares and support vector machines. Relationships between biological parameters can be represented in the form of networks which are connected based on user defined edge lists or from pubchem chemical identifiers which are used to construct biochemical and chemical similarity networks based on the KEGG reactant pairs and Tanimoto distances, or Gaussian Markov networks based partial correlations.

—————-

ABSTRACT: Comparisons of plasma primary metabolite excursion patterns during an oral glucose tolerance test (OGTT) were used to model changes in metabolism associated with a diet and exercise intervention. Plasma aliquots, taken at 30 minute intervals (0-120 minutes) were analyzed by GC/TOF and used to compare metabolite levels (n=323) in a cohort of overweight women before and after a 14 week dietary and exercise regimen. Mixed effects models, partial least squares and partial least squares discriminant analysis (PLS-DA) were used to study OGTT and intervention-associated changes in metabolite baselines, area under the curve for OGTT-associated excursions , and metabolite time course patterns. Metabolic changes due to the oral infusion of glucose were visualized by mapping statistical test p-values and intervention-adjusted PLS model for time during the OGTT variable coefficient weights into a network connected based on KEGG reactant pairs and Tanimoto distances > 70. Vertices, representing metabolites were sized and colored based on the absolute PLS coefficient magnitude and sign respectively. Metabolites showing significant perturbations during the OGTT (false discovery rate (q = 0.05) adjusted p-value < 0.05) were highlighted with node-inset graphs displaying means and confidence intervals during the time course for before and after intervention comparisons. This network was useful for identifying OGTT-associated interactions between the major biochemical domains (lipids, amino acids, organic acids, and carbohydrates). In a follow-up analysis a Gaussian Markov partial correlation network was used to investigate intervention-associated changes in metabolite-metabolite and metabolite-clinical parameter (insulin, hormones) dependency relationships.

June 5, 2013 | Categories: Uncategorized | Tags: biochemical network, chemical similarity network, chemical translations, Cytoscape, multivariate, network, R | Leave a comment

Tutorials Covering Biological Data Analysis Strategies

I’ve posted two new tutorials focused on intermediate and advanced strategies for biological, and specifically metabolomic data analysis (click titles for pdfs).

View this document on Scribd

May 29, 2013 | Categories: Uncategorized | Tags: ANCOVA, chemical similarity network, classification, climate, correlation network, covariate adjustment, data analysis, data visualization, Gaussian graphical Markov metabolic network, imDEV, metabolomics, network, PCA, PLS, PLS-DA, R, research, science, TeachingDemos, tutorial | Leave a comment

Visualizing Sample Scores Trajectory for Repeated Measures

PLS-DA sample scores for a discrimination model identifying multivariate changes in metabolomic measurements before (pre) or after (post) some experimental manipulation.

Based on scores plot for samples given changes in >300 biological parameters; it looks like there are two patterns of samples movement through this principal predictive plane. Few move in the direction capturing the most variance in data matrix or (x-axis, 31%), but the majority show an interaction between x and y (the second dimension explaining only 8%). Also, the most pre or before looking samples in the first dimension (142 and 77, note farthest right) are the least changed post or after the experimental treatment.

February 15, 2013 | Categories: Uncategorized | Tags: network, PLS-DA, repeated measures, scores trajectory | Leave a comment

Data analysis approaches to modeling changes in primary metabolism

View this document on Scribd

February 1, 2013 | Categories: Uncategorized | Tags: chemical similarity network, Cytoscape, data visualization, Devium, imDEV, metabolomics, network, OGTT, PCA, PLS, R, statistics | Leave a comment

Anaerobic Stress in Seeds – A Chemical Similarity Network Story

The chemical similarity network or CSN is a great tool for organizing biological data based on known biochemistry or chemical structural similarity. Here is an example CSN for visualizing metabolomic changes (measured via GC/TOF) due to anaerobic stress in germinating seeds.

In this network edges are formed for chemical similarity scores > 75. Node color describes significant (adjusted p-value < 0.05, q-value = 0.05, paired t-Test) increase (red), decrease (blue) or no change (gray) in anaerobic relative to aerobic treatments. Node size is inversely proportional to the tests p-value.

This CSN was not hard to construct and minimally requires knowledge of analyte PubChem chemical identifiers (CIDs). CIDs can be used to calculate the chemical similarity matrix using online tools provided by PubChem. This symmetric matrix can be easily formatted to create an edge list containing the basic information: source, target and similarity score.

Here is a function for converting square symmetric matrices to edge lists using the R statistical programming environment.

mat.to.edge.list<-function(mat)
{

#accessory function
all.pairs<-function(r,type="one")
   {
    switch(type,
    one = list(first = rep(1:r,rep(r,r))[lower.tri(diag(r))],
    second = rep(1:r, r)[lower.tri(diag(r))]),
    two = list(first = rep(1:r, r)[lower.tri(diag(r))],
    second = rep(1:r,rep(r,r))[lower.tri(diag(r))]))
 ids<-all.pairs(ncol(mat))
 tmp<-as.data.frame(do.call("rbind",lapply(1:length(ids$first) ,function(i)
  {
   value<-mat[ids$first[i],ids$second[i]]
   name<-c(colnames(mat)[ids$first[i]],colnames(mat)[ids$secon   d[i]])
   c(name,value)
  })))
 colnames(tmp)<-c("source","target","value")
 return(tmp)
 }

The function mat.to.edge.list will convert a square symmetric matrix to an edge list through the extraction of the upper triangle excluding the diagonal or self edges.

This edge list can now be visualized as a CSN using some software (see brief instructions here). I prefer to use Cytoscape for this. The edge list merely contains instructions for which vertices or nodes representing metabolites should be connected.

An additional node annotation or attribute table can also be imported into Cytoscape and used to alter the node properties based on statistical results.

December 31, 2012 | Categories: Uncategorized | Tags: chemical similarity network, Cytoscape, ExCytR, metabolomics, network, R | Leave a comment

ExCytR Concept

The concept is to make a GUI to provide a static and dynamic linking between data and its network representations.

Static access will involve making networks based on data and metadata stored in some table or spreadsheet.

Dynamic control will provide interactive access to network construction and annotation properties.

Together, these will provide rapid generation of information rich networks, based on tests of internal data properties or from exogenous semantic knowledge. Here is an example of a network representation of a time course metabolomic experiment. This network is used to encode dependence between top parameters of a PLS-DA model discriminating between pre- and post-experimental interventions. Larger nodes show variables meeting the 5% significance cut off (p < 0.05) for a mixed effects model to identify intervention related differences between unbalanced baseline and area under the curve for metabolite excursion measurements during an oral glucose tolerance test (OGTT). Node color signifies increase (red) or decrease (blue) in post- relative to pre-intervention average values. Node shape and outline display metabolite classification and presence in a PLS-DA model respectively. Node graphs, created in ggplots2, show box plots for pre- (red) and post-intervention (green) class distribution medians, upper and lower quartiles, and outliers.

The interactions between model parameters which exist only in pre-intervention samples are shown in the network below.

Connections are made between metabolites which have a non-zero partial correlation extracted based on a qpnetwork trimmed at a threshold where node and edge number is ~equal. In this network all edges meet the 5% significance based on tests of persons correlations.