Methods and instructions to generate the figures in the paper "A communal catalogue reveals Earth’s multiscale microbial diversity" (Thompson et al., Nature, 2017). This paper describes a meta-analysis of the first 97 studies subjected to 16S rRNA amplicon sequencing through the Earth Microbiome Project, which constitute EMP 16S Release 1.
Source data (ready-to-plot) for the main figures -- and notebooks for isolating source data from other processed files -- are in methods/figure-data.
Raw and processed data files for generating the figures are found in several places:
- All data files (except sequences) required to generate the figures are in
dataand/or available from ftp://ftp.microbio.me/emp/release1. FTP directory contents are listed indata/ftp_contents.txt. - For archival purposes, sample metadata, observation tables and information (trees and taxonomies), alpha- and beta-diversity results, and observation summaries for trading cards are archived at Zenodo with DOI 10.5281/zenodo.890000.
- Sequence files are available directly from EBI (see below).
- The mapping file (metadata) for analyses, unless otherwise noted, is
emp_qiime_mapping_qc_filtered.tsvindata/mapping_files.
Code and notebooks for data processing and figure generation are in the top-level directory code, as described below.
Samples were solicited from the global scientific community that spanned diverse environments and captured spatial, temporal, and/or physicochemical covariation. Sample processing -- DNA extraction and sequencing -- was organized by sample type: soil samples at LBNL, water samples at ANL, and fecal and other samples at CU Boulder. The first 27,751 samples from 97 studies are contained in EMP Release 1.
This section describes the commands to download the raw sequence data from EBI and perform OTU picking using different methods, including generating phylogenetic trees and taxonomies for reference sequences, if necessary.
Per-study sequence files can be downloaded directly from EBI using scripts in code/download-sequences:
download_ebi_fasta.sh(FASTA)download_ebi_fastq.sh(FASTQ)
Fasta sequences are used by the steps below. The sequences from EBI were demultiplexed and minimally quality filtered using the QIIME 1 command split_libraries_fastq.py with Phred quality threshold of 3 and default parameters.
Any adapter contamination was removed using the code in code/02-sequence-processing:
adaptor_cleanup.ipynb
Four separate OTU picking procedures were run on the EMP Release 1 data: de novo using Deblur, closed-reference using Greengenes 13.8, closed-reference using Silva 123, and open-reference using Greengenes 13.8.
Deblur code for picking sOTUs (a.k.a. tag sequences or amplicon sequence variants) is in code/03-otu-picking-trees/deblur:
run_deblur_emp_original.sh(pre-release version used for this meta-analysis)run_deblur_emp_new.sh(using published Deblur version)
Code and documentation for Deblur can be found on GitHub, and the manuscript describing Deblur is available from mSystems.
Closed-reference OTU picking against Greengenes 13.8 (97% OTUs) was done using the QIIME 1 script pick_closed_reference_otus.py with code in code/03-otu-picking-trees/closed-ref:
closed_reference_otu_picking.ipynb
Closed-reference OTU picking against Silva 123 16S (97% OTUs) was done using the QIIME 1 script pick_closed_reference_otus.py with code in code/03-otu-picking-trees/closed-ref:
closed_reference_otu_picking.ipynb
Open-reference OTU picking against Greengenes 13.8 was done using the QIIME 1 script pick_open_reference_otus.py with code in code/03-otu-picking-trees/open-ref:
open_reference_otu_picking.ipynb
Deblur sequences were inserted into the Greengenes reference tree using SEPP. The code for this method is in code/03-otu-picking-trees/deblur:
run_sepp.sh
The output trees are available on the FTP site and the Zenodo archive:
- ftp://ftp.microbio.me/emp/release1/otu_info/deblur/
- https://zenodo.org/record/890000/files/emp_observation_info_deblur.tar.gz
The reference tree for Greengenes 13.8 (97% OTUs) was 97_otus.tree, available from the EMP FTP site and the Greengenes FTP site.
The reference tree for Silva 123 16S (97% OTUs) was 97_otus.tre, available from the EMP FTP site and the Silva website.
Reference sequences from the open-reference OTU picking were aligned using PyNAST and the tree built using FastTree.
Deblur and OTU tables were rarefied (subsampled) to generate equal numbers of observations (sequences) per sample, used in many of the analyses as described below. Deblur tables were rarefied to 5000 observations per sample, and reference-based OTU tables were rarefied to 10000 observations per sample, each using the QIIME 1 script single_rarefaction.py.
Deblur/OTU tables were subset to generate tables with more even representation across sample types and studies, used in many of the analyses as described below. Subsetting of the tables is accomplished by running the following IPython notebooks in code/04-subsets-prevalence:
summarize_observation_counts.ipynbsubset_samples_by_empo_and_study.ipynb
This section describes how metadata were refined to enable interoperability and meta-analysis. QIIME mapping files were downloaded from Qiita and refined to fix errors, standardize formatting, and add fields specific for this meta-analysis.
Three IPython notebooks are provided to curate study metadata (step 1), refine sample metadata (step 2), and generate new sample information files for those studies to upload to Qiita (step 3). These notebooks are located in code/01-metadata:
metadata_refine_step1_studies.ipynbmetadata_refine_step2_samples.ipynbmetadata_refine_step3_qiita.ipynb
A notebook for looking up higher (less semantically granular) terms for a given ENVO biome term (climbing up the hierarchy), which is incorporated in step 2 above, is in code/01-metadata:
envo_hierarchy_lookup.ipynb
A metadata template generator, for future metadata collection, is provided as a Python script and a Markdown version of an IPython notebook in code/01-metadata:
metadata_template_generator.pymetadata_template_generator.md
This section describes the code for generating each figure from the observation and metadata tables produced above. Figures are numbered as in Thompson et al. (2017).
The Sankey diagram was generated from mapping file column empo_3 using Google Charts Sankey Diagram. Code for this version of the Sankey diagram is linked (note: text labels can be turned on by changing line 61):
The world map was generated using the Python Basemap package from mapping file columns latitude_deg and longitude_deg using an IPython notebook in code/01-metadata:
map_samples_by_empo.ipynb
Alpha-diversity code is contained in code/05-alpha-diversity. Alpha-diversity for the Deblur 90-bp table (QC-filtered subset) was run using a script in code/05-alpha-diversity:
alpha_diversity.py
The results for the Deblur 90-bp table rarefied to 5000 sequences per sample were added to the mapping file as the columns adiv_observed_otus, adiv_chao1, adiv_shannon, and adiv_faith_pd.
Alpha-diversity boxplots were generated using an IPython notebook:
alpha_diversity_boxplots.ipynb
Scatter plots and maximum likelihood estimation (MLE) fits of alpha-diversity versus pH and temperature were generated using an IPython notebook:
mle_curve_fits.ipynb
Additional code for macroecological analyses of the EMP data are at https://github.com/klocey/emp_macroeco.
Beta-diversity code is contained in code/06-beta-diversity. Code for performing 'Snappy' UniFrac and principal coordinates analysis on the EMP Deblur 90-bp biom table (QC-filtered subset and rarefied to 5000 observations per sample) are in a Markdown file:
unifrac_and_principal_coordinates.md
Code for estimating average 16S rRNA copy number is in code/07-env-effects-corr:
rrna_copy_number_analysis.ipynb
A GitHub repository that can be used to easily replicate the results is located here:
As noted in the repository, the data files that were used are posted at this Dropbox link:
For consistency with the file naming conventions that were used, the biom tables in Dropbox are renamed as follows (they have also each been rarefied to 5000 reads):
emp_deblur_90bp.subset_2k.rare_5000.biom -> Global.Global2000Subset.Bacteria.EMP.biom
otu_subset.emp_deblur_90bp.subset_2k.lt_1.0_pc_samp.biom -> Global.Global2000Subset.BacteriaSubset1.EMP.biom
otu_subset.emp_deblur_90bp.subset_2k.lt_5.0_pc_samp.biom -> Global.Global2000Subset.BacteriaSubset5.EMP.biom
otu_subset.emp_deblur_90bp.subset_2k.lt_10.0_pc_samp.biom -> Global.Global2000Subset.BacteriaSubset10.EMP.biom
Code for plotting the nestedness binary heatmaps and NODF statistics is in code/08-cooccurrence-nestedness:
nestedness_binary_heatmaps.ipynbnestedness_nodf_plots.ipynb
Code for generating plots of EMPO level 3 distribution for genera and tag sequences is in code/09-specificity-entropy:
entropy_environment_by_taxon.ipynb
Code for generating plots of entropy of EMPO level 3 distribution vs. taxonomic level and branch length is in code/09-specificity-entropy:
R_entropy_plots
The pairplot of physicochemical metadata was generated using an IPython notebook in code/01-metadata:
physicochemical_pairplot.ipynb
The histogram of median sequence length after trimming (output of split_libraries.py, i.e., sequences downloaded from EBI) was generated using an IPython notebook in code/02-sequence-processing:
sequence_length.ipynb
Alpha-diversity boxplots were generated using an IPython notebook in code/05-alpha-diversity:
alpha_diversity_boxplots.ipynb
Beta-diversity code is contained in code/06-beta-diversity (see above).
Alpha-diversity histograms were generated using an IPython notebook in code/05-alpha-diversity:
alpha_diversity_90bp_100bp_150bp.ipynb
Beta-diversity Procrustes code and notebook are in code/06-beta-diversity:
procrustes_90_150.ipynb
See section 3.11.2 for code to summarize sequence/OTU distributions.
Code for generating prevalence plots is in code/04-subsets-prevalence:
otu_prevalence.ipynb
3.9 Environmental effect sizes, sample classification, and correlation patterns (Extended Data Fig. 5)
Code for calculating and plotting effect sizes is in code/07-env-effects-corr:
clean_map_emp.pyeffect_size_rda.ipynbeffect_size_main.ipynb
Code for carrying out random forest analysis and SourceTracker2 analysis, and plotting results, are in code/06-beta-diversity:
random-forest/rf_confusion_matrix.ipynbsourcetracker_mapping_file_and_execution.ipynbsourcetracker_mixing_proportions.ipynb
Code for investigations of alpha-diversity covariation with latitude is in code/07-env-effects-corr:
latitudinal_diversity.ipynb
Code for plotting the NODF statistics with the most prevalent sequences removed and of alternate 2000-sample subsets is in code/08-cooccurrence-nestedness:
nestedness_otu_subsets.ipynb
Code for generating random subsets of samples evenly distributed by environment and study, plus a summary figure, is in code/04-subsets-prevalence:
summarize_observation_counts.ipynb(summarize the number of sequence/OTUs for each sample)subset_samples_by_empo_and_study.ipynb(randomly subsample the samples by environment and study)
IPython notebooks for generating sequence/OTU distribution statistics, LaTeX macros, and charts for EMP Trading Cards is in code/10-sequence-lookup:
summarize_otu_distributions.ipynb(generate summary statistics for each sequence/OTU)otu_entropy.ipynb(determine which sequences have the most skewed environment distributions)otu_trading_cards.ipynb(generate LaTeX macros and charts for trading cards)
Python code and TeX and HTML markup for rendering EMP Trading Cards as PDFs or webpages are provided in these folders:
trading-card-latextrading-card-html
The distribution of any Deblur sequence, as a function of any metadata variable, can be plotted using this notebook in code/10-sequence-lookup:
otu_scatter_plots.ipynb
Deblur sequence utility scripts are also in code/10-sequence-lookup:
get_v4_region.py(extract the V4 region of an input 16S sequence, specifying the desired length, e.g., to match 90-bp or 150-bp Deblur sequences)verify_amplicon_type.py(guess the amplicon type, e.g., 16S V4, using the first few basepairs of a multi-FASTA file)










