Tutorial: applying strainFlye to the "SheepGut" dataset¶
This tutorial will walk you through the various commands strainFlye offers.
0. Following this tutorial¶
Depending on your dataset and goals, you could either walk through the entirety of this document or skip around a bit. You can "jump to" many steps of the pipeline without trouble; for example, if you already have a BCF file of single-nucleotide mutations, you may prefer to skip all of the naïve mutation calling / FDR estimation steps and jump to the phasing analyses.
Note that many of the steps of this pipeline can take a while to run for large datasets like SheepGut. (In my experience: for the full SheepGut dataset on our cluster, "a while" usually means somewhere around 24 hours.) Smaller datasets and/or faster machines should of course decrease this duration. Section 5 of this tutorial ("Optional: Filter the FASTA file in order to focus on certain contigs") discusses how you can subset your dataset, after the align step, to certain contigs of interest—this can also speed things up.
1. The two main strainFlye inputs¶
The strainFlye pipeline takes as input two main types of data:
A set of reads (in FASTA / FASTQ format).
A set of contigs (in FASTA format) assembled from these reads.
This is all that is needed if you are starting the pipeline at its beginning (with align). Later pipeline steps will require other inputs that can be produced by intermediate pipeline steps.
1.1. Details about these types of inputs¶
Regarding reads: We designed strainFlye in the context of PacBio Circular Consensus Sequencing (CCS) "HiFi" reads (Wenger & Peluso et al., 2019). However, in theory it should still work with other reasonably long and accurate reads.
Regarding contigs: We don't impose any restriction on the assembler you use to construct these. We have tested strainFlye on metaFlye (Kolmogorov et al., 2020) and hifiasm-meta (Feng et al., 2022) output, but strainFlye should in theory work with the outputs of other assemblers.
1.2. The SheepGut dataset¶
We'll be applying strainFlye to the SheepGut dataset that is shown in the strainFlye paper. (This dataset has also been described in Kolmogorov et al., 2020 and Bickhart & Kolmogorov et al., 2022.) Please see the strainFlye paper's "Data sets" section for details about acquiring reads and contigs for the SheepGut dataset.
Note that the "contigs" we use for the SheepGut datatset really correspond to edge sequences in the assembly_graph.gfa file produced by metaFlye. These edge sequences may be slightly different from the file of contigs / scaffolds in the assembly.fasta file produced by metaFlye: see Flye's manual for more information. (You could use either type of sequence with strainFlye, although I personally recommend using edges: it's useful to have context about where exactly in the assembly graph a sequence is, and things like gaps in scaffolds [represented by Ns] will cause strainFlye to complain.)
2. Introduction¶
Let's take care of a few things before the tutorial starts.
2.1. Installing strainFlye¶
Before following along with this tutorial, we assume that you have already installed strainFlye. Please see strainFlye's README for installation instructions.
2.2. What commands are available through strainFlye?¶
!strainFlye
Usage: strainFlye [OPTIONS] COMMAND [ARGS]... Pipeline for the analysis of rare mutations in metagenomes. Please consult https://github.com/fedarko/strainFlye if you have any questions, comments, etc. about strainFlye. Thank you for using this tool! Options: -v, --version Show the version and exit. -h, --help Show this message and exit. Commands: align Align reads to contigs, and filter the resulting alignment. call [+] Call mutations in contigs naïvely & compute diversity indices. fdr [+] Estimate and fix FDRs for contigs' naïve mutation calls. spot [+] Identify putative mutational hotspots or coldspots. smooth [+] Create and assemble smoothed and virtual reads. link [+] Create link graphs showing co-occurring alleles. matrix [+] Create codon and amino acid mutation matrices. dynam [+] Compute simple information about growth dynamics. utils [+] Miscellaneous utility commands provided with strainFlye.
2.3. Importing and configuring some utilities¶
You shouldn't need to do much (if any) programming to use strainFlye's commands; that said, we will be using Python to help with a few small tasks throughout this tutorial, and you will probably need to do some programming to plot the results we produce. (This philosophy was inspired partially by Schloss 2020.) We'll import some useful Python packages here to reduce clutter later on in this notebook.
(If you prefer, you could of course use another language instead of Python.)
import time
import skbio
import pandas as pd
import matplotlib
from matplotlib import pyplot
# Make the plots look nice -- this is the style we use in the paper
pyplot.style.use("ggplot")
3. Convert the assembly graph GFA file to a FASTA file of contigs¶
You can skip this step if: you already have a FASTA file describing contigs in your assembly graph.
Our assembly graph (the GFA file) contains the sequences of the contigs that we will use in many downstream analyses, but we'll need to have a FASTA file that just describes these contigs' sequences (independent of the assembly graph topology).
There are some bash one-liners you can use to convert a GFA 1 file to a FASTA file, but strainFlye also provides a utility command (strainFlye utils gfa-to-fasta) to do this for you. We'll use this here. (Our solution may be a bit slower than a bash one-liner, but it performs some useful sanity checking on the GFA file.)
!strainFlye utils gfa-to-fasta \
--graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
--output-fasta /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
Using strainFlye version "0.2.0-dev". -------- utils gfa-to-fasta @ 0.00s: Starting... Input GFA file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa Output FASTA file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta -------- utils gfa-to-fasta @ 16.41s: Done. Output FASTA file contains 78,793 sequences.
4. Align reads to contigs; filter the resulting alignment¶
You can skip this step if: you already have a BAM file representing an alignment of reads to contigs, and this BAM file does not contain secondary alignments / partially-mapped reads / overlapping supplementary alignments (these all may cause problems in downstream analyses).
We'll need to align reads back to these contigs. The resulting alignment, and/or the mutations that we call from it, will be used in pretty much all downstream steps—so it's important to make sure that it is of good quality!
The strainFlye align command uses minimap2 to perform alignment, and then does some extra filtering on the resulting alignment.
Note that this command, in particular, may take a while to run. Sequence alignment is computationally expensive! On our cluster, strainFlye align ran on the full SheepGut dataset in 62,941.21 seconds (aka about 17.5 hours).
!strainFlye align
Usage: strainFlye align [OPTIONS] READS...
Align reads to contigs, and filter the resulting alignment.
Files of reads should be in the FASTA or FASTQ formats; GZIP'd files are
allowed.
This command involves multiple steps, including:
1) Align reads to contigs (using minimap2) to generate a SAM file
2) Convert this SAM file to a sorted and indexed BAM file
3) Filter overlapping supplementary alignments (OSAs) from this BAM file
4) Filter partially-mapped reads from this BAM file
Note that we only sort the alignment file once, although we do re-index it
after the two filtering steps. This decision is motivated by
https://www.biostars.org/p/131333/#131335.
Options:
-c, --contigs PATH FASTA file of contigs to which reads will be
aligned. [required]
-g, --graph PATH GFA 1-formatted file describing an assembly
graph of the contigs. This is used in the
"filter partially-mapped reads" step to make
the filter less strict for reads mapped to
adjacent contigs in the graph. This isn't
required; if it isn't passed, adjacent
contigs will not be considered in this
filter. [default: (no graph)]
-p, --minimap2-params TEXT Additional parameters to pass to minimap2,
besides the contig and read information.
Depending on the size of your dataset and
the amount of memory your system has, you
may want to adjust the -I parameter; see
minimap2's documentation for details. Please
note that we do not perform any validation
on this string before passing it to minimap2
(so if you are allowing users to run
strainFlye through a web server, be careful
about shell injection). [default: -ax asm20
--secondary=no -I 8g --MD]
-o, --output-dir DIRECTORY Directory to which an output BAM file
(final.bam) and BAM index file will be
written. Some temporary files will also be
written to this directory. [required]
--rm-tmp-bam / --no-rm-tmp-bam Remove temporary (before filtering, and
after just filtering OSAs) BAM files, to
reduce space requirements. [default: rm-
tmp-bam]
--verbose / --no-verbose Display extra details during alignment
filtering. [default: no-verbose]
-h, --help Show this message and exit.
# Reads file(s) are specified after all of the other parameters to this command
!strainFlye align \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
--graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/alignment \
/Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz \
/Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/*.fasta.gz
Using strainFlye version "0.2.0-dev". -------- align @ 0.00s: Starting... Input file(s) of reads: 1. /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz 2. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz 3. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz 4. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz 5. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz 6. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz 7. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz 8. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz 9. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta Input graph file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa minimap2 parameters: -ax asm20 --secondary=no -I 8g --MD Remove temporary BAM files?: Yes Verbose?: No Output directory: /Poppy/mfedarko/sftests/tutorial-output/alignment -------- align @ 0.00s: Loading and checking contig information... align @ 19.89s: Looks good. The FASTA file describes 78,793 contigs. -------- align @ 19.89s: Sanity-checking that the assembly graph and FASTA file describe the same contigs (based on names and lengths)... align @ 38.12s: Everything looks good so far. -------- align @ 38.12s: Running minimap2 --> samtools view --> samtools sort... align @ 38.12s: Exact command we're running: minimap2 -ax asm20 --secondary=no -I 8g --MD /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz | samtools view -b - | samtools sort - -o /Poppy/mfedarko/sftests/tutorial-output/alignment/sorted-unfiltered.bam [M::mm_idx_gen::67.042*1.68] collected minimizers [M::mm_idx_gen::84.530*1.94] sorted minimizers [M::main::84.583*1.94] loaded/built the index for 78793 target sequence(s) [M::mm_mapopt_update::104.604*1.76] mid_occ = 100 [M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 78793 [M::mm_idx_stat::113.809*1.70] distinct minimizers: 466400347 (83.22% are singletons); average occurrences: 1.330; average spacing: 5.502 [M::worker_pipeline::234.045*2.35] mapped 60083 sequences [[Note from Marcus: trimming down minimap2 outputs here to reduce notebook length...]] [M::worker_pipeline::41791.167*3.09] mapped 46553 sequences [M::main] Version: 2.17-r941 [M::main] CMD: minimap2 -ax asm20 --secondary=no -I 8g --MD /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz [M::main] Real time: 41792.733 sec; CPU: 129283.952 sec; Peak RSS: 30.432 GB [bam_sort_core] merging from 526 files and 1 in-memory blocks... align @ 45,699.81s: Done running minimap2 --> samtools view --> samtools sort. -------- align @ 45,699.81s: Indexing the sorted BAM... align @ 46,270.42s: Done indexing the sorted BAM. -------- align @ 46,270.42s: Filtering reads with overlapping supplementary alignments (OSAs)... align @ 46,938.50s: Done with pass 1 of the OSA filter; moving on to pass 2... align @ 50,284.20s: Done filtering reads with overlapping supplementary alignments. -------- align @ 50,284.24s: Indexing the OSA-filtered BAM... align @ 50,868.70s: Done indexing the OSA-filtered BAM. -------- align @ 50,868.70s: Removing the before-filtering BAM and its index to save space... align @ 50,876.99s: Done removing these files. -------- align @ 50,876.99s: Filtering partially-mapped reads... align @ 50,877.11s: Loading assembly graph... align @ 50,894.56s: Loaded assembly graph. align @ 56,868.92s: Done filtering out partially-mapped reads. -------- align @ 56,870.69s: Indexing the final BAM... align @ 57,332.50s: Done indexing the final BAM. -------- align @ 57,332.50s: Removing the just-OSA-filtered BAM and its index to save space... align @ 57,341.75s: Done removing these files. -------- align @ 57,341.76s: Done.
This generates a BAM file (final.bam) and BAM index file (final.bam.bai) in the specified output directory.
We can use this BAM file for many analyses downstream—the first of these will be mutation calling.
5. Optional: Filter the FASTA file in order to focus on certain contigs¶
We just aligned our dataset's reads against all contigs in the assembly graph. This is standard practice (see, e.g., this tutorial); aligning reads against all contigs probably yields a more accurate alignment than just aligning reads against a subset of these contigs (although proving if this is "best practice" or not is a challenging question, and one that I will sidestep right now).
However, now that we have this alignment, we don't necessarily need to perform mutation calling, phasing, etc. on all contigs in our dataset (although, if you want to, we could!). This is the case because—with the exception of FDR estimation and fixing—the rest of the steps in strainFlye's pipeline focus on each contig in isolation. So, we can focus on a smaller set of individual contigs we specifically care about in order to speed things up.
We're going to do the following:
- For naive mutation calling, FDR estimation, and FDR fixing, we'll just focus on the "long" contigs in this dataset (defining a contig as "long" if its length is at least 1 Mbp, aka 1,000,000 bp).
In theory, these long contigs represent putative metagenome-assembled genomes (MAGs).
The motivation behind still focusing on a relatively large set of contigs at this step is that, when we perform FDR estimation, we'll need to select a decoy contig. (So it is useful to still have a "pool" of many potential decoy contigs to pick from.) It's also interesting to be able to plot FDR curves for many target contigs.
- Beginning completely with Section 8 ("Phasing analyses"), we'll just focus on an even smaller set of just three long contigs shown throughout the strainFlye paper: CAMP, BACT1, and BACT2 (corresponding to
edge_6104,edge_1671, andedge_2358, respectively).
- Many of these downstream steps take a while to run and produce a lot of data, so focusing on a small amount of contigs helps keep this tutorial (relatively) quick.
Of course, these criteria are not hard requirements you need to follow in your own research! Maybe you'd prefer to specifically focus on contigs with high coverages, or maybe on contigs with good CheckM completeness or contamination values. Or maybe you'd like to keep considering all contigs in the full dataset! Your decision should depend on your goals and your dataset.
In any case, how do we "focus on" certain contigs? We can filter our FASTA file to a subset of contigs present in the full dataset, and use this filtered FASTA file as input to strainFlye's commands. As an example of this, we will use Python (in particular, the scikit-bio library) to create two such filtered FASTA files: one including just the "long" contigs, and one including just the CAMP, BACT1, and BACT2 contigs. (I'm aware that this is a bit inefficient, sorry!)
# (This uses scikit-bio; see http://scikit-bio.org/ for more details.)
input_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta"
# Produce a filtered FASTA file containing only contigs >= 1 Mbp long
long_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta"
len_threshold = 1000000
# ... and another containing just the three "selected" contigs from the paper
three_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta"
t0 = time.time()
num_long_contigs = 0
num_selected_contigs = 0
with open(long_contigs_fp, "w") as lf:
with open(three_contigs_fp, "w") as tf:
for contig in skbio.io.read(input_contigs_fp, format="fasta", constructor=skbio.DNA):
if len(contig) >= len_threshold:
skbio.io.write(contig, format="fasta", into=lf)
num_long_contigs += 1
# All of these three contigs are also >= 1 Mbp long, but let's be paranoid and future-proof this
# anyway
if contig.metadata["id"] in ["edge_6104", "edge_1671", "edge_2358"]:
skbio.io.write(contig, format="fasta", into=tf)
num_selected_contigs += 1
t1 = time.time()
print(f"Found {num_long_contigs:,} contigs with lengths \u2265 {len_threshold:,} bp. Took {t1 - t0:,.2f} sec.")
assert num_selected_contigs == 3, f"Number of selected contigs was {num_selected_contigs:,}?"
print("Found all 3 of the selected contigs (CAMP, BACT1, BACT2).")
Found 468 contigs with lengths ≥ 1,000,000 bp. Took 12.52 sec. Found all 3 of the selected contigs (CAMP, BACT1, BACT2).
The remainder of this tutorial will focus on these 468 long contigs.
In any case, now we can get on to some more interesting stuff!
6. Perform naïve mutation calling, then estimate and fix mutation calls' FDRs¶
You can skip this step if: You already have a BCF file describing single-nucleotide, non-multi-allelic mutations in your contigs.
The analyses downstream of this step (hotspot/coldspot identification, phasing) take as input a set of identified single-nucleotide mutations (or, if you prefer to use different terminology, "called variants", "called SNVs", ...) in which we have some confidence. How does strainFlye identify these mutations?
There are a few steps (as our paper describes). First, we will naïvely call mutations using a simple threshold-based method (referred to as "NaiveFreq" in the paper). We can then estimate the false-discovery rates (FDRs) of the mutations called for each contig using the target-decoy approach. If desired, we can then adjust the called mutations to fix the estimated FDRs of these mutation calls below a specified threshold.
6.1. $p$-mutations and $r$-mutations?¶
So, our first step will be performing this simple threshold-based calling. What do we mean by "threshold" here?
strainFlye supports calling two basic types of mutations: $p$-mutations and $r$-mutations. The docs explain the difference between these two types best:
!strainFlye call
Usage: strainFlye call [OPTIONS] COMMAND [ARGS]...
[+] Call mutations in contigs naïvely & compute diversity indices.
Consider a position "pos" in a contig. Using the alignment, we can count how
many reads have a (mis)match operation to "pos" with one of the four
nucleotides (A, C, G, T; we ignore degenerate nucleotides in reads). We
represent these four nucleotides' counts at pos as follows:
N1 = # reads of the most-common aligned nucleotide at pos,
N2 = # reads of the second-most-common aligned nucleotide at pos,
N3 = # reads of the third-most-common aligned nucleotide at pos,
N4 = # reads of the fourth-most-common aligned nucleotide at pos.
(We break ties arbitrarily.)
strainFlye supports two types of naïve mutation calling based on these
counts: p-mutations and r-mutations. These are described below.
p-mutations (naïve percentage-based mutation calling)
-----------------------------------------------------
This takes as input some percentage p in the range (0%, 50%]. Define
freq(pos) = N2 / (N1 + N2 + N3 + N4). This value, which is inherently
constrained to the range [0%, 50%], is an estimate of the mutation frequency
of this position. We classify pos as a p-mutation if freq(pos) ≥ p, AND if
N2 ≥ the --min-alt-pos parameter (an integer representing a minimum number
of reads that must support the alternate nucleotide).
r-mutations (naïve read-count-based mutation calling)
-----------------------------------------------------
This takes as input some integer r > 0. We classify pos as an r-mutation if
N2 ≥ r.
Diversity indices
-----------------
Later on in the pipeline, if we perform FDR estimation on these called
mutations using the "strainFlye fdr" module, we will need to select a "decoy
contig" with relatively few mutations. A contig with low diversity indices
may represent a promising decoy contig; so, for the sake of convenience,
both commands output this information.
Options:
-h, --help Show this message and exit.
Commands:
p-mutation Call p-mutations and compute diversity indices.
r-mutation Call r-mutations and compute diversity indices.
6.2. Understanding these (sub)commands¶
First off, note that strainFlye call doesn't do anything besides show help info if you run it by itself. This is because, unlike strainFlye align, strainFlye call has two subcommands: p-mutation and r-mutation. Which of these you use will depend on how you want to naïvely call mutations. You can invoke one of these subcommands by writing out the full chain of commands: for example, strainFlye call p-mutation.
6.2.1. Input and output¶
Probably the most important parameter at this step is the minimum threshold. Both of these subcommands, strainFlye call p-mutation and strainFlye call r-mutation, take as input a minimum version of their corresponding threshold (either --min-p or --min-r).
These commands each output:
A BCF (binary variant call format) file describing all mutations called naïvely across the contigs, based on the minimum $p$ or $r$ threshold set (
--min-por--min-r).A TSV (tab separated values) file describing the contigs' computed diversity indices, for various values of $p$ or $r$ (configurable using the
--div-index-p-listor--div-index-r-listparameters).
- Long story short, diversity indices indicate how many of a contig's "sufficiently-covered" positions have called mutations: in general, higher diversity indices imply higher mutation rates.
- If enough positions in a contig are not "sufficiently-covered," then we will not compute the diversity index for this contig. Please see the Supplemental Material "Diversity index details" in the strainFlye paper for details.
6.2.2. Interpreting the output¶
The default minimum value of $p$ (or $r$) used in these commands is fairly low. As you might expect, using such a low threshold for calling a position as a mutation may yield many false positives: we will almost certainly identify many real mutations, but also many "false" mutations that happen to occur as the result of sequencing errors, alignment errors, etc. Viewed another way, the false discovery rate (FDR) (defined as the ratio of false positives to total true + false positives) of the mutation calls generated at this step will probably be high—although this depends on a variety of factors, including which contig(s) we are focusing on mutations in, what our goals are in the first place, etc.
FDR estimation and fixing¶
After we run this command, we can use strainFlye's FDR estimation and fixing functionality to attempt to address this problem. This process involves adjusting the "minimum" value of $p$ (or $r$) used for each contig to reduce the FDR as needed.
Depending on your dataset and your goals, you may or may not want to do this: as of writing, many of the analyses in the strainFlye paper don't perform FDR fixing on their input mutations, and instead just make use of "un-fixed" naïvely called $p$-mutations. (That being said, this is mostly a historical artifact of us implementing the FDR fixing code close to the end of the project.)
In general, if you are going to perform FDR estimation / fixing, we recommend only doing this for $p$-mutations (and not $r$-mutations); this is discussed in the strainFlye paper, in the Supplemental Material section named "Identifying mutations based solely on read counts."
6.3. Naïvely call $p$-mutations ($p = 0.15\%$) and compute diversity indices for various values of $p$¶
Now that we know what we're doing, we're ready to call mutations and compute diversity indices! We'll do $p$-mutation calling at a minimum $p$ of $0.15\%$, which matches what we used for Figure 2 in the paper. The default diversity index values of $p$ (ranging from $0.5\%$ to $50\%$) should be good for us.
Like alignment, this command will also take a while—we need to check each position in the alignment for each of the input contigs. If you'd like, you can use the --verbose flag to display some extra information while this command is running, to make the wait more tolerable (and assure you that it isn't frozen somewhere).
!strainFlye call p-mutation
Usage: strainFlye call p-mutation [OPTIONS]
Call p-mutations and compute diversity indices.
The primary parameter for this command is the lower bound of p, defined by
--min-p. The BCF output will include "mutations" for all positions that pass
this (likely very low) threshold; this BCF can be filtered using the
utilities contained in the "strainFlye fdr" module.
Options:
-c, --contigs PATH FASTA file of contigs in which to naïvely
call mutations. All contigs in this FASTA
file should also be contained in the BAM
file; it's ok if the BAM file contains
contigs not in this FASTA file (we'll ignore
them). [required]
-b, --bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
--min-p INTEGER RANGE Minimum value of p for which to call
p-mutations. This is scaled up by 100 (i.e.
the default of 50 corresponds to 50 / 100 =
0.5%) in order to bypass floating-point
precision issues. [default: 50; 0<x<=5000]
--min-alt-pos INTEGER RANGE In order for us to call a p-mutation at a
position, this position's alternate
nucleotide must be supported by at least
this many reads. [default: 2; x>=1]
--div-index-p-list TEXT List of values of p for which we'll compute
diversity indices. These should all be
separated by commas; and, as with --min-p,
these are scaled up by 100. Please don't use
commas as thousands separators. [default:
50,100,200,500,1000,2500,5000]
-m, --min-read-number FLOAT RANGE
Parameter that impacts the minimum
(mis)match coverage needed in order to
consider "counting" a position / mutation
towards the diversity index. Given a value
of p (converted to the range (0, 0.5]), a
position must have a coverage of at least
(--min-read-number / p) in order to be
"sufficiently covered" and thus counted
towards the diversity index. [default: 5;
x>0]
-o, --output-dir DIRECTORY Directory to which an output BCF file
(describing the called mutations), BCF index
file, and diversity index TSV file will be
written. Some temporary files will also be
written to this directory. [required]
--verbose / --no-verbose Display extra details while running.
[default: no-verbose]
-h, --help Show this message and exit.
!strainFlye call p-mutation \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--min-p 15 \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/call-p15
6.4. Estimating FDRs using the target-decoy approach¶
We now have both our initial mutation calls (which, as we've discussed, probably have a high FDR) and information about our contigs' diversity indices. We will use the target-decoy approach to attempt to estimate and thus control the FDR of our mutation calls. This is done by the strainFlye fdr estimate and strainFlye fdr fix commands.
As discussed in our paper, we can select—out of one of our $C$ contigs—a decoy contig (a.k.a. a decoy genome), and compute a mutation rate for it ($\text{rate}_{\text{decoy}}$). For each of the other $C - 1$ target contigs, we can estimate the FDR of identified mutations in this contig as $\dfrac{\text{rate}_{\text{decoy}}}{\text{rate}_{\text{target}}}$.
6.4.1. Manually selecting a decoy contig¶
If you'd like, we could go through the diversity indices produced by strainFlye call p-mutation ourselves, in an attempt to select a reasonable-seeming decoy contig. This notebook demonstrates this sort of process.
6.4.2. Letting strainFlye fdr estimate automatically select a decoy contig¶
The optional notebook discussed above shows that edge_6104 is probably a good decoy contig, so we could if desired just pass it to strainFlye fdr estimate using that command's -dc or --decoy-contig option. However, to illustrate another option, we'll instead pass our diversity index TSV file to strainFlye fdr estimate and let it do the job of selecting a decoy contig. (Spoiler alert: it'll select edge_6104 anyway.)
The Methods section of our paper provides details about how the automatic decoy contig selection algorithm works. If you'd prefer, you can also go through the exact source code for it, which is reasonably well-documented: see the autoselect_decoy() function in this file.
!strainFlye fdr
Usage: strainFlye fdr [OPTIONS] COMMAND [ARGS]... [+] Estimate and fix FDRs for contigs' naïve mutation calls. Options: -h, --help Show this message and exit. Commands: estimate Estimate the FDRs of contigs' mutation calls. fix Fix contigs' mutation calls' estimated FDRs to an upper limit.
!strainFlye fdr estimate
Usage: strainFlye fdr estimate [OPTIONS]
Estimate the FDRs of contigs' mutation calls.
We do this using the target-decoy approach (TDA). Given a set of C contigs,
we select a "decoy contig" with relatively few called mutations. We then
compute a mutation rate for this decoy contig, and use this mutation rate
(along with the mutation rates of the other C - 1 "target" contigs) to
estimate the FDRs of all of these target contigs' mutation calls.
We can produce multiple FDR estimates for a single target contig's calls by
varying the p or r threshold used (from the --min-p or --min-r threshold
used to generate the input BCF file, up to the --high-p or --high-r
threshold given here). Using this information (and information about the
numbers of mutations called per megabase), we can plot an FDR curve for a
given target contig's mutation calls.
This command accepts an input BCF file of p- or r-mutations; however, in
general we recommend using p-mutations (rather than r-mutations) for FDR
estimation / fixing. Please see the Supplemental Material section of the
strainFlye paper named "Identifying mutations based solely on read counts"
for details.
Options:
-c, --contigs PATH FASTA file of contigs for which we will
estimate mutation calls' FDRs. All contigs
in this FASTA file should also be contained
in the BAM and BCF files. It's ok if the BAM
file contains contigs not in this FASTA file
(we'll ignore them), but all contigs
described in the BCF file's header must also
be in this FASTA file. [required]
--bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
--bcf PATH Indexed BCF file describing naïvely called
p- or r-mutations in the FASTA file's
contigs, produced by "strainFlye call".
[required]
-di, --diversity-indices PATH TSV file describing the diversity indices of
a set of contigs. Used to automatically
select a decoy contig. This option is
mutually exclusive with --decoy-contig.
[default: (nothing)]
-dc, --decoy-contig TEXT Name of a specific contig to use as the
decoy contig for FDR estimation. This option
is mutually exclusive with --diversity-
indices. [default: (nothing)]
-dctx, --decoy-contexts [Full|CP2|Tv|Nonsyn|Nonsense|CP2Tv|CP2Nonsyn|CP2Nonsense|TvNonsyn|TvNonsense|CP2TvNonsense|Everything]
"Context-dependent" types of positions
and/or mutations to which the computation of
mutation rates in the decoy contig will be
limited. The "Full" option will consider the
entire decoy contig; all other options will
limit the computation to certain types of
positions and/or potential mutations. (CP2
will focus on positions in the second codon
position of genes predicted in the decoy
contig using Prodigal; Nonsyn will focus on
potential nonsynonymous mutations in these
genes; Nonsense will focus on potential
nonsense mutations in these genes; and Tv
will focus on potential transversion
mutations.) You can specify this option
multiple times to generate multiple sets of
FDR estimates. (You can also specify
"Everything" to use all available contexts.)
[default: CP2; required]
-hp, --high-p INTEGER RANGE (Only applies if the input BCF file
describes p-mutations.) p-mutations with a
mutation rate (freq(pos)) greater than or
equal to this are considered "indisputable,"
and will not be included in FDR estimation.
Like the values of p used as parameters to
"strainFlye call p-mutation", this is in the
range (0, 5000], such that h = N corresponds
to (N / 100)%. Corresponds to the
"highFrequency" threshold mentioned in the
paper. [default: 500; 0<x<=5000]
-hr, --high-r INTEGER RANGE (Only applies if the input BCF file
describes r-mutations.) r-mutations with an
alternate nucleotide read coverage greater
than or equal to this are considered
indisputable, and will not be included in
FDR estimation. [default: 100; x>0]
-dml, --decoy-min-length INTEGER RANGE
Minimum length of a potential decoy contig
(in bp). Only used if --diversity-indices is
specified. [default: 1000000; x>0]
-dmac, --decoy-min-average-coverage FLOAT RANGE
Minimum average (mis)match coverage of a
potential decoy contig. Only used if
--diversity-indices is specified. [default:
500; x>0]
-o, --output-dir DIRECTORY Directory to which TSV files describing the
estimated FDRs and numbers of naïvely called
mutations per megabase will be written. In
all TSV files, rows correspond to target
contigs and columns correspond to p or r
values. We will generate one estimated FDR
TSV file for every time --decoy-context is
specified, and just one number-of-naïvely-
called-mutations-per-megabase file.
[required]
-h, --help Show this message and exit.
!strainFlye fdr estimate \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
--diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
--decoy-contexts Everything \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info
Using strainFlye version "0.1.0-dev".
--------
fdr estimate @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta
Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf
Input diversity indices file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv
Manually-set decoy contig: None
Decoy contig context-dependent position / mutation type(s): ('Everything',)
High p threshold (only used if the BCF describes p-mutations): 500
High r threshold (only used if the BCF describes r-mutations): 100
Minimum length of a potential decoy contig (only used if diversity indices are specified): 1,000,000 bp
Minimum average coverage of a potential decoy contig (only used if diversity indices are specified): 500.0x
Output directory: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info
--------
fdr estimate @ 0.00s: Loading and checking FASTA, BAM, and BCF files...
fdr estimate @ 5.42s: The FASTA file describes 468 contig(s).
fdr estimate @ 5.50s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)).
fdr estimate @ 19.55s: The FASTA file's contig(s) and BCF file's contig(s) match.
fdr estimate @ 19.55s: Also, the input BCF file describes p-mutations (minimum p = 15).
fdr estimate @ 19.56s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files.
fdr estimate @ 19.56s: So far, these files seem good.
--------
fdr estimate @ 19.56s: Selecting a decoy contig based on the diversity indices...
fdr estimate @ 19.58s: Selected edge_6104 as the decoy contig.
fdr estimate @ 19.58s: Verified that this decoy contig is present in the FASTA, BAM, and BCF files.
fdr estimate @ 19.58s: (Sorry for doubting you.)
--------
fdr estimate @ 19.58s: Determining range of value(s) of p to consider...
fdr estimate @ 19.58s: We'll consider 485 value(s) of p: from 15 to 499.
fdr estimate @ 19.58s: p-mutations for p ≥ 500 will be considered "indisputable."
fdr estimate @ 19.58s: These "indisputable" mutations won't be included in the FDR estimation results.
--------
fdr estimate @ 19.58s: Computing mutation rates for edge_6104 at these threshold values, for each of the 11 decoy context(s)...
fdr estimate @ 21.67s: At least one of the decoy context(s) requires us to have gene predictions for edge_6104; running Prodigal...
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...1289244 bp seq created, 32.65 pct GC
Locating all potential starts and stops...30910 nodes
Looking for GC bias in different frames...frame bias scores: 2.68 0.19 0.12
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
-------------------------------------
Request: Single Genome, Phase: Gene Finding
Finding genes in sequence #1 (1289244 bp)...done!
fdr estimate @ 27.01s: Finished running Prodigal! It predicted 1,297 gene(s) in edge_6104.
fdr estimate @ 27.83s: At least one of the decoy context(s) requires us to know "unreasonable" positions (where reference and consensus disagree) in edge_6104; going through alignment...
fdr estimate @ 804.88s: Done: identified 0 unreasonable position(s) in edge_6104.
fdr estimate @ 804.88s: Computing mutation rates for decoy context "CP2" (1 / 11)...
fdr estimate @ 804.89s: Done computing "CP2" mutation rates for edge_6104.
fdr estimate @ 804.89s: Computing mutation rates for decoy context "CP2Nonsense" (2 / 11)...
fdr estimate @ 811.65s: Done computing "CP2Nonsense" mutation rates for edge_6104.
fdr estimate @ 811.65s: Computing mutation rates for decoy context "CP2Nonsyn" (3 / 11)...
fdr estimate @ 819.02s: Done computing "CP2Nonsyn" mutation rates for edge_6104.
fdr estimate @ 819.02s: Computing mutation rates for decoy context "CP2Tv" (4 / 11)...
fdr estimate @ 819.03s: Done computing "CP2Tv" mutation rates for edge_6104.
fdr estimate @ 819.03s: Computing mutation rates for decoy context "CP2TvNonsense" (5 / 11)...
fdr estimate @ 825.73s: Done computing "CP2TvNonsense" mutation rates for edge_6104.
fdr estimate @ 825.73s: Computing mutation rates for decoy context "Full" (6 / 11)...
fdr estimate @ 825.75s: Done computing "Full" mutation rates for edge_6104.
fdr estimate @ 825.75s: Computing mutation rates for decoy context "Nonsense" (7 / 11)...
fdr estimate @ 845.39s: Done computing "Nonsense" mutation rates for edge_6104.
fdr estimate @ 845.39s: Computing mutation rates for decoy context "Nonsyn" (8 / 11)...
fdr estimate @ 866.92s: Done computing "Nonsyn" mutation rates for edge_6104.
fdr estimate @ 866.92s: Computing mutation rates for decoy context "Tv" (9 / 11)...
fdr estimate @ 867.05s: Done computing "Tv" mutation rates for edge_6104.
fdr estimate @ 867.05s: Computing mutation rates for decoy context "TvNonsense" (10 / 11)...
fdr estimate @ 886.23s: Done computing "TvNonsense" mutation rates for edge_6104.
fdr estimate @ 886.23s: Computing mutation rates for decoy context "TvNonsyn" (11 / 11)...
fdr estimate @ 907.39s: Done computing "TvNonsyn" mutation rates for edge_6104.
fdr estimate @ 907.41s: Done computing mutation rates for all decoy context(s) for edge_6104.
--------
fdr estimate @ 907.41s: Computing mutation rates and FDR estimates for the 467 target contig(s)...
fdr estimate @ 941.67s: Done.
--------
fdr estimate @ 941.72s: Done.
Let's check on the TSV files that got written to the output directory. We should see one file for every decoy context, indicating the FDR estimates for each target contig for this context; and one lone "number of mutations per Mb" file, indicating the number of mutations per megabase for each target contig.
In general, we can plot these as FDR curves by using the FDR estimates as x-axis values and the "number of mutations per Mb" values as y-axis values.
!ls /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/
fdr-CP2Nonsense.tsv fdr-CP2Tv.tsv fdr-TvNonsense.tsv fdr-CP2Nonsyn.tsv fdr-Full.tsv fdr-TvNonsyn.tsv fdr-CP2.tsv fdr-Nonsense.tsv fdr-Tv.tsv fdr-CP2TvNonsense.tsv fdr-Nonsyn.tsv num-mutations-per-mb.tsv
6.5. Plotting FDR curves¶
This notebook (in the analysis code repository) demonstrates how we can plot some or all of these FDR estimates for our target contigs. See the screenshot below for an example figure, showing most of the decoy contexts for eight target contigs in the SheepGut dataset. (I put a decent amount of work into making those plots look fancy for the paper, but you don't need to do all this work unless you want to!)
Please note that this figure was created using a slightly older alignment (from before version 0.2.0 of strainFlye), so it will probably have slight differences from an analogous figure created using the latest version of strainFlye.
6.6. Fixing mutation calls' FDRs to an upper limit of $1\%$¶
FDR estimates are nice (they lend some context to our naïve mutation calls), but we can take things a step further. strainFlye has the ability to fix the estimated FDR for each target contig's mutation calls to a specified upper limit.
Since we already have FDR curves showing how, as $p$ varies, the estimated FDR for each target contig varies, fixing the estimated FDR to an upper limit $F$ amounts to choosing (for each target contig) the "best" value of $p$ that yields a FDR ≤ $F$. The strainFlye fdr fix command takes care of this for us. (If you'd like more details on this process, see the methods section of the strainFlye paper, and/or this unnecessarily fancy comment in the strainFlye code.)
We'll use the "CP2" FDR curve as our source of FDR estimates.
!strainFlye fdr fix
Usage: strainFlye fdr fix [OPTIONS]
Fix contigs' mutation calls' estimated FDRs to an upper limit.
This takes as input the estimated FDRs from "strainFlye fdr estimate" (if
you used multiple decoy contexts, then you will need to choose which set of
FDR estimates to use here) to guide us on how to fix the FDR for each
contig. Note that mutations that passed the "high" p or r threshold
specified for "strainFlye fdr estimate", and thus were not used for FDR
estimation, will all be included in the output BCF file from this command;
these mutations are considered "indisputable."
We include indisputable mutations from the decoy contig and from all target
contigs in our output BCF file. We will only consider including non-
indisputable mutations from the target contigs: the decision of which non-
indisputable mutations will be included is based on the lowest p or r
parameter for a target contig that yields an estimated FDR ≤ the fixed FDR
given here.
Options:
-b, --bcf PATH Indexed BCF file describing naïvely called p- or
r-mutations, produced by "strainFlye call".
[required]
-fi, --fdr-info FILE One of the estimated FDR TSV files produced by
"strainFlye fdr estimate". [required]
--fdr FLOAT RANGE False discovery rate at which identified mutations
will be fixed. This is interpreted as a
percentage: the default of 1 corresponds to an FDR
of 1%. We do not restrict this to an upper limit,
since estimated FDRs can technically exceed 100%.
[default: 1; x>0]
-o, --output-bcf FILE Filepath to which an output indexed BCF file
(describing the called mutations at the optimal p
or r value for each contig, along with any
"indisputable" mutations relative to the --high-p
or --high-r parameter used for strainFlye fdr
estimate) will be written. [required]
--verbose / --no-verbose Display extra details while writing the filtered
BCF. [default: no-verbose]
-h, --help Show this message and exit.
!strainFlye fdr fix \
--bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
--fdr-info /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/fdr-CP2.tsv \
--fdr 1 \
--output-bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Using strainFlye version "0.1.0-dev". -------- fdr fix @ 0.00s: Starting... Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf Input FDR estimate file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/fdr-CP2.tsv FDR at which to fix mutation calls: 1.00% Verbose?: No Output BCF file with mutation calls at the fixed FDR: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf -------- fdr fix @ 0.00s: Loading and checking BCF and TSV files... fdr fix @ 17.34s: Looks good so far; decoy contig seems to be edge_6104. fdr fix @ 17.34s: Looks like the cutoff for "indisputable" mutations was p = 500. fdr fix @ 17.34s: All mutations passing this cutoff will be included in the output BCF file. -------- fdr fix @ 17.34s: Based on the FDR information, finding optimal values of p for each contig... fdr fix @ 17.36s: Done. fdr fix @ 17.36s: For 239 / 467 contigs, there exist values of p (at least, considering the range from p = 15 to p = 499) that yield estimated FDRs ≤ 1.0%. fdr fix @ 17.36s: These values range from p = 37 (edge_1300) to p = 475 (edge_1016). fdr fix @ 17.36s: The mean of these values is p = 255.96. -------- fdr fix @ 17.36s: Writing a filtered BCF file (to a temporary location, for now) including both (1) indisputable mutations from all contigs and (2) non-indisputable mutations from the target contigs that result in a FDR ≤ 1.00%... fdr fix @ 32.55s: Done. -------- fdr fix @ 32.55s: Updating the filtered BCF file's header to indicate that FDR fixing was done... fdr fix @ 36.15s: Done. -------- fdr fix @ 36.15s: Indexing the BCF file... fdr fix @ 37.04s: Done indexing the BCF file. -------- fdr fix @ 37.06s: Done.
It took us a few steps, but we have now generated a file (p15-fdr1pct.bcf) of $p$-mutation calls at a fixed (estimated) FDR of 1%.
Although our methodology has a few limitations (e.g. we don't support calling multi-allelic mutations yet), this BCF file can be used downstream for many types of analyses. In the next sections of the tutorial we'll demonstrate the additional commands supported by strainFlye, most of which make use of these mutation calls in some way.
7. Identify hotspots and coldspots¶
We've called mutations and estimated these calls' FDRs. Now we can get to the fun part: what's going on with these mutations?
Often, we're interested in analyzing where the mutations in a contig are located. Are there any particular "hotspot" regions where there are surprisingly many mutations? Are there any "coldspot" regions where there are, surprisingly, no or few mutations?
We've kept strainFlye's functionality for identifying these types of regions fairly minimal at the moment. Here we'll demonstrate identifying very basic hotspots and coldspots using strainFlye spot's commands.
!strainFlye spot
Usage: strainFlye spot [OPTIONS] COMMAND [ARGS]... [+] Identify putative mutational hotspots or coldspots. Many methods exist for identifying these sorts of hotspots or coldspots; strainFlye's implementations of these methods are intended mostly as a quick proof-of-concept for replicating the results shown in our paper, and are not extremely "feature-rich" quite yet. Options: -h, --help Show this message and exit. Commands: hot-features Identify hotspot features (for example, genes). cold-gaps Identify long coldspot "gaps" without any mutations.
7.1. Identify hotspot features¶
!strainFlye spot hot-features
Usage: strainFlye spot hot-features [OPTIONS]
Identify hotspot features (for example, genes).
By "feature", we refer to a single continuous region within a contig, as
described in the file given for --features. These regions could describe
anything: predicted protein-coding genes, introns or exons, intergenic
regions of interest, etc. For now, we treat each feature independently (e.g.
we don't lump together exons from the same "Parent" gene; each feature is
considered separately as a potential "hotspot").
You can configure whether or not we classify a feature as a hotspot by
adjusting the --min-num-mutations and --min-perc-mutations parameters; at
least one of these parameters must be specified. If both parameters are
specified, then both checks (number of mutations in a feature, and
percentage of mutations in a feature) will need to pass in order for us to
label a feature as a hotspot.
Options:
-b, --bcf PATH Indexed BCF file describing single-
nucleotide mutations in a set of contigs.
[required]
-f, --features PATH Generic Feature Format version 3 (GFF3) file
describing "features" (e.g. predicted
protein-coding genes) in the contigs. We
will ignore features located on contigs that
are not described in the BCF file's header.
[required]
-mn, --min-num-mutations INTEGER RANGE
Label a feature as a "hotspot" if it
contains at least this many mutations.
[default: (no check); x>0]
-mp, --min-perc-mutations FLOAT RANGE
Label a feature as a "hotspot" if its
percentage of mutations ((# mutations /
feature length) × 100) is at least this
value. [default: (no check); 0<x<=100]
-o, --output-hotspots FILE Filepath to which an output tab-separated
values (TSV) file describing hotspot
features across all contigs will be written.
[required]
-h, --help Show this message and exit.
7.1.1. A note about "features"¶
Although we should be familiar with the FASTA and BCF input files by this point, the -f / --features input (a GFF3 file) may be surprising. strainFlye leaves the task of creating this file up to the user.
Predicted genes' coordinates are probably the most obvious type of "feature" for which we could look for hotspots. If you don't have gene predictions for your contigs yet, Prodigal is good (and should have been installed along with strainFlye, since it's used internally elsewhere in strainFlye). Here, we'll use it to predict protein-coding genes across all contigs.
It's important to note that Prodigal does not predict eukaryotic genes (i.e. genes that are split up into introns and exons). These genes will thus not be a perfect representation of all protein-coding genes in all contigs in the dataset, since we know that this sample does contain at least some eukaryotic genomes. (However, if you use another tool for predicting eukaryotic genes that produces GFF3 output, then these should also be usable as "features" for this command.)
Note that, now that we're done with FDR estimation/fixing, the rest of this tutorial will mostly focus on the three selected contigs from the paper (described in sheepgut_contigs_selected3.fasta, which we generated above). Because of this, we'll only bother running Prodigal on these three contigs (so we won't predict genes / identify hotspots in the other 465 long contigs).
# See https://github.com/hyattpd/Prodigal/wiki/cheat-sheet for details about these options.
#
# Note that, for the paper, I ran Prodigal in "normal" mode on certain contigs individually
# (https://github.com/fedarko/sheepgut/blob/main/inspect-seqs/prodigal.py), but here we just
# run Prodigal in "anonymous" mode on all contigs at once. The results should be fairly similar,
# although there'll probably be some differences.
!prodigal \
-i /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
-f gff \
-c \
-p meta \
-o /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff
------------------------------------- PRODIGAL v2.6.3 [February, 2016] Univ of Tenn / Oak Ridge National Lab Doug Hyatt, Loren Hauser, et al. ------------------------------------- Request: Metagenomic, Phase: Training Initializing training files...done! ------------------------------------- Request: Metagenomic, Phase: Gene Finding Finding genes in sequence #1 (2153394 bp)...done! Finding genes in sequence #2 (2806161 bp)...done! Finding genes in sequence #3 (1289244 bp)...done!
7.1.2. Running the command to identify hotspot features (in this case, gene predictions)¶
Now that we have our gene predictions, let's move see if any of them have a lot of mutations. (Based on our findings in the paper, we know that these sorts of hotspots do exist in this dataset.)
strainFlye spot hot-features supports two types of basic thresholds for labelling a feature as a hotspot, --min-num-mutations and --min-perc-mutations. We'll use both here, and label a feature a "hotspot" if it meets both of the following criteria:
- it contains at least 5 mutations, and
- at least 2% of its positions have mutations.
!strainFlye spot hot-features \
--bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
--features /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff \
--min-num-mutations 5 \
--min-perc-mutations 2 \
--output-hotspots /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv
Using strainFlye version "0.1.0-dev". -------- spot hot-features @ 0.00s: Starting... Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf Input feature file: /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff Minimum number of mutations needed to call a feature a hotspot: 5 Minimum % of mutations needed to call a feature a hotspot: 2.0 Output file describing hotspot features: /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv -------- spot hot-features @ 0.00s: Loading and checking the BCF file... spot hot-features @ 12.41s: Looks good so far. -------- spot hot-features @ 12.41s: Going through features in the GFF3 file and identifying hotspot features... spot hot-features @ 12.64s: Identified 165 hotspot feature(s) across all 468 contigs in the BCF file. -------- spot hot-features @ 12.64s: Writing out this information to a TSV file... spot hot-features @ 12.64s: Done. -------- spot hot-features @ 12.65s: Done.
The output of this command isn't anything special: it's a TSV file in which each row describes an identified hotspot feature, defining "hotspots" based on the --min-num-mutations and --min-perc-mutations options we set earlier. Let's load this file using pandas.read_csv() and get a brief sense of what it looks like:
hotspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv", sep="\t")
hotspots.head()
| Contig | FeatureID | FeatureStart_1IndexedInclusive | FeatureEnd_1IndexedInclusive | NumberMutatedPositions | PercentMutatedPositions | |
|---|---|---|---|---|---|---|
| 0 | edge_1671 | 1_77 | 67385 | 67864 | 16 | 3.33 |
| 1 | edge_1671 | 1_78 | 67943 | 68584 | 13 | 2.02 |
| 2 | edge_1671 | 1_81 | 70405 | 70773 | 44 | 11.92 |
| 3 | edge_1671 | 1_94 | 86530 | 88992 | 95 | 3.86 |
| 4 | edge_1671 | 1_96 | 89868 | 90809 | 57 | 6.05 |
Depending on your goals, we could focus on—for example—the hotspot genes with the highest mutation rates in a contig. We can compute this for edge_1671 ("BACT1", as we name it in our paper) by filtering and then sorting the DataFrame:
# Ignore hotspots in CAMP / BACT2, and just focus on hotspots in BACT1
bact1_hotspots = hotspots[hotspots["Contig"] == "edge_1671"]
# Sort all the "hotspot genes" in BACT1 from high to low mutation rates (% of positions mutated).
# This is similar to the table of highly-mutated genes in the Supplemental Material of our paper,
# although unlike that table this uses FDR-fixed mutations (and it also uses different gene
# predictions, as discussed above).
bact1_hotspots.sort_values(["PercentMutatedPositions"], ascending=False)
| Contig | FeatureID | FeatureStart_1IndexedInclusive | FeatureEnd_1IndexedInclusive | NumberMutatedPositions | PercentMutatedPositions | |
|---|---|---|---|---|---|---|
| 47 | edge_1671 | 1_583 | 724871 | 725443 | 131 | 22.86 |
| 82 | edge_1671 | 1_860 | 1041656 | 1042084 | 86 | 20.05 |
| 118 | edge_1671 | 1_1217 | 1460034 | 1460204 | 34 | 19.88 |
| 14 | edge_1671 | 1_183 | 206606 | 207304 | 131 | 18.74 |
| 112 | edge_1671 | 1_1168 | 1402353 | 1402718 | 61 | 16.67 |
| ... | ... | ... | ... | ... | ... | ... |
| 134 | edge_1671 | 1_1427 | 1748001 | 1751735 | 76 | 2.03 |
| 67 | edge_1671 | 1_751 | 916664 | 918493 | 37 | 2.02 |
| 1 | edge_1671 | 1_78 | 67943 | 68584 | 13 | 2.02 |
| 16 | edge_1671 | 1_248 | 272253 | 273047 | 16 | 2.01 |
| 106 | edge_1671 | 1_1131 | 1361175 | 1361672 | 10 | 2.01 |
164 rows × 6 columns
7.2. Identify coldspot gaps¶
Similarly, strainFlye supports the identification of basic "coldspots"—here, defined as long gaps without any mutations. The main parameter is the minimum length needed to define a gap as a "coldspot." Let's test this out on the SheepGut dataset.
Note that the input to this command is just a BCF file—no contig needs to be given to it. So, this will implicitly find coldspots in all 468 long contigs in the SheepGut dataset described in the BCF file—not just the 3 selected contigs. That said, this command runs fairly quickly, even when considering many contigs, so it shouldn't be an issue.
!strainFlye spot cold-gaps
Usage: strainFlye spot cold-gaps [OPTIONS]
Identify long coldspot "gaps" without any mutations.
To clarify, we define a "gap" of length L on a contig as a collection of
continuous positions [N, N + 1, ... N + L - 2, N + L - 1] in which no
positions are mutations (based on the input BCF file).
If the --circular flag is specified, then we can loop around the contig from
right to left; otherwise, the left and right sides of the contig are hard
boundaries. To give an example of this, consider a 9-nucleotide contig that
has mutations at positions 4 and 6:
Mut. Mut.
1 2 3 4 5 6 7 8 9
If --circular is specified, then this contig has two gaps: one gap of length
1 (covering just position 5, between the two mutations), and another gap of
length 6 (starting at position 7 and looping around to position 3: [7, 8, 9,
1, 2, 3]).
If --circular is not specified, then this contig has three gaps: [1, 2, 3],
[5], and [7, 8, 9].
Options:
-b, --bcf PATH Indexed BCF file describing single-
nucleotide mutations in a set of contigs.
[required]
-l, --min-length INTEGER RANGE Label a gap between mutations in a contig as
a "coldspot" if the gap is at least this
long (in bp). [default: 5000; x>0;
required]
--circular / --no-circular If --circular is specified, we'll assume
that all contigs are circular: we'll
consider the gap "looping around" from the
rightmost mutation in each contig to the
leftmost mutation in the contig as a
potential coldspot. Otherwise, we will
assume all contigs are linear. [default:
no-circular]
--exact-pvals / --no-exact-pvals
If --exact-pvals is specified, we'll use the
method for computing exact longest-gap
p-values given in equation (3.1) of (Naus
1982). We use Python's decimal library with
default parameters to try to deal with large
numbers, but we can't guarantee that this
won't cause the program to fail in certain
cases. If --exact-pvals is not specified,
we'll use equation (3.3) of (Naus 1982),
which gives an approximation of the p-value.
[default: exact-pvals]
-o, --output-coldspots FILE Filepath to which an output tab-separated
values (TSV) file describing coldspots will
be written. The longest gap in each contig
will also be given a p-value, indicating the
probability of the longest gap being at
least this long (given the length of and
number of mutated positions in this contig,
and assuming that mutations occur with a
constant probability at any position in the
contig). See (Naus 1982) for details.
[required]
-h, --help Show this message and exit.
!strainFlye spot cold-gaps \
--bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
--output-coldspots /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv
Using strainFlye version "0.1.0-dev". -------- spot cold-gaps @ 0.00s: Starting... Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf Minimum coldspot gap length: 5,000 bp Check for circular coldspot gaps?: No Compute exact longest-gap p-values?: Yes Output file describing coldspot gaps: /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv -------- spot cold-gaps @ 0.00s: Loading and checking the BCF file... spot cold-gaps @ 12.16s: Looks good so far. -------- spot cold-gaps @ 12.16s: Going through contigs and identifying coldspot gaps... spot cold-gaps @ 15.74s: Identified 16,154 coldspot gap(s) across all 468 contigs in the BCF file. -------- spot cold-gaps @ 15.74s: Writing out this information to a TSV file... spot cold-gaps @ 15.75s: Done. -------- spot cold-gaps @ 15.76s: Done.
Again, cold-gaps will output a simple TSV file describing its identified coldspots:
coldspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv", sep="\t")
coldspots.head()
| Contig | Start_1IndexedInclusive | End_1IndexedInclusive | Length | LongestGap_P_Value | |
|---|---|---|---|---|---|
| 0 | edge_8 | 3492 | 10939 | 7448 | NaN |
| 1 | edge_8 | 10941 | 16941 | 6001 | NaN |
| 2 | edge_8 | 35630 | 45223 | 9594 | NaN |
| 3 | edge_8 | 45225 | 53014 | 7790 | NaN |
| 4 | edge_8 | 58003 | 63032 | 5030 | NaN |
There are many ways we could make use of this information—similarly to the hotspot example, we could try sorting these gaps from longest to shortest for the BACT1 contig. This is shown below. (Of the gaps that we identify, the longest one in each contig is assigned a p-value; please see the Supplemental Material of our paper for details on how we compute this, and the assumptions made.)
coldspots[coldspots["Contig"] == "edge_1671"].sort_values(["Length"], ascending=False)
| Contig | Start_1IndexedInclusive | End_1IndexedInclusive | Length | LongestGap_P_Value | |
|---|---|---|---|---|---|
| 2536 | edge_1671 | 1216892 | 1239536 | 22645 | 9.443778e-106 |
| 2537 | edge_1671 | 1618448 | 1637614 | 19167 | NaN |
| 2538 | edge_1671 | 1797912 | 1813581 | 15670 | NaN |
| 2534 | edge_1671 | 1194740 | 1207381 | 12642 | NaN |
| 2532 | edge_1671 | 979572 | 990495 | 10924 | NaN |
| 2530 | edge_1671 | 740701 | 750399 | 9699 | NaN |
| 2535 | edge_1671 | 1207881 | 1216890 | 9010 | NaN |
| 2539 | edge_1671 | 2140896 | 2147770 | 6875 | NaN |
| 2533 | edge_1671 | 1183003 | 1188753 | 5751 | NaN |
| 2540 | edge_1671 | 2147772 | 2153394 | 5623 | NaN |
| 2531 | edge_1671 | 974177 | 979570 | 5394 | NaN |
| 2529 | edge_1671 | 541220 | 546507 | 5288 | NaN |
8. Phasing analyses¶
8.1. Generating "smoothed haplotypes"¶
Given our called mutations, we can attempt to generate haplotypes that respect these mutations using strainFlye's smooth module.
The details and motivation for this are explained in depth in our paper. To briefly summarize, we will convert the reads aligned to each contig into "smoothed reads," which only contain our called mutations with no other variations. We will then (optionally, depending on the --virtual-reads parameter) construct "virtual reads" to fill in low-coverage regions. We will then assemble these reads using LJA to construct "smoothed haplotypes."
Note that these commands (both smooth create and smooth assemble) can take a while to run. You can add the --verbose flag to get more detailed logging information while these commands are running.
!strainFlye smooth
Usage: strainFlye smooth [OPTIONS] COMMAND [ARGS]... [+] Create and assemble smoothed and virtual reads. Options: -h, --help Show this message and exit. Commands: create Create smoothed and virtual reads for each contig. assemble Assemble contigs' smoothed and virtual reads using LJA.
8.1.1. Creating smoothed and virtual reads¶
!strainFlye smooth create
Usage: strainFlye smooth create [OPTIONS]
Create smoothed and virtual reads for each contig.
Options:
-c, --contigs PATH FASTA file of contigs for which we will
create smoothed and virtual reads. All
contigs in this FASTA file should also be
contained in the BAM and BCF files; it's ok
if the BAM or BCF files contain contigs not
in this FASTA file (we'll ignore them).
[required]
--bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
--bcf PATH Indexed BCF file describing single-
nucleotide mutations in a set of contigs.
[required]
-di, --diversity-indices PATH TSV file describing the diversity indices of
a set of contigs, produced by one of the
"strainFlye call" commands. Only used if
--virtual-reads is specified. Along with
diversity indices, these files list contigs'
average coverages. If --virtual-reads is
specified, we will need to know contigs'
average coverages. So, if a diversity
indices file is provided here, then we can
avoid re-computing average coverages.
(Otherwise, if --virtual-reads is specified
but no diversity indices file is provided,
we'll need to compute average coverages;
this will take some extra time.) [default:
(nothing)]
--virtual-reads / --no-virtual-reads
If --virtual-reads is specified, we'll
create virtual reads covering "low-coverage"
regions in contigs. [default: virtual-
reads]
-vrp, --virtual-read-well-covered-perc FLOAT RANGE
Only used if --virtual-reads is specified.
In a contig with average coverage (based on
the BAM file, and only considering match +
mismatch counts) C, we will define a
position in this contig (with coverage P,
based only on the smoothed reads) as low-
coverage if ((P / C) × 100) is less than
this percentage. [default: 95; 0<=x<=100;
required]
-vrf, --virtual-read-flank INTEGER RANGE
Only used if --virtual-reads is specified.
When we add virtual reads spanning a single
continuous low-coverage region, these reads
will start and end this many positions
before and after the region. (For example,
the default of 100 means that the virtual
reads created for a low-coverage region of
5,000 bp will all be 5,200 bp.) [default:
100; x>=0; required]
-o, --output-dir DIRECTORY Directory to which smoothed reads (and
virtual reads, if --virtual-reads is
specified) will be written. Each contig's
reads will be written to a gzipped FASTA
file in this directory named
[contig].fasta.gz. [required]
--verbose / --no-verbose Display extra details while running.
[default: no-verbose]
-h, --help Show this message and exit.
!strainFlye smooth create \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
--diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads \
--verbose
Using strainFlye version "0.1.0-dev". -------- smooth create @ 0.00s: Starting... Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf Input diversity indices file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv Create virtual reads?: Yes "Well-covered" percentage used for virtual read creation: 95.00% Create "flanks" of this length on either side of each virtual read: 100 Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/reads -------- smooth create @ 0.00s: Loading and checking FASTA, BAM, and BCF files... smooth create @ 0.10s: The FASTA file describes 3 contig(s). smooth create @ 0.19s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)). smooth create @ 12.33s: All FASTA contig(s) are included in the BCF file (the header of this BCF file describes 468 contig(s)). smooth create @ 12.34s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files. smooth create @ 12.34s: So far, these files seem good. -------- smooth create @ 12.34s: All contigs must be > (2 × --virtual-read-flank) = 200 bp long. Checking this... smooth create @ 12.34s: All contigs meet this minimum length. -------- smooth create @ 12.34s: Going through contigs and creating smoothed and virtual reads... smooth create @ 12.34s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%). smooth create @ 12.39s: Contig edge_1671 has 23,821 mutated position(s). smooth create @ 5,086.59s: From the 262,757 linear alignment(s) to contig edge_1671: created 228,902 smoothed read(s) and ignored 33,855 linear alignment(s). smooth create @ 5,087.21s: Contig edge_1671 (average coverage 1,415.55x, based on the BAM file) has 538 run(s) of consecutive low-coverage (≤ 1,344.77x) positions (based on smoothed read coverages). Creating virtual reads... smooth create @ 5,542.74s: Created 43,596 virtual read(s) total for contig edge_1671. smooth create @ 5,542.77s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%). smooth create @ 5,542.78s: Contig edge_2358 has 17 mutated position(s). smooth create @ 8,102.10s: From the 742,784 linear alignment(s) to contig edge_2358: created 719,654 smoothed read(s) and ignored 23,130 linear alignment(s). smooth create @ 8,102.85s: Contig edge_2358 (average coverage 2,993.59x, based on the BAM file) has 612 run(s) of consecutive low-coverage (≤ 2,843.91x) positions (based on smoothed read coverages). Creating virtual reads... smooth create @ 8,518.15s: Created 103,347 virtual read(s) total for contig edge_2358. smooth create @ 8,518.17s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%). smooth create @ 8,518.18s: Contig edge_6104 has 35 mutated position(s). smooth create @ 10,215.96s: From the 476,373 linear alignment(s) to contig edge_6104: created 476,232 smoothed read(s) and ignored 141 linear alignment(s). smooth create @ 10,216.38s: Contig edge_6104 (average coverage 4,158.79x, based on the BAM file) has 361 run(s) of consecutive low-coverage (≤ 3,950.85x) positions (based on smoothed read coverages). Creating virtual reads... smooth create @ 10,398.82s: Created 80,090 virtual read(s) total for contig edge_6104. smooth create @ 10,398.83s: Done. -------- smooth create @ 10,398.93s: Done.
8.1.2. Assembling these reads¶
This step assumes that we have already installed LJA, in particular the simple_ec branch of it. Please see LJA's manual for installation instructions.
I have installed LJA into a specific location on our cluster. So that strainFlye can easily run LJA for each contig's reads files, we pass the location of LJA's binary executable to strainFlye below using the --lja-bin parameter.
!strainFlye smooth assemble
Usage: strainFlye smooth assemble [OPTIONS]
Assemble contigs' smoothed and virtual reads using LJA.
Please note that this command relies on the "simple_ec" branch of LJA being
installed on your system. See strainFlye's README (and/or LJA's manual) for
details on installing LJA.
Options:
-r, --reads-dir DIRECTORY Directory produced by "strainFlye smooth create"
containing smoothed (and optionally virtual)
reads. We will use LJA to assemble each
*.fasta.gz file in this directory (representing
reads from different contigs) independently.
[required]
-p, --lja-params TEXT Additional parameters to pass to LJA, besides
the --reads and --output-dir parameters. To
explain our defaults: the --simpleec flag is
currently only available on the simple_ec branch
of LJA. This flag tells LJA to perform "simple"
error correction by removing all edges in the de
Bruijn graph with k-mer coverage less than
--Cov-threshold (10), as well as reads passing
through these edges. This "simple" error
correction is done instead of running the
default LJA error correction module, mowerDBG.
Please note that we do not perform any
validation on this string before passing it to
LJA (so if you are allowing users to run
strainFlye through a web server, be careful
about shell injection). [default: --simpleec
--Cov-threshold 10]
-b, --lja-bin FILE Location of LJA's "lja" binary file, which can
be used to run LJA. This should be located in
the bin/ directory constructed when you compiled
LJA. If this is not provided, we will check to
see if LJA is available in your $PATH.
[default: (Look for LJA's bin in the $PATH
environment variable)]
-o, --output-dir DIRECTORY Directory to which output LJA assemblies (one
subdirectory per *.fasta.gz file in the input
--reads-dir) will be written. [required]
--verbose / --no-verbose Display extra details while running. [default:
no-verbose]
-h, --help Show this message and exit.
!strainFlye smooth assemble \
--reads-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads \
--lja-bin /home/mfedarko/software/LJA-branch/bin/lja \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies \
--verbose
Using strainFlye version "0.1.0-dev". -------- smooth assemble @ 0.00s: Starting... Input reads directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/reads LJA parameters: --simpleec --Cov-threshold 10 LJA binary (if None, we'll look in $PATH): /home/mfedarko/software/LJA-branch/bin/lja Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies -------- smooth assemble @ 0.00s: Assembling each *.fasta.gz file in /Poppy/mfedarko/sftests/tutorial-output/smooth/reads... smooth assemble @ 0.00s: Found file edge_1671.fasta.gz, presumably for contig edge_1671. Assembling. smooth assemble @ 0.00s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_1671.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671 00:00:00 2Mb INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads 00:00:10 5Mb INFO: e213d10b906cc633b78a27eceb52d44073478d6d 00:00:10 5Mb INFO: LJA pipeline started 00:00:10 5Mb INFO: Performing initial correction with k = 5001 00:00:10 0Mb INFO: Reading reads 00:00:10 0Mb INFO: Extracting minimizers 00:00:22 3.1Gb INFO: Finished read processing 00:00:22 3.1Gb INFO: 3804650 hashs collected. Starting sorting. 00:00:22 3.1Gb INFO: Finished sorting. Total distinct minimizers: 26485 00:00:22 3.1Gb INFO: Starting construction of sparse de Bruijn graph 00:00:22 3.1Gb INFO: Vertex map constructed. 00:00:22 3.1Gb INFO: Filling edge sequences. 00:00:36 3.2Gb INFO: Finished sparse de Bruijn graph construction. 00:00:36 3.2Gb INFO: Collecting tips 00:00:36 3.2Gb INFO: Added 1672 artificial minimizers from tips. 00:00:36 3.2Gb INFO: Collected 52384 old edges. 00:00:36 3.2Gb INFO: New minimizers added to sparse graph. 00:00:36 3.2Gb INFO: Refilling graph with old edges. 00:00:37 3.2Gb INFO: Filling graph with new edges. 00:00:37 3.2Gb INFO: Finished fixing sparse de Bruijn graph. 00:00:37 3.2Gb INFO: Starting to extract disjointigs. 00:00:37 3.2Gb INFO: Finished extracting 2434 disjointigs of total size 18962614 00:00:37 0Mb INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/disjointigs.fasta" 00:00:38 35Mb INFO: Filling bloom filter with k+1-mers. 00:00:39 35Mb INFO: Filled 31425472 bits out of 217285760 00:00:39 35Mb INFO: Finished filling bloom filter. Selecting junctions. 00:00:39 36Mb INFO: Collected 5299 junctions. 00:00:39 36Mb INFO: Starting DBG construction. 00:00:39 36Mb INFO: Vertices created. 00:00:39 36Mb INFO: Filled dbg edges. Adding hanging vertices 00:00:39 36Mb INFO: Added 1 hanging vertices 00:00:39 36Mb INFO: Merging unbranching paths 00:00:39 36Mb INFO: Ended merging edges. Resulting size 2748 00:00:40 36Mb INFO: Cleaning edge coverages 00:00:40 36Mb INFO: Collecting alignments of sequences to the graph 00:00:51 3.2Gb INFO: Alignment collection finished. Total length of alignments is 677345 00:00:52 3.2Gb INFO: Could not correct 4910 reads. They will be removed. 00:00:52 3.2Gb INFO: Uncorrected reads were removed. 00:00:52 3.2Gb INFO: Applying changes to the graph 00:00:55 3.2Gb INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/corrected_reads.fasta" 00:01:14 5Mb INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph 00:01:14 0Mb INFO: Loading graph from fasta 00:01:14 13Mb INFO: Finished loading graph 00:01:15 126Mb INFO: Looking for unique edges 00:01:15 126Mb INFO: Marked 76 long edges as unique 00:01:15 126Mb INFO: Marking extra edges as unique based on read paths 00:01:15 126Mb INFO: Marked 81 edges as unique 00:01:15 126Mb INFO: Splitting graph with unique edges 00:01:15 126Mb INFO: Processing 100 components 00:01:15 126Mb INFO: Finished unique edges search. Found 89 unique edges 00:01:15 126Mb INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges 00:01:15 126Mb INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges. 00:01:15 126Mb INFO: Resolving repeats 00:01:15 126Mb INFO: Constructing paths 00:01:17 169Mb INFO: Building graph 00:01:17 169Mb INFO: Increasing k 00:01:17 170Mb INFO: Finished increasing k 00:01:17 170Mb INFO: Exporting remaining active transitions 00:01:17 170Mb INFO: Export to Dot 00:01:17 170Mb INFO: Export to GFA and compressed contigs 00:01:17 178Mb INFO: Finished repeat resolution 00:01:17 5Mb INFO: Performing polishing and homopolymer uncompression 00:01:18 6Mb INFO: Aligning reads back to assembly 00:01:27 2.2Gb INFO: Finished alignment. 00:01:27 2.2Gb INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/uncompressing/alignments.txt" 00:01:27 2.2Gb INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_1671.fasta.gz"] 00:02:45 3.4Gb INFO: Uncompressing homopolymers in contigs 00:02:46 3.4Gb INFO: Total zero covered nucleotides 0 00:02:46 3.4Gb INFO: Calculating overlaps between adjacent uncompressed edges 00:02:46 3.4Gb INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg.gfa" 00:02:46 3.4Gb INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/assembly.fasta" 00:02:46 5Mb INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/corrected_reads.fasta" 00:02:46 5Mb INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg/mdbg.hpc.gfa" 00:02:46 5Mb INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg.gfa" 00:02:46 5Mb INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/assembly.fasta" 00:02:46 5Mb INFO: LJA pipeline finished smooth assemble @ 166.78s: Finished running LJA on file edge_1671.fasta.gz. smooth assemble @ 166.78s: Found file edge_2358.fasta.gz, presumably for contig edge_2358. Assembling. smooth assemble @ 166.78s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_2358.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358 00:00:00 5Mb INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads 00:00:09 5Mb INFO: e213d10b906cc633b78a27eceb52d44073478d6d 00:00:09 5Mb INFO: LJA pipeline started 00:00:09 5Mb INFO: Performing initial correction with k = 5001 00:00:09 0Mb INFO: Reading reads 00:00:09 0Mb INFO: Extracting minimizers 00:00:45 5.4Gb INFO: Finished read processing 00:00:45 5.4Gb INFO: 10414993 hashs collected. Starting sorting. 00:00:46 5.4Gb INFO: Finished sorting. Total distinct minimizers: 8259 00:00:46 5.4Gb INFO: Starting construction of sparse de Bruijn graph 00:00:46 5.4Gb INFO: Vertex map constructed. 00:00:46 5.4Gb INFO: Filling edge sequences. 00:01:21 7.8Gb INFO: Finished sparse de Bruijn graph construction. 00:01:21 7.8Gb INFO: Collecting tips 00:01:21 7.8Gb INFO: Added 4 artificial minimizers from tips. 00:01:21 7.8Gb INFO: Collected 16526 old edges. 00:01:21 7.8Gb INFO: New minimizers added to sparse graph. 00:01:21 7.8Gb INFO: Refilling graph with old edges. 00:01:21 7.8Gb INFO: Filling graph with new edges. 00:01:21 7.8Gb INFO: Finished fixing sparse de Bruijn graph. 00:01:21 7.8Gb INFO: Starting to extract disjointigs. 00:01:21 7.8Gb INFO: Finished extracting 20 disjointigs of total size 2169195 00:01:21 0Mb INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/disjointigs.fasta" 00:01:21 13Mb INFO: Filling bloom filter with k+1-mers. 00:01:22 14Mb INFO: Filled 9578413 bits out of 66213600 00:01:22 14Mb INFO: Finished filling bloom filter. Selecting junctions.
00:01:22 14Mb INFO: Collected 779 junctions. 00:01:22 14Mb INFO: Starting DBG construction. 00:01:22 14Mb INFO: Vertices created. 00:01:22 14Mb INFO: Filled dbg edges. Adding hanging vertices 00:01:22 14Mb INFO: Added 0 hanging vertices 00:01:22 14Mb INFO: Merging unbranching paths 00:01:23 14Mb INFO: Ended merging edges. Resulting size 16 00:01:23 14Mb INFO: Cleaning edge coverages 00:01:23 14Mb INFO: Collecting alignments of sequences to the graph 00:01:54 6.7Gb INFO: Alignment collection finished. Total length of alignments is 662331 00:01:55 6.7Gb INFO: Could not correct 3 reads. They will be removed. 00:01:55 6.7Gb INFO: Uncorrected reads were removed. 00:01:55 6.7Gb INFO: Applying changes to the graph 00:02:00 6.7Gb INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/corrected_reads.fasta" 00:03:04 5Mb INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph 00:03:04 0Mb INFO: Loading graph from fasta 00:03:04 11Mb INFO: Finished loading graph 00:03:06 265Mb INFO: Looking for unique edges 00:03:06 265Mb INFO: Marked 8 long edges as unique 00:03:06 265Mb INFO: Marking extra edges as unique based on read paths 00:03:06 265Mb INFO: Marked 8 edges as unique 00:03:06 265Mb INFO: Splitting graph with unique edges 00:03:06 265Mb INFO: Processing 5 components 00:03:06 265Mb INFO: Finished unique edges search. Found 22 unique edges 00:03:06 265Mb INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges 00:03:06 265Mb INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges. 00:03:06 265Mb INFO: Resolving repeats 00:03:06 265Mb INFO: Constructing paths 00:03:11 496Mb INFO: Building graph 00:03:11 496Mb INFO: Increasing k 00:03:11 496Mb INFO: Finished increasing k 00:03:11 496Mb INFO: Exporting remaining active transitions 00:03:11 496Mb INFO: Export to Dot 00:03:11 496Mb INFO: Export to GFA and compressed contigs 00:03:11 0.5Gb INFO: Finished repeat resolution 00:03:12 5Mb INFO: Performing polishing and homopolymer uncompression 00:03:12 9Mb INFO: Aligning reads back to assembly 00:03:38 5.8Gb INFO: Finished alignment. 00:03:38 5.8Gb INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/uncompressing/alignments.txt" 00:03:38 5.8Gb INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_2358.fasta.gz"] 00:06:12 7Gb INFO: Uncompressing homopolymers in contigs 00:06:12 7Gb INFO: Total zero covered nucleotides 0 00:06:12 7Gb INFO: Calculating overlaps between adjacent uncompressed edges 00:06:12 7Gb INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg.gfa" 00:06:12 7Gb INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/assembly.fasta" 00:06:13 5Mb INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/corrected_reads.fasta" 00:06:13 5Mb INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg/mdbg.hpc.gfa" 00:06:13 5Mb INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg.gfa" 00:06:13 5Mb INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/assembly.fasta" 00:06:13 5Mb INFO: LJA pipeline finished smooth assemble @ 539.88s: Finished running LJA on file edge_2358.fasta.gz. smooth assemble @ 539.88s: Found file edge_6104.fasta.gz, presumably for contig edge_6104. Assembling. smooth assemble @ 539.89s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_6104.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104 00:00:00 2Mb INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads 00:00:09 5Mb INFO: e213d10b906cc633b78a27eceb52d44073478d6d 00:00:09 5Mb INFO: LJA pipeline started 00:00:09 5Mb INFO: Performing initial correction with k = 5001 00:00:09 0Mb INFO: Reading reads 00:00:09 0Mb INFO: Extracting minimizers 00:00:30 5.6Gb INFO: Finished read processing 00:00:30 5.6Gb INFO: 5736312 hashs collected. Starting sorting. 00:00:30 5.6Gb INFO: Finished sorting. Total distinct minimizers: 3721 00:00:30 5.6Gb INFO: Starting construction of sparse de Bruijn graph 00:00:30 5.6Gb INFO: Vertex map constructed. 00:00:30 5.6Gb INFO: Filling edge sequences. 00:00:50 5.7Gb INFO: Finished sparse de Bruijn graph construction. 00:00:50 5.7Gb INFO: Collecting tips 00:00:50 5.7Gb INFO: Added 14 artificial minimizers from tips. 00:00:50 5.7Gb INFO: Collected 7448 old edges. 00:00:50 5.7Gb INFO: New minimizers added to sparse graph. 00:00:50 5.7Gb INFO: Refilling graph with old edges. 00:00:50 5.7Gb INFO: Filling graph with new edges. 00:00:50 5.7Gb INFO: Finished fixing sparse de Bruijn graph. 00:00:50 5.7Gb INFO: Starting to extract disjointigs. 00:00:50 5.7Gb INFO: Finished extracting 26 disjointigs of total size 1074351 00:00:50 0Mb INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/disjointigs.fasta" 00:00:50 9Mb INFO: Filling bloom filter with k+1-mers. 00:00:51 9Mb INFO: Filled 4370802 bits out of 30218400 00:00:51 9Mb INFO: Finished filling bloom filter. Selecting junctions. 00:00:51 9Mb INFO: Collected 392 junctions. 00:00:51 9Mb INFO: Starting DBG construction. 00:00:51 9Mb INFO: Vertices created. 00:00:51 9Mb INFO: Filled dbg edges. Adding hanging vertices 00:00:51 9Mb INFO: Added 0 hanging vertices 00:00:51 9Mb INFO: Merging unbranching paths 00:00:51 9Mb INFO: Ended merging edges. Resulting size 34 00:00:51 9Mb INFO: Cleaning edge coverages 00:00:51 9Mb INFO: Collecting alignments of sequences to the graph 00:01:11 5.7Gb INFO: Alignment collection finished. Total length of alignments is 435418 00:01:11 5.7Gb INFO: Could not correct 24 reads. They will be removed. 00:01:11 5.7Gb INFO: Uncorrected reads were removed. 00:01:11 5.7Gb INFO: Applying changes to the graph 00:01:14 5.7Gb INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/corrected_reads.fasta" 00:01:50 5Mb INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph 00:01:50 0Mb INFO: Loading graph from fasta 00:01:50 7Mb INFO: Finished loading graph 00:01:51 243Mb INFO: Looking for unique edges 00:01:51 243Mb INFO: Marked 6 long edges as unique 00:01:51 243Mb INFO: Marking extra edges as unique based on read paths 00:01:51 243Mb INFO: Marked 6 edges as unique 00:01:51 243Mb INFO: Splitting graph with unique edges 00:01:51 243Mb INFO: Processing 4 components 00:01:51 243Mb INFO: Finished unique edges search. Found 10 unique edges 00:01:51 243Mb INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges 00:01:51 243Mb INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges. 00:01:51 243Mb INFO: Resolving repeats 00:01:51 243Mb INFO: Constructing paths 00:01:54 326Mb INFO: Building graph 00:01:54 326Mb INFO: Increasing k 00:01:54 326Mb INFO: Finished increasing k 00:01:54 326Mb INFO: Exporting remaining active transitions 00:01:54 326Mb INFO: Export to Dot 00:01:54 326Mb INFO: Export to GFA and compressed contigs 00:01:54 337Mb INFO: Finished repeat resolution 00:01:54 5Mb INFO: Performing polishing and homopolymer uncompression 00:01:54 5Mb INFO: Aligning reads back to assembly 00:02:10 3.5Gb INFO: Finished alignment. 00:02:10 3.5Gb INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/uncompressing/alignments.txt" 00:02:10 3.5Gb INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_6104.fasta.gz"]
00:03:52 3.6Gb INFO: Uncompressing homopolymers in contigs 00:03:52 3.6Gb INFO: Total zero covered nucleotides 0 00:03:52 3.6Gb INFO: Calculating overlaps between adjacent uncompressed edges 00:03:52 3.6Gb INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg.gfa" 00:03:52 3.6Gb INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/assembly.fasta" 00:03:53 5Mb INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/corrected_reads.fasta" 00:03:53 5Mb INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg/mdbg.hpc.gfa" 00:03:53 5Mb INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg.gfa" 00:03:53 5Mb INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/assembly.fasta" 00:03:53 5Mb INFO: LJA pipeline finished smooth assemble @ 772.97s: Finished running LJA on file edge_6104.fasta.gz. smooth assemble @ 772.97s: Done. -------- smooth assemble @ 772.97s: Done.
Once you've created these assemblies, you can then analyze them like you would any other assembly of a MAG. For information about LJA-specific details of the outputs (e.g. what files mean what), please see LJA's manual.
8.1.3. Visualizing assembly graphs¶
One thing we may want to do is visualize the assembly graphs created by LJA, indicating how "simple" each assembly wound up. Bandage, AGB, and MetagenomeScope are a few of many tools that can do this.
As an example, here's a MetagenomeScope visualization of the multiplex de Bruijn graph (MDBG) for edge_6104 (CAMP). This visualization makes it clear that, although there were 35 mutated positions in CAMP in the BCF file (according to the logs from strainFlye smooth create above), the final MDBG—after applying a $k$-mer coverage filter—is really made up of two "bubble" structures, aka the boxes colored in blue. (The pink segment in this figure indicates a sequence that is shared between both bubbles, and duplicated here for ease of visualization.)

Why does this graph look different from the other multiplex de Bruijn graph shown for CAMP in the strainFlye paper?¶
For context: Figure 6 in our paper also shows a MDBG for CAMP, produced by running LJA on the smoothed and virtual reads created for CAMP. Why does that visualization include more sequences?
This is primarily due to differences in the input mutations used for the smoothing process. The mutations used as input for the paper's Figure 6 were just the outputs of naïve $p$-mutation calling with $p = 1\%$. Here, on the other hand, the input BCF file we gave to strainFlye smooth create was created from FDR estimation/fixing (using CAMP as the decoy contig). This meant that none of the mutations for CAMP included in this BCF file were "rare" (i.e. had frequencies less than the -hp/--high-p parameter of strainFlye fdr estimate, which defaults to $p = 5\%$).
(The input alignment is also slightly different.)
8.2. Constructing link graphs¶
As a related analysis, we can construct link graphs showing "how much" of a contig we can potentially phase. These graphs are described in detail in the Supplemental Material of our paper.
Long story short, creating link graphs in strainFlye has two steps (like performing smoothed-read assembly above). First, we'll compute nucleotide (co-)occurrence data for each contig; then, we'll convert this data into link graphs.
!strainFlye link
Usage: strainFlye link [OPTIONS] COMMAND [ARGS]... [+] Create link graphs showing co-occurring alleles. The "nt" command should be run first; this generates information about the counts and co-occurrences of nucleotides at mutated positions in a contig. The "graph" command takes as input the information produced by "nt" and creates link graphs (one link graph per contig) from it. There are many parameters that impact the graph creation, so we have split this into two steps in order to make it easy to rerun the "graph" step with different parameter settings if needed (the "nt" command will likely take longer to run). Options: -h, --help Show this message and exit. Commands: nt Compute (co-)occurrence information for mutations' nucleotides. graph Convert (co-)occurrence information into a link graph structure.
8.2.1. Computing nucleotide (co-)occurrence information¶
This command will take a while (as shown below, it took ~47.84 hours to run on our cluster, or just under two days). As with other long-running commands in the pipeline, using --verbose may be helpful if you'd like to see intermediate logging messages.
A brief note about filtering¶
Note that, in the paper's Supplemental Material, we limited our focus to mutated positions with coverages of at least 1,000x. We do not do this here. If you'd like to perform this same sort of filtering—and only consider mutations occurring at positions with at least a certain minimum coverage—you'll need to filter the BCF file yourself. (You could probably use bcftools to do this; if you are using a BCF file produced by strainFlye, then you can filter based on the MDP info tag, which describes the number of matching + mismatching reads at a mutated position.)
!strainFlye link nt
Usage: strainFlye link nt [OPTIONS]
Compute (co-)occurrence information for mutations' nucleotides.
Options:
-c, --contigs PATH FASTA file of contigs for which we will create
link graphs. All contigs in this FASTA file
should also be contained in the BAM and BCF
files; it's ok if the BAM or BCF files contain
contigs not in this FASTA file (we'll ignore
them). [required]
--bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
--bcf PATH Indexed BCF file describing single-nucleotide
mutations in a set of contigs. [required]
-o, --output-dir DIRECTORY Directory to which information about nucleotide
frequencies at the mutated positions in each
contig, as well as co-occurrence information
between pairs of mutated positions, will be
written. Each contig will be represented by two
"pickle" files within this directory, one
describing nucleotide frequencies at each
position and one describing nucleotide pair
frequencies at pairs of positions; each file's
name will be prefixed with the corresponding
contig name. [required]
--verbose / --no-verbose Display extra details while running. [default:
no-verbose]
-h, --help Show this message and exit.
!strainFlye link nt \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt \
--verbose
Using strainFlye version "0.1.0-dev". -------- link nt @ 0.00s: Starting... Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/link/nt -------- link nt @ 0.00s: Loading and checking FASTA, BAM, and BCF files... link nt @ 0.12s: The FASTA file describes 3 contig(s). link nt @ 0.22s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)). link nt @ 11.96s: All FASTA contig(s) are included in the BCF file (the header of this BCF file describes 468 contig(s)). link nt @ 11.98s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files. link nt @ 11.98s: So far, these files seem good. -------- link nt @ 11.98s: Going through contigs and computing nucleotide (co-)occurrence information... link nt @ 11.98s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%). link nt @ 11.99s: Contig edge_1671 has 23,821 mutated position(s). Going through alignments to this contig... link nt @ 899.59s: Found 256,509 read(s) spanning ≥ 1 mutated position in contig edge_1671. Computing (co-)occurrence info... link nt @ 3,115.82s: Wrote out (co-)occurrence info for contig edge_1671. link nt @ 3,115.82s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%). link nt @ 3,115.82s: Contig edge_2358 has 17 mutated position(s). Going through alignments to this contig... link nt @ 4,448.62s: Found 22,384 read(s) spanning ≥ 1 mutated position in contig edge_2358. Computing (co-)occurrence info... link nt @ 4,451.40s: Wrote out (co-)occurrence info for contig edge_2358. link nt @ 4,451.40s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%). link nt @ 4,451.40s: Contig edge_6104 has 35 mutated position(s). Going through alignments to this contig... link nt @ 5,292.62s: Found 8,670 read(s) spanning ≥ 1 mutated position in contig edge_6104. Computing (co-)occurrence info... link nt @ 5,293.61s: Wrote out (co-)occurrence info for contig edge_6104. link nt @ 5,293.61s: Done. -------- link nt @ 5,293.66s: Done.
8.2.2. Converting this information to create link graphs¶
This next command directly takes as input the information we just computed. You can experiment with various parameters of graph construction here, if you'd like; the defaults match what we have set in the paper. (The help text for this command, shown below, explains these parameters in detail.)
!strainFlye link graph
Usage: strainFlye link graph [OPTIONS]
Convert (co-)occurrence information into a link graph structure.
We create one link graph per contig. These graphs are undirected. A node (i,
Ni) in the graph represents an allele: in other words, the occurrence of
nucleotide Ni (one of {A, C, G, T}) at position i (1-indexed).
What do edges between nodes represent? To answer that, let's define reads(i,
Ni) as the number of reads at which Ni is aligned to i. Let's also define
reads(i, j, Ni, Nj) as the number of reads at which Ni is aligned to i, and
Nj is aligned to j.
Given these definitions, we define the link weight between two alleles (i,
Ni) and (j, Nj) as
reads(i, j, Ni, Nj)
link(i, j, Ni, Nj) = -----------------------------------
max(reads(i, Ni), reads(j, Nj))
We create an edge between nodes (i, Ni) and (j, Nj) if both of the following
conditions hold:
1. The number of reads spanning both i and j with any (non-degenerate)
nucleotides aligned to either position is ≥ the --min-span parameter.
2. link(i, j, Ni, Nj) > the --low-link parameter.
Options:
-n, --nt-dir DIRECTORY Directory produced by "strainFlye link nt"
containing "pickle" files describing
nucleotide (co-)occurrence information for
certain contigs. [required]
--min-nt-count INTEGER RANGE We will only create a node representing a
given nucleotide at a mutated position if
this nucleotide is supported by at least
this many reads at this position. [default:
2; x>=1]
--min-span INTEGER RANGE One of the prerequisites for creating an
edge between two nodes (representing
nucleotides Ni, Nj at mutated positions i
and j) is that the total number of reads
spanning both i and j, across all
combinations of (non-degenerate) nucleotides
at both i and j, is at least this parameter.
[default: 501; x>=1]
--low-link FLOAT RANGE The other prerequisite for creating an edge
between two nodes (representing nucleotides
Ni, Nj at mutated positions i and j) is that
link(i, j, Ni, Nj) > this parameter. Note
this check is not inclusive (i.e. the
default of 0 means that, if reads(i, j, Ni,
Nj) = 0, we will never create an edge
between (i, Ni) and (j, Nj). [default: 0;
0<=x<1]
-f, --output-format [dot|nx] Output format used for each contig's link
graph. "dot" will write out graphs in
Graphviz' DOT language; "nx" will write out
graphs to "pickle" files, in which each link
graph is a NetworkX object. [default: dot]
--min-penwidth-clamp FLOAT RANGE
Only used if --output-format is "dot". Edges
will be assigned a "penwidth" attribute by
interpolating their link weight from the
range [0, 1] to the range [0, --max-
penwidth]; penwidths less than --min-
penwidth-clamp will then be clamped to
--min-penwdith-clamp. [default: 0.1; x>=0]
--max-penwidth FLOAT RANGE Only used if --output-format is "dot". See
the description for --min-penwidth-clamp.
[default: 5; x>=0]
-o, --output-dir DIRECTORY Directory to which graphs will be written.
We'll write one graph file per contig; each
file's name will be prefixed with the
corresponding contig name. [required]
--verbose / --no-verbose Display extra details while running.
[default: no-verbose]
-h, --help Show this message and exit.
!strainFlye link graph \
--nt-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/link/graph \
--verbose
Using strainFlye version "0.1.0-dev". -------- link graph @ 0.00s: Starting... Input nucleotide information directory: /Poppy/mfedarko/sftests/tutorial-output/link/nt To make a node (i, Ni): reads(i, Ni) must be ≥ 2 To make an edge btwn (i, Ni) & (j, Nj): spanCount(i, j) must be ≥ 501 To make an edge btwn (i, Ni) & (j, Nj): link(i, j, Ni, Nj) must be > 0.00 Output graph file format: dot In dot output, edges' penwidths will be scaled from link weights to [0, 5.0] and then clamped to a minimum of 0.1 Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/link/graph -------- link graph @ 0.00s: Going through (co-)occurrence information and creating link graphs... link graph @ 25.65s: Creating a link graph for contig edge_1671... link graph @ 25.73s: The link graph for contig edge_1671 has 49,167 node(s). link graph @ 54.96s: The link graph for contig edge_1671 has 8,801,972 edge(s). link graph @ 84.48s: Wrote out a link graph (format: "dot") for contig edge_1671. link graph @ 86.67s: Creating a link graph for contig edge_2358... link graph @ 86.67s: The link graph for contig edge_2358 has 41 node(s). link graph @ 86.67s: The link graph for contig edge_2358 has 13 edge(s). link graph @ 93.57s: Wrote out a link graph (format: "dot") for contig edge_2358. link graph @ 93.57s: Creating a link graph for contig edge_6104... link graph @ 93.57s: The link graph for contig edge_6104 has 80 node(s). link graph @ 93.57s: The link graph for contig edge_6104 has 2,190 edge(s). link graph @ 93.58s: Wrote out a link graph (format: "dot") for contig edge_6104. link graph @ 93.58s: Done. -------- link graph @ 93.58s: Done.
8.2.3. Visualizing CAMP's link graph¶
Since we kept -f/--output-format set to the default of dot, the above command just created DOT files for each contig's link graphs. These files can be easily visualized using Graphviz. (Graphviz is installed by default on many Linux systems; see its website for installation instructions. I've found that installing it from conda-forge seems to work well.)
Let's visualize the link graph for CAMP (edge_6104) as a simple example of how we might visualize link graphs in practice. It shouldn't contain too many mutations, so it's a simple starting point.
To visualize this graph we'll use Graphviz' sfdp layout algorithm, which is useful for visualizing large undirected graphs.
# make sure sfdp (and Graphviz) are installed
!which sfdp
/home/mfedarko/miniconda3/envs/strainflye/bin/sfdp
!sfdp \
/Poppy/mfedarko/sftests/tutorial-output/link/graph/edge_6104_linkgraph.gv -Tpng \
> edge_6104_linkgraph.png

And there we have it!
8.2.4. Interpreting link graphs¶
What's going on in this graph? We see two large "groups" of alleles with high-link-weight edges. Each allele is labelled with four pieces of information:
- Position on the contig (1-indexed)
- Nucleotide (one of A, C, G, T)
- Count
- Frequency
Notice how the frequencies of alleles in the left group are mostly ~16%, and how the frequencies of alleles in the right group are mostly ~84%. These groups correspond to the two main "haplotypes" of gene 1217 in CAMP (ranging from left = 1,208,927 bp to right = 1,210,075 bp). We know from our paper (see Figure 5, as well as Supplemental Material "Haplotypes of the most mutated gene in CAMP") that the mutations in gene 1217 of CAMP have one ~84% haplotype and one ~16% haplotype: we see this pattern recapitulated clearly here.
There's also one mutation with a similar frequency distribution at position 766,857 bp, although this mutation isn't close enough to gene 1217's mutations to connect it.
Notice how there are a few low-count, low-frequency alleles on the periphery of this link graph (e.g. node 1,209,001 (T), with a count of 3x). These alleles pop up because, when we create nodes corresponding to alleles at mutated positions in a contig, we create these nodes for all alleles at these mutated positions (with a count of at least the --min-nt-count parameter of strainFlye link graph)—not just those corresponding to the most common or second most common alleles at a mutated position. You can increase the --min-nt-count parameter to a higher number (e.g. 10) if you'd like to remove these low-count allele nodes.
Comparing this link graph with Figure S19 (bottom) in our paper¶
We note that this graph is a bit different from Figure S19 (bottom) in our paper, which also shows a link graph for CAMP. Why is this the case? Here are a few reasons:
Figure S19's link graphs were created using simple $p = 0.5\%$ mutations (filtered to positions with a minimum coverage of 1,000x), whereas our link graphs were created from post-FDR-fixing mutations without any coverage filtering (and since we used CAMP as our decoy, we only include "indisputable" mutations in CAMP—i.e. those with frequency $p ≥ 5\%$).
Figure S19 (bottom) only shows the largest component of CAMP—whereas this link graph shows the entirety of CAMP's link graph (but since we consider a likely smaller set of mutations, the link graph is not much larger).
In Figure S19, I colored each node (allele) in the link graph based on the gene(s) in which it was located. (This analysis notebook contains the code I used to do this.) Here we just visualize these DOT files as strainFlye outputs them, without any gene-coloring information; you can adjust nodes' colors yourself if desired, although for large graphs this will probably require some programming.
The alignment used here is slightly different from that used in the paper.
Other notes about link graphs¶
When visualizing massive link graphs (e.g. for BACT1's largest component, as shown in Figure S19 (top) in our paper), you can adjust the
dpiattribute of the graph (for BACT1's largest component, I addeddpi=20;to the top of its DOT file) to reduce the visualization's filesize. You can also try removing thelabelattribute of nodes, since for massive graphs at low resolution these labels will be hard to read anyway.Graphviz includes the
ccompscommand-line tool, which splits DOT files up into their connected components. This can be useful if you want to visualize individual components of a large DOT file.
9. Codon / amino acid mutation matrices¶
We don't discuss this much in the main text of the strainFlye paper, but it's possible to create mutation matrices showing how codons mutate into each other, specific to an individual contig. We can also derive amino acid mutation matrices from these. These matrices can in theory be useful as an alternative to more general amino acid mutation matrices like PAM (Dayhoff et al., 1978) or BLOSUM (Henikoff & Henikoff, 1992).
In our paper's Supplemental Material, we show off a few of these matrices for three MAGs. strainFlye supports creating these matrices for arbitrary contigs, and here we demonstrate how we can easily do this!
!strainFlye matrix
Usage: strainFlye matrix [OPTIONS] COMMAND [ARGS]... [+] Create codon and amino acid mutation matrices. Options: -h, --help Show this message and exit. Commands: count Count 3-mers aligned to contigs' codons. fill Call codon mutations from 3-mer counts and fill in matrices.
Similarly to strainFlye link, the strainFlye matrix module is also split up into two steps (count and fill). The rationale behind this is that count takes a while and doesn't require many parameters, whereas fill goes by much quicker and has many adjustable parameters. So it's convenient to have a "break" in the process this way.
9.1. Counting 3-mers aligned to codons¶
We'll reuse prodigal_anonymous_selected3.gff, the file of protein-coding gene predictions in the three selected contigs that we generated above using Prodigal. This tells strainFlye where each contig's genes are, and from there we can figure out where its codons are (and what reads are aligned to each of these codons).
This will take a while, so you may want to use --verbose and/or filter your FASTA file (e.g. if you only want to draw a mutation matrix for a single contig, filtering your FASTA file to this one contig should speed things up a lot).
!strainFlye matrix count
Usage: strainFlye matrix count [OPTIONS]
Count 3-mers aligned to contigs' codons.
Options:
-c, --contigs PATH FASTA file of contigs in which we'll count
3-mers aligned to codons in predicted genes. All
contigs in this FASTA file should also be
contained in the BAM file; it's ok if the BAM
file contains contigs not in this FASTA file
(we'll ignore them). We'll also ignore contigs
for which no gene coordinates are given.
[required]
-b, --bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
-g, --genes PATH Generic Feature Format version 3 (GFF3) file
describing gene coordinates within contigs.
We'll only consider features with a "type" of
one of {"CDS", "SO:0000316"}. [required]
-o, --output-dir DIRECTORY Directory to which 3-mer count information for
each contig will be written. Each contig will be
represented by one "pickle" file in this
directory named [contig]_3mers.pickle.
[required]
--verbose / --no-verbose Display extra details while running. [default:
no-verbose]
-h, --help Show this message and exit.
!strainFlye matrix count \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--genes /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/count \
--verbose
9.2. Calling codon mutations from aligned 3-mer counts, creating mutation matrices¶
The long part is over! Now that we have counts of which 3-mers were aligned to the codons in the coding sequences of our contigs, we can call codon mutations.
Calling codon mutations is done similarly, in our pipeline, to how we call single-nucleotide mutations. We support either $p$ or $r$ codon mutation calling (using frequencies or read counts). Here we'll do $p$-mutation calling at $p = 0.5\%$, just to be consistent with the paper's Supplemental Material.
Minor sidenote: skipping this step and calling codon mutations manually¶
If you'd prefer, you could in theory call codon mutations from these counts through a more sophisticated method—this is one of those advantages of splitting up count and fill into two separate commands. (However: in order to do this, you'd need to load the 3-mer count data yourself—it's stored as a pickled representation of a CodonCounter object, which is defined in strainFlye's source code here. Feel free to open an issue on GitHub if you have questions about interacting with this object.)
!strainFlye matrix fill
Usage: strainFlye matrix fill [OPTIONS]
Call codon mutations from 3-mer counts and fill in matrices.
You can call either p-mutations (based on codon frequencies) or r-mutations
(based just on codon counts) using -p or -r. These methods work analogously
to how they are described in "strainFlye call", with the addition that we do
not call codon mutations for "unreasonable" codons where the most common
aligned 3-mer (or, in the case of a tie, any of the most common aligned
3-mers) at this codon is not exactly the same as the "reference" codon's
3-mer.
It's worth noting that the amino acid mutation matrix is derived directly
from the codon mutation matrix: that is, as soon as we identify a mutation
from one codon (C1) into another codon (C2), we also record a mutation in
the amino acid matrix from translate(C1) into translate(C2). We could
arguably do this in a different way (count the translations of each 3-mer
aligned to a codon, then call amino acid mutations separately), but for now
we just do things the one way.
Options:
-c, --count-dir DIRECTORY Directory produced by "strainFlye matrix
count" containing "pickle" files describing
the counts of 3-mers aligned to contigs'
codons. [required]
-p INTEGER RANGE If specified, we'll call codon p-mutations
at this p threshold. This is scaled up by
100 (i.e. "-p 50" corresponds to 50 / 100 =
0.5%) in order to bypass floating-point
precision issues. Mutually exclusive with
-r. [default: (nothing); 0<x<=5000]
--min-alt-pos INTEGER RANGE Only used if -p is specified. In order for
us to call a p-mutation at a position, this
position's alternate nucleotide must be
supported by at least this many reads.
[default: 2; x>=1]
-r INTEGER RANGE If specified, we'll call codon r-mutations
at this r threshold. Mutually exclusive with
-p. [default: (nothing); x>=1]
-f, --output-format [tsv|json] Output format for matrix information.
[default: tsv]
-o, --output-dir DIRECTORY Directory to which matrix information will
be written. Within this directory, we'll
create one subdirectory for each contig
represented in --count-dir. Each
subdirectory will contain a codon mutation
matrix, amino acid mutation matrix, and
counts of codons and amino acids in the
"reference" contig sequence. [required]
--verbose / --no-verbose Display extra details while running.
[default: no-verbose]
-h, --help Show this message and exit.
!strainFlye matrix fill \
--count-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/count \
-p 50 \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/fill \
--verbose
Using strainFlye version "0.1.0-dev". -------- matrix fill @ 0.00s: Starting... Input 3-mer count directory: /Poppy/mfedarko/sftests/tutorial-output/matrix/count p: 50 Minimum number of supporting reads needed to call a p-mutation: 2 r: None Output matrix information file format: tsv Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/matrix/fill -------- matrix fill @ 0.00s: Performing codon p-mutation calling (p = 0.50%). -------- matrix fill @ 0.00s: Going through aligned 3-mer counts and creating matrices... matrix fill @ 1.26s: Creating matrices for contig edge_1671... matrix fill @ 66.95s: Created an output directory for contig edge_1671. matrix fill @ 68.38s: Creating matrices for contig edge_2358... matrix fill @ 143.64s: Created an output directory for contig edge_2358. matrix fill @ 144.56s: Creating matrices for contig edge_6104... matrix fill @ 183.52s: Created an output directory for contig edge_6104. matrix fill @ 183.52s: Done. -------- matrix fill @ 183.64s: Done.
9.3. Visualizing mutation matrices¶
First, let's load and inspect a codon mutation matrix. We'll use BACT1 (edge_1671)'s matrix, just because we know from our paper that this MAG has a lot of mutations.
An entry (row, col) in the codon or amino acid matrix gives the number of times the codon or amino acid row mutates into the codon or amino acid col. For example, in BACT1's codon mutation matrix (shown below), we saw the codon AAA mutate into the codon AAC exactly 15 times.
Note: We don't do this here, but—depending on what you want to use this matrix for—it is worth thinking about if you want to normalize and/or subset it in some way. For example, you may want to divide each entry in each row by the total number of mutations in this row (dividing each entry in AAA's row by the total number of mutations from AAA into another codon, etc.), in order to convert each entry from counts to percentages. Making these sorts of changes should be feasible with pandas, once you've loaded the initial matrix.
# Load the codon matrix for BACT1 using pandas
cmat = pd.read_csv(
"/Poppy/mfedarko/sftests/tutorial-output/matrix/fill/edge_1671/edge_1671_codon_matrix.tsv",
sep="\t"
)
cmat
| AAA | AAC | AAG | AAT | ACA | ACC | ACG | ACT | AGA | AGC | ... | TCG | TCT | TGA | TGC | TGG | TGT | TTA | TTC | TTG | TTT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAA | NaN | 15.0 | 243.0 | 53.0 | 11.0 | 3.0 | 0.0 | 2.0 | 53.0 | 2.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 |
| AAC | 35.0 | NaN | 3.0 | 316.0 | 2.0 | 7.0 | 0.0 | 1.0 | 0.0 | 45.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| AAG | 265.0 | 7.0 | NaN | 15.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| AAT | 53.0 | 254.0 | 20.0 | NaN | 5.0 | 0.0 | 0.0 | 14.0 | 2.0 | 8.0 | ... | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ACA | 16.0 | 1.0 | 3.0 | 8.0 | NaN | 94.0 | 76.0 | 168.0 | 6.0 | 3.0 | ... | 2.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TGT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 95.0 | 1.0 | NaN | 0.0 | 1.0 | 0.0 | 2.0 |
| TTA | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 7.0 | 195.0 | 11.0 |
| TTC | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 | NaN | 3.0 | 187.0 |
| TTG | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 5.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 264.0 | 0.0 | NaN | 15.0 |
| TTT | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 12.0 | 0.0 | 0.0 | 0.0 | 2.0 | 12.0 | 157.0 | 12.0 | NaN |
64 rows × 64 columns
The following code is a simplified version of how we visualized these matrices in the paper; the original notebook for visualizing these matrices is here.
# Get the tab20 color map. This returns an object that we can call with, e.g. (0) to get the first
# color, (1) to get the second color, etc. See https://stackoverflow.com/a/25408562.
# (List of matplotlib color maps: https://matplotlib.org/stable/gallery/color/colormap_reference.html)
cmap = matplotlib.cm.get_cmap("tab20")
colors = [cmap(i) for i in range(20)]
hexcolors = [matplotlib.colors.to_hex(c) for c in colors]
# Map the 20 amino acids to a unique tab20 color, and then keep the stop codon (item 21)'s text color as black
# so that it has a unique color.
aa2color = {}
aas = "ACDEFGHIKLMNPQRSTVWY*"
for i, aa in enumerate(aas[:-1]):
aa2color[aa] = hexcolors[i]
aa2color["*"] = "#000000"
fig, ax = pyplot.subplots()
im = ax.imshow(cmat.values)
ax.grid(False)
ax.tick_params(
top=True, bottom=True, left=True, right=True,
labeltop=True, labelbottom=True, labelleft=True, labelright=True
)
toplabels = []
leftlabels = []
for oi, o in enumerate(cmat.index):
t = skbio.DNA(o).translate()
# Assign each codon a color based on its translation
toplabels.append(f"{t}\n{o}")
leftlabels.append(f"{t} {o}")
# As in "tick i's". This specifies where to put all the ticks. (If we omit this,
# then only a few of the tick labels are shown.)
tick_is = range(len(cmat.index))
ax.set_xticks(tick_is)
ax.set_yticks(tick_is)
ax.set_xticklabels(toplabels)
ax.set_yticklabels(leftlabels)
def set_text_style(ticklabel, fontsize=20):
# Figure out the color for this tick based on its amino acid / stop codon
tl.set_color(aa2color[tl.get_text().split()[0]])
tl.set_fontweight("bold")
tl.set_fontfamily("monospace")
tl.set_fontsize(fontsize)
for tl in ax.get_yticklabels():
set_text_style(tl)
for tl in ax.get_xticklabels():
set_text_style(tl, fontsize=9)
ax.set_ylabel("Mutated from", fontsize=30)
ax.set_xlabel("Mutated into", fontsize=30)
ax.set_title('Codon mutation matrix ($p = 0.5\%$) for edge_1671 ("BACT1")', fontsize=40, y=1.035)
# Call tight_layout() before adding the colorbar: https://stackoverflow.com/a/48922336
fig.tight_layout()
# Use thousands separators in colorbar labels if needed: https://stackoverflow.com/a/47230864
cbar_formatter = lambda x, p: format(int(x), ",")
# The aspect=60 thing reduces the colorbar's width to look roughly equivalent to that of a matrix cell
# Also, note that adding the colorbar here causes a matplotlib warning about "Auto-removal of grids": as
# far as I can tell (https://github.com/matplotlib/matplotlib/issues/21723), this warning is not important.
fig.colorbar(im, aspect=60, format=cbar_formatter)
fig.set_size_inches(27, 27)
/home/mfedarko/miniconda3/envs/strainflye/lib/python3.7/site-packages/ipykernel_launcher.py:48: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.
So that's an example of how we can visualize mutation matrices! Making them look pretty takes a bit of work, but matplotlib helps a lot. There are lots of other ways to add more information into these visualizations—showing the exact number of mutations in each cell of the matrix, showing the "reference counts" (number of times we saw a codon in a contig's genes) on the diagonal of the codon matrix, etc. The notebook I used to visualize matrices for the paper goes much more in-depth, although its code is a bit of a rat's nest.
Of course, mutation matrices have other uses besides visualization—as mentioned, they could in theory be useful as a way to compare alignments of genes or proteins, similarly to the PAM or BLOSUM matrices. strainFlye just focuses on computing these matrices, so whatever you do with them is up to you :)
10. Growth dynamics analyses¶
One of the nice things of having complete or nearly-complete MAGs is that we can take a look at both their GC skews and coverages.
GC skews can sometimes be indicative of where the origin and terminus of replication are in a genome (this is discussed extensively in Chapter 1 of the Bioinformatics Algorithms textbook). Similarly, we know from (Korem, Zeevi, Suez et al., 2015) that the coverage throughout a MAG can also be indicative of the origin of replication.
In our paper, we demonstrate the easy applicability of HiFi reads to this sort of analysis by showing plots of coverage vs. skew for three MAGs in the SheepGut dataset, demonstrating that—as we might expect—these values are anti-correlated. (Why are they anti-correlated, specifically? The skew minimum indicates the origin of replication, and coverage for a genome undergoing replication should be at its maximum near the origin of replication.)
The strainFlye dynam covskew command can help us replicate reproduce this sort of analysis. Rather than plotting coverage and skew for each position within a MAG, we usually want to create "bins" of positions within a MAG and then plot coverage and skew for these bins: strainFlye can give us this sort of information for a contig in an easy-to-read TSV file that we can use as input for plotting coverage and skew.
This command will take some time to run, so you may want to use --verbose.
10.1. Compute coverage/skew information¶
!strainFlye dynam covskew
Usage: strainFlye dynam covskew [OPTIONS]
Compare coverage and GC skew within contigs.
Bins
----
We split each contig into bin(s) of a fixed number of positions (determined
by the --bin-length parameter), starting from the leftmost end of the
contig (which we define as position 1). We continue creating bins until
we run out of positions: the rightmost bin will contain fewer positions
than --bin-length if the contig length is not divisible by --bin-length.
(And if the contig's length is ≤ --bin-length, we'll just create one
bin for all positions in this contig.)
Normalized coverages
--------------------
For each bin, we compute the median coverage of all positions in this bin.
We then compute M, the median of these medians across all bins in a contig.
We finally compute "normalized" coverages by dividing each bin's median
coverage by M, and then clamping these normalized coverages to the range
[1 - ϵ, 1 + ϵ]. (If M = 0, then we cannot do this normalization:
in this case, we define each bin's normalized coverage as "NA".) The
--norm-coverage-epsilon parameter can be used to adjust ϵ.
Cumulative GC skews
-------------------
For each bin, we compute the skew (G - C) / (G + C), where G is the number
of G nucleotides in this bin and C is the number of C nucleotides in this
bin. (If a bin has zero G or C nucleotides, then we define its skew as 0.)
We make these skews cumulative by, starting from the second-from-the-left
bin, defining this bin's cumulative skew as the sum of its skew and the
cumulative skew of the bin exactly to the left of it. This approach is
derived from (Grigoriev, 1998).
Options:
-c, --contigs PATH FASTA file of contigs for which we will
compute coverage and skew information. All
contigs in this FASTA file should also be
contained in the BAM file; it's ok if the
BAM file contains contigs not in this FASTA
file (we'll ignore them). [required]
-b, --bam PATH Sorted and indexed BAM file representing an
alignment of reads to contigs. [required]
-l, --bin-length INTEGER RANGE Bin length (in bp). [default: 10000; x>=1]
-e, --norm-coverage-epsilon FLOAT RANGE
We clamp all bins' normalized coverages to
within this distance of 1.0. For example,
the default of ϵ = 0.3 means we clamp to the
range [0.7, 1.3]. [default: 0.3; x>0]
-o, --output-dir DIRECTORY Directory to which TSV files describing
contigs' bins, coverages, and skews will be
written. We'll write one TSV file per contig
(named [contig]_covskew.tsv); each bin
within a contig will be described in one row
in its TSV file. [required]
--verbose / --no-verbose Display extra details while running.
[default: no-verbose]
-h, --help Show this message and exit.
!strainFlye dynam covskew \
--contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
--bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
--output-dir /Poppy/mfedarko/sftests/tutorial-output/covskew \
--verbose
Using strainFlye version "0.1.0-dev". -------- dynam covskew @ 0.00s: Starting... Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam Bin length: 10,000 bp Normalized coverage epsilon: 0.3 Verbose?: Yes Output directory: /Poppy/mfedarko/sftests/tutorial-output/covskew -------- dynam covskew @ 0.00s: Loading and checking FASTA and BAM files... dynam covskew @ 0.35s: The FASTA file describes 3 contig(s). dynam covskew @ 0.35s: All of these are included in the BAM file (which has 78,793 reference(s)), with the same lengths. -------- dynam covskew @ 0.35s: Going through contigs and computing coverage/skew information... dynam covskew @ 0.35s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%). dynam covskew @ 0.35s: Creating 215 bins of length 10,000 bp and 1 smaller bin of length 3,394 bp for contig edge_1671... dynam covskew @ 231.35s: For contig edge_1671's bins' covs, pre-norm.: min = 4.0x; max = 1,606.0x; median (M) = 1,435.5x. dynam covskew @ 231.35s: For contig edge_1671's cumulative bin skews: min = 0.0963; max = 4.9053. dynam covskew @ 231.39s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%). dynam covskew @ 231.39s: Creating 280 bins of length 10,000 bp and 1 smaller bin of length 6,161 bp for contig edge_2358... dynam covskew @ 1,172.61s: For contig edge_2358's bins' covs, pre-norm.: min = 2,427.0x; max = 4,194.0x; median (M) = 2,940.0x. dynam covskew @ 1,172.61s: For contig edge_2358's cumulative bin skews: min = -2.1523; max = 2.1512. dynam covskew @ 1,172.67s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%). dynam covskew @ 1,172.67s: Creating 128 bins of length 10,000 bp and 1 smaller bin of length 9,244 bp for contig edge_6104... dynam covskew @ 1,888.12s: For contig edge_6104's bins' covs, pre-norm.: min = 1,455.0x; max = 5,104.5x; median (M) = 4,108.0x. dynam covskew @ 1,888.12s: For contig edge_6104's cumulative bin skews: min = 0.0137; max = 7.6426. dynam covskew @ 1,888.15s: Done. -------- dynam covskew @ 1,888.19s: Done.
10.2. Visualize coverage and skew¶
So we've got this information, but what can we do with it? The most obvious answer is that we can plot it without too much effort. Visually, this will let us see if (normalized) coverage and skew are anti-correlated. (You could also imagine performing some sort of correlation analysis / regression between these quantities, but let's keep it simple for now.)
Here, we'll replicate part of Figure S17 in our paper (the exact code that generated that figure is in this notebook). For the sake of simplicity, we'll just do this for a single contig: CAMP, or edge_6104.
# Load the coverage / skew TSV file, and see what the first few lines (bins) in it look like
covskew = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/covskew/edge_6104_covskew.tsv", sep="\t")
covskew.head()
| LeftPos_1IndexedInclusive | CenterPos_1Indexed | NormalizedCoverage | CumulativeSkew | |
|---|---|---|---|---|
| 0 | 1 | 5000.5 | 1.183057 | 0.013732 |
| 1 | 10001 | 15000.5 | 1.164070 | 0.107924 |
| 2 | 20001 | 25000.5 | 1.180136 | 0.143035 |
| 3 | 30001 | 35000.5 | 1.155794 | 0.265320 |
| 4 | 40001 | 45000.5 | 1.169426 | 0.506219 |
cov_color = "#0760ad"
skew_color = "#ad0707"
fig, ax = pyplot.subplots()
# twinx() lets us plot coverage and skew on the same plot.
ax_skew = ax.twinx()
# Plot coverage and skew with both different colors and line styles -- the use of different line styles
# makes this plot easier to interpret for colorblind readers. The (0, (6, 1)) line style makes the coverage
# line use dense dashes, which matches the version of this plot shown in the strainFlye paper's supplemental
# material; see https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html.
ax.plot(
covskew["CenterPos_1Indexed"], covskew["NormalizedCoverage"], color=cov_color, lw=5, linestyle=(0, (6, 1)),
label="Coverage"
)
ax_skew.plot(
covskew["CenterPos_1Indexed"], covskew["CumulativeSkew"], color=skew_color, lw=5,
label="Skew"
)
# turn off skew gridlines, which get in the way visually.
ax_skew.grid(False)
# plot the low and upper normalized coverage bounds, for reference
for ybound in [0.7, 1.3]:
ax.axhline(y=ybound, linestyle=":", color=cov_color, lw=3)
ax.set_xlabel("Sequence position", labelpad=12, fontsize=20)
ax.set_ylabel("Normalized Coverage", labelpad=12, fontsize=20, color=cov_color)
ax_skew.set_ylabel("Cumulative GC Skew", labelpad=28, fontsize=20, color=skew_color, rotation=-90)
ax.set_title("Coverage vs. GC skew in CAMP", fontsize=25)
ax.xaxis.set_major_formatter("{x:,.0f}")
# The bin positions are 1-indexed, so let's use 1-indexing on the x-axis to make this clear
ax.set_xticks([1, 2e5, 4e5, 6e5, 8e5, 10e5, 12e5])
ax.tick_params(axis="both", labelsize=13)
ax_skew.tick_params(axis="both", labelsize=13)
# Let's add a legend just to make it clear which line is which (mostly useful for colorblind folks)
fig.legend(loc=(0.625, 0.31), handlelength=10, fontsize=12)
fig.set_size_inches(15, 5)
Nice—we've recreated the top plot in Figure S17 without too much work. The TSV files strainFlye dynam covskew produces for each contig simplify the process of creating these sorts of plots, and thus simplify the process of investigating whether or not a contig / MAG is undergoing replication.
10.3. Computing peak-to-trough ratios (PTRs)¶
The above plot makes it clear that coverage and skew are generally anti-correlated for CAMP, which provides some evidence that this particular MAG might be undergoing replication. If desired, we could take things further and compute the peak-to-trough ratio (PTR) (Korem, Zeevi, Suez et al., 2015) for CAMP to try to gain information about its replication rate.
10.3.1. Some background on PTRs¶
From (Korem, Zeevi, Suez et al., 2015) (emphasis mine):
Summed across the population, the copy number of a DNA region will be higher the closer that region is to the replication origin and, conversely, lower the closer that region is to the terminus (20, 21). Hence, the ratio between DNA copy number near the replication origin and that near the terminus, which we term peak-to-trough ratio (PTR), should reflect the growth rate of the bacterial population. At higher growth rates, a larger fraction of cells undergo DNA replication and more active replication forks are present in each cell (22). This results in a ratio higher than 1:1 between near-origin DNA and near-terminus DNA, thereby providing a quantitative readout of the population growth rate (21).
10.3.2. Computing (rough) PTRs using the output of strainFlye dynam covskew¶
First, we'll make the assumption that the bin with the minimum skew corresponds to the approximate location of the replication origin of CAMP, and that the bin with the maximum skew corresponds to the approximate location of the replication terminus of CAMP.
This is not always a safe assumption to make (skew diagrams are not always perfect indicators of this—see, for example, the skew diagram for Thermotoga petrophila). Also, we should be wary of situations like CAMP where the full genome was incompletely assembled (there are a few other contigs in the same component of the assembly graph in which CAMP is located, as discussed in our paper's Supplemental Material). That said, in this case this is a reasonable assumption—the PTRs we'll compute here are not gospel, and should be interpreted as only rough estimates.
After making this assumption, we'll identify the normalized coverages in the minimum-skew and maximum-skew bins:
# Which bins have the smallest and largest skews?
# The "keep" parameter to nsmallest() and nlargest() can be used to adjust how pandas
# breaks ties (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html),
# but that shouldn't matter too much.
smallest_skew_ncov = covskew.nsmallest(1, ["CumulativeSkew"])["NormalizedCoverage"].iloc[0]
largest_skew_ncov = covskew.nlargest(1, ["CumulativeSkew"])["NormalizedCoverage"].iloc[0]
print(f"The normalized coverage in the bin with the minimum skew is {smallest_skew_ncov:.4f}.")
print(f"The normalized coverage in the bin with the maximum skew is {largest_skew_ncov:.4f}.")
The normalized coverage in the bin with the minimum skew is 1.1831. The normalized coverage in the bin with the maximum skew is 0.9282.
... And we can compute the PTR as the ratio of normalized coverage in the bin with the smallest skew (corresponding, per our assumption, to the replication origin) to normalized coverage in the bin with the largest skew (corresponding, per our assumption, to the replication terminus).
Note that if either of these normalized coverages have been "clamped" (i.e. they are set to 0.7 or 1.3, for the default -e/--norm-coverage-epsilon parameter for strainFlye dynam covskew) then I would recommend not computing the PTR for this contig—or, if you do compute the PTR, I'd at least recommend being skeptical of it. You could try to limit the impacts of clamping by increasing -e/--norm-coverage-epsilon, although this will make the normalized coverages more susceptible to outlier coverages.
camp_ptr = smallest_skew_ncov / largest_skew_ncov
print(f"The PTR for CAMP is {smallest_skew_ncov:.4f} / {largest_skew_ncov:.4f} = {camp_ptr:.4f}.")
The PTR for CAMP is 1.1831 / 0.9282 = 1.2746.
Nice—this matches the PTR we computed for CAMP in our Supplemental Material.
11. Wrapping up¶
That's it for this tutorial! To review, here we used strainFlye to:
- Align reads to contigs (
align) - Naïvely call mutations in contigs (
call p-mutation) - Estimate the false discovery rates (FDRs) of these mutation calls (
fdr estimate) - Fix mutation calls' FDRs to an upper limit (
fdr fix) - Identify "hotspot" predicted genes (
spot hot-features) - Identify "coldspot" gaps between mutations (
spot cold-gaps) - Create and assemble smoothed and virtual reads (
smooth create,smooth assemble) - Create link graphs (
link nt,link graph) - Compute codon and amino acid mutation matrices (
matrix count,matrix fill) - Compare contigs' coverage and skew to get information on growth dynamics (
dynam covskew)
Hopefully this gives you a taste of the various tasks for which strainFlye can be useful. If you have any questions, feel free to open an issue on strainFlye's GitHub page.