Tutorial: applying strainFlye to the "SheepGut" dataset¶

This tutorial will walk you through the various commands strainFlye offers.

0. Following this tutorial¶

Depending on your dataset and goals, you could either walk through the entirety of this document or skip around a bit. You can "jump to" many steps of the pipeline without trouble; for example, if you already have a BCF file of single-nucleotide mutations, you may prefer to skip all of the naïve mutation calling / FDR estimation steps and jump to the phasing analyses.

Note that many of the steps of this pipeline can take a while to run for large datasets like SheepGut. (In my experience: for the full SheepGut dataset on our cluster, "a while" usually means somewhere around 24 hours.) Smaller datasets and/or faster machines should of course decrease this duration. Section 5 of this tutorial ("Optional: Filter the FASTA file in order to focus on certain contigs") discusses how you can subset your dataset, after the align step, to certain contigs of interest—this can also speed things up.

(Obligatory xkcd link.)

1. The two main strainFlye inputs¶

The strainFlye pipeline takes as input two main types of data:

A set of reads (in FASTA / FASTQ format).
A set of contigs (in FASTA format) assembled from these reads.

This is all that is needed if you are starting the pipeline at its beginning (with align). Later pipeline steps will require other inputs that can be produced by intermediate pipeline steps.

1.1. Details about these types of inputs¶

Regarding reads: We designed strainFlye in the context of PacBio Circular Consensus Sequencing (CCS) "HiFi" reads (Wenger & Peluso et al., 2019). However, in theory it should still work with other reasonably long and accurate reads.

Regarding contigs: We don't impose any restriction on the assembler you use to construct these. We have tested strainFlye on metaFlye (Kolmogorov et al., 2020) and hifiasm-meta (Feng et al., 2022) output, but strainFlye should in theory work with the outputs of other assemblers.

1.2. The SheepGut dataset¶

We'll be applying strainFlye to the SheepGut dataset that is shown in the strainFlye paper. (This dataset has also been described in Kolmogorov et al., 2020 and Bickhart & Kolmogorov et al., 2022.) Please see the strainFlye paper's "Data sets" section for details about acquiring reads and contigs for the SheepGut dataset.

Note that the "contigs" we use for the SheepGut datatset really correspond to edge sequences in the assembly_graph.gfa file produced by metaFlye. These edge sequences may be slightly different from the file of contigs / scaffolds in the assembly.fasta file produced by metaFlye: see Flye's manual for more information. (You could use either type of sequence with strainFlye, although I personally recommend using edges: it's useful to have context about where exactly in the assembly graph a sequence is, and things like gaps in scaffolds [represented by Ns] will cause strainFlye to complain.)

2. Introduction¶

Let's take care of a few things before the tutorial starts.

2.1. Installing strainFlye¶

Before following along with this tutorial, we assume that you have already installed strainFlye. Please see strainFlye's README for installation instructions.

2.2. What commands are available through strainFlye?¶

In [2]:

!strainFlye

Usage: strainFlye [OPTIONS] COMMAND [ARGS]...

  Pipeline for the analysis of rare mutations in metagenomes.

  Please consult https://github.com/fedarko/strainFlye if you have any
  questions, comments, etc. about strainFlye. Thank you for using this tool!

Options:
  -v, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  align   Align reads to contigs, and filter the resulting alignment.
  call    [+] Call mutations in contigs naïvely & compute diversity indices.
  fdr     [+] Estimate and fix FDRs for contigs' naïve mutation calls.
  spot    [+] Identify putative mutational hotspots or coldspots.
  smooth  [+] Create and assemble smoothed and virtual reads.
  link    [+] Create link graphs showing co-occurring alleles.
  matrix  [+] Create codon and amino acid mutation matrices.
  dynam   [+] Compute simple information about growth dynamics.
  utils   [+] Miscellaneous utility commands provided with strainFlye.

2.3. Importing and configuring some utilities¶

You shouldn't need to do much (if any) programming to use strainFlye's commands; that said, we will be using Python to help with a few small tasks throughout this tutorial, and you will probably need to do some programming to plot the results we produce. (This philosophy was inspired partially by Schloss 2020.) We'll import some useful Python packages here to reduce clutter later on in this notebook.

(If you prefer, you could of course use another language instead of Python.)

In [3]:

import time
import skbio
import pandas as pd
import matplotlib
from matplotlib import pyplot
# Make the plots look nice -- this is the style we use in the paper
pyplot.style.use("ggplot")

3. Convert the assembly graph GFA file to a FASTA file of contigs¶

You can skip this step if: you already have a FASTA file describing contigs in your assembly graph.

Our assembly graph (the GFA file) contains the sequences of the contigs that we will use in many downstream analyses, but we'll need to have a FASTA file that just describes these contigs' sequences (independent of the assembly graph topology).

There are some bash one-liners you can use to convert a GFA 1 file to a FASTA file, but strainFlye also provides a utility command (strainFlye utils gfa-to-fasta) to do this for you. We'll use this here. (Our solution may be a bit slower than a bash one-liner, but it performs some useful sanity checking on the GFA file.)

In [4]:

!strainFlye utils gfa-to-fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-fasta /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta

Using strainFlye version "0.2.0-dev".
--------
utils gfa-to-fasta @ 0.00s: Starting...
Input GFA file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa
Output FASTA file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
--------
utils gfa-to-fasta @ 16.41s: Done.
Output FASTA file contains 78,793 sequences.

4. Align reads to contigs; filter the resulting alignment¶

You can skip this step if: you already have a BAM file representing an alignment of reads to contigs, and this BAM file does not contain secondary alignments / partially-mapped reads / overlapping supplementary alignments (these all may cause problems in downstream analyses).

We'll need to align reads back to these contigs. The resulting alignment, and/or the mutations that we call from it, will be used in pretty much all downstream steps—so it's important to make sure that it is of good quality!

The strainFlye align command uses minimap2 to perform alignment, and then does some extra filtering on the resulting alignment.

Note that this command, in particular, may take a while to run. Sequence alignment is computationally expensive! On our cluster, strainFlye align ran on the full SheepGut dataset in 62,941.21 seconds (aka about 17.5 hours).

In [5]:

!strainFlye align

Usage: strainFlye align [OPTIONS] READS...

  Align reads to contigs, and filter the resulting alignment.

  Files of reads should be in the FASTA or FASTQ formats; GZIP'd files are
  allowed.

  This command involves multiple steps, including:

    1) Align reads to contigs (using minimap2) to generate a SAM file
    2) Convert this SAM file to a sorted and indexed BAM file
    3) Filter overlapping supplementary alignments (OSAs) from this BAM file
    4) Filter partially-mapped reads from this BAM file

  Note that we only sort the alignment file once, although we do re-index it
  after the two filtering steps. This decision is motivated by
  https://www.biostars.org/p/131333/#131335.

Options:
  -c, --contigs PATH              FASTA file of contigs to which reads will be
                                  aligned.  [required]
  -g, --graph PATH                GFA 1-formatted file describing an assembly
                                  graph of the contigs. This is used in the
                                  "filter partially-mapped reads" step to make
                                  the filter less strict for reads mapped to
                                  adjacent contigs in the graph. This isn't
                                  required; if it isn't passed, adjacent
                                  contigs will not be considered in this
                                  filter.  [default: (no graph)]
  -p, --minimap2-params TEXT      Additional parameters to pass to minimap2,
                                  besides the contig and read information.
                                  Depending on the size of your dataset and
                                  the amount of memory your system has, you
                                  may want to adjust the -I parameter; see
                                  minimap2's documentation for details. Please
                                  note that we do not perform any validation
                                  on this string before passing it to minimap2
                                  (so if you are allowing users to run
                                  strainFlye through a web server, be careful
                                  about shell injection).  [default: -ax asm20
                                  --secondary=no -I 8g --MD]
  -o, --output-dir DIRECTORY      Directory to which an output BAM file
                                  (final.bam) and BAM index file will be
                                  written. Some temporary files will also be
                                  written to this directory.  [required]
  --rm-tmp-bam / --no-rm-tmp-bam  Remove temporary (before filtering, and
                                  after just filtering OSAs) BAM files, to
                                  reduce space requirements.  [default: rm-
                                  tmp-bam]
  --verbose / --no-verbose        Display extra details during alignment
                                  filtering.  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [7]:

# Reads file(s) are specified after all of the other parameters to this command
!strainFlye align \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta \
    --graph /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/alignment \
    /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz \
    /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/*.fasta.gz

Using strainFlye version "0.2.0-dev".
--------
align @ 0.00s: Starting...
Input file(s) of reads: 
1. /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz
2. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz
3. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz
4. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz
5. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz
6. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz
7. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz
8. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz
9. /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta
Input graph file: /Poppy/mfedarko/misc-data/sheepgut_flye_big_2.8_graph.gfa
minimap2 parameters: -ax asm20 --secondary=no -I 8g --MD
Remove temporary BAM files?: Yes
Verbose?: No
Output directory: /Poppy/mfedarko/sftests/tutorial-output/alignment
--------
align @ 0.00s: Loading and checking contig information...
align @ 19.89s: Looks good. The FASTA file describes 78,793 contigs.
--------
align @ 19.89s: Sanity-checking that the assembly graph and FASTA file describe the same contigs (based on names and lengths)...
align @ 38.12s: Everything looks good so far.
--------
align @ 38.12s: Running minimap2 --> samtools view --> samtools sort...
align @ 38.12s: Exact command we're running: minimap2 -ax asm20 --secondary=no -I 8g --MD /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz | samtools view -b - | samtools sort - -o /Poppy/mfedarko/sftests/tutorial-output/alignment/sorted-unfiltered.bam
[M::mm_idx_gen::67.042*1.68] collected minimizers
[M::mm_idx_gen::84.530*1.94] sorted minimizers
[M::main::84.583*1.94] loaded/built the index for 78793 target sequence(s)
[M::mm_mapopt_update::104.604*1.76] mid_occ = 100
[M::mm_idx_stat] kmer size: 19; skip: 10; is_hpc: 0; #seq: 78793
[M::mm_idx_stat::113.809*1.70] distinct minimizers: 466400347 (83.22% are singletons); average occurrences: 1.330; average spacing: 5.502
[M::worker_pipeline::234.045*2.35] mapped 60083 sequences
[[Note from Marcus: trimming down minimap2 outputs here to reduce notebook length...]]
[M::worker_pipeline::41791.167*3.09] mapped 46553 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -ax asm20 --secondary=no -I 8g --MD /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta /Poppy/mkolmogo/sheep_meta/data/sheep_poop_CCS_dedup.fastq.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200203_184822.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200211_145859.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200213_024013.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200214_212611.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200220_195233.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200222_075656.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200223_200916.Q20.fasta.gz /Poppy/mkolmogo/sheep_meta/data/ccs_sequel_II/m54337U_200227_201603.Q20.fasta.gz
[M::main] Real time: 41792.733 sec; CPU: 129283.952 sec; Peak RSS: 30.432 GB
[bam_sort_core] merging from 526 files and 1 in-memory blocks...
align @ 45,699.81s: Done running minimap2 --> samtools view --> samtools sort.
--------
align @ 45,699.81s: Indexing the sorted BAM...
align @ 46,270.42s: Done indexing the sorted BAM.
--------
align @ 46,270.42s: Filtering reads with overlapping supplementary alignments (OSAs)...
align @ 46,938.50s: Done with pass 1 of the OSA filter; moving on to pass 2...
align @ 50,284.20s: Done filtering reads with overlapping supplementary alignments.
--------
align @ 50,284.24s: Indexing the OSA-filtered BAM...
align @ 50,868.70s: Done indexing the OSA-filtered BAM.
--------
align @ 50,868.70s: Removing the before-filtering BAM and its index to save space...
align @ 50,876.99s: Done removing these files.
--------
align @ 50,876.99s: Filtering partially-mapped reads...
align @ 50,877.11s: Loading assembly graph...
align @ 50,894.56s: Loaded assembly graph.
align @ 56,868.92s: Done filtering out partially-mapped reads.
--------
align @ 56,870.69s: Indexing the final BAM...
align @ 57,332.50s: Done indexing the final BAM.
--------
align @ 57,332.50s: Removing the just-OSA-filtered BAM and its index to save space...
align @ 57,341.75s: Done removing these files.
--------
align @ 57,341.76s: Done.

This generates a BAM file (final.bam) and BAM index file (final.bam.bai) in the specified output directory.

We can use this BAM file for many analyses downstream—the first of these will be mutation calling.

5. Optional: Filter the FASTA file in order to focus on certain contigs¶

We just aligned our dataset's reads against all contigs in the assembly graph. This is standard practice (see, e.g., this tutorial); aligning reads against all contigs probably yields a more accurate alignment than just aligning reads against a subset of these contigs (although proving if this is "best practice" or not is a challenging question, and one that I will sidestep right now).

However, now that we have this alignment, we don't necessarily need to perform mutation calling, phasing, etc. on all contigs in our dataset (although, if you want to, we could!). This is the case because—with the exception of FDR estimation and fixing—the rest of the steps in strainFlye's pipeline focus on each contig in isolation. So, we can focus on a smaller set of individual contigs we specifically care about in order to speed things up.

We're going to do the following:

For naive mutation calling, FDR estimation, and FDR fixing, we'll just focus on the "long" contigs in this dataset (defining a contig as "long" if its length is at least 1 Mbp, aka 1,000,000 bp).

In theory, these long contigs represent putative metagenome-assembled genomes (MAGs).
The motivation behind still focusing on a relatively large set of contigs at this step is that, when we perform FDR estimation, we'll need to select a decoy contig. (So it is useful to still have a "pool" of many potential decoy contigs to pick from.) It's also interesting to be able to plot FDR curves for many target contigs.

Beginning completely with Section 8 ("Phasing analyses"), we'll just focus on an even smaller set of just three long contigs shown throughout the strainFlye paper: CAMP, BACT1, and BACT2 (corresponding to edge_6104, edge_1671, and edge_2358, respectively).

Many of these downstream steps take a while to run and produce a lot of data, so focusing on a small amount of contigs helps keep this tutorial (relatively) quick.

Of course, these criteria are not hard requirements you need to follow in your own research! Maybe you'd prefer to specifically focus on contigs with high coverages, or maybe on contigs with good CheckM completeness or contamination values. Or maybe you'd like to keep considering all contigs in the full dataset! Your decision should depend on your goals and your dataset.

In any case, how do we "focus on" certain contigs? We can filter our FASTA file to a subset of contigs present in the full dataset, and use this filtered FASTA file as input to strainFlye's commands. As an example of this, we will use Python (in particular, the scikit-bio library) to create two such filtered FASTA files: one including just the "long" contigs, and one including just the CAMP, BACT1, and BACT2 contigs. (I'm aware that this is a bit inefficient, sorry!)

In [45]:

# (This uses scikit-bio; see http://scikit-bio.org/ for more details.)

input_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs.fasta"

# Produce a filtered FASTA file containing only contigs >= 1 Mbp long
long_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta"
len_threshold = 1000000

# ... and another containing just the three "selected" contigs from the paper
three_contigs_fp = "/Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta"


t0 = time.time()
num_long_contigs = 0
num_selected_contigs = 0
with open(long_contigs_fp, "w") as lf:
    with open(three_contigs_fp, "w") as tf:
        for contig in skbio.io.read(input_contigs_fp, format="fasta", constructor=skbio.DNA):
            
            if len(contig) >= len_threshold:
                skbio.io.write(contig, format="fasta", into=lf)
                num_long_contigs += 1
                
            # All of these three contigs are also >= 1 Mbp long, but let's be paranoid and future-proof this
            # anyway
            if contig.metadata["id"] in ["edge_6104", "edge_1671", "edge_2358"]:
                skbio.io.write(contig, format="fasta", into=tf)
                num_selected_contigs += 1
                
t1 = time.time()

print(f"Found {num_long_contigs:,} contigs with lengths \u2265 {len_threshold:,} bp. Took {t1 - t0:,.2f} sec.")
assert num_selected_contigs == 3, f"Number of selected contigs was {num_selected_contigs:,}?"
print("Found all 3 of the selected contigs (CAMP, BACT1, BACT2).")

Found 468 contigs with lengths ≥ 1,000,000 bp. Took 12.52 sec.
Found all 3 of the selected contigs (CAMP, BACT1, BACT2).

The remainder of this tutorial will focus on these 468 long contigs.

In any case, now we can get on to some more interesting stuff!

6. Perform naïve mutation calling, then estimate and fix mutation calls' FDRs¶

You can skip this step if: You already have a BCF file describing single-nucleotide, non-multi-allelic mutations in your contigs.

The analyses downstream of this step (hotspot/coldspot identification, phasing) take as input a set of identified single-nucleotide mutations (or, if you prefer to use different terminology, "called variants", "called SNVs", ...) in which we have some confidence. How does strainFlye identify these mutations?

There are a few steps (as our paper describes). First, we will naïvely call mutations using a simple threshold-based method (referred to as "NaiveFreq" in the paper). We can then estimate the false-discovery rates (FDRs) of the mutations called for each contig using the target-decoy approach. If desired, we can then adjust the called mutations to fix the estimated FDRs of these mutation calls below a specified threshold.

6.1. $p$-mutations and $r$-mutations?¶

So, our first step will be performing this simple threshold-based calling. What do we mean by "threshold" here?

strainFlye supports calling two basic types of mutations: $p$-mutations and $r$-mutations. The docs explain the difference between these two types best:

In [5]:

!strainFlye call

Usage: strainFlye call [OPTIONS] COMMAND [ARGS]...

  [+] Call mutations in contigs naïvely & compute diversity indices.

  Consider a position "pos" in a contig. Using the alignment, we can count how
  many reads have a (mis)match operation to "pos" with one of the four
  nucleotides (A, C, G, T; we ignore degenerate nucleotides in reads). We
  represent these four nucleotides' counts at pos as follows:

      N1 = # reads of the most-common aligned nucleotide at pos,
      N2 = # reads of the second-most-common aligned nucleotide at pos,
      N3 = # reads of the third-most-common aligned nucleotide at pos,
      N4 = # reads of the fourth-most-common aligned nucleotide at pos.

  (We break ties arbitrarily.)

  strainFlye supports two types of naïve mutation calling based on these
  counts: p-mutations and r-mutations. These are described below.

  p-mutations (naïve percentage-based mutation calling)
  -----------------------------------------------------

  This takes as input some percentage p in the range (0%, 50%]. Define
  freq(pos) = N2 / (N1 + N2 + N3 + N4). This value, which is inherently
  constrained to the range [0%, 50%], is an estimate of the mutation frequency
  of this position. We classify pos as a p-mutation if freq(pos) ≥ p, AND if
  N2 ≥ the --min-alt-pos parameter (an integer representing a minimum number
  of reads that must support the alternate nucleotide).

  r-mutations (naïve read-count-based mutation calling)
  -----------------------------------------------------

  This takes as input some integer r > 0. We classify pos as an r-mutation if
  N2 ≥ r.

  Diversity indices
  -----------------

  Later on in the pipeline, if we perform FDR estimation on these called
  mutations using the "strainFlye fdr" module, we will need to select a "decoy
  contig" with relatively few mutations. A contig with low diversity indices
  may represent a promising decoy contig; so, for the sake of convenience,
  both commands output this information.

Options:
  -h, --help  Show this message and exit.

Commands:
  p-mutation  Call p-mutations and compute diversity indices.
  r-mutation  Call r-mutations and compute diversity indices.

6.2. Understanding these (sub)commands¶

First off, note that strainFlye call doesn't do anything besides show help info if you run it by itself. This is because, unlike strainFlye align, strainFlye call has two subcommands: p-mutation and r-mutation. Which of these you use will depend on how you want to naïvely call mutations. You can invoke one of these subcommands by writing out the full chain of commands: for example, strainFlye call p-mutation.

6.2.1. Input and output¶

Probably the most important parameter at this step is the minimum threshold. Both of these subcommands, strainFlye call p-mutation and strainFlye call r-mutation, take as input a minimum version of their corresponding threshold (either --min-p or --min-r).

These commands each output:

A BCF (binary variant call format) file describing all mutations called naïvely across the contigs, based on the minimum $p$ or $r$ threshold set (--min-p or --min-r).
A TSV (tab separated values) file describing the contigs' computed diversity indices, for various values of $p$ or $r$ (configurable using the --div-index-p-list or --div-index-r-list parameters).

Long story short, diversity indices indicate how many of a contig's "sufficiently-covered" positions have called mutations: in general, higher diversity indices imply higher mutation rates.
If enough positions in a contig are not "sufficiently-covered," then we will not compute the diversity index for this contig. Please see the Supplemental Material "Diversity index details" in the strainFlye paper for details.

6.2.2. Interpreting the output¶

The default minimum value of $p$ (or $r$) used in these commands is fairly low. As you might expect, using such a low threshold for calling a position as a mutation may yield many false positives: we will almost certainly identify many real mutations, but also many "false" mutations that happen to occur as the result of sequencing errors, alignment errors, etc. Viewed another way, the false discovery rate (FDR) (defined as the ratio of false positives to total true + false positives) of the mutation calls generated at this step will probably be high—although this depends on a variety of factors, including which contig(s) we are focusing on mutations in, what our goals are in the first place, etc.

FDR estimation and fixing¶

After we run this command, we can use strainFlye's FDR estimation and fixing functionality to attempt to address this problem. This process involves adjusting the "minimum" value of $p$ (or $r$) used for each contig to reduce the FDR as needed.

Depending on your dataset and your goals, you may or may not want to do this: as of writing, many of the analyses in the strainFlye paper don't perform FDR fixing on their input mutations, and instead just make use of "un-fixed" naïvely called $p$-mutations. (That being said, this is mostly a historical artifact of us implementing the FDR fixing code close to the end of the project.)

In general, if you are going to perform FDR estimation / fixing, we recommend only doing this for $p$-mutations (and not $r$-mutations); this is discussed in the strainFlye paper, in the Supplemental Material section named "Identifying mutations based solely on read counts."

6.3. Naïvely call $p$-mutations ($p = 0.15\%$) and compute diversity indices for various values of $p$¶

Now that we know what we're doing, we're ready to call mutations and compute diversity indices! We'll do $p$-mutation calling at a minimum $p$ of $0.15\%$, which matches what we used for Figure 2 in the paper. The default diversity index values of $p$ (ranging from $0.5\%$ to $50\%$) should be good for us.

Like alignment, this command will also take a while—we need to check each position in the alignment for each of the input contigs. If you'd like, you can use the --verbose flag to display some extra information while this command is running, to make the wait more tolerable (and assure you that it isn't frozen somewhere).

In [6]:

!strainFlye call p-mutation

Usage: strainFlye call p-mutation [OPTIONS]

  Call p-mutations and compute diversity indices.

  The primary parameter for this command is the lower bound of p, defined by
  --min-p. The BCF output will include "mutations" for all positions that pass
  this (likely very low) threshold; this BCF can be filtered using the
  utilities contained in the "strainFlye fdr" module.

Options:
  -c, --contigs PATH              FASTA file of contigs in which to naïvely
                                  call mutations. All contigs in this FASTA
                                  file should also be contained in the BAM
                                  file; it's ok if the BAM file contains
                                  contigs not in this FASTA file (we'll ignore
                                  them).  [required]
  -b, --bam PATH                  Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  --min-p INTEGER RANGE           Minimum value of p for which to call
                                  p-mutations. This is scaled up by 100 (i.e.
                                  the default of 50 corresponds to 50 / 100 =
                                  0.5%) in order to bypass floating-point
                                  precision issues.  [default: 50; 0<x<=5000]
  --min-alt-pos INTEGER RANGE     In order for us to call a p-mutation at a
                                  position, this position's alternate
                                  nucleotide must be supported by at least
                                  this many reads.  [default: 2; x>=1]
  --div-index-p-list TEXT         List of values of p for which we'll compute
                                  diversity indices. These should all be
                                  separated by commas; and, as with --min-p,
                                  these are scaled up by 100. Please don't use
                                  commas as thousands separators.  [default:
                                  50,100,200,500,1000,2500,5000]
  -m, --min-read-number FLOAT RANGE
                                  Parameter that impacts the minimum
                                  (mis)match coverage needed in order to
                                  consider "counting" a position / mutation
                                  towards the diversity index. Given a value
                                  of p (converted to the range (0, 0.5]), a
                                  position must have a coverage of at least
                                  (--min-read-number / p) in order to be
                                  "sufficiently covered" and thus counted
                                  towards the diversity index.  [default: 5;
                                  x>0]
  -o, --output-dir DIRECTORY      Directory to which an output BCF file
                                  (describing the called mutations), BCF index
                                  file, and diversity index TSV file will be
                                  written. Some temporary files will also be
                                  written to this directory.  [required]
  --verbose / --no-verbose        Display extra details while running.
                                  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [ ]:

!strainFlye call p-mutation \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --min-p 15 \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/call-p15

6.4. Estimating FDRs using the target-decoy approach¶

We now have both our initial mutation calls (which, as we've discussed, probably have a high FDR) and information about our contigs' diversity indices. We will use the target-decoy approach to attempt to estimate and thus control the FDR of our mutation calls. This is done by the strainFlye fdr estimate and strainFlye fdr fix commands.

As discussed in our paper, we can select—out of one of our $C$ contigs—a decoy contig (a.k.a. a decoy genome), and compute a mutation rate for it ($\text{rate}_{\text{decoy}}$). For each of the other $C - 1$ target contigs, we can estimate the FDR of identified mutations in this contig as $\dfrac{\text{rate}_{\text{decoy}}}{\text{rate}_{\text{target}}}$.

6.4.1. Manually selecting a decoy contig¶

If you'd like, we could go through the diversity indices produced by strainFlye call p-mutation ourselves, in an attempt to select a reasonable-seeming decoy contig. This notebook demonstrates this sort of process.

6.4.2. Letting `strainFlye fdr estimate` automatically select a decoy contig¶

The optional notebook discussed above shows that edge_6104 is probably a good decoy contig, so we could if desired just pass it to strainFlye fdr estimate using that command's -dc or --decoy-contig option. However, to illustrate another option, we'll instead pass our diversity index TSV file to strainFlye fdr estimate and let it do the job of selecting a decoy contig. (Spoiler alert: it'll select edge_6104 anyway.)

The Methods section of our paper provides details about how the automatic decoy contig selection algorithm works. If you'd prefer, you can also go through the exact source code for it, which is reasonably well-documented: see the autoselect_decoy() function in this file.

In [4]:

!strainFlye fdr

Usage: strainFlye fdr [OPTIONS] COMMAND [ARGS]...

  [+] Estimate and fix FDRs for contigs' naïve mutation calls.

Options:
  -h, --help  Show this message and exit.

Commands:
  estimate  Estimate the FDRs of contigs' mutation calls.
  fix       Fix contigs' mutation calls' estimated FDRs to an upper limit.

In [5]:

!strainFlye fdr estimate

Usage: strainFlye fdr estimate [OPTIONS]

  Estimate the FDRs of contigs' mutation calls.

  We do this using the target-decoy approach (TDA). Given a set of C contigs,
  we select a "decoy contig" with relatively few called mutations. We then
  compute a mutation rate for this decoy contig, and use this mutation rate
  (along with the mutation rates of the other C - 1 "target" contigs) to
  estimate the FDRs of all of these target contigs' mutation calls.

  We can produce multiple FDR estimates for a single target contig's calls by
  varying the p or r threshold used (from the --min-p or --min-r threshold
  used to generate the input BCF file, up to the --high-p or --high-r
  threshold given here). Using this information (and information about the
  numbers of mutations called per megabase), we can plot an FDR curve for a
  given target contig's mutation calls.

  This command accepts an input BCF file of p- or r-mutations; however, in
  general we recommend using p-mutations (rather than r-mutations) for FDR
  estimation / fixing. Please see the Supplemental Material section of the
  strainFlye paper named "Identifying mutations based solely on read counts"
  for details.

Options:
  -c, --contigs PATH              FASTA file of contigs for which we will
                                  estimate mutation calls' FDRs. All contigs
                                  in this FASTA file should also be contained
                                  in the BAM and BCF files. It's ok if the BAM
                                  file contains contigs not in this FASTA file
                                  (we'll ignore them), but all contigs
                                  described in the BCF file's header must also
                                  be in this FASTA file.  [required]
  --bam PATH                      Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  --bcf PATH                      Indexed BCF file describing naïvely called
                                  p- or r-mutations in the FASTA file's
                                  contigs, produced by "strainFlye call".
                                  [required]
  -di, --diversity-indices PATH   TSV file describing the diversity indices of
                                  a set of contigs. Used to automatically
                                  select a decoy contig. This option is
                                  mutually exclusive with --decoy-contig.
                                  [default: (nothing)]
  -dc, --decoy-contig TEXT        Name of a specific contig to use as the
                                  decoy contig for FDR estimation. This option
                                  is mutually exclusive with --diversity-
                                  indices.  [default: (nothing)]
  -dctx, --decoy-contexts [Full|CP2|Tv|Nonsyn|Nonsense|CP2Tv|CP2Nonsyn|CP2Nonsense|TvNonsyn|TvNonsense|CP2TvNonsense|Everything]
                                  "Context-dependent" types of positions
                                  and/or mutations to which the computation of
                                  mutation rates in the decoy contig will be
                                  limited. The "Full" option will consider the
                                  entire decoy contig; all other options will
                                  limit the computation to certain types of
                                  positions and/or potential mutations. (CP2
                                  will focus on positions in the second codon
                                  position of genes predicted in the decoy
                                  contig using Prodigal; Nonsyn will focus on
                                  potential nonsynonymous mutations in these
                                  genes; Nonsense will focus on potential
                                  nonsense mutations in these genes; and Tv
                                  will focus on potential transversion
                                  mutations.) You can specify this option
                                  multiple times to generate multiple sets of
                                  FDR estimates. (You can also specify
                                  "Everything" to use all available contexts.)
                                  [default: CP2; required]
  -hp, --high-p INTEGER RANGE     (Only applies if the input BCF file
                                  describes p-mutations.) p-mutations with a
                                  mutation rate (freq(pos)) greater than or
                                  equal to this are considered "indisputable,"
                                  and will not be included in FDR estimation.
                                  Like the values of p used as parameters to
                                  "strainFlye call p-mutation", this is in the
                                  range (0, 5000], such that h = N corresponds
                                  to (N / 100)%. Corresponds to the
                                  "highFrequency" threshold mentioned in the
                                  paper.  [default: 500; 0<x<=5000]
  -hr, --high-r INTEGER RANGE     (Only applies if the input BCF file
                                  describes r-mutations.) r-mutations with an
                                  alternate nucleotide read coverage greater
                                  than or equal to this are considered
                                  indisputable, and will not be included in
                                  FDR estimation.  [default: 100; x>0]
  -dml, --decoy-min-length INTEGER RANGE
                                  Minimum length of a potential decoy contig
                                  (in bp). Only used if --diversity-indices is
                                  specified.  [default: 1000000; x>0]
  -dmac, --decoy-min-average-coverage FLOAT RANGE
                                  Minimum average (mis)match coverage of a
                                  potential decoy contig. Only used if
                                  --diversity-indices is specified.  [default:
                                  500; x>0]
  -o, --output-dir DIRECTORY      Directory to which TSV files describing the
                                  estimated FDRs and numbers of naïvely called
                                  mutations per megabase will be written. In
                                  all TSV files, rows correspond to target
                                  contigs and columns correspond to p or r
                                  values. We will generate one estimated FDR
                                  TSV file for every time --decoy-context is
                                  specified, and just one number-of-naïvely-
                                  called-mutations-per-megabase file.
                                  [required]
  -h, --help                      Show this message and exit.

In [6]:

!strainFlye fdr estimate \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
    --diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
    --decoy-contexts Everything \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info

Using strainFlye version "0.1.0-dev".
--------
fdr estimate @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_atleast_1Mbp.fasta
Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf
Input diversity indices file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv
Manually-set decoy contig: None
Decoy contig context-dependent position / mutation type(s): ('Everything',)
High p threshold (only used if the BCF describes p-mutations): 500
High r threshold (only used if the BCF describes r-mutations): 100
Minimum length of a potential decoy contig (only used if diversity indices are specified): 1,000,000 bp
Minimum average coverage of a potential decoy contig (only used if diversity indices are specified): 500.0x
Output directory: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info
--------
fdr estimate @ 0.00s: Loading and checking FASTA, BAM, and BCF files...
fdr estimate @ 5.42s: The FASTA file describes 468 contig(s).
fdr estimate @ 5.50s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)).
fdr estimate @ 19.55s: The FASTA file's contig(s) and BCF file's contig(s) match.
fdr estimate @ 19.55s: Also, the input BCF file describes p-mutations (minimum p = 15).
fdr estimate @ 19.56s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files.
fdr estimate @ 19.56s: So far, these files seem good.
--------
fdr estimate @ 19.56s: Selecting a decoy contig based on the diversity indices...
fdr estimate @ 19.58s: Selected edge_6104 as the decoy contig.
fdr estimate @ 19.58s: Verified that this decoy contig is present in the FASTA, BAM, and BCF files.
fdr estimate @ 19.58s: (Sorry for doubting you.)
--------
fdr estimate @ 19.58s: Determining range of value(s) of p to consider...
fdr estimate @ 19.58s: We'll consider 485 value(s) of p: from 15 to 499.
fdr estimate @ 19.58s: p-mutations for p ≥ 500 will be considered "indisputable."
fdr estimate @ 19.58s: These "indisputable" mutations won't be included in the FDR estimation results.
--------
fdr estimate @ 19.58s: Computing mutation rates for edge_6104 at these threshold values, for each of the 11 decoy context(s)...
fdr estimate @ 21.67s: At least one of the decoy context(s) requires us to have gene predictions for edge_6104; running Prodigal...
-------------------------------------
PRODIGAL v2.6.3 [February, 2016]
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.
-------------------------------------
Request: Single Genome, Phase: Training
Reading in the sequence(s) to train...1289244 bp seq created, 32.65 pct GC
Locating all potential starts and stops...30910 nodes
Looking for GC bias in different frames...frame bias scores: 2.68 0.19 0.12
Building initial set of genes to train from...done!
Creating coding model and scoring nodes...done!
Examining upstream regions and training starts...done!
-------------------------------------
Request: Single Genome, Phase: Gene Finding
Finding genes in sequence #1 (1289244 bp)...done!
fdr estimate @ 27.01s: Finished running Prodigal! It predicted 1,297 gene(s) in edge_6104.
fdr estimate @ 27.83s: At least one of the decoy context(s) requires us to know "unreasonable" positions (where reference and consensus disagree) in edge_6104; going through alignment...
fdr estimate @ 804.88s: Done: identified 0 unreasonable position(s) in edge_6104.
fdr estimate @ 804.88s: Computing mutation rates for decoy context "CP2" (1 / 11)...
fdr estimate @ 804.89s: Done computing "CP2" mutation rates for edge_6104.
fdr estimate @ 804.89s: Computing mutation rates for decoy context "CP2Nonsense" (2 / 11)...
fdr estimate @ 811.65s: Done computing "CP2Nonsense" mutation rates for edge_6104.
fdr estimate @ 811.65s: Computing mutation rates for decoy context "CP2Nonsyn" (3 / 11)...
fdr estimate @ 819.02s: Done computing "CP2Nonsyn" mutation rates for edge_6104.
fdr estimate @ 819.02s: Computing mutation rates for decoy context "CP2Tv" (4 / 11)...
fdr estimate @ 819.03s: Done computing "CP2Tv" mutation rates for edge_6104.
fdr estimate @ 819.03s: Computing mutation rates for decoy context "CP2TvNonsense" (5 / 11)...
fdr estimate @ 825.73s: Done computing "CP2TvNonsense" mutation rates for edge_6104.
fdr estimate @ 825.73s: Computing mutation rates for decoy context "Full" (6 / 11)...
fdr estimate @ 825.75s: Done computing "Full" mutation rates for edge_6104.
fdr estimate @ 825.75s: Computing mutation rates for decoy context "Nonsense" (7 / 11)...
fdr estimate @ 845.39s: Done computing "Nonsense" mutation rates for edge_6104.
fdr estimate @ 845.39s: Computing mutation rates for decoy context "Nonsyn" (8 / 11)...
fdr estimate @ 866.92s: Done computing "Nonsyn" mutation rates for edge_6104.
fdr estimate @ 866.92s: Computing mutation rates for decoy context "Tv" (9 / 11)...
fdr estimate @ 867.05s: Done computing "Tv" mutation rates for edge_6104.
fdr estimate @ 867.05s: Computing mutation rates for decoy context "TvNonsense" (10 / 11)...
fdr estimate @ 886.23s: Done computing "TvNonsense" mutation rates for edge_6104.
fdr estimate @ 886.23s: Computing mutation rates for decoy context "TvNonsyn" (11 / 11)...
fdr estimate @ 907.39s: Done computing "TvNonsyn" mutation rates for edge_6104.
fdr estimate @ 907.41s: Done computing mutation rates for all decoy context(s) for edge_6104.
--------
fdr estimate @ 907.41s: Computing mutation rates and FDR estimates for the 467 target contig(s)...
fdr estimate @ 941.67s: Done.
--------
fdr estimate @ 941.72s: Done.

Let's check on the TSV files that got written to the output directory. We should see one file for every decoy context, indicating the FDR estimates for each target contig for this context; and one lone "number of mutations per Mb" file, indicating the number of mutations per megabase for each target contig.

In general, we can plot these as FDR curves by using the FDR estimates as x-axis values and the "number of mutations per Mb" values as y-axis values.

In [7]:

!ls /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/

fdr-CP2Nonsense.tsv    fdr-CP2Tv.tsv	 fdr-TvNonsense.tsv
fdr-CP2Nonsyn.tsv      fdr-Full.tsv	 fdr-TvNonsyn.tsv
fdr-CP2.tsv	       fdr-Nonsense.tsv  fdr-Tv.tsv
fdr-CP2TvNonsense.tsv  fdr-Nonsyn.tsv	 num-mutations-per-mb.tsv

6.5. Plotting FDR curves¶

This notebook (in the analysis code repository) demonstrates how we can plot some or all of these FDR estimates for our target contigs. See the screenshot below for an example figure, showing most of the decoy contexts for eight target contigs in the SheepGut dataset. (I put a decent amount of work into making those plots look fancy for the paper, but you don't need to do all this work unless you want to!)

No description has been provided for this image

Please note that this figure was created using a slightly older alignment (from before version 0.2.0 of strainFlye), so it will probably have slight differences from an analogous figure created using the latest version of strainFlye.

6.6. Fixing mutation calls' FDRs to an upper limit of $1\%$¶

FDR estimates are nice (they lend some context to our naïve mutation calls), but we can take things a step further. strainFlye has the ability to fix the estimated FDR for each target contig's mutation calls to a specified upper limit.

Since we already have FDR curves showing how, as $p$ varies, the estimated FDR for each target contig varies, fixing the estimated FDR to an upper limit $F$ amounts to choosing (for each target contig) the "best" value of $p$ that yields a FDR ≤ $F$. The strainFlye fdr fix command takes care of this for us. (If you'd like more details on this process, see the methods section of the strainFlye paper, and/or this unnecessarily fancy comment in the strainFlye code.)

We'll use the "CP2" FDR curve as our source of FDR estimates.

In [8]:

!strainFlye fdr fix

Usage: strainFlye fdr fix [OPTIONS]

  Fix contigs' mutation calls' estimated FDRs to an upper limit.

  This takes as input the estimated FDRs from "strainFlye fdr estimate" (if
  you used multiple decoy contexts, then you will need to choose which set of
  FDR estimates to use here) to guide us on how to fix the FDR for each
  contig. Note that mutations that passed the "high" p or r threshold
  specified for "strainFlye fdr estimate", and thus were not used for FDR
  estimation, will all be included in the output BCF file from this command;
  these mutations are considered "indisputable."

  We include indisputable mutations from the decoy contig and from all target
  contigs in our output BCF file. We will only consider including non-
  indisputable mutations from the target contigs: the decision of which non-
  indisputable mutations will be included is based on the lowest p or r
  parameter for a target contig that yields an estimated FDR ≤ the fixed FDR
  given here.

Options:
  -b, --bcf PATH            Indexed BCF file describing naïvely called p- or
                            r-mutations, produced by "strainFlye call".
                            [required]
  -fi, --fdr-info FILE      One of the estimated FDR TSV files produced by
                            "strainFlye fdr estimate".  [required]
  --fdr FLOAT RANGE         False discovery rate at which identified mutations
                            will be fixed. This is interpreted as a
                            percentage: the default of 1 corresponds to an FDR
                            of 1%. We do not restrict this to an upper limit,
                            since estimated FDRs can technically exceed 100%.
                            [default: 1; x>0]
  -o, --output-bcf FILE     Filepath to which an output indexed BCF file
                            (describing the called mutations at the optimal p
                            or r value for each contig, along with any
                            "indisputable" mutations relative to the --high-p
                            or --high-r parameter used for strainFlye fdr
                            estimate) will be written.  [required]
  --verbose / --no-verbose  Display extra details while writing the filtered
                            BCF.  [default: no-verbose]
  -h, --help                Show this message and exit.

In [10]:

!strainFlye fdr fix \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf \
    --fdr-info /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/fdr-CP2.tsv \
    --fdr 1 \
    --output-bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf

Using strainFlye version "0.1.0-dev".
--------
fdr fix @ 0.00s: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/naive-calls.bcf
Input FDR estimate file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr-info/fdr-CP2.tsv
FDR at which to fix mutation calls: 1.00%
Verbose?: No
Output BCF file with mutation calls at the fixed FDR: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
--------
fdr fix @ 0.00s: Loading and checking BCF and TSV files...
fdr fix @ 17.34s: Looks good so far; decoy contig seems to be edge_6104.
fdr fix @ 17.34s: Looks like the cutoff for "indisputable" mutations was p = 500.
fdr fix @ 17.34s: All mutations passing this cutoff will be included in the output BCF file.
--------
fdr fix @ 17.34s: Based on the FDR information, finding optimal values of p for each contig...
fdr fix @ 17.36s: Done.
fdr fix @ 17.36s: For 239 / 467 contigs, there exist values of p (at least, considering the range from p = 15 to p = 499) that yield estimated FDRs ≤ 1.0%.
fdr fix @ 17.36s: These values range from p = 37 (edge_1300) to p = 475 (edge_1016).
fdr fix @ 17.36s: The mean of these values is p = 255.96.
--------
fdr fix @ 17.36s: Writing a filtered BCF file (to a temporary location, for now) including both (1) indisputable mutations from all contigs and (2) non-indisputable mutations from the target contigs that result in a FDR ≤ 1.00%...
fdr fix @ 32.55s: Done.
--------
fdr fix @ 32.55s: Updating the filtered BCF file's header to indicate that FDR fixing was done...
fdr fix @ 36.15s: Done.
--------
fdr fix @ 36.15s: Indexing the BCF file...
fdr fix @ 37.04s: Done indexing the BCF file.
--------
fdr fix @ 37.06s: Done.

It took us a few steps, but we have now generated a file (p15-fdr1pct.bcf) of $p$-mutation calls at a fixed (estimated) FDR of 1%.

Although our methodology has a few limitations (e.g. we don't support calling multi-allelic mutations yet), this BCF file can be used downstream for many types of analyses. In the next sections of the tutorial we'll demonstrate the additional commands supported by strainFlye, most of which make use of these mutation calls in some way.

7. Identify hotspots and coldspots¶

We've called mutations and estimated these calls' FDRs. Now we can get to the fun part: what's going on with these mutations?

Often, we're interested in analyzing where the mutations in a contig are located. Are there any particular "hotspot" regions where there are surprisingly many mutations? Are there any "coldspot" regions where there are, surprisingly, no or few mutations?

We've kept strainFlye's functionality for identifying these types of regions fairly minimal at the moment. Here we'll demonstrate identifying very basic hotspots and coldspots using strainFlye spot's commands.

In [11]:

!strainFlye spot

Usage: strainFlye spot [OPTIONS] COMMAND [ARGS]...

  [+] Identify putative mutational hotspots or coldspots.

  Many methods exist for identifying these sorts of hotspots or coldspots;
  strainFlye's implementations of these methods are intended mostly as a quick
  proof-of-concept for replicating the results shown in our paper, and are not
  extremely "feature-rich" quite yet.

Options:
  -h, --help  Show this message and exit.

Commands:
  hot-features  Identify hotspot features (for example, genes).
  cold-gaps     Identify long coldspot "gaps" without any mutations.

7.1. Identify hotspot features¶

In [12]:

!strainFlye spot hot-features

Usage: strainFlye spot hot-features [OPTIONS]

  Identify hotspot features (for example, genes).

  By "feature", we refer to a single continuous region within a contig, as
  described in the file given for --features. These regions could describe
  anything: predicted protein-coding genes, introns or exons, intergenic
  regions of interest, etc. For now, we treat each feature independently (e.g.
  we don't lump together exons from the same "Parent" gene; each feature is
  considered separately as a potential "hotspot").

  You can configure whether or not we classify a feature as a hotspot by
  adjusting the --min-num-mutations and --min-perc-mutations parameters; at
  least one of these parameters must be specified. If both parameters are
  specified, then both checks (number of mutations in a feature, and
  percentage of mutations in a feature) will need to pass in order for us to
  label a feature as a hotspot.

Options:
  -b, --bcf PATH                  Indexed BCF file describing single-
                                  nucleotide mutations in a set of contigs.
                                  [required]
  -f, --features PATH             Generic Feature Format version 3 (GFF3) file
                                  describing "features" (e.g. predicted
                                  protein-coding genes) in the contigs. We
                                  will ignore features located on contigs that
                                  are not described in the BCF file's header.
                                  [required]
  -mn, --min-num-mutations INTEGER RANGE
                                  Label a feature as a "hotspot" if it
                                  contains at least this many mutations.
                                  [default: (no check); x>0]
  -mp, --min-perc-mutations FLOAT RANGE
                                  Label a feature as a "hotspot" if its
                                  percentage of mutations ((# mutations /
                                  feature length) × 100) is at least this
                                  value.  [default: (no check); 0<x<=100]
  -o, --output-hotspots FILE      Filepath to which an output tab-separated
                                  values (TSV) file describing hotspot
                                  features across all contigs will be written.
                                  [required]
  -h, --help                      Show this message and exit.

7.1.1. A note about "features"¶

Although we should be familiar with the FASTA and BCF input files by this point, the -f / --features input (a GFF3 file) may be surprising. strainFlye leaves the task of creating this file up to the user.

Predicted genes' coordinates are probably the most obvious type of "feature" for which we could look for hotspots. If you don't have gene predictions for your contigs yet, Prodigal is good (and should have been installed along with strainFlye, since it's used internally elsewhere in strainFlye). Here, we'll use it to predict protein-coding genes across all contigs.

It's important to note that Prodigal does not predict eukaryotic genes (i.e. genes that are split up into introns and exons). These genes will thus not be a perfect representation of all protein-coding genes in all contigs in the dataset, since we know that this sample does contain at least some eukaryotic genomes. (However, if you use another tool for predicting eukaryotic genes that produces GFF3 output, then these should also be usable as "features" for this command.)

Note that, now that we're done with FDR estimation/fixing, the rest of this tutorial will mostly focus on the three selected contigs from the paper (described in sheepgut_contigs_selected3.fasta, which we generated above). Because of this, we'll only bother running Prodigal on these three contigs (so we won't predict genes / identify hotspots in the other 465 long contigs).

In [13]:

# See https://github.com/hyattpd/Prodigal/wiki/cheat-sheet for details about these options.
#
# Note that, for the paper, I ran Prodigal in "normal" mode on certain contigs individually
# (https://github.com/fedarko/sheepgut/blob/main/inspect-seqs/prodigal.py), but here we just
# run Prodigal in "anonymous" mode on all contigs at once. The results should be fairly similar,
# although there'll probably be some differences.
!prodigal \
    -i /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
    -f gff \
    -c \
    -p meta \
    -o /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff

-------------------------------------
PRODIGAL v2.6.3 [February, 2016]         
Univ of Tenn / Oak Ridge National Lab
Doug Hyatt, Loren Hauser, et al.     
-------------------------------------
Request:  Metagenomic, Phase:  Training
Initializing training files...done!
-------------------------------------
Request:  Metagenomic, Phase:  Gene Finding
Finding genes in sequence #1 (2153394 bp)...done!
Finding genes in sequence #2 (2806161 bp)...done!
Finding genes in sequence #3 (1289244 bp)...done!

7.1.2. Running the command to identify hotspot features (in this case, gene predictions)¶

Now that we have our gene predictions, let's move see if any of them have a lot of mutations. (Based on our findings in the paper, we know that these sorts of hotspots do exist in this dataset.)

strainFlye spot hot-features supports two types of basic thresholds for labelling a feature as a hotspot, --min-num-mutations and --min-perc-mutations. We'll use both here, and label a feature a "hotspot" if it meets both of the following criteria:

it contains at least 5 mutations, and
at least 2% of its positions have mutations.

In [14]:

!strainFlye spot hot-features \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --features /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff \
    --min-num-mutations 5 \
    --min-perc-mutations 2 \
    --output-hotspots /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv

Using strainFlye version "0.1.0-dev".
--------
spot hot-features @ 0.00s: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Input feature file: /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff
Minimum number of mutations needed to call a feature a hotspot: 5
Minimum % of mutations needed to call a feature a hotspot: 2.0
Output file describing hotspot features: /Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv
--------
spot hot-features @ 0.00s: Loading and checking the BCF file...
spot hot-features @ 12.41s: Looks good so far.
--------
spot hot-features @ 12.41s: Going through features in the GFF3 file and identifying hotspot features...
spot hot-features @ 12.64s: Identified 165 hotspot feature(s) across all 468 contigs in the BCF file.
--------
spot hot-features @ 12.64s: Writing out this information to a TSV file...
spot hot-features @ 12.64s: Done.
--------
spot hot-features @ 12.65s: Done.

The output of this command isn't anything special: it's a TSV file in which each row describes an identified hotspot feature, defining "hotspots" based on the --min-num-mutations and --min-perc-mutations options we set earlier. Let's load this file using pandas.read_csv() and get a brief sense of what it looks like:

In [15]:

hotspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/hotspot-features-n5p2.tsv", sep="\t")
hotspots.head()

Out[15]:

	Contig	FeatureID	FeatureStart_1IndexedInclusive	FeatureEnd_1IndexedInclusive	NumberMutatedPositions	PercentMutatedPositions
0	edge_1671	1_77	67385	67864	16	3.33
1	edge_1671	1_78	67943	68584	13	2.02
2	edge_1671	1_81	70405	70773	44	11.92
3	edge_1671	1_94	86530	88992	95	3.86
4	edge_1671	1_96	89868	90809	57	6.05

Depending on your goals, we could focus on—for example—the hotspot genes with the highest mutation rates in a contig. We can compute this for edge_1671 ("BACT1", as we name it in our paper) by filtering and then sorting the DataFrame:

In [16]:

# Ignore hotspots in CAMP / BACT2, and just focus on hotspots in BACT1
bact1_hotspots = hotspots[hotspots["Contig"] == "edge_1671"]

# Sort all the "hotspot genes" in BACT1 from high to low mutation rates (% of positions mutated).
# This is similar to the table of highly-mutated genes in the Supplemental Material of our paper,
# although unlike that table this uses FDR-fixed mutations (and it also uses different gene
# predictions, as discussed above).
bact1_hotspots.sort_values(["PercentMutatedPositions"], ascending=False)

Out[16]:

	Contig	FeatureID	FeatureStart_1IndexedInclusive	FeatureEnd_1IndexedInclusive	NumberMutatedPositions	PercentMutatedPositions
47	edge_1671	1_583	724871	725443	131	22.86
82	edge_1671	1_860	1041656	1042084	86	20.05
118	edge_1671	1_1217	1460034	1460204	34	19.88
14	edge_1671	1_183	206606	207304	131	18.74
112	edge_1671	1_1168	1402353	1402718	61	16.67
...	...	...	...	...	...	...
134	edge_1671	1_1427	1748001	1751735	76	2.03
67	edge_1671	1_751	916664	918493	37	2.02
1	edge_1671	1_78	67943	68584	13	2.02
16	edge_1671	1_248	272253	273047	16	2.01
106	edge_1671	1_1131	1361175	1361672	10	2.01

164 rows × 6 columns

7.2. Identify coldspot gaps¶

Similarly, strainFlye supports the identification of basic "coldspots"—here, defined as long gaps without any mutations. The main parameter is the minimum length needed to define a gap as a "coldspot." Let's test this out on the SheepGut dataset.

Note that the input to this command is just a BCF file—no contig needs to be given to it. So, this will implicitly find coldspots in all 468 long contigs in the SheepGut dataset described in the BCF file—not just the 3 selected contigs. That said, this command runs fairly quickly, even when considering many contigs, so it shouldn't be an issue.

In [17]:

!strainFlye spot cold-gaps

Usage: strainFlye spot cold-gaps [OPTIONS]

  Identify long coldspot "gaps" without any mutations.

  To clarify, we define a "gap" of length L on a contig as a collection of
  continuous positions [N, N + 1, ... N + L - 2, N + L - 1] in which no
  positions are mutations (based on the input BCF file).

  If the --circular flag is specified, then we can loop around the contig from
  right to left; otherwise, the left and right sides of the contig are hard
  boundaries. To give an example of this, consider a 9-nucleotide contig that
  has mutations at positions 4 and 6:

                             Mut.    Mut.
                  1   2   3   4   5   6   7   8   9

  If --circular is specified, then this contig has two gaps: one gap of length
  1 (covering just position 5, between the two mutations), and another gap of
  length 6 (starting at position 7 and looping around to position 3: [7, 8, 9,
  1, 2, 3]).

  If --circular is not specified, then this contig has three gaps: [1, 2, 3],
  [5], and [7, 8, 9].

Options:
  -b, --bcf PATH                  Indexed BCF file describing single-
                                  nucleotide mutations in a set of contigs.
                                  [required]
  -l, --min-length INTEGER RANGE  Label a gap between mutations in a contig as
                                  a "coldspot" if the gap is at least this
                                  long (in bp).  [default: 5000; x>0;
                                  required]
  --circular / --no-circular      If --circular is specified, we'll assume
                                  that all contigs are circular: we'll
                                  consider the gap "looping around" from the
                                  rightmost mutation in each contig to the
                                  leftmost mutation in the contig as a
                                  potential coldspot. Otherwise, we will
                                  assume all contigs are linear.  [default:
                                  no-circular]
  --exact-pvals / --no-exact-pvals
                                  If --exact-pvals is specified, we'll use the
                                  method for computing exact longest-gap
                                  p-values given in equation (3.1) of (Naus
                                  1982). We use Python's decimal library with
                                  default parameters to try to deal with large
                                  numbers, but we can't guarantee that this
                                  won't cause the program to fail in certain
                                  cases. If --exact-pvals is not specified,
                                  we'll use equation (3.3) of (Naus 1982),
                                  which gives an approximation of the p-value.
                                  [default: exact-pvals]
  -o, --output-coldspots FILE     Filepath to which an output tab-separated
                                  values (TSV) file describing coldspots will
                                  be written. The longest gap in each contig
                                  will also be given a p-value, indicating the
                                  probability of the longest gap being at
                                  least this long (given the length of and
                                  number of mutated positions in this contig,
                                  and assuming that mutations occur with a
                                  constant probability at any position in the
                                  contig). See (Naus 1982) for details.
                                  [required]
  -h, --help                      Show this message and exit.

In [18]:

!strainFlye spot cold-gaps \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --output-coldspots /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv

Using strainFlye version "0.1.0-dev".
--------
spot cold-gaps @ 0.00s: Starting...
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Minimum coldspot gap length: 5,000 bp
Check for circular coldspot gaps?: No
Compute exact longest-gap p-values?: Yes
Output file describing coldspot gaps: /Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv
--------
spot cold-gaps @ 0.00s: Loading and checking the BCF file...
spot cold-gaps @ 12.16s: Looks good so far.
--------
spot cold-gaps @ 12.16s: Going through contigs and identifying coldspot gaps...
spot cold-gaps @ 15.74s: Identified 16,154 coldspot gap(s) across all 468 contigs in the BCF file.
--------
spot cold-gaps @ 15.74s: Writing out this information to a TSV file...
spot cold-gaps @ 15.75s: Done.
--------
spot cold-gaps @ 15.76s: Done.

Again, cold-gaps will output a simple TSV file describing its identified coldspots:

In [19]:

coldspots = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/coldspot-gaps-minlen5000-nocircular.tsv", sep="\t")
coldspots.head()

Out[19]:

	Contig	Start_1IndexedInclusive	End_1IndexedInclusive	Length	LongestGap_P_Value
0	edge_8	3492	10939	7448	NaN
1	edge_8	10941	16941	6001	NaN
2	edge_8	35630	45223	9594	NaN
3	edge_8	45225	53014	7790	NaN
4	edge_8	58003	63032	5030	NaN

There are many ways we could make use of this information—similarly to the hotspot example, we could try sorting these gaps from longest to shortest for the BACT1 contig. This is shown below. (Of the gaps that we identify, the longest one in each contig is assigned a p-value; please see the Supplemental Material of our paper for details on how we compute this, and the assumptions made.)

In [20]:

coldspots[coldspots["Contig"] == "edge_1671"].sort_values(["Length"], ascending=False)

Out[20]:

	Contig	Start_1IndexedInclusive	End_1IndexedInclusive	Length	LongestGap_P_Value
2536	edge_1671	1216892	1239536	22645	9.443778e-106
2537	edge_1671	1618448	1637614	19167	NaN
2538	edge_1671	1797912	1813581	15670	NaN
2534	edge_1671	1194740	1207381	12642	NaN
2532	edge_1671	979572	990495	10924	NaN
2530	edge_1671	740701	750399	9699	NaN
2535	edge_1671	1207881	1216890	9010	NaN
2539	edge_1671	2140896	2147770	6875	NaN
2533	edge_1671	1183003	1188753	5751	NaN
2540	edge_1671	2147772	2153394	5623	NaN
2531	edge_1671	974177	979570	5394	NaN
2529	edge_1671	541220	546507	5288	NaN

8. Phasing analyses¶

8.1. Generating "smoothed haplotypes"¶

Given our called mutations, we can attempt to generate haplotypes that respect these mutations using strainFlye's smooth module.

The details and motivation for this are explained in depth in our paper. To briefly summarize, we will convert the reads aligned to each contig into "smoothed reads," which only contain our called mutations with no other variations. We will then (optionally, depending on the --virtual-reads parameter) construct "virtual reads" to fill in low-coverage regions. We will then assemble these reads using LJA to construct "smoothed haplotypes."

Note that these commands (both smooth create and smooth assemble) can take a while to run. You can add the --verbose flag to get more detailed logging information while these commands are running.

In [21]:

!strainFlye smooth

Usage: strainFlye smooth [OPTIONS] COMMAND [ARGS]...

  [+] Create and assemble smoothed and virtual reads.

Options:
  -h, --help  Show this message and exit.

Commands:
  create    Create smoothed and virtual reads for each contig.
  assemble  Assemble contigs' smoothed and virtual reads using LJA.

8.1.1. Creating smoothed and virtual reads¶

In [22]:

!strainFlye smooth create

Usage: strainFlye smooth create [OPTIONS]

  Create smoothed and virtual reads for each contig.

Options:
  -c, --contigs PATH              FASTA file of contigs for which we will
                                  create smoothed and virtual reads. All
                                  contigs in this FASTA file should also be
                                  contained in the BAM and BCF files; it's ok
                                  if the BAM or BCF files contain contigs not
                                  in this FASTA file (we'll ignore them).
                                  [required]
  --bam PATH                      Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  --bcf PATH                      Indexed BCF file describing single-
                                  nucleotide mutations in a set of contigs.
                                  [required]
  -di, --diversity-indices PATH   TSV file describing the diversity indices of
                                  a set of contigs, produced by one of the
                                  "strainFlye call" commands. Only used if
                                  --virtual-reads is specified. Along with
                                  diversity indices, these files list contigs'
                                  average coverages. If --virtual-reads is
                                  specified, we will need to know contigs'
                                  average coverages. So, if a diversity
                                  indices file is provided here, then we can
                                  avoid re-computing average coverages.
                                  (Otherwise, if --virtual-reads is specified
                                  but no diversity indices file is provided,
                                  we'll need to compute average coverages;
                                  this will take some extra time.)  [default:
                                  (nothing)]
  --virtual-reads / --no-virtual-reads
                                  If --virtual-reads is specified, we'll
                                  create virtual reads covering "low-coverage"
                                  regions in contigs.  [default: virtual-
                                  reads]
  -vrp, --virtual-read-well-covered-perc FLOAT RANGE
                                  Only used if --virtual-reads is specified.
                                  In a contig with average coverage (based on
                                  the BAM file, and only considering match +
                                  mismatch counts) C, we will define a
                                  position in this contig (with coverage P,
                                  based only on the smoothed reads) as low-
                                  coverage if ((P / C) × 100) is less than
                                  this percentage.  [default: 95; 0<=x<=100;
                                  required]
  -vrf, --virtual-read-flank INTEGER RANGE
                                  Only used if --virtual-reads is specified.
                                  When we add virtual reads spanning a single
                                  continuous low-coverage region, these reads
                                  will start and end this many positions
                                  before and after the region. (For example,
                                  the default of 100 means that the virtual
                                  reads created for a low-coverage region of
                                  5,000 bp will all be 5,200 bp.)  [default:
                                  100; x>=0; required]
  -o, --output-dir DIRECTORY      Directory to which smoothed reads (and
                                  virtual reads, if --virtual-reads is
                                  specified) will be written. Each contig's
                                  reads will be written to a gzipped FASTA
                                  file in this directory named
                                  [contig].fasta.gz.  [required]
  --verbose / --no-verbose        Display extra details while running.
                                  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [23]:

!strainFlye smooth create \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --diversity-indices /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
smooth create @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta
Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Input diversity indices file: /Poppy/mfedarko/sftests/tutorial-output/call-p15/diversity-indices.tsv
Create virtual reads?: Yes
"Well-covered" percentage used for virtual read creation: 95.00%
Create "flanks" of this length on either side of each virtual read: 100
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/reads
--------
smooth create @ 0.00s: Loading and checking FASTA, BAM, and BCF files...
smooth create @ 0.10s: The FASTA file describes 3 contig(s).
smooth create @ 0.19s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)).
smooth create @ 12.33s: All FASTA contig(s) are included in the BCF file (the header of this BCF file describes 468 contig(s)).
smooth create @ 12.34s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files.
smooth create @ 12.34s: So far, these files seem good.
--------
smooth create @ 12.34s: All contigs must be > (2 × --virtual-read-flank) = 200 bp long. Checking this...
smooth create @ 12.34s: All contigs meet this minimum length.
--------
smooth create @ 12.34s: Going through contigs and creating smoothed and virtual reads...
smooth create @ 12.34s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%).
smooth create @ 12.39s: Contig edge_1671 has 23,821 mutated position(s).
smooth create @ 5,086.59s: From the 262,757 linear alignment(s) to contig edge_1671: created 228,902 smoothed read(s) and ignored 33,855 linear alignment(s).
smooth create @ 5,087.21s: Contig edge_1671 (average coverage 1,415.55x, based on the BAM file) has 538 run(s) of consecutive low-coverage (≤ 1,344.77x) positions (based on smoothed read coverages). Creating virtual reads...
smooth create @ 5,542.74s: Created 43,596 virtual read(s) total for contig edge_1671.
smooth create @ 5,542.77s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%).
smooth create @ 5,542.78s: Contig edge_2358 has 17 mutated position(s).
smooth create @ 8,102.10s: From the 742,784 linear alignment(s) to contig edge_2358: created 719,654 smoothed read(s) and ignored 23,130 linear alignment(s).
smooth create @ 8,102.85s: Contig edge_2358 (average coverage 2,993.59x, based on the BAM file) has 612 run(s) of consecutive low-coverage (≤ 2,843.91x) positions (based on smoothed read coverages). Creating virtual reads...
smooth create @ 8,518.15s: Created 103,347 virtual read(s) total for contig edge_2358.
smooth create @ 8,518.17s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%).
smooth create @ 8,518.18s: Contig edge_6104 has 35 mutated position(s).
smooth create @ 10,215.96s: From the 476,373 linear alignment(s) to contig edge_6104: created 476,232 smoothed read(s) and ignored 141 linear alignment(s).
smooth create @ 10,216.38s: Contig edge_6104 (average coverage 4,158.79x, based on the BAM file) has 361 run(s) of consecutive low-coverage (≤ 3,950.85x) positions (based on smoothed read coverages). Creating virtual reads...
smooth create @ 10,398.82s: Created 80,090 virtual read(s) total for contig edge_6104.
smooth create @ 10,398.83s: Done.
--------
smooth create @ 10,398.93s: Done.

8.1.2. Assembling these reads¶

This step assumes that we have already installed LJA, in particular the simple_ec branch of it. Please see LJA's manual for installation instructions.

I have installed LJA into a specific location on our cluster. So that strainFlye can easily run LJA for each contig's reads files, we pass the location of LJA's binary executable to strainFlye below using the --lja-bin parameter.

In [24]:

!strainFlye smooth assemble

Usage: strainFlye smooth assemble [OPTIONS]

  Assemble contigs' smoothed and virtual reads using LJA.

  Please note that this command relies on the "simple_ec" branch of LJA being
  installed on your system. See strainFlye's README (and/or LJA's manual) for
  details on installing LJA.

Options:
  -r, --reads-dir DIRECTORY   Directory produced by "strainFlye smooth create"
                              containing smoothed (and optionally virtual)
                              reads. We will use LJA to assemble each
                              *.fasta.gz file in this directory (representing
                              reads from different contigs) independently.
                              [required]
  -p, --lja-params TEXT       Additional parameters to pass to LJA, besides
                              the --reads and --output-dir parameters. To
                              explain our defaults: the --simpleec flag is
                              currently only available on the simple_ec branch
                              of LJA. This flag tells LJA to perform "simple"
                              error correction by removing all edges in the de
                              Bruijn graph with k-mer coverage less than
                              --Cov-threshold (10), as well as reads passing
                              through these edges. This "simple" error
                              correction is done instead of running the
                              default LJA error correction module, mowerDBG.
                              Please note that we do not perform any
                              validation on this string before passing it to
                              LJA (so if you are allowing users to run
                              strainFlye through a web server, be careful
                              about shell injection).  [default: --simpleec
                              --Cov-threshold 10]
  -b, --lja-bin FILE          Location of LJA's "lja" binary file, which can
                              be used to run LJA. This should be located in
                              the bin/ directory constructed when you compiled
                              LJA. If this is not provided, we will check to
                              see if LJA is available in your $PATH.
                              [default: (Look for LJA's bin in the $PATH
                              environment variable)]
  -o, --output-dir DIRECTORY  Directory to which output LJA assemblies (one
                              subdirectory per *.fasta.gz file in the input
                              --reads-dir) will be written.  [required]
  --verbose / --no-verbose    Display extra details while running.  [default:
                              no-verbose]
  -h, --help                  Show this message and exit.

In [25]:

!strainFlye smooth assemble \
    --reads-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/reads \
    --lja-bin /home/mfedarko/software/LJA-branch/bin/lja \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
smooth assemble @ 0.00s: Starting...
Input reads directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/reads
LJA parameters: --simpleec --Cov-threshold 10
LJA binary (if None, we'll look in $PATH): /home/mfedarko/software/LJA-branch/bin/lja
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies
--------
smooth assemble @ 0.00s: Assembling each *.fasta.gz file in /Poppy/mfedarko/sftests/tutorial-output/smooth/reads...
smooth assemble @ 0.00s: Found file edge_1671.fasta.gz, presumably for contig edge_1671. Assembling.
smooth assemble @ 0.00s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_1671.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671
00:00:00 2Mb  INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads
00:00:10 5Mb  INFO: e213d10b906cc633b78a27eceb52d44073478d6d
00:00:10 5Mb  INFO: LJA pipeline started
00:00:10 5Mb  INFO: Performing initial correction with k = 5001
00:00:10 0Mb  INFO: Reading reads
00:00:10 0Mb  INFO: Extracting minimizers
00:00:22 3.1Gb  INFO: Finished read processing
00:00:22 3.1Gb  INFO: 3804650 hashs collected. Starting sorting.
00:00:22 3.1Gb  INFO: Finished sorting. Total distinct minimizers: 26485
00:00:22 3.1Gb  INFO: Starting construction of sparse de Bruijn graph
00:00:22 3.1Gb  INFO: Vertex map constructed.
00:00:22 3.1Gb  INFO: Filling edge sequences.
00:00:36 3.2Gb  INFO: Finished sparse de Bruijn graph construction.
00:00:36 3.2Gb  INFO:  Collecting tips 
00:00:36 3.2Gb  INFO: Added 1672 artificial minimizers from tips.
00:00:36 3.2Gb  INFO: Collected 52384 old edges.
00:00:36 3.2Gb  INFO: New minimizers added to sparse graph.
00:00:36 3.2Gb  INFO: Refilling graph with old edges.
00:00:37 3.2Gb  INFO: Filling graph with new edges.
00:00:37 3.2Gb  INFO: Finished fixing sparse de Bruijn graph.
00:00:37 3.2Gb  INFO: Starting to extract disjointigs.
00:00:37 3.2Gb  INFO: Finished extracting 2434 disjointigs of total size 18962614
00:00:37 0Mb  INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/disjointigs.fasta"
00:00:38 35Mb  INFO: Filling bloom filter with k+1-mers.
00:00:39 35Mb  INFO: Filled 31425472 bits out of 217285760
00:00:39 35Mb  INFO: Finished filling bloom filter. Selecting junctions.
00:00:39 36Mb  INFO: Collected 5299 junctions.
00:00:39 36Mb  INFO: Starting DBG construction.
00:00:39 36Mb  INFO: Vertices created.
00:00:39 36Mb  INFO: Filled dbg edges. Adding hanging vertices 
00:00:39 36Mb  INFO: Added 1 hanging vertices
00:00:39 36Mb  INFO: Merging unbranching paths
00:00:39 36Mb  INFO: Ended merging edges. Resulting size 2748
00:00:40 36Mb  INFO: Cleaning edge coverages
00:00:40 36Mb  INFO: Collecting alignments of sequences to the graph
00:00:51 3.2Gb  INFO: Alignment collection finished. Total length of alignments is 677345
00:00:52 3.2Gb  INFO: Could not correct 4910 reads. They will be removed.
00:00:52 3.2Gb  INFO: Uncorrected reads were removed.
00:00:52 3.2Gb  INFO: Applying changes to the graph
00:00:55 3.2Gb  INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/corrected_reads.fasta"
00:01:14 5Mb  INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph
00:01:14 0Mb  INFO: Loading graph from fasta
00:01:14 13Mb  INFO: Finished loading graph
00:01:15 126Mb  INFO: Looking for unique edges
00:01:15 126Mb  INFO: Marked 76 long edges as unique
00:01:15 126Mb  INFO: Marking extra edges as unique based on read paths
00:01:15 126Mb  INFO: Marked 81 edges as unique
00:01:15 126Mb  INFO: Splitting graph with unique edges
00:01:15 126Mb  INFO: Processing 100 components
00:01:15 126Mb  INFO: Finished unique edges search. Found 89 unique edges
00:01:15 126Mb  INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges
00:01:15 126Mb  INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges.
00:01:15 126Mb  INFO: Resolving repeats
00:01:15 126Mb  INFO: Constructing paths
00:01:17 169Mb  INFO: Building graph
00:01:17 169Mb  INFO: Increasing k
00:01:17 170Mb  INFO: Finished increasing k
00:01:17 170Mb  INFO: Exporting remaining active transitions
00:01:17 170Mb  INFO: Export to Dot
00:01:17 170Mb  INFO: Export to GFA and compressed contigs
00:01:17 178Mb  INFO: Finished repeat resolution
00:01:17 5Mb  INFO: Performing polishing and homopolymer uncompression
00:01:18 6Mb  INFO: Aligning reads back to assembly
00:01:27 2.2Gb  INFO: Finished alignment.
00:01:27 2.2Gb  INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/uncompressing/alignments.txt"
00:01:27 2.2Gb  INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_1671.fasta.gz"]
00:02:45 3.4Gb  INFO: Uncompressing homopolymers in contigs
00:02:46 3.4Gb  INFO: Total zero covered nucleotides 0
00:02:46 3.4Gb  INFO: Calculating overlaps between adjacent uncompressed edges
00:02:46 3.4Gb  INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg.gfa"
00:02:46 3.4Gb  INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/assembly.fasta"
00:02:46 5Mb  INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/k5001/corrected_reads.fasta"
00:02:46 5Mb  INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg/mdbg.hpc.gfa"
00:02:46 5Mb  INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/mdbg.gfa"
00:02:46 5Mb  INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_1671/assembly.fasta"
00:02:46 5Mb  INFO: LJA pipeline finished
smooth assemble @ 166.78s: Finished running LJA on file edge_1671.fasta.gz.
smooth assemble @ 166.78s: Found file edge_2358.fasta.gz, presumably for contig edge_2358. Assembling.
smooth assemble @ 166.78s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_2358.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358
00:00:00 5Mb  INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads
00:00:09 5Mb  INFO: e213d10b906cc633b78a27eceb52d44073478d6d
00:00:09 5Mb  INFO: LJA pipeline started
00:00:09 5Mb  INFO: Performing initial correction with k = 5001
00:00:09 0Mb  INFO: Reading reads
00:00:09 0Mb  INFO: Extracting minimizers
00:00:45 5.4Gb  INFO: Finished read processing
00:00:45 5.4Gb  INFO: 10414993 hashs collected. Starting sorting.
00:00:46 5.4Gb  INFO: Finished sorting. Total distinct minimizers: 8259
00:00:46 5.4Gb  INFO: Starting construction of sparse de Bruijn graph
00:00:46 5.4Gb  INFO: Vertex map constructed.
00:00:46 5.4Gb  INFO: Filling edge sequences.
00:01:21 7.8Gb  INFO: Finished sparse de Bruijn graph construction.
00:01:21 7.8Gb  INFO:  Collecting tips 
00:01:21 7.8Gb  INFO: Added 4 artificial minimizers from tips.
00:01:21 7.8Gb  INFO: Collected 16526 old edges.
00:01:21 7.8Gb  INFO: New minimizers added to sparse graph.
00:01:21 7.8Gb  INFO: Refilling graph with old edges.
00:01:21 7.8Gb  INFO: Filling graph with new edges.
00:01:21 7.8Gb  INFO: Finished fixing sparse de Bruijn graph.
00:01:21 7.8Gb  INFO: Starting to extract disjointigs.
00:01:21 7.8Gb  INFO: Finished extracting 20 disjointigs of total size 2169195
00:01:21 0Mb  INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/disjointigs.fasta"
00:01:21 13Mb  INFO: Filling bloom filter with k+1-mers.
00:01:22 14Mb  INFO: Filled 9578413 bits out of 66213600
00:01:22 14Mb  INFO: Finished filling bloom filter. Selecting junctions.

00:01:22 14Mb  INFO: Collected 779 junctions.
00:01:22 14Mb  INFO: Starting DBG construction.
00:01:22 14Mb  INFO: Vertices created.
00:01:22 14Mb  INFO: Filled dbg edges. Adding hanging vertices 
00:01:22 14Mb  INFO: Added 0 hanging vertices
00:01:22 14Mb  INFO: Merging unbranching paths
00:01:23 14Mb  INFO: Ended merging edges. Resulting size 16
00:01:23 14Mb  INFO: Cleaning edge coverages
00:01:23 14Mb  INFO: Collecting alignments of sequences to the graph
00:01:54 6.7Gb  INFO: Alignment collection finished. Total length of alignments is 662331
00:01:55 6.7Gb  INFO: Could not correct 3 reads. They will be removed.
00:01:55 6.7Gb  INFO: Uncorrected reads were removed.
00:01:55 6.7Gb  INFO: Applying changes to the graph
00:02:00 6.7Gb  INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/corrected_reads.fasta"
00:03:04 5Mb  INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph
00:03:04 0Mb  INFO: Loading graph from fasta
00:03:04 11Mb  INFO: Finished loading graph
00:03:06 265Mb  INFO: Looking for unique edges
00:03:06 265Mb  INFO: Marked 8 long edges as unique
00:03:06 265Mb  INFO: Marking extra edges as unique based on read paths
00:03:06 265Mb  INFO: Marked 8 edges as unique
00:03:06 265Mb  INFO: Splitting graph with unique edges
00:03:06 265Mb  INFO: Processing 5 components
00:03:06 265Mb  INFO: Finished unique edges search. Found 22 unique edges
00:03:06 265Mb  INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges
00:03:06 265Mb  INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges.
00:03:06 265Mb  INFO: Resolving repeats
00:03:06 265Mb  INFO: Constructing paths
00:03:11 496Mb  INFO: Building graph
00:03:11 496Mb  INFO: Increasing k
00:03:11 496Mb  INFO: Finished increasing k
00:03:11 496Mb  INFO: Exporting remaining active transitions
00:03:11 496Mb  INFO: Export to Dot
00:03:11 496Mb  INFO: Export to GFA and compressed contigs
00:03:11 0.5Gb  INFO: Finished repeat resolution
00:03:12 5Mb  INFO: Performing polishing and homopolymer uncompression
00:03:12 9Mb  INFO: Aligning reads back to assembly
00:03:38 5.8Gb  INFO: Finished alignment.
00:03:38 5.8Gb  INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/uncompressing/alignments.txt"
00:03:38 5.8Gb  INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_2358.fasta.gz"]
00:06:12 7Gb  INFO: Uncompressing homopolymers in contigs
00:06:12 7Gb  INFO: Total zero covered nucleotides 0
00:06:12 7Gb  INFO: Calculating overlaps between adjacent uncompressed edges
00:06:12 7Gb  INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg.gfa"
00:06:12 7Gb  INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/assembly.fasta"
00:06:13 5Mb  INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/k5001/corrected_reads.fasta"
00:06:13 5Mb  INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg/mdbg.hpc.gfa"
00:06:13 5Mb  INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/mdbg.gfa"
00:06:13 5Mb  INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_2358/assembly.fasta"
00:06:13 5Mb  INFO: LJA pipeline finished
smooth assemble @ 539.88s: Finished running LJA on file edge_2358.fasta.gz.
smooth assemble @ 539.88s: Found file edge_6104.fasta.gz, presumably for contig edge_6104. Assembling.
smooth assemble @ 539.89s: Running this command: /home/mfedarko/software/LJA-branch/bin/lja --reads /Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_6104.fasta.gz --simpleec --Cov-threshold 10 --output-dir /Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104
00:00:00 2Mb  INFO: Hello! You are running La Jolla Assembler (LJA), a tool for genome assembly from PacBio HiFi reads
00:00:09 5Mb  INFO: e213d10b906cc633b78a27eceb52d44073478d6d
00:00:09 5Mb  INFO: LJA pipeline started
00:00:09 5Mb  INFO: Performing initial correction with k = 5001
00:00:09 0Mb  INFO: Reading reads
00:00:09 0Mb  INFO: Extracting minimizers
00:00:30 5.6Gb  INFO: Finished read processing
00:00:30 5.6Gb  INFO: 5736312 hashs collected. Starting sorting.
00:00:30 5.6Gb  INFO: Finished sorting. Total distinct minimizers: 3721
00:00:30 5.6Gb  INFO: Starting construction of sparse de Bruijn graph
00:00:30 5.6Gb  INFO: Vertex map constructed.
00:00:30 5.6Gb  INFO: Filling edge sequences.
00:00:50 5.7Gb  INFO: Finished sparse de Bruijn graph construction.
00:00:50 5.7Gb  INFO:  Collecting tips 
00:00:50 5.7Gb  INFO: Added 14 artificial minimizers from tips.
00:00:50 5.7Gb  INFO: Collected 7448 old edges.
00:00:50 5.7Gb  INFO: New minimizers added to sparse graph.
00:00:50 5.7Gb  INFO: Refilling graph with old edges.
00:00:50 5.7Gb  INFO: Filling graph with new edges.
00:00:50 5.7Gb  INFO: Finished fixing sparse de Bruijn graph.
00:00:50 5.7Gb  INFO: Starting to extract disjointigs.
00:00:50 5.7Gb  INFO: Finished extracting 26 disjointigs of total size 1074351
00:00:50 0Mb  INFO: Loading disjointigs from file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/disjointigs.fasta"
00:00:50 9Mb  INFO: Filling bloom filter with k+1-mers.
00:00:51 9Mb  INFO: Filled 4370802 bits out of 30218400
00:00:51 9Mb  INFO: Finished filling bloom filter. Selecting junctions.
00:00:51 9Mb  INFO: Collected 392 junctions.
00:00:51 9Mb  INFO: Starting DBG construction.
00:00:51 9Mb  INFO: Vertices created.
00:00:51 9Mb  INFO: Filled dbg edges. Adding hanging vertices 
00:00:51 9Mb  INFO: Added 0 hanging vertices
00:00:51 9Mb  INFO: Merging unbranching paths
00:00:51 9Mb  INFO: Ended merging edges. Resulting size 34
00:00:51 9Mb  INFO: Cleaning edge coverages
00:00:51 9Mb  INFO: Collecting alignments of sequences to the graph
00:01:11 5.7Gb  INFO: Alignment collection finished. Total length of alignments is 435418
00:01:11 5.7Gb  INFO: Could not correct 24 reads. They will be removed.
00:01:11 5.7Gb  INFO: Uncorrected reads were removed.
00:01:11 5.7Gb  INFO: Applying changes to the graph
00:01:14 5.7Gb  INFO: Printing reads to fasta file "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/corrected_reads.fasta"
00:01:50 5Mb  INFO: Performing repeat resolution by transforming de Bruijn graph into Multiplex de Bruijn graph
00:01:50 0Mb  INFO: Loading graph from fasta
00:01:50 7Mb  INFO: Finished loading graph
00:01:51 243Mb  INFO: Looking for unique edges
00:01:51 243Mb  INFO: Marked 6 long edges as unique
00:01:51 243Mb  INFO: Marking extra edges as unique based on read paths
00:01:51 243Mb  INFO: Marked 6 edges as unique
00:01:51 243Mb  INFO: Splitting graph with unique edges
00:01:51 243Mb  INFO: Processing 4 components
00:01:51 243Mb  INFO: Finished unique edges search. Found 10 unique edges
00:01:51 243Mb  INFO: Analysing repeats of multiplicity 2 and looking for additional unique edges
00:01:51 243Mb  INFO: Finished processing of repeats of multiplicity 2. Found 0 erroneous edges.
00:01:51 243Mb  INFO: Resolving repeats
00:01:51 243Mb  INFO: Constructing paths
00:01:54 326Mb  INFO: Building graph
00:01:54 326Mb  INFO: Increasing k
00:01:54 326Mb  INFO: Finished increasing k
00:01:54 326Mb  INFO: Exporting remaining active transitions
00:01:54 326Mb  INFO: Export to Dot
00:01:54 326Mb  INFO: Export to GFA and compressed contigs
00:01:54 337Mb  INFO: Finished repeat resolution
00:01:54 5Mb  INFO: Performing polishing and homopolymer uncompression
00:01:54 5Mb  INFO: Aligning reads back to assembly
00:02:10 3.5Gb  INFO: Finished alignment.
00:02:10 3.5Gb  INFO: Printing alignments to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/uncompressing/alignments.txt"
00:02:10 3.5Gb  INFO: Reading and processing initial reads from ["/Poppy/mfedarko/sftests/tutorial-output/smooth/reads/edge_6104.fasta.gz"]

00:03:52 3.6Gb  INFO: Uncompressing homopolymers in contigs
00:03:52 3.6Gb  INFO: Total zero covered nucleotides 0
00:03:52 3.6Gb  INFO: Calculating overlaps between adjacent uncompressed edges
00:03:52 3.6Gb  INFO: Printing final gfa file to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg.gfa"
00:03:52 3.6Gb  INFO: Printing final assembly to "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/assembly.fasta"
00:03:53 5Mb  INFO: Final homopolymer compressed and corrected reads can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/k5001/corrected_reads.fasta"
00:03:53 5Mb  INFO: Final graph with homopolymer compressed edges can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg/mdbg.hpc.gfa"
00:03:53 5Mb  INFO: Final graph can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/mdbg.gfa"
00:03:53 5Mb  INFO: Final assembly can be found here: "/Poppy/mfedarko/sftests/tutorial-output/smooth/assemblies/edge_6104/assembly.fasta"
00:03:53 5Mb  INFO: LJA pipeline finished
smooth assemble @ 772.97s: Finished running LJA on file edge_6104.fasta.gz.
smooth assemble @ 772.97s: Done.
--------
smooth assemble @ 772.97s: Done.

Once you've created these assemblies, you can then analyze them like you would any other assembly of a MAG. For information about LJA-specific details of the outputs (e.g. what files mean what), please see LJA's manual.

8.1.3. Visualizing assembly graphs¶

One thing we may want to do is visualize the assembly graphs created by LJA, indicating how "simple" each assembly wound up. Bandage, AGB, and MetagenomeScope are a few of many tools that can do this.

As an example, here's a MetagenomeScope visualization of the multiplex de Bruijn graph (MDBG) for edge_6104 (CAMP). This visualization makes it clear that, although there were 35 mutated positions in CAMP in the BCF file (according to the logs from strainFlye smooth create above), the final MDBG—after applying a $k$-mer coverage filter—is really made up of two "bubble" structures, aka the boxes colored in blue. (The pink segment in this figure indicates a sequence that is shared between both bubbles, and duplicated here for ease of visualization.)

CAMP's multiplex de Bruijn graph

Why does this graph look different from the other multiplex de Bruijn graph shown for CAMP in the strainFlye paper?¶

For context: Figure 6 in our paper also shows a MDBG for CAMP, produced by running LJA on the smoothed and virtual reads created for CAMP. Why does that visualization include more sequences?

This is primarily due to differences in the input mutations used for the smoothing process. The mutations used as input for the paper's Figure 6 were just the outputs of naïve $p$-mutation calling with $p = 1\%$. Here, on the other hand, the input BCF file we gave to strainFlye smooth create was created from FDR estimation/fixing (using CAMP as the decoy contig). This meant that none of the mutations for CAMP included in this BCF file were "rare" (i.e. had frequencies less than the -hp/--high-p parameter of strainFlye fdr estimate, which defaults to $p = 5\%$).

(The input alignment is also slightly different.)

8.2. Constructing link graphs¶

As a related analysis, we can construct link graphs showing "how much" of a contig we can potentially phase. These graphs are described in detail in the Supplemental Material of our paper.

Long story short, creating link graphs in strainFlye has two steps (like performing smoothed-read assembly above). First, we'll compute nucleotide (co-)occurrence data for each contig; then, we'll convert this data into link graphs.

In [26]:

!strainFlye link

Usage: strainFlye link [OPTIONS] COMMAND [ARGS]...

  [+] Create link graphs showing co-occurring alleles.

  The "nt" command should be run first; this generates information about the
  counts and co-occurrences of nucleotides at mutated positions in a contig.

  The "graph" command takes as input the information produced by "nt" and
  creates link graphs (one link graph per contig) from it. There are many
  parameters that impact the graph creation, so we have split this into two
  steps in order to make it easy to rerun the "graph" step with different
  parameter settings if needed (the "nt" command will likely take longer to
  run).

Options:
  -h, --help  Show this message and exit.

Commands:
  nt     Compute (co-)occurrence information for mutations' nucleotides.
  graph  Convert (co-)occurrence information into a link graph structure.

8.2.1. Computing nucleotide (co-)occurrence information¶

This command will take a while (as shown below, it took ~47.84 hours to run on our cluster, or just under two days). As with other long-running commands in the pipeline, using --verbose may be helpful if you'd like to see intermediate logging messages.

A brief note about filtering¶

Note that, in the paper's Supplemental Material, we limited our focus to mutated positions with coverages of at least 1,000x. We do not do this here. If you'd like to perform this same sort of filtering—and only consider mutations occurring at positions with at least a certain minimum coverage—you'll need to filter the BCF file yourself. (You could probably use bcftools to do this; if you are using a BCF file produced by strainFlye, then you can filter based on the MDP info tag, which describes the number of matching + mismatching reads at a mutated position.)

In [31]:

!strainFlye link nt

Usage: strainFlye link nt [OPTIONS]

  Compute (co-)occurrence information for mutations' nucleotides.

Options:
  -c, --contigs PATH          FASTA file of contigs for which we will create
                              link graphs. All contigs in this FASTA file
                              should also be contained in the BAM and BCF
                              files; it's ok if the BAM or BCF files contain
                              contigs not in this FASTA file (we'll ignore
                              them).  [required]
  --bam PATH                  Sorted and indexed BAM file representing an
                              alignment of reads to contigs.  [required]
  --bcf PATH                  Indexed BCF file describing single-nucleotide
                              mutations in a set of contigs.  [required]
  -o, --output-dir DIRECTORY  Directory to which information about nucleotide
                              frequencies at the mutated positions in each
                              contig, as well as co-occurrence information
                              between pairs of mutated positions, will be
                              written. Each contig will be represented by two
                              "pickle" files within this directory, one
                              describing nucleotide frequencies at each
                              position and one describing nucleotide pair
                              frequencies at pairs of positions; each file's
                              name will be prefixed with the corresponding
                              contig name.  [required]
  --verbose / --no-verbose    Display extra details while running.  [default:
                              no-verbose]
  -h, --help                  Show this message and exit.

In [32]:

!strainFlye link nt \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --bcf /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
link nt @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta
Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam
Input BCF file: /Poppy/mfedarko/sftests/tutorial-output/p15-fdr1pct.bcf
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/link/nt
--------
link nt @ 0.00s: Loading and checking FASTA, BAM, and BCF files...
link nt @ 0.12s: The FASTA file describes 3 contig(s).
link nt @ 0.22s: All FASTA contig(s) are included in the BAM file (this BAM file has 78,793 reference(s)).
link nt @ 11.96s: All FASTA contig(s) are included in the BCF file (the header of this BCF file describes 468 contig(s)).
link nt @ 11.98s: The lengths of all contig(s) in the FASTA file match the corresponding lengths in the BAM and BCF files.
link nt @ 11.98s: So far, these files seem good.
--------
link nt @ 11.98s: Going through contigs and computing nucleotide (co-)occurrence information...
link nt @ 11.98s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%).
link nt @ 11.99s: Contig edge_1671 has 23,821 mutated position(s). Going through alignments to this contig...
link nt @ 899.59s: Found 256,509 read(s) spanning ≥ 1 mutated position in contig edge_1671. Computing (co-)occurrence info...
link nt @ 3,115.82s: Wrote out (co-)occurrence info for contig edge_1671.
link nt @ 3,115.82s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%).
link nt @ 3,115.82s: Contig edge_2358 has 17 mutated position(s). Going through alignments to this contig...
link nt @ 4,448.62s: Found 22,384 read(s) spanning ≥ 1 mutated position in contig edge_2358. Computing (co-)occurrence info...
link nt @ 4,451.40s: Wrote out (co-)occurrence info for contig edge_2358.
link nt @ 4,451.40s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%).
link nt @ 4,451.40s: Contig edge_6104 has 35 mutated position(s). Going through alignments to this contig...
link nt @ 5,292.62s: Found 8,670 read(s) spanning ≥ 1 mutated position in contig edge_6104. Computing (co-)occurrence info...
link nt @ 5,293.61s: Wrote out (co-)occurrence info for contig edge_6104.
link nt @ 5,293.61s: Done.
--------
link nt @ 5,293.66s: Done.

8.2.2. Converting this information to create link graphs¶

This next command directly takes as input the information we just computed. You can experiment with various parameters of graph construction here, if you'd like; the defaults match what we have set in the paper. (The help text for this command, shown below, explains these parameters in detail.)

In [33]:

!strainFlye link graph

Usage: strainFlye link graph [OPTIONS]

  Convert (co-)occurrence information into a link graph structure.

  We create one link graph per contig. These graphs are undirected. A node (i,
  Ni) in the graph represents an allele: in other words, the occurrence of
  nucleotide Ni (one of {A, C, G, T}) at position i (1-indexed).

  What do edges between nodes represent? To answer that, let's define reads(i,
  Ni) as the number of reads at which Ni is aligned to i. Let's also define
  reads(i, j, Ni, Nj) as the number of reads at which Ni is aligned to i, and
  Nj is aligned to j.

  Given these definitions, we define the link weight between two alleles (i,
  Ni) and (j, Nj) as

                               reads(i, j, Ni, Nj)
   link(i, j, Ni, Nj) = -----------------------------------
                          max(reads(i, Ni), reads(j, Nj))

  We create an edge between nodes (i, Ni) and (j, Nj) if both of the following
  conditions hold:

  1. The number of reads spanning both i and j with any (non-degenerate)
  nucleotides aligned to either position is ≥ the --min-span parameter.

  2. link(i, j, Ni, Nj) > the --low-link parameter.

Options:
  -n, --nt-dir DIRECTORY          Directory produced by "strainFlye link nt"
                                  containing "pickle" files describing
                                  nucleotide (co-)occurrence information for
                                  certain contigs.  [required]
  --min-nt-count INTEGER RANGE    We will only create a node representing a
                                  given nucleotide at a mutated position if
                                  this nucleotide is supported by at least
                                  this many reads at this position.  [default:
                                  2; x>=1]
  --min-span INTEGER RANGE        One of the prerequisites for creating an
                                  edge between two nodes (representing
                                  nucleotides Ni, Nj at mutated positions i
                                  and j) is that the total number of reads
                                  spanning both i and j, across all
                                  combinations of (non-degenerate) nucleotides
                                  at both i and j, is at least this parameter.
                                  [default: 501; x>=1]
  --low-link FLOAT RANGE          The other prerequisite for creating an edge
                                  between two nodes (representing nucleotides
                                  Ni, Nj at mutated positions i and j) is that
                                  link(i, j, Ni, Nj) > this parameter. Note
                                  this check is not inclusive (i.e. the
                                  default of 0 means that, if reads(i, j, Ni,
                                  Nj) = 0, we will never create an edge
                                  between (i, Ni) and (j, Nj).  [default: 0;
                                  0<=x<1]
  -f, --output-format [dot|nx]    Output format used for each contig's link
                                  graph. "dot" will write out graphs in
                                  Graphviz' DOT language; "nx" will write out
                                  graphs to "pickle" files, in which each link
                                  graph is a NetworkX object.  [default: dot]
  --min-penwidth-clamp FLOAT RANGE
                                  Only used if --output-format is "dot". Edges
                                  will be assigned a "penwidth" attribute by
                                  interpolating their link weight from the
                                  range [0, 1] to the range [0, --max-
                                  penwidth]; penwidths less than --min-
                                  penwidth-clamp will then be clamped to
                                  --min-penwdith-clamp.  [default: 0.1; x>=0]
  --max-penwidth FLOAT RANGE      Only used if --output-format is "dot". See
                                  the description for --min-penwidth-clamp.
                                  [default: 5; x>=0]
  -o, --output-dir DIRECTORY      Directory to which graphs will be written.
                                  We'll write one graph file per contig; each
                                  file's name will be prefixed with the
                                  corresponding contig name.  [required]
  --verbose / --no-verbose        Display extra details while running.
                                  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [34]:

!strainFlye link graph \
    --nt-dir /Poppy/mfedarko/sftests/tutorial-output/link/nt \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/link/graph \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
link graph @ 0.00s: Starting...
Input nucleotide information directory: /Poppy/mfedarko/sftests/tutorial-output/link/nt
To make a node (i, Ni): reads(i, Ni) must be ≥ 2
To make an edge btwn (i, Ni) & (j, Nj): spanCount(i, j) must be ≥ 501
To make an edge btwn (i, Ni) & (j, Nj): link(i, j, Ni, Nj) must be > 0.00
Output graph file format: dot
In dot output, edges' penwidths will be scaled from link weights to [0, 5.0] and then clamped to a minimum of 0.1
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/link/graph
--------
link graph @ 0.00s: Going through (co-)occurrence information and creating link graphs...
link graph @ 25.65s: Creating a link graph for contig edge_1671...
link graph @ 25.73s: The link graph for contig edge_1671 has 49,167 node(s).
link graph @ 54.96s: The link graph for contig edge_1671 has 8,801,972 edge(s).
link graph @ 84.48s: Wrote out a link graph (format: "dot") for contig edge_1671.
link graph @ 86.67s: Creating a link graph for contig edge_2358...
link graph @ 86.67s: The link graph for contig edge_2358 has 41 node(s).
link graph @ 86.67s: The link graph for contig edge_2358 has 13 edge(s).
link graph @ 93.57s: Wrote out a link graph (format: "dot") for contig edge_2358.
link graph @ 93.57s: Creating a link graph for contig edge_6104...
link graph @ 93.57s: The link graph for contig edge_6104 has 80 node(s).
link graph @ 93.57s: The link graph for contig edge_6104 has 2,190 edge(s).
link graph @ 93.58s: Wrote out a link graph (format: "dot") for contig edge_6104.
link graph @ 93.58s: Done.
--------
link graph @ 93.58s: Done.

8.2.3. Visualizing CAMP's link graph¶

Since we kept -f/--output-format set to the default of dot, the above command just created DOT files for each contig's link graphs. These files can be easily visualized using Graphviz. (Graphviz is installed by default on many Linux systems; see its website for installation instructions. I've found that installing it from conda-forge seems to work well.)

Let's visualize the link graph for CAMP (edge_6104) as a simple example of how we might visualize link graphs in practice. It shouldn't contain too many mutations, so it's a simple starting point.

To visualize this graph we'll use Graphviz' sfdp layout algorithm, which is useful for visualizing large undirected graphs.

In [35]:

# make sure sfdp (and Graphviz) are installed
!which sfdp

/home/mfedarko/miniconda3/envs/strainflye/bin/sfdp

In [36]:

!sfdp \
    /Poppy/mfedarko/sftests/tutorial-output/link/graph/edge_6104_linkgraph.gv -Tpng \
    > edge_6104_linkgraph.png

CAMP's link graph

And there we have it!

8.2.4. Interpreting link graphs¶

What's going on in this graph? We see two large "groups" of alleles with high-link-weight edges. Each allele is labelled with four pieces of information:

Position on the contig (1-indexed)
Nucleotide (one of A, C, G, T)
Count
Frequency

Notice how the frequencies of alleles in the left group are mostly ~16%, and how the frequencies of alleles in the right group are mostly ~84%. These groups correspond to the two main "haplotypes" of gene 1217 in CAMP (ranging from left = 1,208,927 bp to right = 1,210,075 bp). We know from our paper (see Figure 5, as well as Supplemental Material "Haplotypes of the most mutated gene in CAMP") that the mutations in gene 1217 of CAMP have one ~84% haplotype and one ~16% haplotype: we see this pattern recapitulated clearly here.

There's also one mutation with a similar frequency distribution at position 766,857 bp, although this mutation isn't close enough to gene 1217's mutations to connect it.

Notice how there are a few low-count, low-frequency alleles on the periphery of this link graph (e.g. node 1,209,001 (T), with a count of 3x). These alleles pop up because, when we create nodes corresponding to alleles at mutated positions in a contig, we create these nodes for all alleles at these mutated positions (with a count of at least the --min-nt-count parameter of strainFlye link graph)—not just those corresponding to the most common or second most common alleles at a mutated position. You can increase the --min-nt-count parameter to a higher number (e.g. 10) if you'd like to remove these low-count allele nodes.

Comparing this link graph with Figure S19 (bottom) in our paper¶

We note that this graph is a bit different from Figure S19 (bottom) in our paper, which also shows a link graph for CAMP. Why is this the case? Here are a few reasons:

Figure S19's link graphs were created using simple $p = 0.5\%$ mutations (filtered to positions with a minimum coverage of 1,000x), whereas our link graphs were created from post-FDR-fixing mutations without any coverage filtering (and since we used CAMP as our decoy, we only include "indisputable" mutations in CAMP—i.e. those with frequency $p ≥ 5\%$).
Figure S19 (bottom) only shows the largest component of CAMP—whereas this link graph shows the entirety of CAMP's link graph (but since we consider a likely smaller set of mutations, the link graph is not much larger).
In Figure S19, I colored each node (allele) in the link graph based on the gene(s) in which it was located. (This analysis notebook contains the code I used to do this.) Here we just visualize these DOT files as strainFlye outputs them, without any gene-coloring information; you can adjust nodes' colors yourself if desired, although for large graphs this will probably require some programming.
The alignment used here is slightly different from that used in the paper.

9. Codon / amino acid mutation matrices¶

We don't discuss this much in the main text of the strainFlye paper, but it's possible to create mutation matrices showing how codons mutate into each other, specific to an individual contig. We can also derive amino acid mutation matrices from these. These matrices can in theory be useful as an alternative to more general amino acid mutation matrices like PAM (Dayhoff et al., 1978) or BLOSUM (Henikoff & Henikoff, 1992).

In our paper's Supplemental Material, we show off a few of these matrices for three MAGs. strainFlye supports creating these matrices for arbitrary contigs, and here we demonstrate how we can easily do this!

In [1]:

!strainFlye matrix

Usage: strainFlye matrix [OPTIONS] COMMAND [ARGS]...

  [+] Create codon and amino acid mutation matrices.

Options:
  -h, --help  Show this message and exit.

Commands:
  count  Count 3-mers aligned to contigs' codons.
  fill   Call codon mutations from 3-mer counts and fill in matrices.

Similarly to strainFlye link, the strainFlye matrix module is also split up into two steps (count and fill). The rationale behind this is that count takes a while and doesn't require many parameters, whereas fill goes by much quicker and has many adjustable parameters. So it's convenient to have a "break" in the process this way.

9.1. Counting 3-mers aligned to codons¶

We'll reuse prodigal_anonymous_selected3.gff, the file of protein-coding gene predictions in the three selected contigs that we generated above using Prodigal. This tells strainFlye where each contig's genes are, and from there we can figure out where its codons are (and what reads are aligned to each of these codons).

This will take a while, so you may want to use --verbose and/or filter your FASTA file (e.g. if you only want to draw a mutation matrix for a single contig, filtering your FASTA file to this one contig should speed things up a lot).

In [2]:

!strainFlye matrix count

Usage: strainFlye matrix count [OPTIONS]

  Count 3-mers aligned to contigs' codons.

Options:
  -c, --contigs PATH          FASTA file of contigs in which we'll count
                              3-mers aligned to codons in predicted genes. All
                              contigs in this FASTA file should also be
                              contained in the BAM file; it's ok if the BAM
                              file contains contigs not in this FASTA file
                              (we'll ignore them). We'll also ignore contigs
                              for which no gene coordinates are given.
                              [required]
  -b, --bam PATH              Sorted and indexed BAM file representing an
                              alignment of reads to contigs.  [required]
  -g, --genes PATH            Generic Feature Format version 3 (GFF3) file
                              describing gene coordinates within contigs.
                              We'll only consider features with a "type" of
                              one of {"CDS", "SO:0000316"}.  [required]
  -o, --output-dir DIRECTORY  Directory to which 3-mer count information for
                              each contig will be written. Each contig will be
                              represented by one "pickle" file in this
                              directory named [contig]_3mers.pickle.
                              [required]
  --verbose / --no-verbose    Display extra details while running.  [default:
                              no-verbose]
  -h, --help                  Show this message and exit.

In [ ]:

!strainFlye matrix count \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --genes /Poppy/mfedarko/sftests/tutorial-output/prodigal_anonymous_selected3.gff \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/count \
    --verbose

9.2. Calling codon mutations from aligned 3-mer counts, creating mutation matrices¶

The long part is over! Now that we have counts of which 3-mers were aligned to the codons in the coding sequences of our contigs, we can call codon mutations.

Calling codon mutations is done similarly, in our pipeline, to how we call single-nucleotide mutations. We support either $p$ or $r$ codon mutation calling (using frequencies or read counts). Here we'll do $p$-mutation calling at $p = 0.5\%$, just to be consistent with the paper's Supplemental Material.

Minor sidenote: skipping this step and calling codon mutations manually¶

If you'd prefer, you could in theory call codon mutations from these counts through a more sophisticated method—this is one of those advantages of splitting up count and fill into two separate commands. (However: in order to do this, you'd need to load the 3-mer count data yourself—it's stored as a pickled representation of a CodonCounter object, which is defined in strainFlye's source code here. Feel free to open an issue on GitHub if you have questions about interacting with this object.)

In [4]:

!strainFlye matrix fill

Usage: strainFlye matrix fill [OPTIONS]

  Call codon mutations from 3-mer counts and fill in matrices.

  You can call either p-mutations (based on codon frequencies) or r-mutations
  (based just on codon counts) using -p or -r. These methods work analogously
  to how they are described in "strainFlye call", with the addition that we do
  not call codon mutations for "unreasonable" codons where the most common
  aligned 3-mer (or, in the case of a tie, any of the most common aligned
  3-mers) at this codon is not exactly the same as the "reference" codon's
  3-mer.

  It's worth noting that the amino acid mutation matrix is derived directly
  from the codon mutation matrix: that is, as soon as we identify a mutation
  from one codon (C1) into another codon (C2), we also record a mutation in
  the amino acid matrix from translate(C1) into translate(C2). We could
  arguably do this in a different way (count the translations of each 3-mer
  aligned to a codon, then call amino acid mutations separately), but for now
  we just do things the one way.

Options:
  -c, --count-dir DIRECTORY       Directory produced by "strainFlye matrix
                                  count" containing "pickle" files describing
                                  the counts of 3-mers aligned to contigs'
                                  codons.  [required]
  -p INTEGER RANGE                If specified, we'll call codon p-mutations
                                  at this p threshold. This is scaled up by
                                  100 (i.e. "-p 50" corresponds to 50 / 100 =
                                  0.5%) in order to bypass floating-point
                                  precision issues. Mutually exclusive with
                                  -r.  [default: (nothing); 0<x<=5000]
  --min-alt-pos INTEGER RANGE     Only used if -p is specified. In order for
                                  us to call a p-mutation at a position, this
                                  position's alternate nucleotide must be
                                  supported by at least this many reads.
                                  [default: 2; x>=1]
  -r INTEGER RANGE                If specified, we'll call codon r-mutations
                                  at this r threshold. Mutually exclusive with
                                  -p.  [default: (nothing); x>=1]
  -f, --output-format [tsv|json]  Output format for matrix information.
                                  [default: tsv]
  -o, --output-dir DIRECTORY      Directory to which matrix information will
                                  be written. Within this directory, we'll
                                  create one subdirectory for each contig
                                  represented in --count-dir. Each
                                  subdirectory will contain a codon mutation
                                  matrix, amino acid mutation matrix, and
                                  counts of codons and amino acids in the
                                  "reference" contig sequence.  [required]
  --verbose / --no-verbose        Display extra details while running.
                                  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [5]:

!strainFlye matrix fill \
    --count-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/count \
    -p 50 \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/matrix/fill \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
matrix fill @ 0.00s: Starting...
Input 3-mer count directory: /Poppy/mfedarko/sftests/tutorial-output/matrix/count
p: 50
Minimum number of supporting reads needed to call a p-mutation: 2
r: None
Output matrix information file format: tsv
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/matrix/fill
--------
matrix fill @ 0.00s: Performing codon p-mutation calling (p = 0.50%).
--------
matrix fill @ 0.00s: Going through aligned 3-mer counts and creating matrices...
matrix fill @ 1.26s: Creating matrices for contig edge_1671...
matrix fill @ 66.95s: Created an output directory for contig edge_1671.
matrix fill @ 68.38s: Creating matrices for contig edge_2358...
matrix fill @ 143.64s: Created an output directory for contig edge_2358.
matrix fill @ 144.56s: Creating matrices for contig edge_6104...
matrix fill @ 183.52s: Created an output directory for contig edge_6104.
matrix fill @ 183.52s: Done.
--------
matrix fill @ 183.64s: Done.

9.3. Visualizing mutation matrices¶

First, let's load and inspect a codon mutation matrix. We'll use BACT1 (edge_1671)'s matrix, just because we know from our paper that this MAG has a lot of mutations.

An entry (row, col) in the codon or amino acid matrix gives the number of times the codon or amino acid row mutates into the codon or amino acid col. For example, in BACT1's codon mutation matrix (shown below), we saw the codon AAA mutate into the codon AAC exactly 15 times.

Note: We don't do this here, but—depending on what you want to use this matrix for—it is worth thinking about if you want to normalize and/or subset it in some way. For example, you may want to divide each entry in each row by the total number of mutations in this row (dividing each entry in AAA's row by the total number of mutations from AAA into another codon, etc.), in order to convert each entry from counts to percentages. Making these sorts of changes should be feasible with pandas, once you've loaded the initial matrix.

In [11]:

# Load the codon matrix for BACT1 using pandas
cmat = pd.read_csv(
    "/Poppy/mfedarko/sftests/tutorial-output/matrix/fill/edge_1671/edge_1671_codon_matrix.tsv",
    sep="\t"
)
cmat

Out[11]:

	AAA	AAC	AAG	AAT	ACA	ACC	ACG	ACT	AGA	AGC	...	TCG	TCT	TGA	TGC	TGG	TGT	TTA	TTC	TTG	TTT
AAA	NaN	15.0	243.0	53.0	11.0	3.0	0.0	2.0	53.0	2.0	...	0.0	0.0	1.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0
AAC	35.0	NaN	3.0	316.0	2.0	7.0	0.0	1.0	0.0	45.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0
AAG	265.0	7.0	NaN	15.0	0.0	0.0	2.0	0.0	3.0	0.0	...	0.0	1.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
AAT	53.0	254.0	20.0	NaN	5.0	0.0	0.0	14.0	2.0	8.0	...	1.0	3.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0
ACA	16.0	1.0	3.0	8.0	NaN	94.0	76.0	168.0	6.0	3.0	...	2.0	4.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TGT	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	95.0	1.0	NaN	0.0	1.0	0.0	2.0
TTA	0.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	...	0.0	2.0	0.0	0.0	0.0	0.0	NaN	7.0	195.0	11.0
TTC	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	13.0	NaN	3.0	187.0
TTG	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	5.0	0.0	1.0	0.0	3.0	0.0	264.0	0.0	NaN	15.0
TTT	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	12.0	0.0	0.0	0.0	2.0	12.0	157.0	12.0	NaN

64 rows × 64 columns

The following code is a simplified version of how we visualized these matrices in the paper; the original notebook for visualizing these matrices is here.

In [12]:

# Get the tab20 color map. This returns an object that we can call with, e.g. (0) to get the first
# color, (1) to get the second color, etc. See https://stackoverflow.com/a/25408562.
# (List of matplotlib color maps: https://matplotlib.org/stable/gallery/color/colormap_reference.html)
cmap = matplotlib.cm.get_cmap("tab20")
colors = [cmap(i) for i in range(20)]
hexcolors = [matplotlib.colors.to_hex(c) for c in colors]

# Map the 20 amino acids to a unique tab20 color, and then keep the stop codon (item 21)'s text color as black
# so that it has a unique color.
aa2color = {}
aas = "ACDEFGHIKLMNPQRSTVWY*"
for i, aa in enumerate(aas[:-1]):
    aa2color[aa] = hexcolors[i]
aa2color["*"] = "#000000"

In [37]:

fig, ax = pyplot.subplots()
im = ax.imshow(cmat.values)
ax.grid(False)
ax.tick_params(
    top=True, bottom=True, left=True, right=True,
    labeltop=True, labelbottom=True, labelleft=True, labelright=True
)
toplabels = []
leftlabels = []
for oi, o in enumerate(cmat.index):
    t = skbio.DNA(o).translate()
    # Assign each codon a color based on its translation
    toplabels.append(f"{t}\n{o}")
    leftlabels.append(f"{t} {o}")
    
# As in "tick i's". This specifies where to put all the ticks. (If we omit this,
# then only a few of the tick labels are shown.)
tick_is = range(len(cmat.index))
ax.set_xticks(tick_is)
ax.set_yticks(tick_is)
    
ax.set_xticklabels(toplabels)
ax.set_yticklabels(leftlabels)

def set_text_style(ticklabel, fontsize=20):
    # Figure out the color for this tick based on its amino acid / stop codon
    tl.set_color(aa2color[tl.get_text().split()[0]])
    tl.set_fontweight("bold")
    tl.set_fontfamily("monospace")
    tl.set_fontsize(fontsize)
    
for tl in ax.get_yticklabels():
    set_text_style(tl)
for tl in ax.get_xticklabels():
    set_text_style(tl, fontsize=9)
    
ax.set_ylabel("Mutated from", fontsize=30)
ax.set_xlabel("Mutated into", fontsize=30)
ax.set_title('Codon mutation matrix ($p = 0.5\%$) for edge_1671 ("BACT1")', fontsize=40, y=1.035)

# Call tight_layout() before adding the colorbar: https://stackoverflow.com/a/48922336
fig.tight_layout()
# Use thousands separators in colorbar labels if needed: https://stackoverflow.com/a/47230864
cbar_formatter = lambda x, p: format(int(x), ",")
# The aspect=60 thing reduces the colorbar's width to look roughly equivalent to that of a matrix cell
# Also, note that adding the colorbar here causes a matplotlib warning about "Auto-removal of grids": as
# far as I can tell (https://github.com/matplotlib/matplotlib/issues/21723), this warning is not important.
fig.colorbar(im, aspect=60, format=cbar_formatter)

fig.set_size_inches(27, 27)

/home/mfedarko/miniconda3/envs/strainflye/lib/python3.7/site-packages/ipykernel_launcher.py:48: MatplotlibDeprecationWarning: Auto-removal of grids by pcolor() and pcolormesh() is deprecated since 3.5 and will be removed two minor releases later; please call grid(False) first.

No description has been provided for this image

So that's an example of how we can visualize mutation matrices! Making them look pretty takes a bit of work, but matplotlib helps a lot. There are lots of other ways to add more information into these visualizations—showing the exact number of mutations in each cell of the matrix, showing the "reference counts" (number of times we saw a codon in a contig's genes) on the diagonal of the codon matrix, etc. The notebook I used to visualize matrices for the paper goes much more in-depth, although its code is a bit of a rat's nest.

Of course, mutation matrices have other uses besides visualization—as mentioned, they could in theory be useful as a way to compare alignments of genes or proteins, similarly to the PAM or BLOSUM matrices. strainFlye just focuses on computing these matrices, so whatever you do with them is up to you :)

10. Growth dynamics analyses¶

One of the nice things of having complete or nearly-complete MAGs is that we can take a look at both their GC skews and coverages.

GC skews can sometimes be indicative of where the origin and terminus of replication are in a genome (this is discussed extensively in Chapter 1 of the Bioinformatics Algorithms textbook). Similarly, we know from (Korem, Zeevi, Suez et al., 2015) that the coverage throughout a MAG can also be indicative of the origin of replication.

In our paper, we demonstrate the easy applicability of HiFi reads to this sort of analysis by showing plots of coverage vs. skew for three MAGs in the SheepGut dataset, demonstrating that—as we might expect—these values are anti-correlated. (Why are they anti-correlated, specifically? The skew minimum indicates the origin of replication, and coverage for a genome undergoing replication should be at its maximum near the origin of replication.)

The strainFlye dynam covskew command can help us ~~replicate~~ reproduce this sort of analysis. Rather than plotting coverage and skew for each position within a MAG, we usually want to create "bins" of positions within a MAG and then plot coverage and skew for these bins: strainFlye can give us this sort of information for a contig in an easy-to-read TSV file that we can use as input for plotting coverage and skew.

This command will take some time to run, so you may want to use --verbose.

10.1. Compute coverage/skew information¶

In [27]:

!strainFlye dynam covskew

Usage: strainFlye dynam covskew [OPTIONS]

  Compare coverage and GC skew within contigs.

  Bins
  ----
  We split each contig into bin(s) of a fixed number of positions (determined
  by the --bin-length parameter), starting from the leftmost end of the
  contig (which we define as position 1). We continue creating bins until
  we run out of positions: the rightmost bin will contain fewer positions
  than --bin-length if the contig length is not divisible by --bin-length.
  (And if the contig's length is ≤ --bin-length, we'll just create one
  bin for all positions in this contig.)

  Normalized coverages
  --------------------
  For each bin, we compute the median coverage of all positions in this bin.
  We then compute M, the median of these medians across all bins in a contig.
  We finally compute "normalized" coverages by dividing each bin's median
  coverage by M, and then clamping these normalized coverages to the range
  [1 - ϵ, 1 + ϵ]. (If M = 0, then we cannot do this normalization:
  in this case, we define each bin's normalized coverage as "NA".) The
  --norm-coverage-epsilon parameter can be used to adjust ϵ.

  Cumulative GC skews
  -------------------
  For each bin, we compute the skew (G - C) / (G + C), where G is the number
  of G nucleotides in this bin and C is the number of C nucleotides in this
  bin.  (If a bin has zero G or C nucleotides, then we define its skew as 0.)
  We make these skews cumulative by, starting from the second-from-the-left
  bin, defining this bin's cumulative skew as the sum of its skew and the
  cumulative skew of the bin exactly to the left of it. This approach is
  derived from (Grigoriev, 1998).

Options:
  -c, --contigs PATH              FASTA file of contigs for which we will
                                  compute coverage and skew information. All
                                  contigs in this FASTA file should also be
                                  contained in the BAM file; it's ok if the
                                  BAM file contains contigs not in this FASTA
                                  file (we'll ignore them).  [required]
  -b, --bam PATH                  Sorted and indexed BAM file representing an
                                  alignment of reads to contigs.  [required]
  -l, --bin-length INTEGER RANGE  Bin length (in bp).  [default: 10000; x>=1]
  -e, --norm-coverage-epsilon FLOAT RANGE
                                  We clamp all bins' normalized coverages to
                                  within this distance of 1.0. For example,
                                  the default of ϵ = 0.3 means we clamp to the
                                  range [0.7, 1.3].  [default: 0.3; x>0]
  -o, --output-dir DIRECTORY      Directory to which TSV files describing
                                  contigs' bins, coverages, and skews will be
                                  written. We'll write one TSV file per contig
                                  (named [contig]_covskew.tsv); each bin
                                  within a contig will be described in one row
                                  in its TSV file.  [required]
  --verbose / --no-verbose        Display extra details while running.
                                  [default: no-verbose]
  -h, --help                      Show this message and exit.

In [28]:

!strainFlye dynam covskew \
    --contigs /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta \
    --bam /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam \
    --output-dir /Poppy/mfedarko/sftests/tutorial-output/covskew \
    --verbose

Using strainFlye version "0.1.0-dev".
--------
dynam covskew @ 0.00s: Starting...
Input contig file: /Poppy/mfedarko/sftests/tutorial-output/sheepgut_contigs_selected3.fasta
Input BAM file: /Poppy/mfedarko/sftests/tutorial-output/alignment/final.bam
Bin length: 10,000 bp
Normalized coverage epsilon: 0.3
Verbose?: Yes
Output directory: /Poppy/mfedarko/sftests/tutorial-output/covskew
--------
dynam covskew @ 0.00s: Loading and checking FASTA and BAM files...
dynam covskew @ 0.35s: The FASTA file describes 3 contig(s).
dynam covskew @ 0.35s: All of these are included in the BAM file (which has 78,793 reference(s)), with the same lengths.
--------
dynam covskew @ 0.35s: Going through contigs and computing coverage/skew information...
dynam covskew @ 0.35s: On contig edge_1671 (2,153,394 bp) (1 / 3 contigs = 33.33%).
dynam covskew @ 0.35s: Creating 215 bins of length 10,000 bp and 1 smaller bin of length 3,394 bp for contig edge_1671...
dynam covskew @ 231.35s: For contig edge_1671's bins' covs, pre-norm.: min = 4.0x; max = 1,606.0x; median (M) = 1,435.5x.
dynam covskew @ 231.35s: For contig edge_1671's cumulative bin skews: min = 0.0963; max = 4.9053.
dynam covskew @ 231.39s: On contig edge_2358 (2,806,161 bp) (2 / 3 contigs = 66.67%).
dynam covskew @ 231.39s: Creating 280 bins of length 10,000 bp and 1 smaller bin of length 6,161 bp for contig edge_2358...
dynam covskew @ 1,172.61s: For contig edge_2358's bins' covs, pre-norm.: min = 2,427.0x; max = 4,194.0x; median (M) = 2,940.0x.
dynam covskew @ 1,172.61s: For contig edge_2358's cumulative bin skews: min = -2.1523; max = 2.1512.
dynam covskew @ 1,172.67s: On contig edge_6104 (1,289,244 bp) (3 / 3 contigs = 100.00%).
dynam covskew @ 1,172.67s: Creating 128 bins of length 10,000 bp and 1 smaller bin of length 9,244 bp for contig edge_6104...
dynam covskew @ 1,888.12s: For contig edge_6104's bins' covs, pre-norm.: min = 1,455.0x; max = 5,104.5x; median (M) = 4,108.0x.
dynam covskew @ 1,888.12s: For contig edge_6104's cumulative bin skews: min = 0.0137; max = 7.6426.
dynam covskew @ 1,888.15s: Done.
--------
dynam covskew @ 1,888.19s: Done.

10.2. Visualize coverage and skew¶

So we've got this information, but what can we do with it? The most obvious answer is that we can plot it without too much effort. Visually, this will let us see if (normalized) coverage and skew are anti-correlated. (You could also imagine performing some sort of correlation analysis / regression between these quantities, but let's keep it simple for now.)

Here, we'll replicate part of Figure S17 in our paper (the exact code that generated that figure is in this notebook). For the sake of simplicity, we'll just do this for a single contig: CAMP, or edge_6104.

In [8]:

# Load the coverage / skew TSV file, and see what the first few lines (bins) in it look like
covskew = pd.read_csv("/Poppy/mfedarko/sftests/tutorial-output/covskew/edge_6104_covskew.tsv", sep="\t")
covskew.head()

Out[8]:

	LeftPos_1IndexedInclusive	CenterPos_1Indexed	NormalizedCoverage	CumulativeSkew
0	1	5000.5	1.183057	0.013732
1	10001	15000.5	1.164070	0.107924
2	20001	25000.5	1.180136	0.143035
3	30001	35000.5	1.155794	0.265320
4	40001	45000.5	1.169426	0.506219

In [12]:

cov_color = "#0760ad"
skew_color = "#ad0707"

fig, ax = pyplot.subplots()

# twinx() lets us plot coverage and skew on the same plot.
ax_skew = ax.twinx()

# Plot coverage and skew with both different colors and line styles -- the use of different line styles
# makes this plot easier to interpret for colorblind readers. The (0, (6, 1)) line style makes the coverage
# line use dense dashes, which matches the version of this plot shown in the strainFlye paper's supplemental
# material; see https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html.
ax.plot(
    covskew["CenterPos_1Indexed"], covskew["NormalizedCoverage"], color=cov_color, lw=5, linestyle=(0, (6, 1)),
    label="Coverage"
)
ax_skew.plot(
    covskew["CenterPos_1Indexed"], covskew["CumulativeSkew"], color=skew_color, lw=5,
    label="Skew"
)

# turn off skew gridlines, which get in the way visually.
ax_skew.grid(False)
# plot the low and upper normalized coverage bounds, for reference
for ybound in [0.7, 1.3]:
    ax.axhline(y=ybound, linestyle=":", color=cov_color, lw=3)

ax.set_xlabel("Sequence position", labelpad=12, fontsize=20)
ax.set_ylabel("Normalized Coverage", labelpad=12, fontsize=20, color=cov_color)
ax_skew.set_ylabel("Cumulative GC Skew", labelpad=28, fontsize=20, color=skew_color, rotation=-90)
ax.set_title("Coverage vs. GC skew in CAMP", fontsize=25)
ax.xaxis.set_major_formatter("{x:,.0f}")

# The bin positions are 1-indexed, so let's use 1-indexing on the x-axis to make this clear
ax.set_xticks([1, 2e5, 4e5, 6e5, 8e5, 10e5, 12e5])
ax.tick_params(axis="both", labelsize=13)
ax_skew.tick_params(axis="both", labelsize=13)

# Let's add a legend just to make it clear which line is which (mostly useful for colorblind folks)
fig.legend(loc=(0.625, 0.31), handlelength=10, fontsize=12)

fig.set_size_inches(15, 5)

Nice—we've recreated the top plot in Figure S17 without too much work. The TSV files strainFlye dynam covskew produces for each contig simplify the process of creating these sorts of plots, and thus simplify the process of investigating whether or not a contig / MAG is undergoing replication.

10.3. Computing peak-to-trough ratios (PTRs)¶

The above plot makes it clear that coverage and skew are generally anti-correlated for CAMP, which provides some evidence that this particular MAG might be undergoing replication. If desired, we could take things further and compute the peak-to-trough ratio (PTR) (Korem, Zeevi, Suez et al., 2015) for CAMP to try to gain information about its replication rate.

10.3.1. Some background on PTRs¶

From (Korem, Zeevi, Suez et al., 2015) (emphasis mine):

Summed across the population, the copy number of a DNA region will be higher the closer that region is to the replication origin and, conversely, lower the closer that region is to the terminus (20, 21). Hence, the ratio between DNA copy number near the replication origin and that near the terminus, which we term peak-to-trough ratio (PTR), should reflect the growth rate of the bacterial population. At higher growth rates, a larger fraction of cells undergo DNA replication and more active replication forks are present in each cell (22). This results in a ratio higher than 1:1 between near-origin DNA and near-terminus DNA, thereby providing a quantitative readout of the population growth rate (21).

10.3.2. Computing (rough) PTRs using the output of `strainFlye dynam covskew`¶

First, we'll make the assumption that the bin with the minimum skew corresponds to the approximate location of the replication origin of CAMP, and that the bin with the maximum skew corresponds to the approximate location of the replication terminus of CAMP.

This is not always a safe assumption to make (skew diagrams are not always perfect indicators of this—see, for example, the skew diagram for Thermotoga petrophila). Also, we should be wary of situations like CAMP where the full genome was incompletely assembled (there are a few other contigs in the same component of the assembly graph in which CAMP is located, as discussed in our paper's Supplemental Material). That said, in this case this is a reasonable assumption—the PTRs we'll compute here are not gospel, and should be interpreted as only rough estimates.

After making this assumption, we'll identify the normalized coverages in the minimum-skew and maximum-skew bins:

In [14]:

# Which bins have the smallest and largest skews?
# The "keep" parameter to nsmallest() and nlargest() can be used to adjust how pandas
# breaks ties (see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nlargest.html),
# but that shouldn't matter too much.
smallest_skew_ncov = covskew.nsmallest(1, ["CumulativeSkew"])["NormalizedCoverage"].iloc[0]
largest_skew_ncov = covskew.nlargest(1, ["CumulativeSkew"])["NormalizedCoverage"].iloc[0]
print(f"The normalized coverage in the bin with the minimum skew is {smallest_skew_ncov:.4f}.")
print(f"The normalized coverage in the bin with the maximum skew is {largest_skew_ncov:.4f}.")

The normalized coverage in the bin with the minimum skew is 1.1831.
The normalized coverage in the bin with the maximum skew is 0.9282.

... And we can compute the PTR as the ratio of normalized coverage in the bin with the smallest skew (corresponding, per our assumption, to the replication origin) to normalized coverage in the bin with the largest skew (corresponding, per our assumption, to the replication terminus).

Note that if either of these normalized coverages have been "clamped" (i.e. they are set to 0.7 or 1.3, for the default -e/--norm-coverage-epsilon parameter for strainFlye dynam covskew) then I would recommend not computing the PTR for this contig—or, if you do compute the PTR, I'd at least recommend being skeptical of it. You could try to limit the impacts of clamping by increasing -e/--norm-coverage-epsilon, although this will make the normalized coverages more susceptible to outlier coverages.

In [15]:

camp_ptr = smallest_skew_ncov / largest_skew_ncov
print(f"The PTR for CAMP is {smallest_skew_ncov:.4f} / {largest_skew_ncov:.4f} = {camp_ptr:.4f}.")

The PTR for CAMP is 1.1831 / 0.9282 = 1.2746.

Nice—this matches the PTR we computed for CAMP in our Supplemental Material.

11. Wrapping up¶

That's it for this tutorial! To review, here we used strainFlye to:

Align reads to contigs (align)
Naïvely call mutations in contigs (call p-mutation)
Estimate the false discovery rates (FDRs) of these mutation calls (fdr estimate)
Fix mutation calls' FDRs to an upper limit (fdr fix)
Identify "hotspot" predicted genes (spot hot-features)
Identify "coldspot" gaps between mutations (spot cold-gaps)
Create and assemble smoothed and virtual reads (smooth create, smooth assemble)
Create link graphs (link nt, link graph)
Compute codon and amino acid mutation matrices (matrix count, matrix fill)
Compare contigs' coverage and skew to get information on growth dynamics (dynam covskew)

Hopefully this gives you a taste of the various tasks for which strainFlye can be useful. If you have any questions, feel free to open an issue on strainFlye's GitHub page.

Tutorial: applying strainFlye to the "SheepGut" dataset¶

0. Following this tutorial¶

1. The two main strainFlye inputs¶

1.1. Details about these types of inputs¶

1.2. The SheepGut dataset¶

2. Introduction¶

2.1. Installing strainFlye¶

2.2. What commands are available through strainFlye?¶

2.3. Importing and configuring some utilities¶

3. Convert the assembly graph GFA file to a FASTA file of contigs¶

4. Align reads to contigs; filter the resulting alignment¶

5. Optional: Filter the FASTA file in order to focus on certain contigs¶

6. Perform naïve mutation calling, then estimate and fix mutation calls' FDRs¶

6.1. $p$-mutations and $r$-mutations?¶

6.2. Understanding these (sub)commands¶

6.2.1. Input and output¶

6.2.2. Interpreting the output¶

FDR estimation and fixing¶

6.3. Naïvely call $p$-mutations ($p = 0.15\%$) and compute diversity indices for various values of $p$¶

6.4. Estimating FDRs using the target-decoy approach¶

6.4.1. Manually selecting a decoy contig¶

6.4.2. Letting strainFlye fdr estimate automatically select a decoy contig¶

6.5. Plotting FDR curves¶

6.6. Fixing mutation calls' FDRs to an upper limit of $1\%$¶

7. Identify hotspots and coldspots¶

7.1. Identify hotspot features¶

7.1.1. A note about "features"¶

7.1.2. Running the command to identify hotspot features (in this case, gene predictions)¶

7.2. Identify coldspot gaps¶

8. Phasing analyses¶

8.1. Generating "smoothed haplotypes"¶

8.1.1. Creating smoothed and virtual reads¶

8.1.2. Assembling these reads¶

8.1.3. Visualizing assembly graphs¶

Why does this graph look different from the other multiplex de Bruijn graph shown for CAMP in the strainFlye paper?¶

8.2. Constructing link graphs¶

8.2.1. Computing nucleotide (co-)occurrence information¶

A brief note about filtering¶

8.2.2. Converting this information to create link graphs¶

8.2.3. Visualizing CAMP's link graph¶

8.2.4. Interpreting link graphs¶

Comparing this link graph with Figure S19 (bottom) in our paper¶

Other notes about link graphs¶

9. Codon / amino acid mutation matrices¶

9.1. Counting 3-mers aligned to codons¶

9.2. Calling codon mutations from aligned 3-mer counts, creating mutation matrices¶

Minor sidenote: skipping this step and calling codon mutations manually¶

9.3. Visualizing mutation matrices¶

10. Growth dynamics analyses¶

10.1. Compute coverage/skew information¶

10.2. Visualize coverage and skew¶

10.3. Computing peak-to-trough ratios (PTRs)¶

10.3.1. Some background on PTRs¶

10.3.2. Computing (rough) PTRs using the output of strainFlye dynam covskew¶

11. Wrapping up¶

6.4.2. Letting `strainFlye fdr estimate` automatically select a decoy contig¶

10.3.2. Computing (rough) PTRs using the output of `strainFlye dynam covskew`¶