Tutorial

Task specific tutorials for various metagenomic analyses. Soon.

Binning mitochondrial reads from skimming data using krepp

To perform any kind of query, you need to either build an index from reference genomes or download a public krepp index from one of our servers. It is quite straightforward to download an index and get started (jump to Downloading an existing index). Alternatively, if you have your own reference mitogenomes, you can index them from scratch (see Indexing mitogenomes). Once you do either of these, the next step is querying your skimming data (e.g., a FASTA or FASTQ file), as we will discuss in detail.

Downloading an existing index

Download a relatively up-to-date mitogenome index by running the below wget command (or curl if wget is not available). Then, untar and decompress it.

wget --no-check-certificate https://ter-trees.ucsd.edu/data/krepp/index_mitochondrion-RefSeq20240913.tar.gz
tar -xzvf index_mitochondrion-RefSeq20240913.tar.gz

The resulting directory index_mitochondrion-RefSeq20240913 stores the index that we can perform queries against. You can jump to the next step.

Indexing mitogenomes

For example, we will use a recent RefSeq release of all available mitochondrial genomes. The very first step is to download all the mitogenomes available on RefSeq as of 2024/09/13, and decompress the downloaded file.

# 2024-09-13
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz
gunzip -d mitochondrion.1.1.genomic.fna.gz

Convert multi-line FASTA to 2-line FASTA for easier processing.

seqtk seq -l0 mitochondrion.1.1.genomic.fna > /tmp/mitochondrion_l0.1.1.genomic.fna && mv /tmp/mitochondrion_l0.1.1.genomic.fna mitochondrion.1.1.genomic.fna

Since krepp requires one reference sequence per file, we have to split concatenated mitogenomes into separate files:

mkdir mitochondrion_genomic_peracc
cd mitochondrion_genomic_peracc
cat ../mitochondrion.1.1.genomic.fna | sed 's/^>/>>/' | awk 'BEGIN{RS=">>";FS="\n"} NR>1{split($0,acc," "); fnme=acc[1]".fasta"; print ">" $0 > fnme; close(fnme);}'
cd ../

Then, list all sequence identifiers in another file and create a mapping between accession IDs and each file corresponding to a reference sequence.

grep "^>" mitochondrion.1.1.genomic.fna | sed 's/^>//' > seqids_mitochondrion.txt
cut -f1 -d' ' seqids_mitochondrion.txt | awk '{print $1"\t./mitochondrion_genomic_peracc/"$1".fasta";}' > input_map-mitochondrion.tsv

This file (input_map-mitochondrion.tsv) is going to be used as the input to krepp.

We are almost there. Now that we have a mapping, we can index mitogenomes by simply running the following krepp index command.

seq 0 7 | xargs -I{} bash -c "krepp index -i input_map-mitochondrion.tsv -o index_mitochondrion-RefSeq20240913 -k 29 -w 32 -h 13 -m 8 -r {} --no-frac --sdust-t 0 --num-threads 24"

Here, index_mitochondrion-RefSeq20240913 is the name of the directory where we store the index. For all other parameters, you can refer to the tutorial in the GitHub repository.

Querying your skimming data

We will use the dist subcommand of krepp via krepp dist. Suppose you have a FASTQ file named xyz.fq and an index in a directory index_mitochondrion-RefSeq20240913. To compute the distance between each read in xyz.fq and all sufficiently close mitogenomes indexed, run the command below.

# Compute distances and filter any mapping above 15% sequence similarity
krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter

This will output estimated distances to stdout in a tab-separated format. You can save this to a file with the -o option (e.g., add -o xyz-distances.tsv to the command above). The first 10 lines would look something like this.

# Print distances to stdout and get the first 10 lines
krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | head

Output:

#software: krepp  #version: v0.5.1  #invocation :krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter
SEQ_ID	REFERENCE_NAME	DIST
AG-359-G18_1-1273	NaN	NaN
AG-359-G18_1-1270	NC_065012.1 Amynthas deogyusanensis mitochondrion	2.00371e-05
AG-359-G18_1-1272	NaN	NaN
AG-359-G18_1-1269	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.0137122
AG-359-G18_1-1268	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.00638889
AG-359-G18_1-1267	NC_065012.1 Amynthas deogyusanensis mitochondrion	0.0163656
AG-359-G18_1-1271	NaN	NaN

The first column is for the read ID, the second column is for the accession ID of the references the corresponding read maps to, and the third column is for the distances between the reads and the mapped references. Notice that we used a threshold of 15% sequence similarity (--dist-max 0.15). This means that any read without a hit below 15% similarity will have NaN in the second and third columns. These are most likely non-mitochondrial reads (e.g., bacterial). You can simply filter such reads using their IDs and move forward with the rest. You may want to change the threshold value (e.g., --dist-max 0.1 for a more aggressive filtering or --dist-max 0.2 for a more conservative filtering), but 15% should do a decent job for this purpose.

To test these commands, I created a toy example of a skimming file consisting of 500 mitochondrial reads (generated from a randomly selected set of mitogenomes) and 500 bacterial reads. When I run the below command, it outputs 500 mitochondrial read IDs, meaning that we filter all bacterial reads and retain all mitochondrial reads, which is the desired output.

krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | grep -v NaN | cut -f1 | uniq

Good luck with your data!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial

Binning mitochondrial reads from skimming data using krepp

Downloading an existing index

Indexing mitogenomes

Querying your skimming data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Menu

Home

Installation

A toy example for testing

Quickstart with a microbial index

Indexing reference genomes

Available reference indexes

Estimating distances from queries

Placing on a tree

Sketching & sequence distances

Tutorials

Clone this wiki locally