-
Notifications
You must be signed in to change notification settings - Fork 3
Tutorial
Task specific tutorials for various metagenomic analyses. Soon.
To perform any kind of query, you need to either build an index from reference genomes or download a public krepp index from one of our servers. It is quite straightforward to download an index and get started (jump to Downloading an existing index). Alternatively, if you have your own reference mitogenomes, you can index them from scratch (see Indexing mitogenomes). Once you do either of these, the next step is querying your skimming data (e.g., a FASTA or FASTQ file), as we will discuss in detail.
Download a relatively up-to-date mitogenome index by running the below wget command (or curl if wget is not available).
Then, untar and decompress it.
wget --no-check-certificate https://ter-trees.ucsd.edu/data/krepp/index_mitochondrion-RefSeq20240913.tar.gz
tar -xzvf index_mitochondrion-RefSeq20240913.tar.gzThe resulting directory index_mitochondrion-RefSeq20240913 stores the index that we can perform queries against.
You can jump to the next step.
For example, we will use a recent RefSeq release of all available mitochondrial genomes. The very first step is to download all the mitogenomes available on RefSeq as of 2024/09/13, and decompress the downloaded file.
# 2024-09-13
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/mitochondrion/mitochondrion.1.1.genomic.fna.gz
gunzip -d mitochondrion.1.1.genomic.fna.gzConvert multi-line FASTA to 2-line FASTA for easier processing.
seqtk seq -l0 mitochondrion.1.1.genomic.fna > /tmp/mitochondrion_l0.1.1.genomic.fna && mv /tmp/mitochondrion_l0.1.1.genomic.fna mitochondrion.1.1.genomic.fnaSince krepp requires one reference sequence per file, we have to split concatenated mitogenomes into separate files:
mkdir mitochondrion_genomic_peracc
cd mitochondrion_genomic_peracc
cat ../mitochondrion.1.1.genomic.fna | sed 's/^>/>>/' | awk 'BEGIN{RS=">>";FS="\n"} NR>1{split($0,acc," "); fnme=acc[1]".fasta"; print ">" $0 > fnme; close(fnme);}'
cd ../Then, list all sequence identifiers in another file and create a mapping between accession IDs and each file corresponding to a reference sequence.
grep "^>" mitochondrion.1.1.genomic.fna | sed 's/^>//' > seqids_mitochondrion.txt
cut -f1 -d' ' seqids_mitochondrion.txt | awk '{print $1"\t./mitochondrion_genomic_peracc/"$1".fasta";}' > input_map-mitochondrion.tsvThis file (input_map-mitochondrion.tsv) is going to be used as the input to krepp.
We are almost there. Now that we have a mapping, we can index mitogenomes by simply running the following krepp index command.
seq 0 7 | xargs -I{} bash -c "krepp index -i input_map-mitochondrion.tsv -o index_mitochondrion-RefSeq20240913 -k 29 -w 32 -h 13 -m 8 -r {} --no-frac --sdust-t 0 --num-threads 24"Here, index_mitochondrion-RefSeq20240913 is the name of the directory where we store the index.
For all other parameters, you can refer to the tutorial in the GitHub repository.
We will use the dist subcommand of krepp via krepp dist.
Suppose you have a FASTQ file named xyz.fq and an index in a directory index_mitochondrion-RefSeq20240913.
To compute the distance between each read in xyz.fq and all sufficiently close mitogenomes indexed, run the command below.
# Compute distances and filter any mapping above 15% sequence similarity
krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filterThis will output estimated distances to stdout in a tab-separated format.
You can save this to a file with the -o option (e.g., add -o xyz-distances.tsv to the command above).
The first 10 lines would look something like this.
# Print distances to stdout and get the first 10 lines
krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | headOutput:
#software: krepp #version: v0.5.1 #invocation :krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter
SEQ_ID REFERENCE_NAME DIST
AG-359-G18_1-1273 NaN NaN
AG-359-G18_1-1270 NC_065012.1 Amynthas deogyusanensis mitochondrion 2.00371e-05
AG-359-G18_1-1272 NaN NaN
AG-359-G18_1-1269 NC_065012.1 Amynthas deogyusanensis mitochondrion 0.0137122
AG-359-G18_1-1268 NC_065012.1 Amynthas deogyusanensis mitochondrion 0.00638889
AG-359-G18_1-1267 NC_065012.1 Amynthas deogyusanensis mitochondrion 0.0163656
AG-359-G18_1-1271 NaN NaN
The first column is for the read ID, the second column is for the accession ID of the references the corresponding read maps to, and the third column is for the distances between the reads and the mapped references.
Notice that we used a threshold of 15% sequence similarity (--dist-max 0.15).
This means that any read without a hit below 15% similarity will have NaN in the second and third columns.
These are most likely non-mitochondrial reads (e.g., bacterial).
You can simply filter such reads using their IDs and move forward with the rest.
You may want to change the threshold value (e.g., --dist-max 0.1 for a more aggressive filtering or --dist-max 0.2 for a more conservative filtering), but 15% should do a decent job for this purpose.
To test these commands, I created a toy example of a skimming file consisting of 500 mitochondrial reads (generated from a randomly selected set of mitogenomes) and 500 bacterial reads. When I run the below command, it outputs 500 mitochondrial read IDs, meaning that we filter all bacterial reads and retain all mitochondrial reads, which is the desired output.
krepp dist -i index_mitochondrion-RefSeq20240913 -q xyz.fq --dist-max 0.15 --filter | grep -v NaN | cut -f1 | uniqGood luck with your data!