-
Notifications
You must be signed in to change notification settings - Fork 22
Home
Note, EVidenceModeler (EVM) is no longer being maintained or supported. Please seek alternatives as necessary.
The EVidenceModeler (aka EVM) software combines ab intio gene predictions and protein and transcript alignments into weighted consensus gene structures. EVM provides a flexible and intuitive framework for combining diverse evidence types into a single automated gene structure annotation system.
Inputs to EVM include the genome sequence, gene predictions and alignment data in GFF3 format, and a list of numeric weight values to be applied to each type of evidence. The weights can be configured manually.
Ideally, inputs should include:
- at least 3 different ab initio predictors, such as among: GlimmerHMM, SNAP, Augustus, GeneMarkHMM, and FGeneSH.
- RNA-seq based transcript structures (alignment assemblies from PASA )
- Protein spliced alignments (for example, running miniprot with a relevant protein database )
Docker or Singularity images are provided for running EVM and are the easiest way to run EVM, requiring no additional software installation.
Otherwise, EVM can easily be obtained from the GitHub EVM Release site.
After downloading the software, untar-gunzip at its permanent location, ie. /usr/local/bin This location will be referred to below as $EVM_HOME.
Run 'make' in the software base directory to compile the included ParaFly plugin for parallel command execution.
- To fully leverage high quality transcript alignments, you should download and install PASA - more on use of PASA included below.
EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format. Note the gff3 format for spliced alignments and gene structures are very different - see examples in the included test data and as shown below.
- example of Gene structure gff3 formatting:
Contig1 genemark gene 26023 29114 . + . ID=1.tPRED000024;Name=genemark model 1.m000024
Contig1 genemark mRNA 26023 29114 . + . ID=1.m000024;Parent=1.tPRED000024
Contig1 genemark exon 26023 26402 . + . ID=1.e153;Parent=1.m000024
Contig1 genemark CDS 26023 26402 . + 0 ID=cds_of_1.m000024;Parent=1.m000024
Contig1 genemark exon 26872 26978 . + . ID=1.e154;Parent=1.m000024
Contig1 genemark CDS 26872 26978 . + 2 ID=cds_of_1.m000024;Parent=1.m000024
Contig1 genemark exon 27082 27137 . + . ID=1.e155;Parent=1.m000024
Contig1 genemark CDS 27082 27137 . + 1 ID=cds_of_1.m000024;Parent=1.m000024
Contig1 genemark exon 27227 27475 . + . ID=1.e156;Parent=1.m000024
Contig1 genemark CDS 27227 27475 . + 0 ID=cds_of_1.m000024;Parent=1.m000024
...
Special requirements for EVM GFF3: The structure of the gene prediction format is gene->(mRNA->(exon->cds(?))(+) )(+) . What is meant by this is that more than one mRNA can link to a gene, and each mRNA has one or more exons, each exon having at most one CDS segment but perhaps no CDS segment if it's entirely untranslated (UTR). The child/parent relationships between the features are indicated by the Parent= part in the 9th field. The identifiers for gene's, mRNA's, and exon's need to be unique. The CDS identifier does not need to be unique. Also, if an exon is shared among two different transcripts, be sure to duplicate them and present them as distinct exon features with unique ids. EVM doesn't consider evidence for alternative splicing. Validate your gene predictions gff3 file using the included script: $EVM_BASE_DIR/EvmUtils/gff3_gene_prediction_file_validator.pl . Recommended genefinders for eukaryotic genefinding include GlimmerHMM, SNAP, Augustus, GeneMarkHMM, and FGeneSH.
- example of spliced alignment gff3 formatting:
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 3052 3229 71.35 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 128 294
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 2882 2931 84.00 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 295 344
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 2740 2794 83.64 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 345 399
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 2528 2587 88.33 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 400 459
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 2353 2412 90.00 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 460 519
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 2164 2262 83.00 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 520 619
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 1868 1935 95.59 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 620 687
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 1698 1747 90.00 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 688 737
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 1486 1603 78.81 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 738 855
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 987 1130 69.86 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 856 998
Contig1 gap2-plant_gene_index.11282006.fasta EST_match 179 201 60.87 + . ID=match.gap2.plant_gene_index.11282006.fasta.5;Target=ZMGI:TC311907 999 1021
...
Alignments (EST or protein) provided to EVM should be spliced alignments, and not simple blast results. Spliced alignments will provide intron-aware alignments, with evidence for intron and exon structures. Blast will not do this. Recommended programs for generating spliced alignments of ESTs or proteins include AAT, Exonerate, and GeneWise. In the above format, the link between individual alignment segments of a single alignment chain are implied by all rows sharing the same identifier (ID=''). No parent/child relationships are explicitly indicated here, as is done with the gene prediction formats.
Format converters for popular gene predictors and alignment utilities are available in EVM. If it's missing one, just ask and provide an example.
Larger sample data corresponding to the Rice genome are available and can be downloaded as EVM_sample_data by running 'make large_sample_data' in the EVM software base directory.
The weights file has three columns including the evidence class, type, and weight. The class parameter can be one of the following: ABINITIO_PREDICTION, OTHER_PREDICTION, PROTEIN, or TRANSCRIPT. These are the only input types accepted by EVM currently. An example weight file looks like so (tab-delimited):
ABINITIO_PREDICTION augustus 1
ABINITIO_PREDICTION twinscan 1
ABINITIO_PREDICTION glimmerHMM 1
PROTEIN spliced_protein_alignments 1
PROTEIN genewise_protein_alignments 2
TRANSCRIPT spliced_transcript_alignments 1
TRANSCRIPT PASA_transcript_assemblies 10
OTHER_PREDICTION PASA_transdecoder 5
-
ABINITIO_PREDICTION : gene predictions derived from ab initio gene predictors. These provide evidence for predicted gene structures (introns and exons), coding regions, and intergenic non-coding regions (regions devoid of ab initio predictions). Uses GFF3-gene-structure input formatting.
-
OTHER_PREDICTION : gene predictions derived from transcript or protein alignments. These provide gene structure components (exons and introns), contribute to coding region identification, but DO NOT contribute to intergenic non-coding region identification (unlike the ABINITIO_PREDICTION type). Uses GFF3-gene-structure input formatting.
-
PROTEIN : contributes gene structures (exons and introns) and contributes to coding region identification. Does not contribute to non-coding intergenic region scoring. Uses GFF3-alignment input formatting.
-
TRANSCRIPT : contributes gene structures (exons and introns) but NOT coding or intergenic region scoring. Uses GFF3-alignment input formatting.
The weights can be set intuitively (ie. weight(pasa) >> weight (protein) >= weight(prediction)).
You can choose weights such as:
ab initio predictions, weight = 1
protein alignments, weight = 1
If you have high quality genewise alignments, perhaps weight that between 2 and 5.
alignments of ESTs from related species, weight = 1
PASA alignment assemblies, weight = 10
If you have RNA-Seq data and have reconstructed transcripts using a method such as Stringtie or PASA, you can use TransDecoder to identify likely coding regions and to prepare a GFF3 file corresponding to the corresponding protein-coding gene structures. This GFF3 file (with the genome as the reference coordinate system) should be provided to EVM as the evidence type 'OTHER_PREDICTION'.
If using PASA, this can be done simply by running the ORF extraction utility subsequent to the standard PASA alignment assembly pipeline. -
- The resulting 'pasa.assemblies.fasta.transdecoder.genome.gff3' GFF3 file can be used with EVM as the 'OTHER_PREDICTION' type.
- Also include the pasa alignment assembly gff3 as TRANSCRIPT type.
EVM relies on having high quality evidence as input in order to provide high quality (ideally better) predictions as output. Be sure that the evidence input to EVM is of sufficient quality. This means that your ab initio predictions have high exon accuracy and reasonable gene prediction accuracy. Likewise, if you're providing protein alignments, make sure these alignments look to provide good evidence for underlying homologous gene structures.
How do I do this?
-
Look at your data. Upload into IGV and see how the gene structures agree and differ among the evidence that you're inputting. I find that '.bed' formatted files work best for viewing in IGV. EVidenceModeler comes with a couple of gff3-to-bed conversion utilities. For gene prediction style gff3 inputs, you can convert to bed using: 'EVidenceModeler/EvmUtils/gene_gff3_to_bed.pl'. For alignment gff3 formatting (eg. the TRANSCRIPT and PROTEIN evidence types), you can use 'EVidenceModeler/EvmUtils/transcript_gff3_to_bed.pl'.
-
Examine exon and gene prediction accuracy for your input gene predictions. This would require that you have a gold standard to evaluate accuracy with. One way to put together such a gold standard set for benchmarking is to use BUSCO and identify genes in your genome that represent core orthologs. Using these core ortholog complete gene structures, evaluate your input ab initio predictions and ensure that they are providing high quality evidence. If not, then retraining them is required. If you need to retrain, try creating a training set based on high quality transcriptome (rna-seq) based evidence such as by using this module within PASA.
Now that you have generated all the required inputs to EVM, you are ready to execute the system. EVM execution performs the following steps: partitioning the inputs into smaller more efficiently processed chunks, executing EVM on each of the partitioned data sets (in parallel), and finally combining the outputs. The genome sequences and gff3 files are partitioned based on individual contigs, and large contigs are segmented into smaller overlapping chunks. The final outputs include the EVM gene predictions in GFF3 format and protein and CDS sequences in FASTA format.
Note, you'll likely have multiple different sources of gene predictions and alignments in different files. EVidenceModeler currently only accepts single input files for the different evidence types (unfortunately), and so you'll need to concatenate your predictions and alignments into individual files, respectively. The 2nd column should indicate the evidence type and this should match up identically with what is specified in the weights file.
Run EVM like so:
$EVM_HOME/EVidenceModeler \
--sample_id mySampleID \
--genome genome.fasta \
--gene_predictions gene_predictions.gff3 \
--protein_alignments protein_alignments.gff3 \
--transcript_alignments transcript_alignments.gff3 \
--segmentSize 100000 \
--overlapSize 10000
To reduce memory requirements, the --segmentSize parameter should be set to less than 1 Mb. The --overlapSize should be set to a length at least one or two expected gene lengths, to minimize the likelihood of missing a complete gene structure within any single segment length.
Use -h with this script to examine all the various options. Additional options of interest includes:
--stop_codons :list of stop codons that provide valid stops (default: TAA,TGA,TAG)
For organisms such as Tetrahymena, where only a single stop codon is used as 'stop', you would define that single stop codon with the above option. The others are read thru.
--RECURSE :recurse into long introns to find genes that are nested within introns of other genes
--forwardStrandOnly
--reverseStrandOnly
A separate directory is created under '${mySampleID}.partitions/' for every contig which houses the corresponding contig-specific subset of the data, and additional subdirectories will exist where long contigs were further processed into overlapping chunks.
The raw output (named 'evm.out') provided by EVM for each contig describes the consensus gene structures in a tab-delimited format, listing each exon with the set of evidences that fully support each exon structure. An example gene structure in this raw format is shown below:
# EVM prediction: 80081-81514 orient(+) score(5464) noncoding_equivalent(442) raw_noncoding(2193) offset(1751)
80081 80104 initial+ 1 3 glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80463 80561 internal+ 1 3 genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80656 80853 internal+ 1 3 gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861313,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308,nap-nraa/PIR:C84824/match.nap.nraa.48729919
81026 81170 internal+ 1 1 gap2-Ceres.arab.cdna/32440./match.gap2.Ceres.arab.cdna.24436708,gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861314,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,nap-nraa/GP:20198307/match.nap.nraa.48729935,nap-nraa/PIR:C84824/match.nap.nraa.48729925
81258 81514 terminal+ 2 3 genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
The final aggregated EVM outputs are organized in the working directory as follows:
- ${mySampleID}.EVM.gff3 - EVM gene structure outputs in GFF3-gene-structure format.
- ${mySampleID}.EVM.pep and .cds - protein and CDS sequences for EVM predictions in FASTA format.
- ${mySampleID}.EVM.bed - EVM gene structure outputs in BED format for viewing in IGV.
For most efficient navigation of the gene structures in IGV, first 'bgzip' compress and 'tabix' index the .bed.gz file. View the .bed.gz file in IGV.
Nowadays, RNA-seq is ubiquitous and, if you have transcriptome data across diverse tissues, timepoints, conditions, etc., such that you can detect expression for most genes and hence gene structure info for most genes solely based on transcriptome sequencing data, it may be possible to effectively annotate your genome leveraging RNA-seq alone. However - if genes aren't expressed, they remain hidden within the genome, and ab initio gene prediction coupled with protein alignments may be needed to reveal them!
Our RNA-seq-only approach to gene structure annotation involves EVM and PASA with the following steps:
- First run PASA to generate alignment assemblies (using Stringtie and/or Trinity de novo assembled transcripts from the RNA-seq data)
- Second, run EVM using just two inputs:
- PASA alignment assemblies (in GFF3-alignment format) as TRANSCRIPT type (result from standard PASA alignment assembly pipeline)
- PASA Orf structures (in GFF3-gene-structure format) as OTHER_PREDICTION type (result from PASA Orf construction
The weights file might look like so:
OTHER_PREDICTION transdecoder 1
TRANSCRIPT assembler-PASA 1
- Finally, import the EVM gene structures as your genome annotation into PASA for doing the annotation comparison and annotation update steps to incorporate UTRs and alternative splice variants.
-
The EVM paper: Haas et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 2008, 9:R7doi:10.1186/gb-2008-9-1-r7.
-
Using EVM along with PASA for genome annotation: Approaches to Fungal Genome Annotation, Mycology. 2011 Oct 3; 2(3): 118–141
- Visit and use our Google group: https://groups.google.com/forum/#!forum/evidencemodeler-users for assistance and subscribe for announcements of new releases.
- GINGER: an integrated method for high-accuracy prediction of gene structure in higher eukaryotes at the gene and exon level Takeaki Taniguchi, Miki Okuno, Takahiro Shinoda, Fumiya Kobayashi, Kazuki Takahashi, Hideaki Yuasa, Yuta Nakamura, Hiroyuki Tanaka, Rei Kajitani, Takehiko Itoh