Automatic Enzyme Sequence Annotation

Inspiration

All enzymes are made of one or more chains of amino acids, which determine their structure, behavior, and interactions with other enzymes and molecules. That means predicting the protein’s function and behavior should be possible given just the amino acid sequence. A model able to perform this task would have many applications. In addition to enzymes from known organisms (which we have from studying their proteomes), there are vast numbers of metagenomic sequences - this is proteomic sequence data from environmental samples. Being able to quickly annotate them with function using this model (i.e. going beyond simple sequence similarity) would be indispensable.

What it does

Automatic Annotation of Enzyme/Protein Sequence Classification into their Enzyme/Protein types.

How we built it

Evolutionary Scale Modeling and MLP /neural network.

Challenges we ran into

The embedder we used is a huge language model which takes up a lot of computing resources. I started building it using my MacBook Pro with 8Gb Ram and a basic GPU. As you'll notice in the Feature_Generation module of my submitted system, I had to divide the training files into multiple files because the RAM would peak if I tried to embed all the sequences at once. The training sequences consisted of 858777 amino acid sequences. Which would in turn get embedded in 1024 long feature vector. So running the model got increasingly challenging and I finally had to move to Google Cloud Services.

Accomplishments that we're proud of

After a huge amount of training time and computing resources, we managed to build a highly accurate model with an accuracy of around 93.7 % on 253146 unseen enzyme sequences with un-disclosed label distribution.

What we learned

We started off with a naive approach of using sequence similarity to cluster different classes of enzymes. As we moved further, I got to know about various ways of embedding the protein sequences which would help the machine learning models to better differentiate the enzyme class to which a particular enzyme belongs. We learned the efficiency and accuracy trade-off by experimenting with various sequence vectorizers. The basic Seq2Vec is a light model which embeds the protein sequence easily whereas the prottrans_t5_xl_u50 is a big model which takes a long time to embed but is highly accurate.

What's next for Automatic Enzyme Sequence Annotation

In this project, I've worked with labeled enzyme sequences to train the model, however, we can use the model for any kind of protein sequence.

Built With

bio-embeddings
esm1b
google-cloud
google-colab
python
scikit-learn

Updates

Nikhil Singh started this project — May 26, 2022 10:55 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.