Inspiration

Inspired by https://doi.org/10.1038/s41598-022-12201-9 as well as other prominent papers related to using Graph Neural Networks for machine learning tasks involving proteins

What it does

Classification model for binding sites using an LSTM + GNN

How it works

LSTM to generate embeddings from the protein sequence + node features from the given dataset, and then constructed into a graph with edges between residues with 6Angstroms of distance in between (manually calculated + from AF2 dataset). Then passed through a DeeperGCN model.

Technologies used

  • Used PyTorch Geometric for the graph neural network.
  • Used bio_embeddings to generate LSTM embeddings.
  • Used AlphaFold2 dataset to get graph structures of all proteins in dataset
  • DeeperGCN model implementation
  • Trained locally on RTX 3090ti, about 5 minutes

Results

Approximately .83 ROC AUC, .49 PR AUC, .22 F1 Score on the test set which is not as good as we had hoped. Training the model for a bit longer resulted in a .77 PR AUC, .50 PR AUC, .50 F1 score, but we wanted to maximize for ROC/PR

Challenges we ran into

  • Creating a graph out of the proteins took surprisingly long to figure out and generate

Is it overcomplicated?

probably, but it was a great experience for us to learn more and gain experience with GNNs and LSTMs

Will it be outperformed by a simple random forest model?

probably, but wheres the fun in that?

Remarks

cyclica please give us a co-op for the fall term the economy is suffering right now and its hard to find a co-op i need a salary to survive i will work 16 hours a day if i have to like afm students

Built With

Share this project:

Updates