Inspiration
Inspired by https://doi.org/10.1038/s41598-022-12201-9 as well as other prominent papers related to using Graph Neural Networks for machine learning tasks involving proteins
What it does
Classification model for binding sites using an LSTM + GNN
How it works
LSTM to generate embeddings from the protein sequence + node features from the given dataset, and then constructed into a graph with edges between residues with 6Angstroms of distance in between (manually calculated + from AF2 dataset). Then passed through a DeeperGCN model.
Technologies used
- Used PyTorch Geometric for the graph neural network.
- Used bio_embeddings to generate LSTM embeddings.
- Used AlphaFold2 dataset to get graph structures of all proteins in dataset
- DeeperGCN model implementation
- Trained locally on RTX 3090ti, about 5 minutes
Results
Approximately .83 ROC AUC, .49 PR AUC, .22 F1 Score on the test set which is not as good as we had hoped. Training the model for a bit longer resulted in a .77 PR AUC, .50 PR AUC, .50 F1 score, but we wanted to maximize for ROC/PR
Challenges we ran into
- Creating a graph out of the proteins took surprisingly long to figure out and generate
Is it overcomplicated?
probably, but it was a great experience for us to learn more and gain experience with GNNs and LSTMs
Will it be outperformed by a simple random forest model?
probably, but wheres the fun in that?
Remarks
cyclica please give us a co-op for the fall term the economy is suffering right now and its hard to find a co-op i need a salary to survive i will work 16 hours a day if i have to like afm students
Log in or sign up for Devpost to join the conversation.