💡 Inspiration
Teaming up as mainly Statistics students, we decided to work on the Cyclica challenge and explore the world of Bioinformatics. We learned more about AlphaFold2 protein structures and different binding sites. We chose this challenge because this problem hasn't been completely solved in the real world yet. We ran into various modelling and computation issues given how largely imbalanced this dataset was. Through many days of model training and data analysis, the one and only customized prediction for Cyclica data has been created through the genius minds of four.
🔍 What it does
Our goal is to build a classification model to predict the drug binding sites on AlphaFold2-predicted proteins. We do this by:
- Classifying residue as either a 'drug binding site' or 'non-binding site'
- Listing out the most important features/columns for drug binding
We used a BalancedRandomForestClassifier model to run with corresponding features with many tryouts to find the best performance of the model on the data. We got the following results:
- Mean f1 for RBF: 0.195
- Mean recall for RBF: 0.753
- Mean precision for RBF: 0.112
- Mean ROC-AUC for RBF: 0.860
- Mean balanced_accuracy for RBF: 0.771
We tried a lot of new algorithms, went through an entire DS lifecycle, and were able to come up with the results. We found the following features to be most important for drug binding:
- feat_pLDDT
- coord_X
- coord_Y
- coord_Z
- feat_BBSASA
- feat_SCSASA
With this process, our team has successfully produced strong prediction results of drug binding sites on AlphaFold2-predicted proteins. Thus, with the support of the technology, this model can potentially be the reference for future pharmaceutical and biochem industries' research with a more proper process of drug binding on AlphaFold2-predicted proteins.
⚙️ How it was built
We used the following to build our model: Python packages including:
- Numpy
- Pandas
- Sklearn
Statistical techniques include:
- Exploratory data analysis
- Data preprocessing
- Model selection
- Hyperparameter tuning
- Cross-validation
Models tested:
- Random Forest
- Balanced Random Forest
- xgboost
- LightGBM
- SVM
- Neural Networks
- Generalized Linear Models (logistic regression)
🚧 Challenges we ran into
A lot of us were new to ML. So we have to learn about the nuances of certain algorithms like SVM and neural networks as well as deal with an imbalanced dataset. In addition, it was difficult for us to understand the different features because we were unfamiliar with their biochemical context.
Another challenge our team faced was the lack of computation power when processing large amounts of data. This prevented us from fine-tuning any of our models. We partially mitigated this issue by dividing and conquering computational tasks.
✔️ Accomplishments that we're proud of
For many of us, it was our first time working in a group environment at a datathon and we are all proud of what we were able to build during the period. Other accomplishments include:
- Learning how to run models with stats knowledge
- Explore actual data with biofields
- Working together collaboratively as a team, and teaching each other new concepts
📚 What we learned
Through the implementation of our model, we learned many new concepts such as, but not limited to:
- how to run different models using the data
- how to test accuracy and other metrics
- how to choose the best model with the data The biggest takeaway from this event is the importance of collaboration and help among peers to put together a shared vision.
Built With
- numpy
- pandas
- python
- scikit-learn
- tensorflow
Log in or sign up for Devpost to join the conversation.