Cyclia Submission

ROC Curve
Visualization of Target Variable

💡 Inspiration

Teaming up as mainly Statistics students, we decided to work on the Cyclica challenge and explore the world of Bioinformatics. We learned more about AlphaFold2 protein structures and different binding sites. We chose this challenge because this problem hasn't been completely solved in the real world yet. We ran into various modelling and computation issues given how largely imbalanced this dataset was. Through many days of model training and data analysis, the one and only customized prediction for Cyclica data has been created through the genius minds of four.

🔍 What it does

Our goal is to build a classification model to predict the drug binding sites on AlphaFold2-predicted proteins. We do this by:

Classifying residue as either a 'drug binding site' or 'non-binding site'
Listing out the most important features/columns for drug binding

We used a BalancedRandomForestClassifier model to run with corresponding features with many tryouts to find the best performance of the model on the data. We got the following results:

Mean f1 for RBF: 0.195
Mean recall for RBF: 0.753
Mean precision for RBF: 0.112
Mean ROC-AUC for RBF: 0.860
Mean balanced_accuracy for RBF: 0.771

We tried a lot of new algorithms, went through an entire DS lifecycle, and were able to come up with the results. We found the following features to be most important for drug binding:

feat_pLDDT
coord_X
coord_Y
coord_Z
feat_BBSASA
feat_SCSASA

With this process, our team has successfully produced strong prediction results of drug binding sites on AlphaFold2-predicted proteins. Thus, with the support of the technology, this model can potentially be the reference for future pharmaceutical and biochem industries' research with a more proper process of drug binding on AlphaFold2-predicted proteins.

⚙️ How it was built

We used the following to build our model: Python packages including:

Numpy
Pandas
Sklearn

Statistical techniques include:

Exploratory data analysis
Data preprocessing
Model selection
Hyperparameter tuning
Cross-validation

Models tested:

Random Forest
Balanced Random Forest
xgboost
LightGBM
SVM
Neural Networks
Generalized Linear Models (logistic regression)

🚧 Challenges we ran into

A lot of us were new to ML. So we have to learn about the nuances of certain algorithms like SVM and neural networks as well as deal with an imbalanced dataset. In addition, it was difficult for us to understand the different features because we were unfamiliar with their biochemical context.

Another challenge our team faced was the lack of computation power when processing large amounts of data. This prevented us from fine-tuning any of our models. We partially mitigated this issue by dividing and conquering computational tasks.

✔️ Accomplishments that we're proud of

For many of us, it was our first time working in a group environment at a datathon and we are all proud of what we were able to build during the period. Other accomplishments include:

Learning how to run models with stats knowledge
Explore actual data with biofields
Working together collaboratively as a team, and teaching each other new concepts

📚 What we learned

Through the implementation of our model, we learned many new concepts such as, but not limited to:

how to run different models using the data
how to test accuracy and other metrics
how to choose the best model with the data The biggest takeaway from this event is the importance of collaboration and help among peers to put together a shared vision.