Virtual screening of bioassay data - Pharmahacks Challenge 1

Coefficient Matrix
MW vs Outcome
PSA vs Outcome
XLogP vs Outcome

Abstract

Drug screening is a necessary process that leads to new drug discovery. However, it takes an average of 15 years and $800 million to bring a drug to the market. With the advent of ML, these problems are eliminated as ML models can be used to predict interactions between ligands and protein-of-interest. Consequently, we sought to improve upon existing drug-screening ML models. In our work (Challenge 1), to understand which descriptors correlated to the outcome, we used Binary Logistic Regression to pre-process our data because it could assess multiple independent variables at a time to see how they were correlated to the Boolean outcome (Active or Inactive). Descriptors which resulted in a p-value less than 0.05 after the analysis were used for modelling. To build our models, we used Python and modules such as pandas, matplotlib, seaborn, statsmodels, scikit-learn. We used a combination of ML models including Random Forest with Randomized Search CV, XGBoost, SVM, and Naive Bayes with kernel density. We combined them in a Stacked model with XGBoost as the strongest learner. Through our model, we monitored accuracy, f1 and cohen-kappa scores. We achieved an accuracy of 0.98 and average f1 score of 0.66 in predicting protein-ligand interactions.

What it does

By using a combination of 4 ML models, we predict if a given ligand will bind to a protein of interest. We also used hyperparameter tuning to fine tune each model, which allows us to achieve high accuracy.

How we built it

We built our model using Python and several libraries:

pandas
matplotlib
seaborn
statsmodels
scikit-learn

We used 4 ML models: Random Forest with RandomizedSearchCV, XGBoost, SVM, and Naive Bayes with kernel density.

Our rationale for using our chosen models is mentioned below:

Random Forest - We believed a decision tree algorithm would be ideal. More importantly, the researchers in the paper ran out of memory space since they used only 2 gigabytes of heap space for Windows system. Since we have computers with faster processing, we were able to implement it. We also used RandomizedSearchCV to pinpoint what max_depth and n_estimators are best for our dataset.
XGBoost (gaussian) - It has both linear model solver and tree learning algorithms.
Support vector machine - It worked well, as mentioned in the research paper.
Gaussian Naive Bayes with kernel density estimation - It worked moderately well for researchers, as mentioned in the paper we were given, but we implemented it with kernel density estimation to optimize it.

We also used a logistic regression model for preprocessing.

Challenges we ran into

While using Binary Logistic Regression for preprocessing, we ran into some issues with using descriptors with binary data.
Also, while we were able to predict the outcomes with high accuracy, our f1 scores weren't high enough until we implemented a stacked model.
Since this is our first ML related hackathon, we did not have experience and had to explore a lot through the ML/AI resources provided.

Accomplishments that we're proud of

Our model predicted outcomes with an accuracy of 98%. In general, it resulted in very few false negatives and even fewer false positives.
In our demo run, we had no false positives and very few false negatives resulting in a cost matrix score of 45 calculated with well established validation methods(5 for false negative * 9 + 1 for false positive * 0).

What we learned

From our work, we understood that multiple factors which have intricate relationships are involved in protein-ligand interactions. Understanding such intricacies requires us to harness the power of machine learning. Our team of a a medical science student and 2 software engineering students, learned a lot of ML!

What's next for Drug Screening using ML models

We will be working on the shortcomings of our existing model to make it better predict protein-ligand interactions
We will make more models to include the confirmatory screening test as well so the entire process of drug discovery can be optimized