Abstract

This project aims to establish the indicative conditions, features, and patterns of a sufficient production of human cardiomocytes from HPSCs. Differentiation causes cells to undergo major changes in their shape, size, and metabolism. These concepts served as the guide for which parameters in the dataset were most likely to be biologically relevant, with the aid of a literature review on known cellular changes during differentiation. The preparatory conditions were dropped, as they were fairly consistent across samples. A second selection round happened to eliminate variables which are similar representations of changes in cell shape, size and metabolism status in order to reduce variance.

The most reliable classification model was the Random Forest Classifier. It performed the best. We initially attempted to structure the problem as a regression analysis and train a model accordingly, followed by a classification of the predicted dd10 CM content. This method produced reliable true negatives, but was very vulnerable to false negatives. As a result, we changed our approach to a classification method from the start. We initially transformed the y variable into a binary one, essentially assigning it a 1 if it had sufficient CM content (higher or equal to 90%). We trained the claasification model and used it to classify the test data. This method produced more consistent true positive results, leading to an increase in model performance according to given metrics, namely accuracy, precision, and recall.

Inspiration

hIPSC differentiated cardiomyocytes have the potential to be powerful laboratory models of hearts, act similarly to transplants, and also be a lead into precision medicine. However, control over hIPSC differentiation specifically into cardiomyocytes still has many unknowns, leading to inefficiencies in their production. We hope that by making these prediction models, scientists will be able to hone in on the ideal environmental conditions that best push for cardiomyocyte differentiation.

What it does

It predicts whether the cardiomyocyte content will be sufficient or insufficient based on a given set of conditions, and observed features and patterns.

How we built it

We used python and pandas for data pre-processing, and then sci-kit learn to build the Random Forest Classifier. It was trained using analytics from successful hIPSC differentiation into cardiomyocyte experiments.

Challenges we ran into

Our initial plan was to use regression analysis to predict a continuous y-variable, and then a function was applied to classify the variable. In the best case scenario, the Random Forest Regression model had an accuracy of 72%, but precision and recall were zero (i.e. no TPs). We then reviewed our approach, and decided to format the problem instead as a classification task. We convertd the CM content into a binary (>= or less than 90%), before using the data to train the model. Between the Random Forest Classifier, Artificial Neural Network, and Gradient Boosting Classifier, the RFC yielded the best accuracy of 88.89%.

Accomplishments that we're proud

We are proud to have created a model that is very close to the original paper's accuracy of more than 90%!

What we learned

This weekend was a good opportunity to learn more about how different prediction models worked, as well as how different approaches in data analysis.

What's next for Heart Searcher

We hope to continue improving the accuracy, as well as continue to be able to refine the parameters of the prediction model to best reflect updated literature on stem cell differentiation and expand beyond this experimental sample size.

Built With

Share this project:

Updates