scHVAE

Built With

python
tensorflow

Updates

Pinar Demetci posted an update — Nov 30, 2020 01:41 PM EST

UPDATE #1 Introduction: (copied from the proposal) Single-cell sequencing has allowed scientists to capture finer-grain patterns in transcriptional data in comparison to previous bulk-sequencing technologies. However, depending on the research question, the resolution of interest differs (e.g. some studies might want to look at tissue-wide patterns, some might be interested in transcriptomic changes between different cell types and some might require even finer grained analysis on continuous cell states within each cell type). Most single-cell clustering and annotation algorithms, however, only yield results in one level of resolution despite the hierarchical nature of cell types. This has been identified as a major point of interest in the ``Eleven grand challenges in single-cell data science’’ paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1926-6. Moreover, cell ofen go through continuous developmental trajectories and their cell type/cell state identity can be transitory. This is usually not captured by clustering algorithms that yield hard assignments to clusters and the same methods do not yield any quantification of uncertainty in their clustering results. Lastly, people often pool transcriptomic data from different experiments, and integrating data from different batches can result in batch effects (technical noise) affecting the clustering quality. As a result, I will be working on a deep learning model that performs probabilistic hierarchical clustering on single-cell transcriptomic data, which will also allow for integrating data from different experiments. This is an unsupervised latent representation learning problem. I will be addressing it using variational autoencoders with latent space modeled with a nested Dirichlet process. While the Dirichlet process keeps the posterior of the latent space learned by the model as flexible as possible, the nested structure will help learn hierarchical representations. Each sample fed into the model (corresponding to a cell with transcriptomic data) will be probabilistically assigned to nodes down a “tree” of clusters. The architecture I am proposing has several modifications on top of a variational autoencoder and I will implement these in steps, going from a simpler model to a more complicated model. I hope to finish implementing them all by the Deep Learning day: -The latent representations in VAE will be modeled as a nested Dirichlet distribution in order to capture hierarchical nature of the cell type annotations. This will be the minimum viable implementation: a model that captures hierarchical nature of cell types and yields probabilistic hierarchical clusters.

The next step (first stretch goal) will be to turn this into a conditional variational autoencoder for allowing users to integrate data from different sequencing experiments (across batches), and removing batch effects (technical noise/variation) during the integration via conditional VAE.
My last stretch goal is to add some interpretability into the model such that we can tell users which genes are overexpressed / underexpressed in different clusters and which features (genes) are more important in guiding the clustering results. This will hopefully help people with assigning cell type identities to the clusters yielded by the model and make certain downstream analyses, such as differential expression analysis, easier. I will start out by looking at the concrete autoencoder paper for interpretability (although there are other options such as SHAP) https://arxiv.org/abs/1901.09346 , which replaces the initial layer and the last layer weights with Continuously Relaxed Discrete Variables (Maddison 2016, aka “Gumbel Softmax Variables” ) for feature interpretation (https://arxiv.org/abs/1611.00712).

Challenges: Combined hyperparameter tuning of the neural network hyperparameters as well as nested Dirichlet process hyperparameters proved to be challenging. I added penalty terms to the nested representations to control the depth of the “tree” and how large a cluster can be, which introduced additional hyperparameters to the model and they significantly affect the clustering quality. Currently, I am using an iterative approach to first optimizing the neural network hyperparameters, then optimizing the nested representation parameters with an upper-bound on the number of iterations (10). This seems to work fine so far but is an additional time sink in model training. I have also experimented with model interpretability borrowing the implementation from “concrete autoencoders” but realized this will not be sufficient.

Insights: Model interpretability method with the concrete autoencoders yields some results on which genes are more important in clustering results, however, these are not cluster-specific results. I was hoping to get cluster specific results so the user could tell which genes are overexpressed / underexpressed in certain cell clusters. I am currently looking into backpropagation-based interpretability methods I could use for getting cluster specific interpretability.

Plan: I am confident about being able to finish the hierarchical clustering implementations since I have some acceptable results there already, however, the interpretability part is going to take longer than I initially expected. I also have not yet looked into integrating data across batches (different experiments) and I am not sure whether I will get to this stretch goal. Moving forward, I will focus on only one of these stretch goals and won’t try to split my time between the two. I will make a decision on this after some literature review on interpretability methods

Log in or sign up for Devpost to join the conversation.

Pinar Demetci started this project — Nov 30, 2020 01:27 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.