scGenePT: Is language all you need for modeling single-cell perturbations?

Model Description

scGenePT is a collection of single-cell models for perturbation prediction. It leverages the scGPT [1] foundation model for scRNAseq data by injecting language embeddings at the gene level into the model architecture. The language gene embeddings are obtained by embedding gene level information from different knowledge sources using LLMs. The knowledge sources used include NCBI gene descriptions, UniProt protein Summaries for protein coding genes - as inspired by the genePT [2] approach - and GO (Gene Ontology) Gene Molecular Annotations, across three different axes: Molecular Function, Biological Process and Cellular Component

📁 Data

All of the data - including pre-computed gene_embeddings, as well as trained models, can be found in the public s3 bucket:

s3://czi-scgenept-public/

The data can be accessed through the aws cli. In most cases, pip install awscli should provide the required functionality to download and see the files. For information on installing the aws cli, follow the official documentation.

To download a folder: aws s3 sync --no-sign-request s3://czi-scgenept-public/models/finetuned/scgenept_go_c output_dir

To download a file: aws s3 sync --no-sign-request s3://czi-scgenept-public/models/gene_embeddings/GO_C_gene_embeddings-gpt3.5-ada-concat.pickle output_dir

scGenePT Model Zoo

Trained scGenePT Models can be downloaded from this Google Drive link

Model	Description	Download link aws s3 sync --no-sign-request [...]
scgenept_ncbi	scGPT + NCBI Gene Card Summaries	s3://czi-scgenept-public/models/finetuned/scgenept_ncbi
scgenept_ncbi+uniprot	scGPT + NCBI Gene Card Summaries + UniProt Protein Summaries	s3://czi-scgenept-public/models/finetuned/scgenept_ncbi+uniprot
scgenept_go_c	scGPT + GO Cellular Components Annotations	s3://czi-scgenept-public/models/finetuned/scgenept_go_c
scgenept_go_f	scGPT + GO Molecular Functions Annotations	s3://czi-scgenept-public/models/finetuned/scgenept_go_f
scgenept_go_p	scGPT + GO Biological Processes Annotations	s3://czi-scgenept-public/models/finetuned/scgenept_go_p
scgenept_go_all	scGPT + GO_F + GO_C + GO_P	s3://czi-scgenept-public/models/finetuned/scgenept_go_all
scgpt	scGPT	s3://czi-scgenept-public/models/finetuned/scgpt

scGPT Pretrained Model

Pretrained model	Download from	Should be under
scGPT Model weights (whole-human)	scGPT Google Drive Link s3://czi-scgenept-public/models/pretrained/scgpt	`models/pretrained/scgpt/` - best_model.pt - args.json - vocab.json

Pre-Computed Gene Embeddings
All gene embeddings can be found under s3://czi-scgenept-public/gene_embeddings/. You can download all of them at once using aws s3 sync --no-sign-request s3://czi-scgenept-public/models/gene_embeddings gene_embeddings

Gene Embedding	Download from aws s3 sync --no-sign-request[...]	Should be under
NCBI Gene summaries	GenePT zenodo Link s3://czi-scgenept-public/gene_embeddings/	`models/gene_embeddings/` NCBI_gene_embeddings-gpt3.5-ada.pickle
NCBI Gene summaries + UniProt protein summaries	s3://czi-scgenept-public/models/gene_embeddings/	`models/gene_embeddings/` NCBI+UniProt_embeddings-gpt3.5-ada.pkl
GO Cellular Components Annotations	s3://czi-scgenept-public/models/gene_embeddings/	`models/gene_embeddings/` GO_C_gene_embeddings-gpt3.5-ada_concat.pickle or GO_C_gene_embeddings-gpt3.5-ada_avg.pickle
GO Molecular Function Annotations	s3://czi-scgenept-public/models/gene_embeddings/	`models/gene_embeddings/` GO_F_gene_embeddings-gpt3.5-ada_concat.pickle or GO_F_gene_embeddings-gpt3.5-ada_avg.pickle
GO Biological Processes Annotations	s3://czi-scgenept-public/models/gene_embeddings/	`models/gene_embeddings/` GO_P_gene_embeddings-gpt3.5-ada_concat.pickle or GO_P_gene_embeddings-gpt3.5-ada_avg.pickle
Aggregation of GO-C + GO-F + GO-P	s3://czi-scgenept-public/models/gene_embeddings/	`models/gene_embeddings/` GO_all_gene_embeddings-gpt3.5-ada_concat.pickle or GO_all_gene_embeddings-gpt3.5-ada_avg.pickle

The gene annotations can be downloaded from s3://czi-scgenept-public/models/gene_embeddings/gene_annotations

📈 Training

Step 1: Download pretrained scGPT model
aws s3 sync --no-sign-request s3://czi-scgenept-public/models/pretrained/scgpt models/pretrained/

Step 2: Download pre-computed gene Embeddings

scGenePT can use multiple sources for textual gene annotations. The different sources and gene representations are described above, together with the download links. If you're only interested in using one type of gene embeddings, you only need to download those embeddings only.

Example for training a model using the GO-C embeddings: aws s3 sync --no-sign-request s3://czi-scgenept-public/models/gene_embeddings/GO_C_gene_embeddings-gpt3.5-ada-concat.pickle

Step 3: Environment setup

We highly recommend creating a virtual environment. Models have been trained using flash-attn. However, flash-attn installation might be finicky, in which case models can be trained without.

conda create -y --name scgenept python=3.10 # or python3.10 -m venv scgenept
source activate scgenept
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install scgpt "flash-attn<1.0.5"

Step 4: Training Data

We use the processed versions of the Adamson and Norman datasets from GEARS. Note that there are some differences in dataloaders between differrent versions, we trained and evaluated the models on GEARS v=0.0.2. This code snippet is already embedded in the the codebase, so no additional work is needed to train on these datasets.

from GEARS import PertData
dataset_name = 'norman' # or 'adamson'
pert_data = PertData("data/")
pert_data.load(data_name=dataset_name)
pert_data.prepare_split(split=split, seed=1)
pert_data.get_dataloader(batch_size=batch_size, test_batch_size=val_batch_size)

Step 5 Training Script
⚠️ Note that training requires a GPU

python train.py --model-type=scgenept_ncbi+uniprot_gpt --num-epochs=20 --dataset=norman --device=cuda:0

The model-type to train can be passed through the --model-type argument, which can be one of:

scGenePT	genePT	scGenePT_combined	scGPT
scgenept_ncbi_gpt	genept_ncbi_gpt
scgenept_ncbi+uniprot_gpt	genept_ncbi+uniprot_gpt
scgenept_go_c_gpt	go_c_gpt_concat	scgenept_ncbi+uniprot_gpt_go_c_gpt_concat
scgenept_go_f_gpt	go_f_gpt_concat	scgenept_ncbi+uniprot_gpt_go_f_gpt_concat	scgpt
scgenept_go_p_gpt	go_p_gpt_concat	scgenept_ncbi+uniprot_gpt_go_p_gpt_concat	scgpt_counts
scgenept_go_all_gpt	go_all_gpt_concat	scgenept_ncbi+uniprot_gpt_go_all_gpt_concat	scgpt_tokens

More details on model_type can be found in the get_embs_to_include(model_type) function under utils/data_loading.py. For each of the model types, a suffix _no_attention can be added, which means that the model won't use scGPT pre-trained attention. All other training parameters can be found in the script.

📊 Inference

scgenept_tutorial - Tutorial showcasing how to use trained scGenePT models in inference mode for perturbation prediction. It uses models fine-tuned on the Norman dataset and offers examples of predicting post-perturbation expression responses for single and two-gene perturbations.
For inference, we recommend not using flash attention:

python3.10 -m venv scgenept
source scgenept/bin/activate
pip install -r requirements.txt
pip install scgpt

Same tutorial can be found as a Google Collab notebook here

🔖 Cite Us

If you use scGenePT in your analyses, please cite us:

Paper: Istrate, Ana-Maria, Donghui Li, and Theofanis Karaletsos. "scGenePT: Is language all you need for modeling single-cell perturbations?." bioRxiv (2024): 2024-10. bioRxiv Link

@article{istrate2024scgenept,
  title={scGenePT: Is language all you need for modeling single-cell perturbations?},
  author={Istrate, Ana-Maria and Li, Donghui and Karaletsos, Theofanis},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}

⭐ Acknowledgements

We would like to sincerely thank the authors of the following models and packages:

scGPT
GenePT

📎 References

Cui, Haotian, et al. "scGPT: toward building a foundation model for single-cell multi-omics using generative AI." Nature Methods (2024): 1-11. Paper Link | GitHub Repo
Chen, Yiqun, and James Zou. "GenePT: a simple but effective foundation model for genes and cells built from ChatGPT." bioRxiv (2024): 2023-10. Paper Link | GitHub Repo
Roohani, Yusuf, Kexin Huang, and Jure Leskovec. "Predicting transcriptional outcomes of novel multigene perturbations with GEARS." Nature Biotechnology 42.6 (2024): 927-935. Paper Link | GitHub Repo

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
html		html
models		models
tests		tests
tutorials		tutorials
utils		utils
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
FOSB_gene_example.png		FOSB_gene_example.png
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
evaluate-perturbation.py		evaluate-perturbation.py
requirements.txt		requirements.txt
requirements_conda.txt		requirements_conda.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scGenePT: Is language all you need for modeling single-cell perturbations?

Model Description

📁 Data

scGenePT Model Zoo

📈 Training

📊 Inference

🔖 Cite Us

⭐ Acknowledgements

📎 References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scGenePT: Is language all you need for modeling single-cell perturbations?

Model Description

📁 Data

scGenePT Model Zoo

📈 Training

📊 Inference

🔖 Cite Us

⭐ Acknowledgements

📎 References

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages