This repository contains the snakemake pipeline for our benchmarking
analysis of scRNA-seq data integration tools. It is based on the package
scib and the previous
scib-pipeline from the
Theis Lab Luecken et al,
2020.
Compared to this previous study our benchmark focus on (semi) supervised tools with the addition of our new version of STACAS. It includes our new integration metric CiLISI we computed with our scIngrationMetrics R package. It also assesses how well (semi)-supervised integration tools are robust to noise we introduce with shuffling of cell type labels and partial annotations. It also tests the capacity of (semi) supervised tools to separate cell type when they are guided with a broader annotation.
- Adding STACAS and semi-supervised STACAS
- Using embedding output (PCA computed on scaled integrated data with Seurat) for R based methods
- Embedding/latent space for integration with a size fixed (e.g. 30, 50) for all tools (reduced space or bottleneck layer of autoencoders)
- Testing noisy and missing cell type labels to guide integration (i.e. partially removing and shuffling cell type labels)
- New batch-correction metric CiLISI (cell type-aware LISI) computed with scIngrationMetrics
- Packages of integration tools updated
As in the original pipeline, to reproduce the results from this study,
two separate conda environments are needed for python and R operations.
Please make sure you have either
mambaforge or
conda installed on your system to
be able to use the pipeline. We recommend using
mamba, which is also available for
conda, for faster package installations with a smaller memory footprint.
We provide python and R environment YAML files in envs/, together with
an installation script for setting up the correct environments in a
single command. based on the R version you want to use. Our new pipeline
currently only supports R 4.1 Call the script as follows
bash envs/create_conda_environments.sh -r 4.1Once installation is successful, you will have the python environment
scib-pipeline-R<version> and the R environment scib-R<version> that
you must specify in the config file.
| R version | Python environment name | R environment name | Test data config YAML file |
|---|---|---|---|
| 4.1 | scib-pipeline-R4.1 |
scib-R4.1 |
configs/test_data-R4.1.yml |
This repository contains a snakemake pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups.
The parameters and input files are specified in config files. A
description of the config formats and example files can found in
configs/. You can use the example config that use the test data to get
the pipeline running quickly, and then modify a copy of it to work with
your own data.
To call the pipeline on the test data e.g. using R 4.1 to reproduce our benchmarking with the original annotations to guide supervised tools:
snakemake --configfile configs/test_original_annotations-R4.1.yaml -nThis gives you an overview of the jobs that will be run. In order to execute these jobs with up to 10 cores, call
snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 10We strongly recommand to use this snakemake on a HPC cluster e.g. using slurm
and the config file configs/cluster.yml you can run the workflow as follow:
mkdir -p cluster/snakemake/; \
snakemake -j 100 --configfile configs/test_original_annotations-R4.1.yaml \
--cluster-config configs/cluster.yml \
--cluster "sbatch -A {cluster.account} \
-p {cluster.partition} \
-N {cluster.N} \
-t {cluster.time} \
--job-name {cluster.name} \
--mem {cluster.mem} \
--cpus-per-task {cluster.cpus-per-task}\
--output {cluster.output} \
--error {cluster.error}"Then you can generate a table gathering the snakemake benchmark files (cpu time, memory usage...)
snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 1 benchmarksMore snakemake commands can be found in the documentation.
We provide the config files to reproduce our 3 different analyses together with the rmarkdown we used to generate the figures from the results of the pipeline that you can find on the results directory
| Analysis | config YAML file | Rmarkdown file |
|---|---|---|
| original annotations | test_original_annotations-R4.1.yml | originalAnnotationAnalysis.Rmd |
| robustness to noise | test_supervised_methods-R4.1.yml | SupervisedToolAnalysis.Rmd |
| final benchmark | test_final_benchmark-R4.1.yml | finalBenchmarkAnalysis.Rmd |
Some tools fail to integrate certain task, in order to complete the workflow and set NA to integration metrics for these scenarios you can use the script integration_fail_file.py as follow:
python scripts/integration_fail_file.py -c configs/review_tests.yaml -t Pancreas_rm8 -l unknown_15_shuffled_20 -m seuratrpca -v
Tools that are compared include: - STACAS - Scanorama - - scANVI - - FastMNN - - scGen - - scVI - Seurat v4 (CCA and RPCA) - - Harmony
Benchmarking atlas-level data integration in single-cell genomics. Luecken et al, 2020
Semi-supervised integration of single-cell transcriptomics data. Andreatta et al, 2023