Pipeline for benchmarking supervised integration of single-cell RNA-seq atlases

This repository contains the snakemake pipeline for our benchmarking analysis of scRNA-seq data integration tools. It is based on the package scib and the previous scib-pipeline from the Theis Lab Luecken et al, 2020.

Compared to this previous study our benchmark focus on (semi) supervised tools with the addition of our new version of STACAS. It includes our new integration metric CiLISI we computed with our scIngrationMetrics R package. It also assesses how well (semi)-supervised integration tools are robust to noise we introduce with shuffling of cell type labels and partial annotations. It also tests the capacity of (semi) supervised tools to separate cell type when they are guided with a broader annotation.

Major modifications regarding the original pipeline

Adding STACAS and semi-supervised STACAS
Using embedding output (PCA computed on scaled integrated data with Seurat) for R based methods
Embedding/latent space for integration with a size fixed (e.g. 30, 50) for all tools (reduced space or bottleneck layer of autoencoders)
Testing noisy and missing cell type labels to guide integration (i.e. partially removing and shuffling cell type labels)
New batch-correction metric CiLISI (cell type-aware LISI) computed with scIngrationMetrics
Packages of integration tools updated

Installation

As in the original pipeline, to reproduce the results from this study, two separate conda environments are needed for python and R operations. Please make sure you have either mambaforge or conda installed on your system to be able to use the pipeline. We recommend using mamba, which is also available for conda, for faster package installations with a smaller memory footprint.

We provide python and R environment YAML files in envs/, together with an installation script for setting up the correct environments in a single command. based on the R version you want to use. Our new pipeline currently only supports R 4.1 Call the script as follows

bash envs/create_conda_environments.sh -r 4.1

Once installation is successful, you will have the python environment scib-pipeline-R<version> and the R environment scib-R<version> that you must specify in the config file.

R version	Python environment name	R environment name	Test data config YAML file
4.1	`scib-pipeline-R4.1`	`scib-R4.1`	`configs/test_data-R4.1.yml`

Running the Pipeline

This repository contains a snakemake pipeline to run integration methods and metrics reproducibly for different data scenarios preprocessing setups.

Setup Configuration File {#setup-configuration-file}

The parameters and input files are specified in config files. A description of the config formats and example files can found in configs/. You can use the example config that use the test data to get the pipeline running quickly, and then modify a copy of it to work with your own data.

Pipeline Commands

To call the pipeline on the test data e.g. using R 4.1 to reproduce our benchmarking with the original annotations to guide supervised tools:

snakemake --configfile configs/test_original_annotations-R4.1.yaml -n

This gives you an overview of the jobs that will be run. In order to execute these jobs with up to 10 cores, call

snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 10

We strongly recommand to use this snakemake on a HPC cluster e.g. using slurm and the config file configs/cluster.yml you can run the workflow as follow:

mkdir -p cluster/snakemake/; \
snakemake -j 100 --configfile configs/test_original_annotations-R4.1.yaml \
--cluster-config configs/cluster.yml \
--cluster "sbatch -A {cluster.account} \
    -p {cluster.partition} \
    -N {cluster.N} \
    -t {cluster.time} \
    --job-name {cluster.name} \
    --mem {cluster.mem} \
    --cpus-per-task {cluster.cpus-per-task}\
    --output {cluster.output} \
    --error {cluster.error}"

Then you can generate a table gathering the snakemake benchmark files (cpu time, memory usage...)

snakemake --configfile configs/test_original_annotations-R4.1.yaml --cores 1 benchmarks

More snakemake commands can be found in the documentation.

Reproduce/Visualize our results

We provide the config files to reproduce our 3 different analyses together with the rmarkdown we used to generate the figures from the results of the pipeline that you can find on the results directory

Analysis	config YAML file	Rmarkdown file
original annotations	test_original_annotations-R4.1.yml	originalAnnotationAnalysis.Rmd
robustness to noise	test_supervised_methods-R4.1.yml	SupervisedToolAnalysis.Rmd
final benchmark	test_final_benchmark-R4.1.yml	finalBenchmarkAnalysis.Rmd

Failed integration

Some tools fail to integrate certain task, in order to complete the workflow and set NA to integration metrics for these scenarios you can use the script integration_fail_file.py as follow:

python scripts/integration_fail_file.py  -c configs/review_tests.yaml -t Pancreas_rm8 -l unknown_15_shuffled_20 -m seuratrpca -v

Tools

Tools that are compared include: - STACAS - Scanorama - - scANVI - - FastMNN - - scGen - - scVI - Seurat v4 (CCA and RPCA) - - Harmony

References

Benchmarking atlas-level data integration in single-cell genomics. Luecken et al, 2020

Semi-supervised integration of single-cell transcriptomics data. Andreatta et al, 2023

Name		Name	Last commit message	Last commit date
Latest commit History 1,261 Commits
.github/workflows		.github/workflows
configs		configs
data		data
envs		envs
results		results
sanity_check		sanity_check
scripts		scripts
sensitivity_analysis_shuffling_neighbors		sensitivity_analysis_shuffling_neighbors
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
ScibConfig.py		ScibConfig.py
Snakefile		Snakefile
dependency.png		dependency.png
figure.png		figure.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for benchmarking supervised integration of single-cell RNA-seq atlases

Major modifications regarding the original pipeline

Installation

Running the Pipeline

Setup Configuration File {#setup-configuration-file}

Pipeline Commands

Reproduce/Visualize our results

Failed integration

Tools

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pipeline for benchmarking supervised integration of single-cell RNA-seq atlases

Major modifications regarding the original pipeline

Installation

Running the Pipeline

Setup Configuration File {#setup-configuration-file}

Pipeline Commands

Reproduce/Visualize our results

Failed integration

Tools

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages