Database instead of file-based logging and book keeping? Performance issues on HPC with many Snakemake users and millions of meta files

We started to use Snakemake heavily in our experiment and teaching students to do even small scientific analyses in a nice, reproducible way with the help of this tool. It works great but we have big issues with our HPC clusters...

We process large amounts of data (O(100TB)) and a large amount of files (tens of millions) and Snakemake is creating an extremely large amount of meta files in the `.snakemake` folders. The biggest problem is the impact on the I/O performance of the HPC storage systems, often optimised for large files and the bottleneck is often the IOPS due to many users acting at the same time. We basically constantly fight with non-responsive file systems (not even `ls` is working) which are caused by Snakemake files and the way Snakemake accesses those.
It's not related to a specific HPC cluster, we face the same issues everywhere around the world.

**Describe the solution you'd like**
We already use options like `shallow shadow` and try to force Snakemake to put files on local storages but the possibilities are limited since worker nodes need access to logs etc. I am wondering if how much work it would be to allow the usage of e.g. a database backend for meta data and log files. SQLite or something other serverless solution which ofheavily reduces the number of files.

**Describe alternatives you've considered**
I would happy to hear alternatives from Snakemake experts, maybe even existing solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database instead of file-based logging and book keeping? Performance issues on HPC with many Snakemake users and millions of meta files #2969

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Database instead of file-based logging and book keeping? Performance issues on HPC with many Snakemake users and millions of meta files #2969

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions