Skip to content

Snakemake very slow on large workflow #2354

@fgvieira

Description

@fgvieira

Snakemake version
7.30.1

Describe the bug
I am running a snakemake workflow with one checkpoint and ~40k jobs and it all starts well. However, after ~12h it starts using a lot of CPU:

3063714 lnc113    20   0  279.1g   1.9g  30616 S  9600   0.2  64660:59 micromamba/envs/snakemake_env/bin/python3.11 micromamba/envs/snakemake_env/bin/snakemake --local-cores 5 --configfile config/config.yaml --rerun-incomplete --keep-going --latency-wait 60 --printshellcmds --reason --max-jobs-per-second 2 --use-conda --log-handler-script scripts/log_handler.py --conda-prefix .cache/snakemake/conda --cores 60 --notemp
 427913 user    20   0       0      0      0 Z   0.0   0.0   0:03.66 [cbc] <defunct>                                                                                                                                                                                                    
1270004 user    20   0       0      0      0 Z   0.0   0.0   0:00.00 [bash] <defunct>                                                                                                                                                                                                   
1270102 user    20   0       0      0      0 Z   0.0   0.0   0:00.00 [bash] <defunct>                                                                                                                                                                                                   

and jobs are marked as finished in "batches" that can take up to 1h to update:

[...]
[Thu Jul 13 10:58:18 2023]
Finished job 37214.
30924 of 39428 steps (78%) done
[Thu Jul 13 10:58:18 2023]
Finished job 3469.
30925 of 39428 steps (78%) done
[Thu Jul 13 12:00:24 2023]
Finished job 24581.
30926 of 39428 steps (78%) done
[Thu Jul 13 12:00:24 2023]
Finished job 24021.
30927 of 39428 steps (78%) done
[...]
[Thu Jul 13 12:01:57 2023]
Finished job 670.
31166 of 39428 steps (79%) done
[Thu Jul 13 12:01:57 2023]
Finished job 22429.
31167 of 39428 steps (79%) done
[Thu Jul 13 12:38:03 2023]
Finished job 1707.
31168 of 39428 steps (79%) done
[Thu Jul 13 12:38:03 2023]
Finished job 30844.
31169 of 39428 steps (79%) done
[...]
[Thu Jul 13 12:38:55 2023]
Finished job 25337.
31340 of 39428 steps (79%) done
[Thu Jul 13 12:38:56 2023]
Finished job 13318.
31341 of 39428 steps (79%) done
[Thu Jul 13 12:54:38 2023]
Finished job 35211.
31342 of 39428 steps (79%) done
[Thu Jul 13 12:54:38 2023]
Finished job 34025.
31343 of 39428 steps (79%) done
[...]
  • The first issue, is that I am running snakemake specifying a maximum of 60 cores, and it is using 96 (all available cores) leading to a load of load average: 15433.63, 14465.49, 13850.62.

  • Second, there are some thread exceptions:

Exception in thread Thread-543654:

But I am not sure if it has any effect on the workflow, since it might be a thread in the snakemake process.

  • Third, this issue seems to be similar to this, and they suspect it is because of checking for correct job completion. However, should it really increase with the size of the workflow? As I see it, it should only depend on the number of currently running jobs.

EDIT: came across #1374 and changing to the greedy scheduler (--scheduler greedy) seems to fix the issue. So, it might be an issue with the scheduler that, in the presence of a lot of files, takes a lot of time to choose which one to run next.

EDIT 2: even though the greedy scheduler seems to help, it does not completely fix the issue...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions