-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Snakemake version
7.30.1
Describe the bug
I am running a snakemake workflow with one checkpoint and ~40k jobs and it all starts well. However, after ~12h it starts using a lot of CPU:
3063714 lnc113 20 0 279.1g 1.9g 30616 S 9600 0.2 64660:59 micromamba/envs/snakemake_env/bin/python3.11 micromamba/envs/snakemake_env/bin/snakemake --local-cores 5 --configfile config/config.yaml --rerun-incomplete --keep-going --latency-wait 60 --printshellcmds --reason --max-jobs-per-second 2 --use-conda --log-handler-script scripts/log_handler.py --conda-prefix .cache/snakemake/conda --cores 60 --notemp
427913 user 20 0 0 0 0 Z 0.0 0.0 0:03.66 [cbc] <defunct>
1270004 user 20 0 0 0 0 Z 0.0 0.0 0:00.00 [bash] <defunct>
1270102 user 20 0 0 0 0 Z 0.0 0.0 0:00.00 [bash] <defunct>
and jobs are marked as finished in "batches" that can take up to 1h to update:
[...]
[Thu Jul 13 10:58:18 2023]
Finished job 37214.
30924 of 39428 steps (78%) done
[Thu Jul 13 10:58:18 2023]
Finished job 3469.
30925 of 39428 steps (78%) done
[Thu Jul 13 12:00:24 2023]
Finished job 24581.
30926 of 39428 steps (78%) done
[Thu Jul 13 12:00:24 2023]
Finished job 24021.
30927 of 39428 steps (78%) done
[...]
[Thu Jul 13 12:01:57 2023]
Finished job 670.
31166 of 39428 steps (79%) done
[Thu Jul 13 12:01:57 2023]
Finished job 22429.
31167 of 39428 steps (79%) done
[Thu Jul 13 12:38:03 2023]
Finished job 1707.
31168 of 39428 steps (79%) done
[Thu Jul 13 12:38:03 2023]
Finished job 30844.
31169 of 39428 steps (79%) done
[...]
[Thu Jul 13 12:38:55 2023]
Finished job 25337.
31340 of 39428 steps (79%) done
[Thu Jul 13 12:38:56 2023]
Finished job 13318.
31341 of 39428 steps (79%) done
[Thu Jul 13 12:54:38 2023]
Finished job 35211.
31342 of 39428 steps (79%) done
[Thu Jul 13 12:54:38 2023]
Finished job 34025.
31343 of 39428 steps (79%) done
[...]
-
The first issue, is that I am running
snakemakespecifying a maximum of 60 cores, and it is using 96 (all available cores) leading to a load ofload average: 15433.63, 14465.49, 13850.62. -
Second, there are some thread exceptions:
Exception in thread Thread-543654:
But I am not sure if it has any effect on the workflow, since it might be a thread in the snakemake process.
- Third, this issue seems to be similar to this, and they suspect it is because of checking for correct job completion. However, should it really increase with the size of the workflow? As I see it, it should only depend on the number of currently running jobs.
EDIT: came across #1374 and changing to the greedy scheduler (--scheduler greedy) seems to fix the issue. So, it might be an issue with the scheduler that, in the presence of a lot of files, takes a lot of time to choose which one to run next.
EDIT 2: even though the greedy scheduler seems to help, it does not completely fix the issue...