fix: use pulp internal timelimit, instead of stopit #2938
fix: use pulp internal timelimit, instead of stopit #2938johanneskoester merged 19 commits intosnakemake:mainfrom
Conversation
snakemake/scheduler.py
Outdated
| # Subsample jobs to be run (to speedup solver) | ||
| n_total_needrun = len(needrun) | ||
| solver_max_jobs = int(os.environ.get("SNAKEMAKE_SOLVER_MAX_JOBS", 1000)) | ||
| if n_total_needrun > solver_max_jobs: | ||
| import random | ||
| needrun = set(random.sample(tuple(needrun), k=solver_max_jobs)) | ||
|
|
There was a problem hiding this comment.
Mhm, interesting idea. We already have a 10s timeout for the ILP solver, which leads to falling back to the greedy solver that should be very fast. The stalling that people observe, does it happen because this fallback does not work, or is the greedy solver too slow as well to solve those instances?
There was a problem hiding this comment.
The timeout is checked in line 602 (609 in this PR).
|
Some cases are quite old, but it seems to be mostly that the timeout was not triggered (and changing to |
Could be that the timeout was introduced after those issues. I would vote for the following:
|
|
Maybe the issue is with the timeout. I am running a workflow locally now and, even though there are no jobs running, no new jobs are being launched. If I look on If I kill it, it says: and starts launching jobs again. |
WalkthroughThe recent changes involve a transition in the Snakemake codebase to utilize Changes
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
stopit, the package we use for the timeout cannot time while the GIL is hold (https://pypi.org/project/stopit/#threading-based-resources), so that it might wait for the solving to be completed. This suprises me though because I tested this before and it worked fine. One option to try is to use a different kind of timeout instead that can handle this without requiring the GIL to be released. Pypi lists a lot of packages. |
|
I can try using another package. I am considering either wrapt_timeout_decorator or async-timeout. Any preference? |
4f6eb6c to
cfea0b6
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (2)
- snakemake/path_modifier.py (1 hunks)
- snakemake/scheduler.py (2 hunks)
Additional context used
Ruff
snakemake/scheduler.py
264-264: Trailing comma missing
Add trailing comma
(COM812)
279-279: Trailing comma missing
Add trailing comma
(COM812)
Additional comments not posted (2)
snakemake/path_modifier.py (1)
Line range hint
13-13:
Consider the impact of removing debug logging.The removal of the debug logging statement in the
modifymethod reduces verbosity, which can improve performance. However, ensure that this does not hinder debugging efforts, especially in scenarios where path modification issues are common.Tools
Ruff
36-36: Missing return type annotation for public function
modify(ANN201)
36-36: Missing type annotation for
selfin method(ANN101)
36-36: Missing type annotation for function argument
path(ANN001)
36-36: Missing type annotation for function argument
property(ANN001)
snakemake/scheduler.py (1)
261-270: Ensure correctness of subsampling logic.The subsampling mechanism uses
random.sample()to limit the number of jobs. Verify that the environment variableSNAKEMAKE_SOLVER_MAX_JOBSis correctly set and that the subsampling logic does not inadvertently exclude critical jobs.Tools
Ruff
264-264: Trailing comma missing
Add trailing comma
(COM812)
There was a problem hiding this comment.
Actionable comments posted: 1
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (3)
- setup.cfg (1 hunks)
- snakemake/scheduler.py (3 hunks)
- test-environment.yml (1 hunks)
Additional context used
Ruff
snakemake/scheduler.py
617-617: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
Additional comments not posted (4)
test-environment.yml (1)
10-10: Dependency update approved.The replacement of
stopitwithasync-timeoutaligns with the transition to asynchronous programming. This change is appropriate for enhancing non-blocking operations.setup.cfg (1)
58-58: Dependency update approved.The replacement of
stopitwithasync-timeoutis consistent with the shift towards asynchronous programming. This update should enhance timeout handling.snakemake/scheduler.py (2)
495-495: Update import statements for asynchronous timeout handling.The import of
timeoutandTimeoutErrorfromasync_timeoutreplaces the previousstopitimports. This change is necessary for transitioning to asynchronous timeout management.
615-617: Approve asynchronous timeout handling inget_temp_sizes_gb.The use of
async with timeout(10)effectively handles timeouts in an asynchronous context, aligning with the PR's objectives.Tools
Ruff
617-617: Local variable
eis assigned to but never usedRemove assignment to unused variable
e(F841)
There was a problem hiding this comment.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (1)
- snakemake/scheduler.py (3 hunks)
Additional context used
Ruff
snakemake/scheduler.py
616-616: Missing return type annotation for private function
_solve_ilp_timeoutAdd return type annotation:
None(ANN202)
616-616: Missing type annotation for function argument
self(ANN001)
616-616: Missing type annotation for function argument
prob(ANN001)
616-616: Missing type annotation for function argument
secs(ANN001)
624-624: Trailing comma missing
Add trailing comma
(COM812)
Additional comments not posted (1)
snakemake/scheduler.py (1)
495-495: LGTM!The transition to asynchronous timeout handling using
async_timeoutis well-implemented.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (3)
- setup.cfg (1 hunks)
- snakemake/scheduler.py (4 hunks)
- test-environment.yml (1 hunks)
Additional context used
Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not suggest to add trailing commas.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Ruff
snakemake/scheduler.py
495-495:
wrapt_timeout_decorator.timeoutimported but unusedRemove unused import:
wrapt_timeout_decorator.timeout(F401)
643-643: Undefined name
timeout(F821)
Additional comments not posted (3)
test-environment.yml (1)
10-10: Dependency change approved.The replacement of
stopitwithwrapt_timeout_decoratoraligns with the PR objectives to improve timeout management. Ensure that the new package is compatible with the rest of the dependencies.setup.cfg (1)
58-58: Dependency change approved.The replacement of
stopitwithwrapt_timeout_decoratoraligns with the PR objectives to improve timeout management. Ensure that the new package is compatible with the rest of the dependencies.snakemake/scheduler.py (1)
615-622: Exception handling approved.The transition from
TimeoutExceptiontoTimeoutErroraligns with the new asynchronous timeout strategy. Ensure that all related code paths are tested.
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Files selected for processing (2)
- snakemake/scheduler.py (4 hunks)
- test-environment.yml (2 hunks)
Additional context used
Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not suggest to add trailing commas.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Additional comments not posted (4)
test-environment.yml (1)
71-71: Dependency update approved.The replacement of
stopitwithwrapt_timeout_decoratoraligns with the PR objectives to improve timeout management.snakemake/scheduler.py (3)
642-643: Import change approved.The import of
wrapt_timeout_decoratoris appropriate for the new asynchronous timeout management strategy.
614-618: Exception handling update approved.The update to handle
TimeoutErroris consistent with the use ofwrapt_timeout_decorator.
644-644: Decorator usage approved.The
@timeout(10)decorator is correctly applied to enforce a timeout on the_solve_ilpmethod.
wrapt seems like a good choice, as async-timeout maybe makes things unnecessarily complex because it is async, also it might have the same GIL issues. |
|
@johanneskoester I looked a bit more into the |
There was a problem hiding this comment.
Actionable comments posted: 0
Outside diff range, codebase verification and nitpick comments (2)
snakemake/scheduler.py (2)
613-623: Consider parameterizing the timeout value.The timeout value of 10 seconds is hardcoded in the call to
_solve_ilp. Consider making this a configurable parameter to allow for flexibility in different scenarios.- status = self._solve_ilp(prob, time_limit=10) + status = self._solve_ilp(prob, time_limit=self.workflow.scheduling_settings.ilp_timeout)Ensure that
ilp_timeoutis defined in the scheduling settings.
Line range hint
640-663: Improve handling of thePATHenvironment variable.The manipulation of the
PATHenvironment variable could be made safer and clearer by using a context manager to ensure it is restored even if an exception occurs.from contextlib import contextmanager @contextmanager def temporary_path_prepend(path): old_path = os.environ["PATH"] os.environ["PATH"] = f"{path}:{old_path}" try: yield finally: os.environ["PATH"] = old_path # Usage in _solve_ilp with temporary_path_prepend(self.workflow.scheduling_settings.solver_path): solver = ( pulp.getSolver(self.workflow.scheduling_settings.ilp_solver) if self.workflow.scheduling_settings.ilp_solver else pulp.apis.LpSolverDefault )
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- setup.cfg (1 hunks)
- snakemake/scheduler.py (5 hunks)
- test-environment.yml (1 hunks)
Files skipped from review due to trivial changes (2)
- setup.cfg
- test-environment.yml
Additional context used
Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
|
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- snakemake/scheduler.py (6 hunks)
Files skipped from review as they are similar to previous changes (1)
- snakemake/scheduler.py
Also an interesting idea indeed. One should evaluate whether the greedy or the suboptimal solution is preferable. For now, let us merge this in order to get the fix out. |
Ah, now I get what you meant. Indeed, the solution provided here is perfect already. We use the internal timeout functionality of pulp but still use greedy to solve the scheduling problem if the solution is not optimal. Nice! |
🤖 I have created a release *beep* *boop* --- ## [8.19.1](v8.19.0...v8.19.1) (2024-09-04) ### Bug Fixes * fix issues with misinterpretation of max-jobs-per-timespan and max-jobs-per-seconds ([#3067](#3067)) ([d82453b](d82453b)) * pip deployment path ([#3062](#3062)) ([bf9305b](bf9305b)) * return empty set if rate limiter at max ([#3060](#3060)) ([4e59963](4e59963)) * use wrapt_timeout_decorator, instead of stopit ([#2938](#2938)) ([3b64e41](3b64e41)) * Wrong linenumbers reported when linting ([#2985](#2985)) ([3a8bd36](3a8bd36)) ### Documentation * update `doc-environment.yml` file and Documentation Setup documentation ([#3058](#3058)) ([a540a2e](a540a2e)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
|
I got a question about this... is it recommended to use |
|
With this latest patch, I'd say to not use |



There are some issues of the solver stalling (or not behaving as expected) when running very large DAGs (e.g. #871, #1003, #1374, #1620, #2354).
To minimize those issues, I was wondering if it would make sense for the solver/scheduler to evaluate just a subset (randomly sampled) of the jobs waiting to run.
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
stopitwithtimeout_decoratorto support modern asynchronous handling.