feat: subsample jobs to speed-up scheduler#3112
Conversation
WalkthroughThe pull request introduces modifications to the Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
snakemake/scheduler.py (1)
276-278: Consider logging subsampling at 'info' levelCurrently, subsampling information is logged at debug level. To make users aware when subsampling affects job scheduling, consider logging this information at info level.
Apply this diff to change the logging level:
- logger.debug( + logger.info( f"Ready subsampled jobs: {len(needrun)} (out of {n_total_needrun})" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- snakemake/resources.py (0 hunks)
- snakemake/scheduler.py (1 hunks)
💤 Files with no reviewable changes (1)
- snakemake/resources.py
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake/scheduler.py (1)
266-280: Consider future performance optimizationsThe current implementation significantly improves performance for large workflows. In the future, you might consider additional optimizations:
- Implement adaptive subsampling based on the workflow size and available resources.
- Explore parallel processing for job selection in extremely large workflows.
- Consider caching job priorities to reduce sorting overhead in subsequent scheduling rounds.
These suggestions are not urgent but could be considered for future enhancements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- snakemake/scheduler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🔇 Additional comments (4)
snakemake/scheduler.py (4)
272-278: LGTM: Efficient job subsampling implementationThe job subsampling logic is well-implemented. It efficiently sorts jobs by priority and selects a subset when necessary. The use of
sorted()with a key function is a good practice for performance. The debug logging provides valuable information for troubleshooting.
279-280: LGTM: Consistent logging for job countThis code segment maintains consistent logging whether job subsampling occurs or not. This is good practice for maintaining clear and informative debug output.
266-280: Overall: Good implementation of job subsampling for large workflowsThe introduced changes effectively address the PR objective of speeding up the scheduler for workflows with a large number of jobs. The implementation is non-intrusive, maintains existing functionality, and provides flexibility through the use of an environment variable. This approach allows for easy testing and configuration in different environments.
266-271:⚠️ Potential issueConsider adding error handling for invalid environment variable values
The current implementation assumes that the
SNAKEMAKE_SOLVER_MAX_JOBSenvironment variable, if set, will always be a valid integer. It's a good practice to add error handling to provide a clear message to the user if an invalid value is provided.Consider applying this diff to handle invalid values:
try: solver_max_jobs = int( os.environ.get("SNAKEMAKE_SOLVER_MAX_JOBS", sys.maxsize) ) +except ValueError: + raise ValueError( + "Invalid value for SNAKEMAKE_SOLVER_MAX_JOBS. It must be an integer." + )Likely invalid or redundant comment.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
snakemake/scheduler.py (1)
279-281: Consider including 'solver_max_jobs' in the debug log message for clarityIncluding the value of
solver_max_jobsin the debug log message can aid in understanding and debugging the subsampling behavior.Apply this diff to enhance the log message:
logger.debug( - f"Ready subsampled jobs: {len(needrun)} (out of {n_total_needrun})" + f"Ready subsampled jobs: {len(needrun)} (out of {n_total_needrun}), " + f"solver_max_jobs: {solver_max_jobs}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- snakemake/scheduler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/scheduler.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🔇 Additional comments (1)
snakemake/scheduler.py (1)
268-281: Subsampling ready jobs by priority is correctly implementedThe logic for subsampling ready jobs based on priority is well-implemented, which should enhance the scheduler's performance when handling a large number of jobs.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
snakemake/cli.py (1)
1422-1431: LGTM! Consider a minor improvement to the help text.The addition of the
--scheduler-subsampleargument is well-implemented and aligns with the PR objectives. It provides a useful option for optimizing scheduler performance on large DAGs.Consider adding a brief example to the help text to make it even more user-friendly. For instance:
help="Number of jobs to be considered for scheduling. If number of ready " "jobs is greater than this value, this number of jobs is randomly chosen " "for scheduling; if number of ready jobs is lower, this option has no " "effect. This can be useful on very large DAGs, where the scheduler can " -"take some time selecting which jobs to run.", +"take some time selecting which jobs to run. Example: --scheduler-subsample 1000",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- snakemake/cli.py (2 hunks)
- snakemake/scheduler.py (2 hunks)
- snakemake/settings/types.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- snakemake/scheduler.py
🧰 Additional context used
📓 Path-based instructions (2)
snakemake/cli.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.snakemake/settings/types.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🔇 Additional comments (4)
snakemake/cli.py (2)
2145-2147: LGTM! Proper integration of the new scheduler subsample parameter.The new
subsampleparameter has been correctly added to theSchedulingSettingsclass and is properly populated with the value fromargs.scheduler_subsample. This ensures that the command-line argument is effectively passed to the scheduler.
Line range hint
1-2484: Overall, excellent implementation of the scheduler subsample feature.The changes to
snakemake/cli.pyare focused, minimal, and well-integrated into the existing codebase. The new--scheduler-subsampleargument is properly added to the parser and correctly passed to the scheduler settings. This implementation aligns well with the PR objectives and should effectively address the performance issues in workflows with a large number of jobs.snakemake/settings/types.py (2)
302-302: Addition of thesubsampleattribute looks good.The
subsampleattribute is appropriately added to theSchedulingSettingsclass with the correct type annotation and default value.
320-321:⚠️ Potential issueCorrect the typo in attribute name 'greedyness' to 'greediness'.
In the
_checkmethod, the attributeself.greedynessis misspelled. This typo will lead to anAttributeErrorat runtime sinceself.greedinessis the correct attribute name.Apply this diff to fix the typo:
def _check(self): - if not (0 < self.greedyness <= 1.0): + if not (0 < self.greediness <= 1.0): raise ApiError("greediness must be >0 and <=1") if self.subsample is not None: if self.subsample <= 0: raise ApiError("subsample must be >0")Likely invalid or redundant comment.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- snakemake/cli.py (2 hunks)
- snakemake/settings/types.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
snakemake/cli.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.snakemake/settings/types.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🔇 Additional comments (4)
snakemake/settings/types.py (1)
302-302: LGTM: New attributesubsampleadded.The addition of the
subsampleattribute is consistent with the docstring description and properly typed asOptional[int]with a default value ofNone.snakemake/cli.py (3)
1422-1431: New CLI argument added for scheduler subsampling.A new command-line argument
--scheduler-subsamplehas been added to allow users to set the number of jobs to be considered for scheduling. This can be useful for large DAGs where the scheduler might take a long time to select jobs.The implementation looks correct and includes:
- Appropriate type (int)
- Default value (None)
- Clear help text explaining the purpose and effect of the argument
2145-2146: SchedulingSettings updated to include new subsample parameter.The
SchedulingSettingsclass has been updated to include the newsubsampleparameter, which corresponds to the--scheduler-subsampleCLI argument.This change ensures that the new CLI argument is properly passed to the scheduling settings and can be used by the scheduler.
Line range hint
1422-2146: Overall assessment: Well-implemented feature for scheduler subsampling.The changes introduce a new CLI argument
--scheduler-subsampleand update theSchedulingSettingsto support job subsampling during the scheduling process. This feature can be particularly useful for large workflows where the scheduler might take a long time to select jobs.The implementation is clean, consistent with the existing code style, and provides clear documentation for users. It adds flexibility to the Snakemake CLI without introducing unnecessary complexity.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
snakemake/settings/types.py (1)
292-294: Improve docstring formatting for better readability.The docstring for the
SchedulingSettingsclass has some formatting issues. Please consider the following improvements:
- Ensure consistent indentation for parameter descriptions.
- Remove the extra quotation mark at the end of the
subsampledescription.- Use consistent punctuation at the end of each parameter description.
Apply this diff to improve the formatting:
- Set the greediness of scheduling. This value, between 0 and 1, determines how careful jobs are selected for execution. The default value (0.5 if prioritytargets are used, 1.0 else) provides the best speed and still acceptable scheduling quality. + Set the greediness of scheduling. This value, between 0 and 1, determines how carefully jobs are selected for execution. + The default value (0.5 if prioritytargets are used, 1.0 else) provides the best speed and still acceptable scheduling quality. subsample: - Set the number of jobs to be considered for scheduling. If number of ready jobs is greater than this value, this number of jobs is randomly chosen for scheduling; if number of ready jobs is lower, this option has no effect. This can be useful on very large DAGs, where the scheduler can take some time selecting which jobs to run." + Set the number of jobs to be considered for scheduling. If the number of ready jobs is greater than this value, + this number of jobs is randomly chosen for scheduling; if the number of ready jobs is lower, this option has no effect. + This can be useful on very large DAGs, where the scheduler can take some time selecting which jobs to run.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- snakemake/settings/types.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/settings/types.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🔇 Additional comments (1)
snakemake/settings/types.py (1)
302-302: LGTM: Newsubsampleattribute added.The addition of the
subsampleattribute is consistent with the docstring changes and provides a useful feature for improving scheduling performance in large DAGs.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
snakemake/settings/types.py (1)
292-294: Remove extra quotation mark and improve clarity in docstring.The docstring for
subsamplecontains an extra quotation mark at the end. Additionally, the description can be slightly improved for clarity.Apply this diff to fix the issues:
subsample: - Set the number of jobs to be considered for scheduling. If number of ready jobs is greater than this value, this number of jobs is randomly chosen for scheduling; if number of ready jobs is lower, this option has no effect. This can be useful on very large DAGs, where the scheduler can take some time selecting which jobs to run." + Set the number of jobs to consider for scheduling. If the number of ready jobs exceeds this value, a random subset of jobs equal to `subsample` is selected for scheduling. If the number of ready jobs is lower, this option has no effect. This can be useful on very large DAGs, where the scheduler can take significant time deciding which jobs to run.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- snakemake/settings/types.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
snakemake/settings/types.py (1)
Pattern
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
🪛 Ruff
snakemake/settings/types.py
320-321: Use a single
ifstatement instead of nestedifstatements(SIM102)
🔇 Additional comments (2)
snakemake/settings/types.py (2)
302-302: LGTM: Newsubsampleattribute added.The addition of the
subsampleattribute as an optional integer with a default value ofNoneis well-implemented. This aligns with the PR objectives of enhancing scheduler performance for large DAGs by allowing subsampling of jobs for scheduling.
Line range hint
292-322: Summary: Well-implementedsubsamplefeature for enhancing scheduler performance.The changes in this file successfully implement the
subsamplefeature for enhancing scheduler performance on large DAGs. The new attribute is properly documented, typed, and validated. The minor suggestions provided will further improve code clarity and consistency.Overall, these changes align well with the PR objectives and should significantly improve scheduling time for workflows with a large number of jobs.
🤖 I have created a release *beep* *boop* --- ## [8.24.0](v8.23.2...v8.24.0) (2024-10-21) ### Features * subsample jobs to speed-up scheduler ([#3112](#3112)) ([e10feef](e10feef)) ### Documentation * addition of interction/visualization/reporting tutorial ([#3159](#3159)) ([1d94bd1](1d94bd1)) * fix tutorial step numbering ([2d7b9e9](2d7b9e9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

I am running a workflow with ~700k jobs and, at each given time, there are around 230k jobs ready to be run. The initial building of the DAG is quite slow (~2h, but I'll leave that for another PR 😄), but the main issue is that the scheduler takes a lot of time deciding the next jobs to be submitted.
In my case, all jobs are quite fast and similar in terms of resources, so the cluster is idle most of the time. The greedy scheduler is considerably faster, but still too slow.
The ILP should switch to the greedy after 10s, but it sometimes ignores the timeout (coin-or/Cbc#487) and it has been reported being quite slow instantiating large problems (coin-or/pulp#749). In my case, the ILP runs for 60s (the pulp file is 100Mb) before switching to greedy. Apart from that, and specially on slow file systems, the scheduler can still be quite slow checking all temp and input files.
Here, I propose sampling ready jobs, so that only a subset of jobs (instead of all ready jobs) are evaluated by the scheduler. In my tests, this greatly reduces the scheduler time:
QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).Summary by CodeRabbit
Summary by CodeRabbit
New Features
--scheduler-subsampleto optimize job scheduling by limiting the number of jobs considered for execution.Bug Fixes
Refactor