Avoid sorting large stacks in order#3298
Merged
mrocklin merged 1 commit intodask:masterfrom Mar 19, 2018
Merged
Conversation
When performning task ordering we sort tasks based on the number of dependents/dependencies they have. This is critical to low-memory processing. However, sometimes individual tasks have millions of dependencies, for which an n*log(n) sort adds significant overhead. In these cases we give up on sorting, and just hope that the tasks are well ordered naturally (such as is often the case in Python 3.6+ due to sorted dicts and the natural ordering that exists when constructing common graphs) See pangeo-data/pangeo#150 (comment) for a real-world case
Member
Author
|
@jcrist any thoughts or concerns on this? |
jcrist
reviewed
Mar 19, 2018
|
|
||
| stack = [k for k, v in dependents.items() if not v] | ||
| stack = sorted(stack, key=dependencies_key) | ||
| if len(stack) < 10000: |
Member
There was a problem hiding this comment.
How did you come up with these numbers?
Member
Author
There was a problem hiding this comment.
In practice numbers around 1-16 seem to be common and good to sort. Numbers like 100k seem to be bad to sort. I chose something in the middle.
Member
There was a problem hiding this comment.
Fine by me. Avoiding sorting in expensive situations seems fine, and the numbers are sufficiently high that they're unlikely to affect complicated graph structures that might benefit more from sorting.
eriknw
added a commit
to eriknw/dask
that referenced
this pull request
Dec 6, 2019
Avoid sorting or taking the min when there are many, many edges. This respects the use case here: dask#3298 Minor performance improvements.
2 tasks
jcrist
pushed a commit
that referenced
this pull request
Jan 14, 2020
#5646) * Redo `dask.order.order`. Fix #5584. Use structural info, not key names. This is a substantial rewrite of `dask.order.order`, but the goals remain the same and many previous lessons were taken into consideration. The new version relies less on the key name by using more metrics and using a different strategy for walking up and down the DAG. Performance appears to be about the same (sometimes a litle faster, sometimes a little slower, and never a lot slower). I still need to wrap up some cosmetics (doc strings, code comments, etc). @TomAugspurger also suggested I add some benchmarks to dask-benchmarks. * run black * Clean up: update docstrings, code comments, and some performance tweaks * Remove commented out line in test * Add test that regressed on master. Avoid sorting or taking the min when there are many, many edges. This respects the use case here: #3298 Minor performance improvements. * Fix failing test and add regression test. Re-add `total_dependencies` in `dependencies_key`. * Don't leave any dangling single nodes in `order`. Also, some performance tweaks. * Run black (correct version?) * Fix typo in doctest * Fix typo in doctest * Improve docstring of `graph_metrics`; also, detect and raise if cycle exists * oops. test cycle detection in `order` with non-string keys * Pre-compute `initial_stack_key` in `order` (for performance)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When performning task ordering we sort tasks based on the
number of dependents/dependencies they have. This is critical to
low-memory processing.
However, sometimes individual tasks have millions of dependencies,
for which an n*log(n) sort adds significant overhead. In these cases
we give up on sorting, and just hope that the tasks are well ordered
naturally (such as is often the case in Python 3.6+ due to sorted
dicts and the natural ordering that exists when constructing common
graphs)
See pangeo-data/pangeo#150 (comment)
for a real-world case
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API