Avoid sorting large stacks in order by mrocklin · Pull Request #3298 · dask/dask

mrocklin · 2018-03-19T14:13:14Z

When performning task ordering we sort tasks based on the
number of dependents/dependencies they have. This is critical to
low-memory processing.

However, sometimes individual tasks have millions of dependencies,
for which an n*log(n) sort adds significant overhead. In these cases
we give up on sorting, and just hope that the tasks are well ordered
naturally (such as is often the case in Python 3.6+ due to sorted
dicts and the natural ordering that exists when constructing common
graphs)

See pangeo-data/pangeo#150 (comment)
for a real-world case

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

When performning task ordering we sort tasks based on the number of dependents/dependencies they have. This is critical to low-memory processing. However, sometimes individual tasks have millions of dependencies, for which an n*log(n) sort adds significant overhead. In these cases we give up on sorting, and just hope that the tasks are well ordered naturally (such as is often the case in Python 3.6+ due to sorted dicts and the natural ordering that exists when constructing common graphs) See pangeo-data/pangeo#150 (comment) for a real-world case

mrocklin · 2018-03-19T17:06:42Z

@jcrist any thoughts or concerns on this?

jcrist · 2018-03-19T17:27:26Z

dask/order.py


    stack = [k for k, v in dependents.items() if not v]
-    stack = sorted(stack, key=dependencies_key)
+    if len(stack) < 10000:


How did you come up with these numbers?

Totally arbitrary

In practice numbers around 1-16 seem to be common and good to sort. Numbers like 100k seem to be bad to sort. I chose something in the middle.

Fine by me. Avoiding sorting in expensive situations seems fine, and the numbers are sufficiently high that they're unlikely to affect complicated graph structures that might benefit more from sorting.

Avoid sorting or taking the min when there are many, many edges. This respects the use case here: dask#3298 Minor performance improvements.

@TomAugspurger

#5646) * Redo `dask.order.order`. Fix #5584. Use structural info, not key names. This is a substantial rewrite of `dask.order.order`, but the goals remain the same and many previous lessons were taken into consideration. The new version relies less on the key name by using more metrics and using a different strategy for walking up and down the DAG. Performance appears to be about the same (sometimes a litle faster, sometimes a little slower, and never a lot slower). I still need to wrap up some cosmetics (doc strings, code comments, etc). @TomAugspurger also suggested I add some benchmarks to dask-benchmarks. * run black * Clean up: update docstrings, code comments, and some performance tweaks * Remove commented out line in test * Add test that regressed on master. Avoid sorting or taking the min when there are many, many edges. This respects the use case here: #3298 Minor performance improvements. * Fix failing test and add regression test. Re-add `total_dependencies` in `dependencies_key`. * Don't leave any dangling single nodes in `order`. Also, some performance tweaks. * Run black (correct version?) * Fix typo in doctest * Fix typo in doctest * Improve docstring of `graph_metrics`; also, detect and raise if cycle exists * oops. test cycle detection in `order` with non-string keys * Pre-compute `initial_stack_key` in `order` (for performance)

mrocklin force-pushed the order-sorted branch from 7bbab4c to 4352fe5 Compare March 19, 2018 14:15

mrocklin mentioned this pull request Mar 19, 2018

Why is xarray.to_zarr slow sometimes? pangeo-data/pangeo#150

Closed

jcrist reviewed Mar 19, 2018

View reviewed changes

mrocklin merged commit f577913 into dask:master Mar 19, 2018

mrocklin deleted the order-sorted branch March 19, 2018 21:05

eriknw added a commit to eriknw/dask that referenced this pull request Dec 6, 2019

Add test that regressed on master.

911efd5

Avoid sorting or taking the min when there are many, many edges. This respects the use case here: dask#3298 Minor performance improvements.

eriknw mentioned this pull request Dec 6, 2019

Redo dask.order.order. Fix #5584. Use structural info, not key names #5646

Merged

2 tasks

fjetter mentioned this pull request Sep 26, 2023

[WIP] Fixes for dask.order - Remove change of tactical goal in single dep path #10505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid sorting large stacks in order#3298

Avoid sorting large stacks in order#3298
mrocklin merged 1 commit intodask:masterfrom
mrocklin:order-sorted

mrocklin commented Mar 19, 2018 •

edited

Loading

Uh oh!

mrocklin commented Mar 19, 2018

Uh oh!

jcrist Mar 19, 2018

Uh oh!

mrocklin Mar 19, 2018

Uh oh!

mrocklin Mar 19, 2018

Uh oh!

jcrist Mar 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mrocklin commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Mar 19, 2018

Uh oh!

jcrist Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

jcrist Mar 19, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrocklin commented Mar 19, 2018 •

edited

Loading