Skip to content

Avoid sorting large stacks in order#3298

Merged
mrocklin merged 1 commit intodask:masterfrom
mrocklin:order-sorted
Mar 19, 2018
Merged

Avoid sorting large stacks in order#3298
mrocklin merged 1 commit intodask:masterfrom
mrocklin:order-sorted

Conversation

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 19, 2018

When performning task ordering we sort tasks based on the
number of dependents/dependencies they have. This is critical to
low-memory processing.

However, sometimes individual tasks have millions of dependencies,
for which an n*log(n) sort adds significant overhead. In these cases
we give up on sorting, and just hope that the tasks are well ordered
naturally (such as is often the case in Python 3.6+ due to sorted
dicts and the natural ordering that exists when constructing common
graphs)

See pangeo-data/pangeo#150 (comment)
for a real-world case

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

When performning task ordering we sort tasks based on the
number of dependents/dependencies they have.  This is critical to
low-memory processing.

However, sometimes individual tasks have millions of dependencies,
for which an n*log(n) sort adds significant overhead.  In these cases
we give up on sorting, and just hope that the tasks are well ordered
naturally (such as is often the case in Python 3.6+ due to sorted
dicts and the natural ordering that exists when constructing common
graphs)

See pangeo-data/pangeo#150 (comment)
for a real-world case
@mrocklin
Copy link
Member Author

@jcrist any thoughts or concerns on this?


stack = [k for k, v in dependents.items() if not v]
stack = sorted(stack, key=dependencies_key)
if len(stack) < 10000:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with these numbers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally arbitrary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice numbers around 1-16 seem to be common and good to sort. Numbers like 100k seem to be bad to sort. I chose something in the middle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. Avoiding sorting in expensive situations seems fine, and the numbers are sufficiently high that they're unlikely to affect complicated graph structures that might benefit more from sorting.

@mrocklin mrocklin merged commit f577913 into dask:master Mar 19, 2018
@mrocklin mrocklin deleted the order-sorted branch March 19, 2018 21:05
eriknw added a commit to eriknw/dask that referenced this pull request Dec 6, 2019
Avoid sorting or taking the min when there are many, many edges.
This respects the use case here: dask#3298

Minor performance improvements.
jcrist pushed a commit that referenced this pull request Jan 14, 2020
#5646)

* Redo `dask.order.order`.  Fix #5584.  Use structural info, not key names.

This is a substantial rewrite of `dask.order.order`, but the goals remain the
same and many previous lessons were taken into consideration.  The new version
relies less on the key name by using more metrics and using a different
strategy for walking up and down the DAG.

Performance appears to be about the same (sometimes a litle faster, sometimes
a little slower, and never a lot slower).

I still need to wrap up some cosmetics (doc strings, code comments, etc).

@TomAugspurger also suggested I add some benchmarks to dask-benchmarks.

* run black

* Clean up: update docstrings, code comments, and some performance tweaks

* Remove commented out line in test

* Add test that regressed on master.

Avoid sorting or taking the min when there are many, many edges.
This respects the use case here: #3298

Minor performance improvements.

* Fix failing test and add regression test.

Re-add `total_dependencies` in `dependencies_key`.

* Don't leave any dangling single nodes in `order`.  Also, some performance tweaks.

* Run black (correct version?)

* Fix typo in doctest

* Fix typo in doctest

* Improve docstring of `graph_metrics`; also, detect and raise if cycle exists

* oops.  test cycle detection in `order` with non-string keys

* Pre-compute `initial_stack_key` in `order` (for performance)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants