Avoid string comparison in order key by mrocklin · Pull Request #3056 · dask/dask

mrocklin · 2018-01-10T17:51:47Z

TODO: this needs tests

Tests added / passed
Passes flake8 dask
Fully documented, including docs/source/changelog.rst for all changes
and one of the docs/source/*-api.rst files for new API

Fixes dask#3055 TODO: this needs tests

jcrist · 2018-01-10T18:41:25Z

dask/order.py


    def key(x):
-        return ndeps.get(x, 0), str(x)
+        return ndeps.get(x, 0), id(type(x)), str(x)


I'm not sure how this solves the issue? The id of the type is random, so this won't be consistent across multiple runs. It also doesn't solve comparing non-identically-typed tuples. I might be misunderstanding something though.

Perhaps instead, something like:

In [43]: from functools import total_ordering In [44]: @total_ordering ...: class Key(object): ...: def __init__(self, k): ...: self.k = k ...: def __lt__(self, other): ...: try: ...: return self.k < other.k ...: except Exception: ...: return str(self.k) < str(other.k) ...: def __eq__(self, other): ...: try: ...: return self.k == other.k ...: except Exception: ...: return str(self.k) == str(other.k) ...: In [45]: def key(x): ...: return Key((ndeps.get(x, 0), x)) ...: In [46]: ndeps = {} In [47]: key((k, 2)) < key((k, 10)) Out[47]: True In [48]: key((k, 12)) > key((k, 10)) Out[48]: True In [49]: key((k, 12)) > key((k, k)) Out[49]: False In [50]: key((k, 12)) < key((k, k)) Out[50]: True

Whoops, my intent was

return ndeps.get(x, 0), id(type(x)), x

str(x) was originally designed to avoid comparisons between different types that would fail on Python 3. The id/type check is just another way to avoid cross type comparisons.

This still doesn't fix that though, as id(type(x)) will be the same for both ('foo', 1) and ('foo', 'bar'), but comparing these valid keys will error.

I think my suggestion above is better for two reasons:

Definitive ordering between keys of different types. Sorting by id of builtin types isn't random (although I think this is an implementation detail), but it shouldn't be relied on. id(str) > id(tuple) just happens to be true.

Works for tuples of varying types without enforcing our convention of (str, int, int, ....). This is not a requirement in the spec, and I don't think it should be.

It's also not terribly slower than your solution.

In [31]: class key(object): ...: __slots__ = ('k',) ...: def __init__(self, k): ...: self.k = ndeps.get(k, 0), k ...: def __lt__(self, other): ...: try: ...: return self.k < other.k ...: except Exception: ...: return str(self.k) < str(other.k) ...: In [32]: def key2(x): ...: return ndeps.get(x, 0), id(type(x)), x ...: In [33]: ndeps = {} In [34]: keys = [('foobar', i) for i in range(10000)] In [35]: import random In [36]: random.shuffle(keys) In [37]: %time sorted(keys, key=key)[:2] CPU times: user 49.7 ms, sys: 968 µs, total: 50.7 ms Wall time: 50.4 ms Out[37]: [('foobar', 0), ('foobar', 1)] In [38]: %time sorted(keys, key=key2)[:2] CPU times: user 19.5 ms, sys: 92 µs, total: 19.6 ms Wall time: 20 ms Out[38]: [('foobar', 0), ('foobar', 1)] In [39]: a = ('foo', 'bar') In [40]: b = ('foo', 2) In [41]: sorted([a, b], key=key) Out[41]: [('foo', 'bar'), ('foo', 2)] In [42]: sorted([a, b], key=key2) # Fails due to type differences --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-42-5064b83b5956> in <module>() ----> 1 sorted([a, b], key=key2) TypeError: unorderable types: int() < str()

Indeed, that seems sensible to me (or at least I'm not yet able to come up with a better solution :)).

In regards to performance this change isn't trivial. It looks like this is costing around 3us per task.
If our target is 200 us per task then this isn't huge, but also isn't nothing. I think it's still a good change to make, given that it solves problems that have arisen in the wild, I just wanted to provide some context around the performance numbers.

mrocklin · 2018-01-12T15:40:28Z

I'm resolving some additional things here with string comparison. It looks like our tests were to easy to pass because we used strings 'a', 'b', 'c', ... in a way that suggested the proper order.

Also change tests to support randomizing the string names

jcrist

Overall this looks good to me. Left a few comments about the tests.

jcrist · 2018-01-12T21:10:29Z

dask/tests/test_order.py

+if random.random() < 0.5:
+    a, b, c, d, e = 'abcde'
+else:
+    a, b, c, d, e = 'abcde'[::-1]


This seems dangerous to me - tests that depend on random behavior are harder to reproduce on failure, and harder to debug.

Also, having a == 'a' sometimes, and other times a == 'e' is confusing as a reader.

A few ideas:

Always assign values in reverse order - since we sort by key names, the order function must be correct based on other factors.

Build test graphs in a way that we know sorting by keyname alone won't pass the test.

How would you feel about making the order switching explicit with pytest.parametrize? I don't trust our ability to craft the graphs such that keyname won't matter. I want to ensure that things work both ways regardless of future changes.

Sounds reasonable to me.

jcrist · 2018-01-12T21:11:12Z

dask/tests/test_order.py

+    dsk = dict(w.__dask_graph__())
+    o = order(dsk)
+    L = [o[k] for k in w.__dask_keys__()]
+    assert sorted(L) == L or sorted(L) == L[::-1]


I assume this is due to the random behavior above? In either case, I'd rather this was a consistent test result, as it'll make it easier to debug in the future.

Not related to the randomization above, I just didn't care what order it came in as long as it was sorted. Happy to take off the second piece.

Avoid string comparison in order key

b0bb20f

Fixes dask#3055 TODO: this needs tests

jcrist reviewed Jan 10, 2018

View reviewed changes

mrocklin added 3 commits January 10, 2018 14:20

add test

6ecc6c3

remove str from key function

d88b15a

add StrCompariable to order

521cdce

mrocklin force-pushed the order-types branch from 2a52be7 to 2702eb0 Compare January 12, 2018 18:50

Redesign order to prioritize on ndependents, then total dependencies

630b14d

Also change tests to support randomizing the string names

mrocklin force-pushed the order-types branch from 2702eb0 to 630b14d Compare January 12, 2018 19:14

jcrist reviewed Jan 12, 2018

View reviewed changes

mrocklin added 3 commits January 12, 2018 17:40

parametrize order tests based on string order

28be2fb

fix doctests

5140bda

[skip ci] add changelog

7b2af45

mrocklin merged commit d7db87b into dask:master Jan 14, 2018

mrocklin deleted the order-types branch January 14, 2018 18:45

eriknw mentioned this pull request Dec 6, 2019

Redo dask.order.order. Fix #5584. Use structural info, not key names #5646

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid string comparison in order key#3056

Avoid string comparison in order key#3056
mrocklin merged 8 commits intodask:masterfrom
mrocklin:order-types

mrocklin commented Jan 10, 2018 •

edited

Loading

Uh oh!

jcrist Jan 10, 2018

Uh oh!

mrocklin Jan 10, 2018

Uh oh!

mrocklin Jan 10, 2018

Uh oh!

jcrist Jan 10, 2018 •

edited

Loading

Uh oh!

mrocklin Jan 10, 2018

Uh oh!

mrocklin commented Jan 12, 2018

Uh oh!

jcrist left a comment

Uh oh!

jcrist Jan 12, 2018

Uh oh!

mrocklin Jan 12, 2018

Uh oh!

jcrist Jan 12, 2018

Uh oh!

jcrist Jan 12, 2018

Uh oh!

mrocklin Jan 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mrocklin commented Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcrist Jan 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jan 12, 2018

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrocklin commented Jan 10, 2018 •

edited

Loading

jcrist Jan 10, 2018 •

edited

Loading