Avoid string comparison in order key#3056
Conversation
Fixes dask#3055 TODO: this needs tests
dask/order.py
Outdated
|
|
||
| def key(x): | ||
| return ndeps.get(x, 0), str(x) | ||
| return ndeps.get(x, 0), id(type(x)), str(x) |
There was a problem hiding this comment.
I'm not sure how this solves the issue? The id of the type is random, so this won't be consistent across multiple runs. It also doesn't solve comparing non-identically-typed tuples. I might be misunderstanding something though.
Perhaps instead, something like:
In [43]: from functools import total_ordering
In [44]: @total_ordering
...: class Key(object):
...: def __init__(self, k):
...: self.k = k
...: def __lt__(self, other):
...: try:
...: return self.k < other.k
...: except Exception:
...: return str(self.k) < str(other.k)
...: def __eq__(self, other):
...: try:
...: return self.k == other.k
...: except Exception:
...: return str(self.k) == str(other.k)
...:
In [45]: def key(x):
...: return Key((ndeps.get(x, 0), x))
...:
In [46]: ndeps = {}
In [47]: key((k, 2)) < key((k, 10))
Out[47]: True
In [48]: key((k, 12)) > key((k, 10))
Out[48]: True
In [49]: key((k, 12)) > key((k, k))
Out[49]: False
In [50]: key((k, 12)) < key((k, k))
Out[50]: TrueThere was a problem hiding this comment.
Whoops, my intent was
return ndeps.get(x, 0), id(type(x)), x
There was a problem hiding this comment.
str(x) was originally designed to avoid comparisons between different types that would fail on Python 3. The id/type check is just another way to avoid cross type comparisons.
There was a problem hiding this comment.
This still doesn't fix that though, as id(type(x)) will be the same for both ('foo', 1) and ('foo', 'bar'), but comparing these valid keys will error.
I think my suggestion above is better for two reasons:
-
Definitive ordering between keys of different types. Sorting by id of builtin types isn't random (although I think this is an implementation detail), but it shouldn't be relied on.
id(str) > id(tuple)just happens to be true. -
Works for tuples of varying types without enforcing our convention of
(str, int, int, ....). This is not a requirement in the spec, and I don't think it should be.
It's also not terribly slower than your solution.
In [31]: class key(object):
...: __slots__ = ('k',)
...: def __init__(self, k):
...: self.k = ndeps.get(k, 0), k
...: def __lt__(self, other):
...: try:
...: return self.k < other.k
...: except Exception:
...: return str(self.k) < str(other.k)
...:
In [32]: def key2(x):
...: return ndeps.get(x, 0), id(type(x)), x
...:
In [33]: ndeps = {}
In [34]: keys = [('foobar', i) for i in range(10000)]
In [35]: import random
In [36]: random.shuffle(keys)
In [37]: %time sorted(keys, key=key)[:2]
CPU times: user 49.7 ms, sys: 968 µs, total: 50.7 ms
Wall time: 50.4 ms
Out[37]: [('foobar', 0), ('foobar', 1)]
In [38]: %time sorted(keys, key=key2)[:2]
CPU times: user 19.5 ms, sys: 92 µs, total: 19.6 ms
Wall time: 20 ms
Out[38]: [('foobar', 0), ('foobar', 1)]
In [39]: a = ('foo', 'bar')
In [40]: b = ('foo', 2)
In [41]: sorted([a, b], key=key)
Out[41]: [('foo', 'bar'), ('foo', 2)]
In [42]: sorted([a, b], key=key2) # Fails due to type differences
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-42-5064b83b5956> in <module>()
----> 1 sorted([a, b], key=key2)
TypeError: unorderable types: int() < str()There was a problem hiding this comment.
Indeed, that seems sensible to me (or at least I'm not yet able to come up with a better solution :)).
In regards to performance this change isn't trivial. It looks like this is costing around 3us per task.
If our target is 200 us per task then this isn't huge, but also isn't nothing. I think it's still a good change to make, given that it solves problems that have arisen in the wild, I just wanted to provide some context around the performance numbers.
|
I'm resolving some additional things here with string comparison. It looks like our tests were to easy to pass because we used strings |
Also change tests to support randomizing the string names
jcrist
left a comment
There was a problem hiding this comment.
Overall this looks good to me. Left a few comments about the tests.
dask/tests/test_order.py
Outdated
| if random.random() < 0.5: | ||
| a, b, c, d, e = 'abcde' | ||
| else: | ||
| a, b, c, d, e = 'abcde'[::-1] |
There was a problem hiding this comment.
This seems dangerous to me - tests that depend on random behavior are harder to reproduce on failure, and harder to debug.
Also, having a == 'a' sometimes, and other times a == 'e' is confusing as a reader.
A few ideas:
- Always assign values in reverse order - since we sort by key names, the
orderfunction must be correct based on other factors. - Build test graphs in a way that we know sorting by keyname alone won't pass the test.
There was a problem hiding this comment.
How would you feel about making the order switching explicit with pytest.parametrize? I don't trust our ability to craft the graphs such that keyname won't matter. I want to ensure that things work both ways regardless of future changes.
dask/tests/test_order.py
Outdated
| dsk = dict(w.__dask_graph__()) | ||
| o = order(dsk) | ||
| L = [o[k] for k in w.__dask_keys__()] | ||
| assert sorted(L) == L or sorted(L) == L[::-1] |
There was a problem hiding this comment.
I assume this is due to the random behavior above? In either case, I'd rather this was a consistent test result, as it'll make it easier to debug in the future.
There was a problem hiding this comment.
Not related to the randomization above, I just didn't care what order it came in as long as it was sorted. Happy to take off the second piece.
Fixes #3055
TODO: this needs tests
flake8 daskdocs/source/changelog.rstfor all changesand one of the
docs/source/*-api.rstfiles for new API