Skip to content

Avoid string comparison in order key#3056

Merged
mrocklin merged 8 commits intodask:masterfrom
mrocklin:order-types
Jan 14, 2018
Merged

Avoid string comparison in order key#3056
mrocklin merged 8 commits intodask:masterfrom
mrocklin:order-types

Conversation

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Jan 10, 2018

Fixes #3055

TODO: this needs tests

  • Tests added / passed
  • Passes flake8 dask
  • Fully documented, including docs/source/changelog.rst for all changes
    and one of the docs/source/*-api.rst files for new API

Fixes dask#3055

TODO: this needs tests
dask/order.py Outdated

def key(x):
return ndeps.get(x, 0), str(x)
return ndeps.get(x, 0), id(type(x)), str(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this solves the issue? The id of the type is random, so this won't be consistent across multiple runs. It also doesn't solve comparing non-identically-typed tuples. I might be misunderstanding something though.

Perhaps instead, something like:

In [43]: from functools import total_ordering

In [44]: @total_ordering
    ...: class Key(object):
    ...:     def __init__(self, k):
    ...:         self.k = k
    ...:     def __lt__(self, other):
    ...:         try:
    ...:             return self.k < other.k
    ...:         except Exception:
    ...:             return str(self.k) < str(other.k)
    ...:     def __eq__(self, other):
    ...:         try:
    ...:             return self.k == other.k
    ...:         except Exception:
    ...:             return str(self.k) == str(other.k)
    ...:

In [45]: def key(x):
    ...:     return Key((ndeps.get(x, 0), x))
    ...:

In [46]: ndeps = {}

In [47]: key((k, 2)) < key((k, 10))
Out[47]: True

In [48]: key((k, 12)) > key((k, 10))
Out[48]: True

In [49]: key((k, 12)) > key((k, k))
Out[49]: False

In [50]: key((k, 12)) < key((k, k))
Out[50]: True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, my intent was

return ndeps.get(x, 0), id(type(x)), x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(x) was originally designed to avoid comparisons between different types that would fail on Python 3. The id/type check is just another way to avoid cross type comparisons.

Copy link
Member

@jcrist jcrist Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't fix that though, as id(type(x)) will be the same for both ('foo', 1) and ('foo', 'bar'), but comparing these valid keys will error.

I think my suggestion above is better for two reasons:

  • Definitive ordering between keys of different types. Sorting by id of builtin types isn't random (although I think this is an implementation detail), but it shouldn't be relied on. id(str) > id(tuple) just happens to be true.

  • Works for tuples of varying types without enforcing our convention of (str, int, int, ....). This is not a requirement in the spec, and I don't think it should be.

It's also not terribly slower than your solution.

In [31]: class key(object):
    ...:     __slots__ = ('k',)
    ...:     def __init__(self, k):
    ...:         self.k = ndeps.get(k, 0), k
    ...:     def __lt__(self, other):
    ...:         try:
    ...:             return self.k < other.k
    ...:         except Exception:
    ...:             return str(self.k) < str(other.k)
    ...:

In [32]: def key2(x):
    ...:     return ndeps.get(x, 0), id(type(x)), x
    ...:

In [33]: ndeps = {}

In [34]: keys = [('foobar', i) for i in range(10000)]

In [35]: import random

In [36]: random.shuffle(keys)

In [37]: %time sorted(keys, key=key)[:2]
CPU times: user 49.7 ms, sys: 968 µs, total: 50.7 ms
Wall time: 50.4 ms
Out[37]: [('foobar', 0), ('foobar', 1)]

In [38]: %time sorted(keys, key=key2)[:2]
CPU times: user 19.5 ms, sys: 92 µs, total: 19.6 ms
Wall time: 20 ms
Out[38]: [('foobar', 0), ('foobar', 1)]

In [39]: a = ('foo', 'bar')

In [40]: b = ('foo', 2)

In [41]: sorted([a, b], key=key)
Out[41]: [('foo', 'bar'), ('foo', 2)]

In [42]: sorted([a, b], key=key2)  # Fails due to type differences
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-5064b83b5956> in <module>()
----> 1 sorted([a, b], key=key2)

TypeError: unorderable types: int() < str()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that seems sensible to me (or at least I'm not yet able to come up with a better solution :)).

In regards to performance this change isn't trivial. It looks like this is costing around 3us per task.
If our target is 200 us per task then this isn't huge, but also isn't nothing. I think it's still a good change to make, given that it solves problems that have arisen in the wild, I just wanted to provide some context around the performance numbers.

@mrocklin
Copy link
Member Author

I'm resolving some additional things here with string comparison. It looks like our tests were to easy to pass because we used strings 'a', 'b', 'c', ... in a way that suggested the proper order.

Also change tests to support randomizing the string names
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Left a few comments about the tests.

if random.random() < 0.5:
a, b, c, d, e = 'abcde'
else:
a, b, c, d, e = 'abcde'[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous to me - tests that depend on random behavior are harder to reproduce on failure, and harder to debug.

Also, having a == 'a' sometimes, and other times a == 'e' is confusing as a reader.

A few ideas:

  • Always assign values in reverse order - since we sort by key names, the order function must be correct based on other factors.
  • Build test graphs in a way that we know sorting by keyname alone won't pass the test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about making the order switching explicit with pytest.parametrize? I don't trust our ability to craft the graphs such that keyname won't matter. I want to ensure that things work both ways regardless of future changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

dsk = dict(w.__dask_graph__())
o = order(dsk)
L = [o[k] for k in w.__dask_keys__()]
assert sorted(L) == L or sorted(L) == L[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is due to the random behavior above? In either case, I'd rather this was a consistent test result, as it'll make it easier to debug in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the randomization above, I just didn't care what order it came in as long as it was sorted. Happy to take off the second piece.

@mrocklin mrocklin merged commit d7db87b into dask:master Jan 14, 2018
@mrocklin mrocklin deleted the order-types branch January 14, 2018 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants