Skip to content

HLG: get_dependencies() of single keys#6699

Merged
jrbourbeau merged 4 commits intodask:masterfrom
madsbk:hlg_deps_per_key
Oct 8, 2020
Merged

HLG: get_dependencies() of single keys#6699
jrbourbeau merged 4 commits intodask:masterfrom
madsbk:hlg_deps_per_key

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Oct 2, 2020

Fixes #6694 by letting get_dependencies() work on single keys instead of all keys in the Layer.

  • Tests added / passed
  • Passes black dask / flake8 dask

@madsbk madsbk changed the title Hlg deps per key HLG: get_dependencies() of single keys Oct 2, 2020
@madsbk
Copy link
Contributor Author

madsbk commented Oct 2, 2020

@sofroniewn thanks for reporting the performance issue. Can I ask you to try this PR, it should fix the issue?

@madsbk madsbk mentioned this pull request Oct 2, 2020
2 tasks
@sofroniewn
Copy link

Hi @madsbk, I just tested this PR and the performance regressions have been fixed!

I ran my test script

import dask.array as da
import numpy as np
import time


data = da.random.random(
    size=(100_000, 1000, 1000), chunks=(1, 1000, 1000)
)

idxs = [(0,), (50_000,), (99_999,)]

t0 = time.time()
reduced_data = np.min([np.min(data[idx]) for idx in idxs])
t1 = time.time()
print(t1 - t0)

And this is back at around ~0.19 ms, which matches 2.27.0 and is much better than the 3.4 seconds it was on 2.28.0

I'm not so familiar with dask testing practices or policies, but this seems as good a time and place as any to ask if the project has any timing based tests? In napari we use the pytest.mark.timeout(n) decorator in a few places to catch major performance regressions - see this test, which was very helpful for us identifying this performance regression - we saw it on our CI right away on the 2.28.0 release, and I'm wondering if some rough tests like that might be helpful for dask too?

Thanks again for the quick turn around on the fix!!

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @madsbk! This is in

@jrbourbeau jrbourbeau merged commit bd18820 into dask:master Oct 8, 2020
@TomAugspurger
Copy link
Member

TomAugspurger commented Oct 13, 2020

Just FYI, the benchmark at https://pandas.pydata.org/speed/dask/#array.Slicing.time_slice_int_tail (source) did pick up this regression.

It looks like with this fix we're still about 2.5x slower on that benchmark compared to 2.30.0. I'm unsure if that's worth worrying about.

Edit: I'm going to open a new issue for these. A bit easier to track.

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Updates `get_dependencies()` to work on single keys instead of all keys in the `Layer`
@madsbk madsbk deleted the hlg_deps_per_key branch February 16, 2021 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2.28.0 performance related issues

4 participants