Skip to content

Add more ordering diagnostics to dask.visualize#7992

Merged
jsignell merged 5 commits intodask:mainfrom
eriknw:ordering_diagnostics
Nov 2, 2021
Merged

Add more ordering diagnostics to dask.visualize#7992
jsignell merged 5 commits intodask:mainfrom
eriknw:ordering_diagnostics

Conversation

@eriknw
Copy link
Copy Markdown
Member

@eriknw eriknw commented Aug 4, 2021

I added these to help investigate #7929

The names could probably be better. This also needs documented and tested, but I thought I'd share, because, you know, pretty plots! I think all options here are potentially useful. Anything else we might want to show?

Speaking of pretty plots, I'll share some tomorrow. Good night!

  • Tests added / passed
  • Passes black dask / flake8 dask / isort dask

eriknw added 3 commits August 3, 2021 23:18
I added these to help investigate dask#7929

The names could probably be better.  This also needs documented and tested,
but I thought I'd share, because, you know, pretty plots!  I think all options
here are potentially useful.  Anything else we might want to show?

Speaking of pretty plots, I'll share some tomorrow.  Good night!
Also, update so that visualize `color="pressure"` includes memory usage when the task
is run (shown on the function) and when the data is released (shown on the data).
@eriknw
Copy link
Copy Markdown
Member Author

eriknw commented Aug 4, 2021

As promised, here are a few graphs. These show memory pressure--the number of dependencies that are held in memory when a task is run (shown in the function circle) and when the data of a task is released (shown in the data rectangle).

Here's a simple example to get you started:

graph1

Note that all these are created with color="order-pressure", and cmap="plasma". The label on the functions is the order number and the memory pressure. If we only wanted to show the memory pressure, then do color="pressure".

Another one that is slightly more complicated:

graph2

And another:

graph3

And a more complicated one:

graph4

And the example from #7929 (main branch)

graph5

And the same one using PR 7929:

graph5-pr

To compare the two graphs above more directly, it would be nice to have them on the same scale. So, we can now pass maxval= to visualize, which changes the previous graph to:

graph5-pr2

I think this is pretty nice. I don't know why I didn't whip this up long ago!

@ncclementi
Copy link
Copy Markdown
Member

@eriknw Checking in here, is this PR still WIP, or is it in a state ready for review?

@eriknw
Copy link
Copy Markdown
Member Author

eriknw commented Sep 23, 2021

Thanks for checking! Still WIP, and not forgotten. Feature-wise, it could be reviewed.

TODO:

  • docstring
  • tests

@eriknw
Copy link
Copy Markdown
Member Author

eriknw commented Oct 27, 2021

Okay, I think this is ready. Naming and describing things was somewhat challenging.

I chose the diagnostics that I think will be most useful when trying to understand what dask.order is doing, especially when it does something poorly. Let me walk through an example where we investigate sub-optimal ordering (sigh).

Let's consider the Dask graph from da.arange(N, chunks=1).cumsum(0, method='blelloch'). Here's the familiar visualization with color="order", cmap="autumn":

example_order
We can see that the first half (the red at the bottom) is ordered well, but there may be some issues with the second half (yellow and orange nodes are intermixed).

Let's look at age, color="age", cmap="plasma":

example_age
Indigo colors (most of the graph) are good, but we can clearly see that the data from some nodes (reddish and orangish) are held in memory longer than desired. But, why?

Here we look at how many more outputs are held after the lifetime of each node,color="memoryincreases":

example_memoryincreases
Large values may indicate nodes that should have run later. Indeed, here, the yellow nodes in the upper left are clearly run far too soon.

Similarly, we can look at how many fewer outputs are held after the lifetime of each node,color="memorydecreases":

example_memorydecreases
Large values may indicate nodes that should have run sooner, which is indeed the case for the yellow nodes in the middle.

Now that we have an idea of what's going on, let's look at a more complicated, but very informative visualization--color="memorypressure". This one indicates how many outputs are held when a node is run (the circle in the diagram) and when the output of the node is released (the rectangle in the diagram).

example_memorypressure

This tells the same story as the previous diagrams, but it does so differently.

This last diagram shows how many dependencies are released when a node is run, color="freed".
example_freed

This can be a nice view, because it also shows patterns differently. It is clear that something is amiss in the upper left portion, because there are many tasks that don't release dependencies (but should) when they are run.

So, where does dask.order go wrong? Here's a closeup view of the middle section with color="order":

example_order

Node 151 (the yellow node in the middle) should have run immediately after node 81, but didn't, or node 112 should have run after node 81 (to work towards running node 117).

@eriknw eriknw marked this pull request as ready for review October 27, 2021 00:34
@eriknw
Copy link
Copy Markdown
Member Author

eriknw commented Oct 27, 2021

I forgot a plot in the example investigation in the previous post. The second value returned by dask.order.diagnostics is the number of outputs held over time. Here's an example:

import dask
import dask.array as da
import pandas as pd
import hvplot.pandas

A = da.arange(33, chunks=1).cumsum(0, method='blelloch')
info, num_in_memory = dask.order.diagnostics(dict(dask.base.collections_to_dsk([A])))
df = pd.DataFrame(
    {
        'time': list(range(len(num_in_memory))),
        'Num in memory': num_in_memory
    }
)
# rasterize=True is nice for very large graphs
df.hvplot(x='time', y='Num in memory', rasterize=False)

bokeh_plot (4)

num_in_memory is a list, and is very useful, which is why I have it as a separate return value. For example, I used it here: #7583 (comment). It is not strictly necessary to be a return value, since it can be reconstructed from the data in info above. For example:

assert num_in_memory == [
    val.num_data_when_run for val in sorted((val for val in info.values()), key=lambda x: x.order)
]

I still like it as a separate return value to make it easier to use.

@eriknw eriknw changed the title WIP: Add more ordering diagnostics to dask.visualize Add more ordering diagnostics to dask.visualize Oct 27, 2021
@jsignell
Copy link
Copy Markdown
Member

jsignell commented Nov 2, 2021

This a big diagnostic improvement and I love the narrative that you go through in #7992 (comment) - can we make that into a blog post or a how-to?

@jsignell jsignell merged commit 89d93a8 into dask:main Nov 2, 2021
@quasiben
Copy link
Copy Markdown
Member

quasiben commented Nov 2, 2021

+1 to writing up as a blogpost

@eriknw
Copy link
Copy Markdown
Member Author

eriknw commented Nov 2, 2021

Aw, thanks for the kind words!

Yeah, I can write a blog post from this, but you must know how much it pains me to show off dask.order not doing well 😛 .

I'm about to go on vacation (hooray!), so it'll be at least a few weeks before I can get around to it.

@jsignell
Copy link
Copy Markdown
Member

jsignell commented Nov 3, 2021

no worries, anytime is a good time :) it can be framed as how to understand or debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants