Skip to content

add task counts to HighLevelGraph and Layer html reprs#8589

Merged
jsignell merged 3 commits intodask:mainfrom
kori73:task-count
Feb 18, 2022
Merged

add task counts to HighLevelGraph and Layer html reprs#8589
jsignell merged 3 commits intodask:mainfrom
kori73:task-count

Conversation

@kori73
Copy link
Copy Markdown
Contributor

@kori73 kori73 commented Jan 19, 2022

I am simply passing the len. Is this okay?

Example:

arr = da.ones(shape=(100, 100), chunks=[5,10])
arr2 = da.ones(shape=(100, 100))
res = arr.dot(arr2)

Screenshot from 2022-01-20 00-47-36

@GPUtester
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@quasiben
Copy link
Copy Markdown
Member

add to allowlist

info = {
"layer_type": type(self).__name__,
"is_materialized": self.is_materialized(),
"tasks (unoptimized)": f"{len(self)}",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be number of layers not number of tasks. Right @ian-r-rose?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.

Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kori73! I actually think that calling len() on HighLevelGraph Layers is not quite as safe as we would hope.

With the current design, __len__() gives the length of the materialized graph. But a lot of the time we don't actually know that length until it is materialized, as specific culling operations can dramatically change the length. So calling __len__() canl force materialization before we are ready.

See, e.g., the implementation for the BroadcastJoinLayer, which constructs the full task graph upon calling __len__():

dask/dask/layers.py

Lines 919 to 936 in 4228dc7

@property
def _dict(self):
"""Materialize full dict representation"""
if hasattr(self, "_cached_dict"):
return self._cached_dict
else:
dsk = self._construct_graph()
self._cached_dict = dsk
return self._cached_dict
def __getitem__(self, key):
return self._dict[key]
def __iter__(self):
return iter(self._dict)
def __len__(self):
return len(self._dict)

The thing that should be safe to call is get_output_keys(), which is explicitly meant to not force materialization:

@abc.abstractmethod
def get_output_keys(self) -> AbstractSet:
"""Return a set of all output keys
Output keys are all keys in the layer that might be referenced by
other layers.
Classes overriding this implementation should not cause the layer
to be materialized.
Returns
-------
keys: AbstractSet
All output keys
"""
return self.keys() # this implementation will materialize the graph

This has a slightly different meaning from the materialized graph length, in that it is restricted only to keys that might be of interest to other layers, rather than intermediate results or other private-ish tasks.

So I'd recommend tweaking the goal of this PR to either:

  1. Only show __len__ if the graph is materialized.
  2. Show the length of get_output_keys() instead (and maybe change the label from tasks (unoptimized) to something like number of outputs.

I'd probably prefer the second, as it's closer to the heart of what HighLevelGraphs are meant to do, which is reason about inputs/outputs without forcing a full graph construction.

info = {
"layer_type": type(self).__name__,
"is_materialized": self.is_materialized(),
"tasks (unoptimized)": f"{len(self)}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.

@kori73
Copy link
Copy Markdown
Contributor Author

kori73 commented Jan 21, 2022

Thanks a lot for the detailed explanation @ian-r-rose! I was suspecting that this might not be the right approcach without knowing why exactly (was thinking that this might force some unwanted computation). I will update the PR according to your suggestion.

@kori73
Copy link
Copy Markdown
Contributor Author

kori73 commented Jan 22, 2022

I have updated to use:

for layers: get_output_keys
for graph: get_all_external_keys which seems to call get_output_keys for each layer.

Updated example:

Screenshot from 2022-01-22 18-38-45

@bryanwweber
Copy link
Copy Markdown
Contributor

@ian-r-rose Can you take another look here when you get a chance?

@jsignell jsignell added enhancement Improve existing functionality or make things work better highlevelgraph Issues relating to HighLevelGraphs. needs review Needs review from a contributor. labels Feb 4, 2022
Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So sorry for the slow review @kori73! I'm happy with this implementation as it is, I just have some musings about how best to describe these keys, which I think are probably confusing to a lot of users. I'd be curious to hear your thoughts.

<h3 style="margin-bottom: 0px;">HighLevelGraph</h3>
<p style="color: var(--jp-ui-font-color2, #5D5851); margin-bottom:0px;">
{{ type }} with {{ layers | length }} layers.
{{ type }} with {{ layers | length }} layers and {{ n_outputs }} outputs.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm finding the term "outputs" to be a bit ambiguous. As used here, it's all the "output" keys for each individual layer. But I could also imagine it being meaning the number of output keys for the last layer. That is to say, the number of outputs for the whole computation, discarding intermediate keys.

We could, of course, include both. But since users are also able to inspect the number of outputs in the last layer already, maybe we could just say something like "and {{ n_outputs }} keys from all layers". Or maybe it's okay as-is. What do you think @kori73 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree that outputs sounds like the output keys for the last layer. keys from all layers would be much less ambiguous.

@kori73
Copy link
Copy Markdown
Contributor Author

kori73 commented Feb 17, 2022

No problem at all @ian-r-rose! I really appreciate the time you are all putting in on the reviews.

I also have another concern unrelated to this discussion. I just want to make sure that we are not doing anything that will harm the user experience. According to #8570 get_all_external_keys could cause memory problems (i.e. generation of billions of keys).

In this case we might be creating very large collections just to find their lenghts. Also, I'm guessing we are creating them twice, one in the layer.get_output_keys and another in hlg.get_all_external_keys.

In the case of SimpleShuffleLayer, this is what we do:

dask/dask/layers.py

Lines 425 to 427 in bd8e8dc

def get_output_keys(self):
return {(self.name, part) for part in self.parts_out}

Instead we could have just used self.parts_out. So, do you think the current implementation is okay as it is or should we reconsider?

@ian-r-rose
Copy link
Copy Markdown
Collaborator

I also have another concern unrelated to this discussion. I just want to make sure that we are not doing anything that will harm the user experience. According to #8570 get_all_external_keys could cause memory problems (i.e. generation of billions of keys).

At least for get_all_external_keys(), the set is cached, so it should be safe to call it more than once:

return self._all_external_keys

I would say that the intention of get_all_external_keys() is that it is fairly cheap to call. So using it here would not be a misuse of the API. But, as you point out, there are some issues with the design of HighLevelGraph algorithms that make it not as cheap to manipulate as intended. To my mind, that's something that will need to be addressed in the short term, though I don't think we need to do it here. Any refactoring of the highlevelgraph cull operation will need to have something that allows you to reason about how "big" each graph is, and it should be easy to update the repr here should we need to.

TL;DR, I think this is fine, and would probably be easy to update if we need to

Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kori73!

@jsignell jsignell removed the needs review Needs review from a contributor. label Feb 18, 2022
@jsignell jsignell merged commit 2ed4545 into dask:main Feb 18, 2022
@jsignell
Copy link
Copy Markdown
Member

Thank you @kori73 and @ian-r-rose for your work on this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improve existing functionality or make things work better highlevelgraph Issues relating to HighLevelGraphs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Show length of layer in HLG JupyterLab repr

6 participants