add task counts to HighLevelGraph and Layer html reprs by kori73 · Pull Request #8589 · dask/dask

kori73 · 2022-01-19T21:54:46Z

Closes Show length of layer in HLG JupyterLab repr #8497
Tests added / passed
Passes pre-commit run --all-files

I am simply passing the len. Is this okay?

Example:

arr = da.ones(shape=(100, 100), chunks=[5,10])
arr2 = da.ones(shape=(100, 100))
res = arr.dot(arr2)

black fix

GPUtester · 2022-01-19T21:54:48Z

Can one of the admins verify this patch?

quasiben · 2022-01-20T03:23:55Z

add to allowlist

jsignell · 2022-01-21T16:55:30Z

dask/highlevelgraph.py

        info = {
            "layer_type": type(self).__name__,
            "is_materialized": self.is_materialized(),
+            "tasks (unoptimized)": f"{len(self)}",


I think this will be number of layers not number of tasks. Right @ian-r-rose?

I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.

ian-r-rose

Thanks @kori73! I actually think that calling len() on HighLevelGraph Layers is not quite as safe as we would hope.

With the current design, __len__() gives the length of the materialized graph. But a lot of the time we don't actually know that length until it is materialized, as specific culling operations can dramatically change the length. So calling __len__() canl force materialization before we are ready.

See, e.g., the implementation for the BroadcastJoinLayer, which constructs the full task graph upon calling __len__():

dask/dask/layers.py

Lines 919 to 936 in 4228dc7

    
           @property 
        
           def _dict(self): 
        
               """Materialize full dict representation""" 
        
               if hasattr(self, "_cached_dict"): 
        
                   return self._cached_dict 
        
               else: 
        
                   dsk = self._construct_graph() 
        
                   self._cached_dict = dsk 
        
               return self._cached_dict 
        
           def __getitem__(self, key): 
        
               return self._dict[key] 
        
           def __iter__(self): 
        
               return iter(self._dict) 
        
           def __len__(self): 
        
               return len(self._dict)

The thing that should be safe to call is get_output_keys(), which is explicitly meant to not force materialization:

dask/dask/highlevelgraph.py

Lines 94 to 109 in 4228dc7

    
               @abc.abstractmethod 
        
               def get_output_keys(self) -> AbstractSet: 
        
                   """Return a set of all output keys 
        
                   Output keys are all keys in the layer that might be referenced by 
        
                   other layers. 
        
                   Classes overriding this implementation should not cause the layer 
        
                   to be materialized. 
        
                   Returns 
        
                   ------- 
        
                   keys: AbstractSet 
        
                       All output keys 
        
                   """ 
        
                   return self.keys()  # this implementation will materialize the graph

This has a slightly different meaning from the materialized graph length, in that it is restricted only to keys that might be of interest to other layers, rather than intermediate results or other private-ish tasks.

So I'd recommend tweaking the goal of this PR to either:

Only show __len__ if the graph is materialized.
Show the length of get_output_keys() instead (and maybe change the label from tasks (unoptimized) to something like number of outputs.

I'd probably prefer the second, as it's closer to the heart of what HighLevelGraphs are meant to do, which is reason about inputs/outputs without forcing a full graph construction.

ian-r-rose · 2022-01-21T20:05:45Z

dask/highlevelgraph.py

        info = {
            "layer_type": type(self).__name__,
            "is_materialized": self.is_materialized(),
+            "tasks (unoptimized)": f"{len(self)}",


I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.

kori73 · 2022-01-21T20:49:37Z

Thanks a lot for the detailed explanation @ian-r-rose! I was suspecting that this might not be the right approcach without knowing why exactly (was thinking that this might force some unwanted computation). I will update the PR according to your suggestion.

kori73 · 2022-01-22T15:43:11Z

I have updated to use:

for layers: get_output_keys
for graph: get_all_external_keys which seems to call get_output_keys for each layer.

Updated example:

bryanwweber · 2022-02-02T17:25:08Z

@ian-r-rose Can you take another look here when you get a chance?

ian-r-rose

So sorry for the slow review @kori73! I'm happy with this implementation as it is, I just have some musings about how best to describe these keys, which I think are probably confusing to a lot of users. I'd be curious to hear your thoughts.

ian-r-rose · 2022-02-16T23:38:33Z

dask/widgets/templates/highlevelgraph.html.j2

            <h3 style="margin-bottom: 0px;">HighLevelGraph</h3>
            <p style="color: var(--jp-ui-font-color2, #5D5851); margin-bottom:0px;">
-                {{ type }} with {{ layers | length }} layers.
+                {{ type }} with {{ layers | length }} layers and {{ n_outputs }} outputs.


I'm finding the term "outputs" to be a bit ambiguous. As used here, it's all the "output" keys for each individual layer. But I could also imagine it being meaning the number of output keys for the last layer. That is to say, the number of outputs for the whole computation, discarding intermediate keys.

We could, of course, include both. But since users are also able to inspect the number of outputs in the last layer already, maybe we could just say something like "and {{ n_outputs }} keys from all layers". Or maybe it's okay as-is. What do you think @kori73 ?

I totally agree that outputs sounds like the output keys for the last layer. keys from all layers would be much less ambiguous.

kori73 · 2022-02-17T16:30:18Z

No problem at all @ian-r-rose! I really appreciate the time you are all putting in on the reviews.

I also have another concern unrelated to this discussion. I just want to make sure that we are not doing anything that will harm the user experience. According to #8570 get_all_external_keys could cause memory problems (i.e. generation of billions of keys).

In this case we might be creating very large collections just to find their lenghts. Also, I'm guessing we are creating them twice, one in the layer.get_output_keys and another in hlg.get_all_external_keys.

In the case of SimpleShuffleLayer, this is what we do:

dask/dask/layers.py

Lines 425 to 427 in bd8e8dc

    
           def get_output_keys(self): 
        
               return {(self.name, part) for part in self.parts_out}

Instead we could have just used self.parts_out. So, do you think the current implementation is okay as it is or should we reconsider?

ian-r-rose · 2022-02-17T17:03:33Z

I also have another concern unrelated to this discussion. I just want to make sure that we are not doing anything that will harm the user experience. According to #8570 get_all_external_keys could cause memory problems (i.e. generation of billions of keys).

At least for get_all_external_keys(), the set is cached, so it should be safe to call it more than once:

dask/dask/highlevelgraph.py

Line 772 in bd8e8dc

return self._all_external_keys

I would say that the intention of get_all_external_keys() is that it is fairly cheap to call. So using it here would not be a misuse of the API. But, as you point out, there are some issues with the design of HighLevelGraph algorithms that make it not as cheap to manipulate as intended. To my mind, that's something that will need to be addressed in the short term, though I don't think we need to do it here. Any refactoring of the highlevelgraph cull operation will need to have something that allows you to reason about how "big" each graph is, and it should be easy to update the repr here should we need to.

TL;DR, I think this is fine, and would probably be easy to update if we need to

ian-r-rose

Thanks @kori73!

jsignell · 2022-02-18T13:55:19Z

Thank you @kori73 and @ian-r-rose for your work on this!!

add task counts to HighLevelGraph and Layer html reprs

0eb9fed

black fix

jsignell reviewed Jan 21, 2022

View reviewed changes

ian-r-rose reviewed Jan 21, 2022

View reviewed changes

call get_output_keys not to materialize the graph

c4fba23

jsignell added enhancement Improve existing functionality or make things work better highlevelgraph Issues relating to HighLevelGraphs. needs review Needs review from a contributor. labels Feb 4, 2022

scharlottej13 requested a review from ian-r-rose February 16, 2022 17:43

ian-r-rose reviewed Feb 16, 2022

View reviewed changes

describe n_outputs more clearly

4c07962

ian-r-rose approved these changes Feb 17, 2022

View reviewed changes

jsignell removed the needs review Needs review from a contributor. label Feb 18, 2022

jsignell merged commit 2ed4545 into dask:main Feb 18, 2022

ian-r-rose mentioned this pull request Jul 19, 2022

Change repr methods to avoid Layer materialization #9289

Merged

	@property
	def _dict(self):
	"""Materialize full dict representation"""
	if hasattr(self, "_cached_dict"):
	return self._cached_dict
	else:
	dsk = self._construct_graph()
	self._cached_dict = dsk
	return self._cached_dict

	def __getitem__(self, key):
	return self._dict[key]

	def __iter__(self):
	return iter(self._dict)

	def __len__(self):
	return len(self._dict)

	@abc.abstractmethod
	def get_output_keys(self) -> AbstractSet:
	"""Return a set of all output keys

	Output keys are all keys in the layer that might be referenced by
	other layers.

	Classes overriding this implementation should not cause the layer
	to be materialized.

	Returns
	-------
	keys: AbstractSet
	All output keys
	"""
	return self.keys() # this implementation will materialize the graph

Uh oh!

Conversation

kori73 commented Jan 19, 2022

Uh oh!

GPUtester commented Jan 19, 2022

Uh oh!

quasiben commented Jan 20, 2022

Uh oh!

jsignell Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

kori73 commented Jan 21, 2022

Uh oh!

kori73 commented Jan 22, 2022

Uh oh!

bryanwweber commented Feb 2, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose Feb 16, 2022

Choose a reason for hiding this comment

Uh oh!

kori73 Feb 17, 2022

Choose a reason for hiding this comment

Uh oh!

kori73 commented Feb 17, 2022

Uh oh!

ian-r-rose commented Feb 17, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

jsignell commented Feb 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants