add task counts to HighLevelGraph and Layer html reprs#8589
add task counts to HighLevelGraph and Layer html reprs#8589
Conversation
|
Can one of the admins verify this patch? |
|
add to allowlist |
dask/highlevelgraph.py
Outdated
| info = { | ||
| "layer_type": type(self).__name__, | ||
| "is_materialized": self.is_materialized(), | ||
| "tasks (unoptimized)": f"{len(self)}", |
There was a problem hiding this comment.
I think this will be number of layers not number of tasks. Right @ian-r-rose?
There was a problem hiding this comment.
I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @kori73! I actually think that calling len() on HighLevelGraph Layers is not quite as safe as we would hope.
With the current design, __len__() gives the length of the materialized graph. But a lot of the time we don't actually know that length until it is materialized, as specific culling operations can dramatically change the length. So calling __len__() canl force materialization before we are ready.
See, e.g., the implementation for the BroadcastJoinLayer, which constructs the full task graph upon calling __len__():
Lines 919 to 936 in 4228dc7
The thing that should be safe to call is get_output_keys(), which is explicitly meant to not force materialization:
Lines 94 to 109 in 4228dc7
This has a slightly different meaning from the materialized graph length, in that it is restricted only to keys that might be of interest to other layers, rather than intermediate results or other private-ish tasks.
So I'd recommend tweaking the goal of this PR to either:
- Only show
__len__if the graph is materialized. - Show the length of
get_output_keys()instead (and maybe change the label fromtasks (unoptimized)to something likenumber of outputs.
I'd probably prefer the second, as it's closer to the heart of what HighLevelGraphs are meant to do, which is reason about inputs/outputs without forcing a full graph construction.
dask/highlevelgraph.py
Outdated
| info = { | ||
| "layer_type": type(self).__name__, | ||
| "is_materialized": self.is_materialized(), | ||
| "tasks (unoptimized)": f"{len(self)}", |
There was a problem hiding this comment.
I believe this is Layer.__len__, not HighLevelGraph.__len__, so it should be the right length for this part.
|
Thanks a lot for the detailed explanation @ian-r-rose! I was suspecting that this might not be the right approcach without knowing why exactly (was thinking that this might force some unwanted computation). I will update the PR according to your suggestion. |
|
@ian-r-rose Can you take another look here when you get a chance? |
ian-r-rose
left a comment
There was a problem hiding this comment.
So sorry for the slow review @kori73! I'm happy with this implementation as it is, I just have some musings about how best to describe these keys, which I think are probably confusing to a lot of users. I'd be curious to hear your thoughts.
| <h3 style="margin-bottom: 0px;">HighLevelGraph</h3> | ||
| <p style="color: var(--jp-ui-font-color2, #5D5851); margin-bottom:0px;"> | ||
| {{ type }} with {{ layers | length }} layers. | ||
| {{ type }} with {{ layers | length }} layers and {{ n_outputs }} outputs. |
There was a problem hiding this comment.
I'm finding the term "outputs" to be a bit ambiguous. As used here, it's all the "output" keys for each individual layer. But I could also imagine it being meaning the number of output keys for the last layer. That is to say, the number of outputs for the whole computation, discarding intermediate keys.
We could, of course, include both. But since users are also able to inspect the number of outputs in the last layer already, maybe we could just say something like "and {{ n_outputs }} keys from all layers". Or maybe it's okay as-is. What do you think @kori73 ?
There was a problem hiding this comment.
I totally agree that outputs sounds like the output keys for the last layer. keys from all layers would be much less ambiguous.
|
No problem at all @ian-r-rose! I really appreciate the time you are all putting in on the reviews. I also have another concern unrelated to this discussion. I just want to make sure that we are not doing anything that will harm the user experience. According to #8570 In this case we might be creating very large collections just to find their lenghts. Also, I'm guessing we are creating them twice, one in the In the case of Lines 425 to 427 in bd8e8dc Instead we could have just used |
At least for Line 772 in bd8e8dc I would say that the intention of TL;DR, I think this is fine, and would probably be easy to update if we need to |
|
Thank you @kori73 and @ian-r-rose for your work on this!! |

pre-commit run --all-filesI am simply passing the len. Is this okay?
Example: