Fix optimize_blockwise bug for duplicate dependency names#8542
Fix optimize_blockwise bug for duplicate dependency names#8542
optimize_blockwise bug for duplicate dependency names#8542Conversation
gjoseph92
left a comment
There was a problem hiding this comment.
I don't understand blockwise optimization at all, so these comments are very from the hip
dask/blockwise.py
Outdated
| new_axes = inputs[root].new_axes | ||
| concatenate = inputs[root].concatenate | ||
| dsk = dict(inputs[root].dsk) | ||
| seen = defaultdict(int) |
There was a problem hiding this comment.
Do we need to track this globally? Or could we start fresh at each layer we process (is this the for i, (dep, ind) in enumerate(indices) loop?)?
I had thought the problem was specifically when one layer used the same input name multiple times, but with different iteration orders (indices), not just that the same name occurred in the whole graph multiple times. So maybe this is too wide of a net? Or maybe it doesn't matter?
There was a problem hiding this comment.
If this is the case, then maybe rather than a counter, we could actually just have use_dep = f"{dep}_{ind}", i.e. identify dependencies internally not just by unique name, but by unique combo of name and indexing pattern?
There was a problem hiding this comment.
I was wondering the same thing. Currently sitting in a REPL trying to see if I can make a name/index pair that should be combined with this, but is not.
There was a problem hiding this comment.
I agree that we can probably move all the logic into the same dep-processing step. This exact suggestion is one of the reasons the PR was still marked as draft. We may also want to avoid "duplicating" the k-v pair if the name+index combiunation matches exactly.
There was a problem hiding this comment.
Actually, scratch my last comment... A single for i, (dep, ind) in enumerate(indices): loop does not actually cover the repeated dep case that we need. This outer while changed: loop actually constructs the fused subgraph in a depth-first way (not in a breadth-first layer-by-layer order). This is why dep may be the same for two distinct loops, and we want to make sure we do not use the first index used for that name for all subsequent occurrences.
dask/blockwise.py
Outdated
| if _ind: | ||
| # We only append a "counting" label for | ||
| # `ind` cases after the first ooccurrence |
There was a problem hiding this comment.
Were you ale to find a case where both the dep and the ind were matching and the previous implementation separated them? I was not, unfortunately
There was a problem hiding this comment.
Hmm - I was thinking da.add would capture this case, but apparently it does not. In that case, the simpler defaultdict(int) solution may be preferable for now.
|
@crusaderky - Are all these recent |
|
This is #8549 :
This looks like a regression in this PR:
|
This may be true, but I probably won't bother "debugging" the minor changes here until I see Windows/Python-3.7 passing in other PRs :) |
dask/blockwise.py
Outdated
|
|
||
| # Change new_dsk key to match use_dep | ||
| if dep != use_dep and dep in new_dsk: | ||
| new_dsk[use_dep] = new_dsk.pop(dep) |
There was a problem hiding this comment.
Is it okay that we're changing the name of the key in the dsk? If dep == root, I could see this being a problem, because the resulting Blockwise's output key wouldn't be in its dsk.
It may be that it's impossible for this case to happen for root, but I think we should at least prove and assert that.
There was a problem hiding this comment.
We are only changing the name of an internal key that we are adding to the subgraph during fusion (note that dsk is the internal subgraph for the root blockwise layer). The name of the root layer we are fusing into does not change here. However, I see your concern and can add an explicit assertion that this is not happening.
There was a problem hiding this comment.
I'm pretty sure that if dep == root and dep != use_dep, this will break the SubgraphCallable that ultimately gets produced, because the outkey (which is root) won't exist in the dsk.
There was a problem hiding this comment.
Another observation: if dep == root, then by definition no other keys should depend on dep, so just leaving it in the dict under dep instead of use_dep should be fine. Basically this would be if dep != root and dep != use_dep and dep in new_dsk.
dask/blockwise.py
Outdated
| use_dep = dep + f"_{seen[dep]}" if seen[dep] else dep | ||
| seen[dep] += 1 |
There was a problem hiding this comment.
| use_dep = dep + f"_{seen[dep]}" if seen[dep] else dep | |
| seen[dep] += 1 | |
| use_dep = f"{dep}_{''.join(ind)}" |
Still wondering whether we actually need the seen set, or whether we can just include the indices in the name in every case. As far as I understand, the key names in this dsk are an implementation detail of Blockwise and could be changed to whatever as long as the dask is still valid (is this correct??).
Making this change on your branch makes your test_optimize_blockwise_duplicate_dependency pass, along will all other Array tests besides dask/array/tests/test_atop.py::test_rewrite. The reason test_rewrite fails is that its cases include handwritten dasks of the expected output, so obviously the key names there no longer match. However, I don't know if those key names matter from a correctness standpoint, or if we could just adjust those tests. (The fact that they're not failing with this branch just indicates to me that none of those cases involve this situation of a repeated dep, so they're incomplete anyway.)
I'd guess the only case where the key names matter is with output, which (as I mentioned in the other comment) probably needs to be addressed here too.
There was a problem hiding this comment.
Right - We could certainly add an enumeration in every key that is added to the root subgraph during fusion. This should result in "correct" behavior, and changing the tests should be fine in my opinion. With that said, this approach wont really simplify the code much.
There was a problem hiding this comment.
In this very complicated code, I'll take any simplification I can get :)
It also is slightly different behavior, and I'm not sure which one we want. In my version, two repeated dep names with the same (user-given?) indices will still trample each other. Maybe this is good, because they should be considered identical? Or maybe we want to consider every layer unique, even if the names and indices match. I'm really not sure.
|
@gjoseph92 - Thanks for the helpful review here! I revised the PR to align with the "append the index info every time" approach. It makes the PR larger, but the logic is a bit simpler. Let me know if you had something else in mind, or if you have further thoughts/suggestions :) |
gjoseph92
left a comment
There was a problem hiding this comment.
Nice, I feel like this is slightly easier to intuit. I'm still not sure my suggestion was the right approach though compared to what you originally did! I really don't know if matching dep-ind combos should be considered distinct or equivalent.
| indices.append(index) | ||
| new_dsk = subs(inputs[dep].dsk, sub) | ||
|
|
||
| # Change new_dsk key to match local_dep |
There was a problem hiding this comment.
Just still wondering about this. Can we be guaranteed that the dsk contains no references to the key dep? I just don't know. In practice I think the graphs are totally formulaic, but in principle couldn't the dsk of a Blockwise be any arbitrary valid graph?
Would we want to do this renaming via subs or something just to be sure we're catching everything? In the same way we use subs because we're renaming the blockwise tokens?
There was a problem hiding this comment.
Just still wondering about this. Can we be guaranteed that the dsk contains no references to the key dep? I just don't know.
I'm definitely not sure I understand what you are asking here. Maybe you could share a hypothetical graph with the problem you are thinking of?
In practice I think the graphs are totally formulaic, but in principle couldn't the dsk of a Blockwise be any arbitrary valid graph?
I am pretty sure the answer is no. At this point dsk is not a SubgraphCallable yet, and so it's guaranteed to be pretty simple (i.e. {<layer-name>: (<function>, <one-or-more-blockwise-tokens>)}).
Would we want to do this renaming via subs or something just to be sure we're catching everything? In the same way we use subs because we're renaming the blockwise tokens?
When we rename blockwise tokens, we are modifying the values (and not the keys) in the dictionary. In this case, we are only changing a single key. I am not really understanding what would be missed here.
There was a problem hiding this comment.
My worry was basically that the dsk could in theory include references to the name of a dependency inside its tasks:
Lines 327 to 329 in 1a1abfd
Notice the second sentence. We would have just changed the name of that input, without changing its name within the tasks.
However, what I missed was that the dsk gets evaluated inside a SubgraphCallable, so its scope, essentially, is restricted. It's impossible for it to reference keys outside of itself.
And the other thing I missed is that that docstring is basically wrong: the dsk cannot reference the inputs as keys. It has to use special (undocumented) keys generated by blockwise_token.
Put together, those facts mean that the only keys it's possible for dsk to reference are _0, _1, etc. And those keys are already rewritten during the optimization process. If the dsk uses the names of one of its inputs literally, instead of the _N argument for it, it'll fail whether or it it gets optimized.
At this point
dskis not a SubgraphCallable yet, and so it's guaranteed to be pretty simple (i.e.{<layer-name>: (<function>, <one-or-more-blockwise-tokens>)})
I would not say dsk is guaranteed to be simple. (Counterexample: the Blockwise layer returned by rewrite_blockwise! Its graph is rather complicated.) If you happen to use the blockwise.blockwise function to make your Blockwise layer instance, then yes the graph will look like what you said, but the Blockwise class itself can accept any dsk of arbitrary complexity. There's no documentation or validation saying otherwise, so I'd say there's nothing in the contract about the dsk. This should be tightened a up a lot: for example, the docs should say that the dsk must use blockwise tokens as its input keys, and that it must use the key corresponding to output as its result.
There was a problem hiding this comment.
Ah - Okay. Thanks for clarifying your thoughts here. I think I agree with everything you said :)
| [ | ||
| [ | ||
| (b, "jk", {b: (add, _0, _1)}, [(a, "jk"), (2, None)]), | ||
| (c, "ijk", {c: (add, _0, _1)}, [(b, "ij"), (b, "jk")]), |
There was a problem hiding this comment.
Should you also test repeating the same name-indices combination as an input? Like
(x, "ijk", {c: (add, _0, _1, _2)}, [(b, "ij"), (b, "jk"), (b, "ij")]),or also
(y, "ijk", {c: (sub, _0, _1)}, [(b, "ij"), (b, "jk")]),(so b_i_j and b_j_k show up as deps multiple times in the list of inputs)
There was a problem hiding this comment.
I added another test case, but I'm not sure if it covers the case you are thinking of here.
|
Thanks again for the reviews here @gjoseph92 - Sorry that I didn't see youre review from yesterday until this evening!
I feel pretty confident that it makes sense to always encode the index along with the layer name when it is fused into the subgraph of another layer. Also, I think it is pretty clear that two instances of the same dep-ind combination should be considered equivalent within the context of the same subgraph (which is what we are considereing here). The problem that motivated this PR was when two distinct dep-ind combinations were erroneously being treated as equivalent. |
gjoseph92
left a comment
There was a problem hiding this comment.
Okay, I've played with this a lot and I'm confident this change doesn't introduce any new problems.
Also, I think it is pretty clear that two instances of the same dep-ind combination should be considered equivalent within the context of the same subgraph
Having thought about this more, I agree. The dep names should already be distinct identifiers, and the indices are normalized by the time they reach this point.
| indices.append(index) | ||
| new_dsk = subs(inputs[dep].dsk, sub) | ||
|
|
||
| # Change new_dsk key to match local_dep |
There was a problem hiding this comment.
My worry was basically that the dsk could in theory include references to the name of a dependency inside its tasks:
Lines 327 to 329 in 1a1abfd
Notice the second sentence. We would have just changed the name of that input, without changing its name within the tasks.
However, what I missed was that the dsk gets evaluated inside a SubgraphCallable, so its scope, essentially, is restricted. It's impossible for it to reference keys outside of itself.
And the other thing I missed is that that docstring is basically wrong: the dsk cannot reference the inputs as keys. It has to use special (undocumented) keys generated by blockwise_token.
Put together, those facts mean that the only keys it's possible for dsk to reference are _0, _1, etc. And those keys are already rewritten during the optimization process. If the dsk uses the names of one of its inputs literally, instead of the _N argument for it, it'll fail whether or it it gets optimized.
At this point
dskis not a SubgraphCallable yet, and so it's guaranteed to be pretty simple (i.e.{<layer-name>: (<function>, <one-or-more-blockwise-tokens>)})
I would not say dsk is guaranteed to be simple. (Counterexample: the Blockwise layer returned by rewrite_blockwise! Its graph is rather complicated.) If you happen to use the blockwise.blockwise function to make your Blockwise layer instance, then yes the graph will look like what you said, but the Blockwise class itself can accept any dsk of arbitrary complexity. There's no documentation or validation saying otherwise, so I'd say there's nothing in the contract about the dsk. This should be tightened a up a lot: for example, the docs should say that the dsk must use blockwise tokens as its input keys, and that it must use the key corresponding to output as its result.
|
@jrbourbeau we think this is ready, might be nice to get in the release? |
|
rerun tests |
optimize_blockwise bug for duplicate dependency names
|
As this is a relatively important bug fix, and @gjoseph92 @rjzamora are confident about the changes in this PR, let's go ahead and include it. @rjzamora, would like the honor of pressing the green button? |
|
Thanks @rjzamora and @gjoseph92! |
This PR addsd a possible solution to #8535
It seems that the
rewrite_blockwisephase of the currentoptimize_blockwiseimplementation can loose important information when aBlockwiselayer depends on two otherBlockwiselayers with the same name (i.e. the same layer). For example, ifmatmulis used to multiply a matrix with itself, the same matrix projection may be used for both inputs to the binomial operation (rather than the proper ij,jk projections). This will happen if the duplicate input is also aBlockwiselayer (likeX*2), because the originalindicesinformation in thematmullayer will be lost when theX*2result is fused into the root subgraph.This PR proposes that we avoid this problem by appending a "counting" label to duplicated dependency names when they are fused into the root subgraph. This ensures that operations like
matmul(a, b)will always store a distinct key-value pair foraandbthe subgraph.pre-commit run --all-filescc @ian-r-rose