JIT: guard DifferentiableGraph node#49433
Conversation
💊 CI failures summary and remediationsAs of commit 0b1c880 (more details on the Dr. CI page):
🚧 2 fixed upstream failures:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
3c73921 to
3752916
Compare
eellison
left a comment
There was a problem hiding this comment.
Cool, looks great, thanks for the PR!
I think we need to figure out when we're specializing the graph, because we might be doing it too generally here without guarding to make sure that the types we have set are guaranteed to be executed.
cc @Krovatkin as well for review
| runPreAutodiffPassPipeline(copy); | ||
|
|
||
| if (needsGradientInProfilingMode(copy->block())) { | ||
| RemoveProfileNodesAndSpecializeTypes(copy); |
There was a problem hiding this comment.
Hmm, we're specializing the types of the graph here without guaranteeing that they will be seen. This could lead to invalid peepholes in the forward of the differentiable graph. Could we try to rewrite this without removing the profiling nodes here ? The only property that we're guaranteeing stays consistent is the gradient of the inputs.
I think the flow we want is:
-
CreateAutodiffSubgraphs -
Specialize the inputs of
CreateAutodiffSubgraphson their requires_grad property. This might require going into DifferentiableGraph and aggregaating the grad property based on uses. -
Add a Type Guard to the diff graph. In the fallback function, we probably want to inline the differentiable graph, since we are running it only if gradients have changed, and we may not need to differentiate at all.
-
Next, when we differentiate the graph, it would be nice if we're not adding excess gradients, since we know which inputs require grad. (This doesn't have to be done in thisPR).
It is sort of annoying to have to keep the profile nodes in the graph, but I don't know if there's a better solution here.
There was a problem hiding this comment.
I had the following in mind:
- CreateAutodiffSubgraphs
- Specialize the inputs of DifferentiableGraphs
- Add a guard based on the input requiresGrad.
My thinking was that the fallback function doesn't have a differentiable graph but only the nodes that have been moved inside the graph so it would be good to use as is.
| copy, | ||
| getAutodiffSubgraphInlining() ? autodiffSubgraphInlineThreshold : 1); | ||
| RemoveProfilingNodes(copy); | ||
| replaceFallbackGraphWithFallbackFunction(copy->block()); |
There was a problem hiding this comment.
Hmm, how does this work ? what do we want the graph to look like ?
if TypeCheck:
DifferentiableGraph
else:
FallBack(...)
within the Fallback, we probably don't want it to be a Differentiable Graph right, because the gradient properties are different? We might need to inline the DifferentiableGraph in the FallBack graph if i'm reading this right.
There was a problem hiding this comment.
I was thinking that the fallback graph is inserted by insertTypeGuard using the subgraph attached to the guarded (ie the DifferentiableGraph) node. So it does not include the DifferentiableGraph itself. Then here the replaceFallbackGraphWithFallbackFunction replaces the fallback graph with the fallback function. (And that gets an executor so it can then do its own profiling + optimization.)
|
Thank you for the review! I'll update the PR tomorrow morning. For some reason, in the CI the requiresGrad of the type seems to be nullopt rather than false. |
|
I updated the Type-Guarding to pull the Type from first the profile node inside the differentiable graph to the outside and leave the profile alone, I hope this works. |
a972819 to
cd26ec4
Compare
|
When the output of a differentiable graph is the input to another, the Note that this is also a problem for the fusers. On the other hand it depends on unmergeable differentiable graph, which might not be too common. @eellison I must admit I'm a bit dense, but this has me thinking again that it might be worth exploring specializing types before creating the autodiff subgraphs -- then we wouldn't have the problem. Could you elaborate your concerns? Or we might remove and re-instrument the things inside the differentiable graph. That would have us do another set of profiling runs, but would likely work out nicely. I will probably revisit this once I have thought a bit more about incremental profiling. |
Codecov Report
@@ Coverage Diff @@
## master #49433 +/- ##
===========================================
- Coverage 80.57% 41.03% -39.54%
===========================================
Files 1883 500 -1383
Lines 204369 67395 -136974
===========================================
- Hits 164662 27658 -137004
- Misses 39707 39737 +30 |
| // defined | ||
| return TensorType::get()->withRequiresGrad(t->requiresGrad()); | ||
| }); | ||
| if (all_requires_grad_defined) { |
There was a problem hiding this comment.
it's somewhat common for LSTMs (in a loop) to have tensor values whose requires_grad attributes flip from false (in the first iteration) to true (in the subsequent iterations). This behaviour would result in all_requires_grad_defined set to false. The other probable scenario would be when we switch from training to eval/test (i.e. true to false).
I think we could consider the following options:
a) enable re-profiling if we aren't sure, so we will profile a steady state (would cost us an extra call in a loop, to always go through a fallback)
b) gamble and assume that requires_gradients will be true. This strategy works out very nicely for both LSTMs and eval/train scenarios. We would take the fallback in the first iterations since requires_gradients would be false, but then we will be running the optimized path (no fallbacks). Similarly, we would spend more time in training than eval, so incurring the fallback penalty in the test/eval runs is preferable.
c) peel the first iteration, so profiler only sees requires_grad=true. We kinda tried it but very reluctantly.
d) do nothing and lose out on perf since we won't fuse in non-diff regions as long as even a single requires_grad=true anywhere in a graph.
Thoughts
@t-vi @eellison @jjsjann123 ?
There was a problem hiding this comment.
Oha, indeed! I would go for b). If that doesn't work out, I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.
This will apply to completely unseen things, too, though...
There was a problem hiding this comment.
I also kinda like b). It seems to be a reasonable heuristic; we could always revise it later.
This will apply to completely unseen things, too, though...
that's true. we could check for unseen things as their types are just TensorType::get().
I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.
I'm actually working on this in a separate PR. However, the initial version would aggregate across different invocations of a function rather than different iterations of a loop.
There was a problem hiding this comment.
that's true. we could check for unseen things as their types are just TensorType::get().
I wonder whether we might just have a "seen" bit on the profiling node. I've been thinking about incremental profiling (where we leave profiling nodes for the bits we don't have records of and re-run the optimization once we hit them) and will probably play with it one of these days.
There was a problem hiding this comment.
Yea, I think we should do b) but filter out TensorType::get() (unseen tensors). Incremental profiling could have value, i'm not sure, but is probably out of scope for this PRR
There was a problem hiding this comment.
OK, I'll return to the taking out graphs with TensorType::get() for now and look into incremental profiling. Thank you for the insightful comments!
|
Finally, it seems that the CI is happy, and I'm reasonably happy, too. |
Krovatkin
left a comment
There was a problem hiding this comment.
looks gut. I had a few fairly minor comments that would be nice to fix.
|
|
||
| bool shouldProfileNode(const Node* node) { | ||
| std::lock_guard<std::mutex> guard(mutex_); | ||
| if (isDifferentiable(node)) { |
There was a problem hiding this comment.
could you please add a comment here? I presume we need this this to minimize the number of nodes whose requires_grad isn't set because they aren't being profiled?
| if (n->kind() == prim::profile) { | ||
| GRAPH_DEBUG( | ||
| "setting input ", i, " to type ", *n->ty(attr::profiled_type)); | ||
| dni->setType(n->ty(attr::profiled_type)); |
There was a problem hiding this comment.
Is there a reason we can't just set requiresGrad attribute vs the complete type? TensorType::get().withRequiresGrad(n->type(attr::profiled_type)->requiresGrad())?
There was a problem hiding this comment.
But then if withRequiresGrad is undefined (ie the case you mentioned), we would end with something indistinguishable from an unseen tensor. So I'd probably leave it as is for now.
| GRAPH_DEBUG( | ||
| "setting input ", i, " to type ", *n->ty(attr::profiled_type)); | ||
| dni->setType(n->ty(attr::profiled_type)); | ||
| } else if (dni->node()->kind() == prim::DifferentiableGraph) { |
There was a problem hiding this comment.
Yeah. Eventually, I think the best fix would be to 1) re-profile the graph inside the differentiable graph (or do we do that already when using the differentiable graph operator?) or 2) apply the type annotations before creating autodiff graphs. So I understood @eellison that this isn't good to do at the moment, but I haven't understood well enough what is unsafe about it to see if that could be fixed.
There was a problem hiding this comment.
-
is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.
-
yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
cool cool, LGTM, comments I made you/ @Krovatkin / myself can do in a follow up, doesnt have to be in this PR.. Thanks for doing this, I know it was a ton of work and you improved some infrastructure that was lacking along the way!
@jjsjann123 it would be good to try to speed up the layer norm autodiff with this in place.
| @@ -0,0 +1,17 @@ | |||
| import torch | |||
There was a problem hiding this comment.
There's a test/jit/test_profiler, this test might be better there
| int64_t i = 0; | ||
| std::string s = ""; | ||
| double f = 0.0; | ||
| TypePtr ty; |
| GRAPH_DEBUG("Optimizing diff node ", idx, " in ", *copy); | ||
| if (!guardDifferentiableGraph(dnode)) { | ||
| // if we cannot guard (because of inputs without profiling information), | ||
| // we re-inline the subgraph and remove the differentiable node |
There was a problem hiding this comment.
It would be nice if inputs without profiling information we just didn't guard on, instead of disabling differentiable graphs entirely, maybe as a follow up... I'm not sure this is acttually relevantt in practice though
| for (size_t i = 0; i < gi.size(); i++) { | ||
| auto ty = gi[i]->type()->cast<TensorType>(); | ||
| if (ty) { | ||
| auto n = gi[i]->uses().at(0).user; |
There was a problem hiding this comment.
I suppose with DifferentiableGraph's we're always fusing just a block, so all or no uses will have been executed
| } | ||
| } | ||
| if (all_inputs_seen) { | ||
| // we may have seen both true and false for requires_grad. In this case |
There was a problem hiding this comment.
Hmm, when would this happen exactly? I'm a little confused here. I guess this is the LSTM first loop doesn't require grad ?
There was a problem hiding this comment.
for LSTM in the first timestep the hidden state usually doesn't require gradients, the inputs would.
| GRAPH_DEBUG( | ||
| "setting input ", i, " to type ", *n->ty(attr::profiled_type)); | ||
| dni->setType(n->ty(attr::profiled_type)); | ||
| } else if (dni->node()->kind() == prim::DifferentiableGraph) { |
There was a problem hiding this comment.
-
is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.
-
yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.
| } | ||
| GRAPH_DEBUG("After guardDifferentiableGraph:\n", *copy); | ||
| auto diff_graph = std::move(dnode->g(attr::Subgraph)); | ||
| Gradient gradient = differentiate(diff_graph); |
There was a problem hiding this comment.
Once we re-profile within differentiable graphs, we can remove the profile nodes before differentiating, and we not compute unnecesssary gradients. I'm not sure we're doing that now.
see:
pytorch/torch/csrc/jit/runtime/autodiff.cpp
Line 396 in 22a34bc
(not for this PR, as a follow up)
| // ideally we would set this up for re-profiling | ||
| UpdateDifferentiableGraphRequiresGrad( | ||
| dnode->g(attr::Subgraph), c10::nullopt); | ||
| SubgraphUtils::unmergeSubgraph(dnode); |
There was a problem hiding this comment.
UpdateDifferentiableGraphRequiresGrad(dnode->g(attr::Subgraph), c10::nullopt);
replaceBlockWithFallbackGraph(SubgraphUtils::getSubgraph(dnode)->block());
SubgraphUtils::unmergeSubgraph(dnode)
|
Hey @t-vi, sry things are moving a little slowly during the holidays! We're trying to merge this ~soon~, however @Krovatkin noticed a 3-4% slowdown on fastrnns. We're looking if we can make this faster by replacing the generic TypeCheck node with a check that just looks at requires grad, and by replacing the unoptimized block with a FallBack path. Any other ideas welcome ! As you pointed out, there is some overhead with the extra executor, I wonder if there's anyway we could inline fallback graphs after we've done optimizing them 🤔 I also don't know why we're instantiating the interpreter state on each forward - |
|
Thank you for catching this. But do we think the type check is the problem or the fallback? |
|
I've run I am slowly looking into this. I added a specialized check here, #49929 , but I didn't get a chance to measure it yet (some setup issues with a benchmarking machine, I usually use). I'll measure it and report back, If my patch doesn't help, I'll add counters to confirm that we are indeed hitting more bailouts. |
|
Do you know if the std is the std of measurements or adjusted for the number of runs btw? You shouldn't have to fix my PR. |
|
Here's the raw csv data. If I remember correctly, the slowdown was greater than both stds. |
To give a little context: I started to look into it as a follow up to your PR when I thought we were ready to merge as-is and before I ran the numbers. This was just too similar to my other effort to speed up TypeChecks in general. |
|
@t-vi specialized checks seem to help more than not (most runs we get most perf back). Although, this machine is pretty noisy: https://docs.google.com/spreadsheets/d/1UCzYlHFnGo_G097aq0sSbmYAwd1Go_ouyGpKamCmerI/edit?usp=sharing |
|
@t-vi could you please rebase yr PR and I'll try merging ASAP! sorry for the delay! |
|
@Krovatkin will do! Thank you! |
This adds guarding for DifferentiableGraph nodes. Fixes: pytorch#49299
facebook-github-bot
left a comment
There was a problem hiding this comment.
@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@Krovatkin merged this pull request in ea087e2. |
Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes pytorch/pytorch#49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: pytorch/pytorch#49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes pytorch/pytorch#49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: pytorch/pytorch#49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes pytorch#49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: pytorch#49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296
This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.
Fixes #49299
I still need to look into a handful of failing tests, but maybe it can be a discussion basis.