JIT: guard DifferentiableGraph node by t-vi · Pull Request #49433 · pytorch/pytorch

t-vi · 2020-12-15T22:31:15Z

This adds guarding for DifferentiableGraph nodes in order to not depend on
Also bailing out on required gradients for the CUDA fuser.

Fixes #49299

I still need to look into a handful of failing tests, but maybe it can be a discussion basis.

facebook-github-bot · 2020-12-15T22:31:33Z

💊 CI failures summary and remediations

As of commit 0b1c880 (more details on the Dr. CI page):

2/4 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)
2/4 broken upstream at merge base d78b638 on Jan 08 from 10:56am to 2:26pm

🚧 2 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_xenial_py3_6_gcc5_4_jit_legacy_test on Jan 08 from 10:56am to 1:01pm (a4f30d4 - 1bb7d8f)
- 🔁 rerun
pytorch_linux_xenial_py3_6_gcc5_4_test on Jan 08 from 10:56am to 2:26pm (a4f30d4 - 1bb7d8f)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.10-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 130 times.

eellison

Cool, looks great, thanks for the PR!

I think we need to figure out when we're specializing the graph, because we might be doing it too generally here without guarding to make sure that the types we have set are guaranteed to be executed.

cc @Krovatkin as well for review

eellison · 2020-12-17T20:58:58Z

  runPreAutodiffPassPipeline(copy);

  if (needsGradientInProfilingMode(copy->block())) {
+    RemoveProfileNodesAndSpecializeTypes(copy);


Hmm, we're specializing the types of the graph here without guaranteeing that they will be seen. This could lead to invalid peepholes in the forward of the differentiable graph. Could we try to rewrite this without removing the profiling nodes here ? The only property that we're guaranteeing stays consistent is the gradient of the inputs.

I think the flow we want is:

CreateAutodiffSubgraphs

Specialize the inputs of CreateAutodiffSubgraphs on their requires_grad property. This might require going into DifferentiableGraph and aggregaating the grad property based on uses.

Add a Type Guard to the diff graph. In the fallback function, we probably want to inline the differentiable graph, since we are running it only if gradients have changed, and we may not need to differentiate at all.

Next, when we differentiate the graph, it would be nice if we're not adding excess gradients, since we know which inputs require grad. (This doesn't have to be done in thisPR).

It is sort of annoying to have to keep the profile nodes in the graph, but I don't know if there's a better solution here.

I had the following in mind:

CreateAutodiffSubgraphs

Specialize the inputs of DifferentiableGraphs

Add a guard based on the input requiresGrad.

My thinking was that the fallback function doesn't have a differentiable graph but only the nodes that have been moved inside the graph so it would be good to use as is.

eellison · 2020-12-17T21:10:39Z

        copy,
        getAutodiffSubgraphInlining() ? autodiffSubgraphInlineThreshold : 1);
-    RemoveProfilingNodes(copy);
+    replaceFallbackGraphWithFallbackFunction(copy->block());


Hmm, how does this work ? what do we want the graph to look like ?

if TypeCheck: DifferentiableGraph else: FallBack(...)

within the Fallback, we probably don't want it to be a Differentiable Graph right, because the gradient properties are different? We might need to inline the DifferentiableGraph in the FallBack graph if i'm reading this right.

I was thinking that the fallback graph is inserted by insertTypeGuard using the subgraph attached to the guarded (ie the DifferentiableGraph) node. So it does not include the DifferentiableGraph itself. Then here the replaceFallbackGraphWithFallbackFunction replaces the fallback graph with the fallback function. (And that gets an executor so it can then do its own profiling + optimization.)

t-vi · 2020-12-18T00:16:45Z

Thank you for the review! I'll update the PR tomorrow morning.

For some reason, in the CI the requiresGrad of the type seems to be nullopt rather than false.
I could use a hint why, as I cannot reproduce it locally.

t-vi · 2020-12-18T10:09:06Z

I updated the Type-Guarding to pull the Type from first the profile node inside the differentiable graph to the outside and leave the profile alone, I hope this works.

t-vi · 2020-12-19T08:09:13Z

When the output of a differentiable graph is the input to another, the prim::profile node gets moved into the differentiable graph where it is the output. Then it is missing at the input. ~~I'll go duplicate these nodes.~~ I ended up not duplicating them because it turned into a mess quickly. Instead I now special case inputs that come from differentiable graphs.

Note that this is also a problem for the fusers. On the other hand it depends on unmergeable differentiable graph, which might not be too common.

@eellison I must admit I'm a bit dense, but this has me thinking again that it might be worth exploring specializing types before creating the autodiff subgraphs -- then we wouldn't have the problem. Could you elaborate your concerns? Or we might remove and re-instrument the things inside the differentiable graph. That would have us do another set of profiling runs, but would likely work out nicely. I will probably revisit this once I have thought a bit more about incremental profiling.

codecov · 2020-12-19T17:00:09Z

Codecov Report

Merging #49433 (7387285) into master (e0f60c9) will decrease coverage by 39.53%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #49433       +/-   ##
===========================================
- Coverage   80.57%   41.03%   -39.54%     
===========================================
  Files        1883      500     -1383     
  Lines      204369    67395   -136974     
===========================================
- Hits       164662    27658   -137004     
- Misses      39707    39737       +30

Krovatkin · 2020-12-19T22:03:58Z

-                                                          // defined
-    return TensorType::get()->withRequiresGrad(t->requiresGrad());
-  });
+  if (all_requires_grad_defined) {


it's somewhat common for LSTMs (in a loop) to have tensor values whose requires_grad attributes flip from false (in the first iteration) to true (in the subsequent iterations). This behaviour would result in all_requires_grad_defined set to false. The other probable scenario would be when we switch from training to eval/test (i.e. true to false).

I think we could consider the following options:
a) enable re-profiling if we aren't sure, so we will profile a steady state (would cost us an extra call in a loop, to always go through a fallback)
b) gamble and assume that requires_gradients will be true. This strategy works out very nicely for both LSTMs and eval/train scenarios. We would take the fallback in the first iterations since requires_gradients would be false, but then we will be running the optimized path (no fallbacks). Similarly, we would spend more time in training than eval, so incurring the fallback penalty in the test/eval runs is preferable.
c) peel the first iteration, so profiler only sees requires_grad=true. We kinda tried it but very reluctantly.
d) do nothing and lose out on perf since we won't fuse in non-diff regions as long as even a single requires_grad=true anywhere in a graph.

Thoughts
@t-vi @eellison @jjsjann123 ?

Oha, indeed! I would go for b). If that doesn't work out, I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.
This will apply to completely unseen things, too, though...

I also kinda like b). It seems to be a reasonable heuristic; we could always revise it later.

This will apply to completely unseen things, too, though...

that's true. we could check for unseen things as their types are just TensorType::get().

I'd probably go for actually querying the configuration in the ProfilingRecord for which is the most commonly seen.

I'm actually working on this in a separate PR. However, the initial version would aggregate across different invocations of a function rather than different iterations of a loop.

that's true. we could check for unseen things as their types are just TensorType::get().

I wonder whether we might just have a "seen" bit on the profiling node. I've been thinking about incremental profiling (where we leave profiling nodes for the bits we don't have records of and re-run the optimization once we hit them) and will probably play with it one of these days.

Yea, I think we should do b) but filter out TensorType::get() (unseen tensors). Incremental profiling could have value, i'm not sure, but is probably out of scope for this PRR

OK, I'll return to the taking out graphs with TensorType::get() for now and look into incremental profiling. Thank you for the insightful comments!

t-vi · 2020-12-20T07:06:42Z

Finally, it seems that the CI is happy, and I'm reasonably happy, too.

Krovatkin

looks gut. I had a few fairly minor comments that would be nice to fix.

Krovatkin · 2020-12-22T06:32:48Z


  bool shouldProfileNode(const Node* node) {
    std::lock_guard<std::mutex> guard(mutex_);
+    if (isDifferentiable(node)) {


could you please add a comment here? I presume we need this this to minimize the number of nodes whose requires_grad isn't set because they aren't being profiled?

Krovatkin · 2020-12-22T06:48:02Z

+      if (n->kind() == prim::profile) {
+        GRAPH_DEBUG(
+            "setting input ", i, " to type ", *n->ty(attr::profiled_type));
+        dni->setType(n->ty(attr::profiled_type));


Is there a reason we can't just set requiresGrad attribute vs the complete type? TensorType::get().withRequiresGrad(n->type(attr::profiled_type)->requiresGrad())?

But then if withRequiresGrad is undefined (ie the case you mentioned), we would end with something indistinguishable from an unseen tensor. So I'd probably leave it as is for now.

Krovatkin · 2020-12-22T07:01:26Z

+        GRAPH_DEBUG(
+            "setting input ", i, " to type ", *n->ty(attr::profiled_type));
+        dni->setType(n->ty(attr::profiled_type));
+      } else if (dni->node()->kind() == prim::DifferentiableGraph) {


Yeah. Eventually, I think the best fix would be to 1) re-profile the graph inside the differentiable graph (or do we do that already when using the differentiable graph operator?) or 2) apply the type annotations before creating autodiff graphs. So I understood @eellison that this isn't good to do at the moment, but I haven't understood well enough what is unsafe about it to see if that could be fixed.

is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.

yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.

facebook-github-bot

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

eellison

cool cool, LGTM, comments I made you/ @Krovatkin / myself can do in a follow up, doesnt have to be in this PR.. Thanks for doing this, I know it was a ton of work and you improved some infrastructure that was lacking along the way!

@jjsjann123 it would be good to try to speed up the layer norm autodiff with this in place.

eellison · 2020-12-22T19:44:37Z

@@ -0,0 +1,17 @@
+import torch


There's a test/jit/test_profiler, this test might be better there

eellison · 2020-12-22T19:48:49Z

  int64_t i = 0;
  std::string s = "";
  double f = 0.0;
+  TypePtr ty;


eellison · 2020-12-22T20:01:24Z

+      GRAPH_DEBUG("Optimizing diff node ", idx, " in ", *copy);
+      if (!guardDifferentiableGraph(dnode)) {
+        // if we cannot guard (because of inputs without profiling information),
+        // we re-inline the subgraph and remove the differentiable node


It would be nice if inputs without profiling information we just didn't guard on, instead of disabling differentiable graphs entirely, maybe as a follow up... I'm not sure this is acttually relevantt in practice though

eellison · 2020-12-22T20:08:07Z

+  for (size_t i = 0; i < gi.size(); i++) {
+    auto ty = gi[i]->type()->cast<TensorType>();
+    if (ty) {
+      auto n = gi[i]->uses().at(0).user;


I suppose with DifferentiableGraph's we're always fusing just a block, so all or no uses will have been executed

eellison · 2020-12-22T20:18:45Z

+    }
+  }
+  if (all_inputs_seen) {
+    // we may have seen both true and false for requires_grad. In this case


Hmm, when would this happen exactly? I'm a little confused here. I guess this is the LSTM first loop doesn't require grad ?

for LSTM in the first timestep the hidden state usually doesn't require gradients, the inputs would.

eellison · 2020-12-22T20:23:35Z

+        GRAPH_DEBUG(
+            "setting input ", i, " to type ", *n->ty(attr::profiled_type));
+        dni->setType(n->ty(attr::profiled_type));
+      } else if (dni->node()->kind() == prim::DifferentiableGraph) {


is an easy enough - see switching from Code to GraphExecutor for gradient forward #47434.

yea is a little weird bc it gets away from our "fusers decide what to guard on" logic, and im also a little worried about proving what properties in the differentiable graph are guaranteed. For instance, if an Op takes in a NumberType, the output type could change if in one iteration the input is an Integer and the next a Float. I'm not saying that's particularly common, but it's just one example where strictly guarding on Tensor inputs does not guarantee consistent tensor types.

eellison · 2020-12-22T20:36:58Z

+      }
+      GRAPH_DEBUG("After guardDifferentiableGraph:\n", *copy);
      auto diff_graph = std::move(dnode->g(attr::Subgraph));
      Gradient gradient = differentiate(diff_graph);


Once we re-profile within differentiable graphs, we can remove the profile nodes before differentiating, and we not compute unnecesssary gradients. I'm not sure we're doing that now.

see:

pytorch/torch/csrc/jit/runtime/autodiff.cpp

Line 396 in 22a34bc

if (!output->requires_grad())

(not for this PR, as a follow up)

eellison · 2020-12-22T20:42:21Z

+    // ideally we would set this up for re-profiling
+    UpdateDifferentiableGraphRequiresGrad(
+        dnode->g(attr::Subgraph), c10::nullopt);
+    SubgraphUtils::unmergeSubgraph(dnode);


UpdateDifferentiableGraphRequiresGrad(dnode->g(attr::Subgraph), c10::nullopt); replaceBlockWithFallbackGraph(SubgraphUtils::getSubgraph(dnode)->block()); SubgraphUtils::unmergeSubgraph(dnode)

eellison · 2020-12-28T23:19:51Z

Hey @t-vi, sry things are moving a little slowly during the holidays! We're trying to merge this ~soon~, however @Krovatkin noticed a 3-4% slowdown on fastrnns. We're looking if we can make this faster by replacing the generic TypeCheck node with a check that just looks at requires grad, and by replacing the unoptimized block with a FallBack path. Any other ideas welcome !

As you pointed out, there is some overhead with the extra executor, I wonder if there's anyway we could inline fallback graphs after we've done optimizing them 🤔

I also don't know why we're instantiating the interpreter state on each forward -

pytorch/torch/csrc/jit/runtime/graph_executor.cpp

Line 390 in f5073b0

InterpreterState(f).run(*stack);

t-vi · 2020-12-29T07:04:30Z

Thank you for catching this. But do we think the type check is the problem or the fallback?
If the latter, then having re-profiling of parts might be a good solution that would allow us to just leave the fallback in the "else block" directly.
I'm a bit sad we don't get this in because of the correctness aspect, but hey 4% slowdown is terrible, too.

Krovatkin · 2020-12-29T17:52:32Z

@t-vi fyi: @eellison

I've run fastrnns from benchmarks, please see the results here: https://docs.google.com/spreadsheets/d/1otmopqasKTe1rYqreJrNm4WNwQpaT37z4CpMPszVq3o/edit?usp=sharing The slowdown seems to be somewhat small, so I'm optimistically suspecting checks, given the fact that graphs are nearly identical modulo extra checks.

I am slowly looking into this. I added a specialized check here, #49929 , but I didn't get a chance to measure it yet (some setup issues with a benchmarking machine, I usually use). I'll measure it and report back, If my patch doesn't help, I'll add counters to confirm that we are indeed hitting more bailouts.

t-vi · 2020-12-29T18:31:40Z

Do you know if the std is the std of measurements or adjusted for the number of runs btw?
I must admit that my intuition is that the extra function call is might be problematic than the check itself (and we'd run into that at some point). The other thing I thought about was whether we could actually be making the if into a switch that works similar to the traditional executor but with "user provided" (for the CUDA fuser) criteria.

You shouldn't have to fix my PR.

Krovatkin · 2020-12-29T18:37:30Z

Here's the raw csv data. If I remember correctly, the slowdown was greater than both stds.
tvi3.txt

Krovatkin · 2020-12-29T20:03:36Z

@t-vi

you shouldn't have to fix my PR

To give a little context: I started to look into it as a follow up to your PR when I thought we were ready to merge as-is and before I ran the numbers. This was just too similar to my other effort to speed up TypeChecks in general.

Krovatkin · 2020-12-30T00:53:19Z

@t-vi specialized checks seem to help more than not (most runs we get most perf back). Although, this machine is pretty noisy: https://docs.google.com/spreadsheets/d/1UCzYlHFnGo_G097aq0sSbmYAwd1Go_ouyGpKamCmerI/edit?usp=sharing

Krovatkin · 2021-01-08T18:05:55Z

@t-vi could you please rebase yr PR and I'll try merging ASAP! sorry for the delay!

t-vi · 2021-01-08T18:07:46Z

@Krovatkin will do! Thank you!

This adds guarding for DifferentiableGraph nodes. Fixes: pytorch#49299

…::profile nodes

facebook-github-bot

@Krovatkin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-01-09T04:03:33Z

@Krovatkin merged this pull request in ea087e2.

Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes pytorch/pytorch#49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: pytorch/pytorch#49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296

Summary: This adds guarding for DifferentiableGraph nodes in order to not depend on Also bailing out on required gradients for the CUDA fuser. Fixes pytorch#49299 I still need to look into a handful of failing tests, but maybe it can be a discussion basis. Pull Request resolved: pytorch#49433 Reviewed By: ngimel Differential Revision: D25681374 Pulled By: Krovatkin fbshipit-source-id: 8e7be53a335c845560436c0cceeb5e154c9cf296

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Dec 15, 2020

pytorchbot added the open source label Dec 15, 2020

t-vi changed the title ~~JIT: guard DifferentiableGraph node~~ [WIP] JIT: guard DifferentiableGraph node Dec 15, 2020

t-vi force-pushed the guard_diffgraph branch 4 times, most recently from 3c73921 to 3752916 Compare December 17, 2020 11:21

eellison reviewed Dec 17, 2020

View reviewed changes

eellison requested a review from Krovatkin December 17, 2020 23:40

t-vi force-pushed the guard_diffgraph branch from a972819 to cd26ec4 Compare December 18, 2020 10:10

t-vi mentioned this pull request Dec 19, 2020

Upload test times to S3 #49190

Closed

Krovatkin reviewed Dec 19, 2020

View reviewed changes

t-vi changed the title ~~[WIP] JIT: guard DifferentiableGraph node~~ JIT: guard DifferentiableGraph node Dec 20, 2020

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 21, 2020

Krovatkin approved these changes Dec 22, 2020

View reviewed changes

facebook-github-bot reviewed Dec 22, 2020

View reviewed changes

eellison approved these changes Dec 22, 2020

View reviewed changes

t-vi added 12 commits January 8, 2021 19:16

JIT: guard DifferentiableGraph node

c547e0c

This adds guarding for DifferentiableGraph nodes. Fixes: pytorch#49299

fix failing test

7786247

Incorporate feedback from Elias, thank you!

b80a06e

make prim::profile optional for guarding

976fe51

profile autodiff nodes

5f0af0c

re-inline graphs that we cannot guard

1ba9e68

duelling linters...

120eb2a

look out for consecutive unmergeable DiffGraphs when looking for prim…

c65b5a3

…::profile nodes

add type and type list attrs to IRParser to fix C++ test

f7b1c3a

default to expecting gradients

4eef75c

inline diffgraph if there are unseen inputs, thank you Elias!

991831b

Incorporate Nick's feedback. Thank you!

0b1c880

t-vi force-pushed the guard_diffgraph branch from 6ae8e74 to 0b1c880 Compare January 8, 2021 19:34

facebook-github-bot reviewed Jan 8, 2021

View reviewed changes

facebook-github-bot closed this in ea087e2 Jan 9, 2021

facebook-github-bot added the Merged label Jan 9, 2021

Krovatkin mentioned this pull request Feb 24, 2021

add a specialized check for gradients (RequiresGradCheck) to speed up runtime assumptions check #50392

Closed

Conversation

t-vi commented Dec 15, 2020

Uh oh!

facebook-github-bot commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🚧 2 fixed upstream failures:

ci.pytorch.org: 1 failed

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi commented Dec 18, 2020

Uh oh!

t-vi commented Dec 18, 2020

Uh oh!

t-vi commented Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi Dec 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eellison Dec 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t-vi commented Dec 20, 2020

Uh oh!

Krovatkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

eellison left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

facebook-github-bot commented Dec 15, 2020 •

edited

Loading

t-vi commented Dec 19, 2020 •

edited

Loading

codecov Bot commented Dec 19, 2020 •

edited

Loading

t-vi Dec 19, 2020 •

edited

Loading

eellison Dec 21, 2020 •

edited

Loading

eellison left a comment •

edited

Loading

eellison Dec 22, 2020 •

edited

Loading

eellison Dec 22, 2020 •

edited

Loading

eellison commented Dec 28, 2020 •

edited

Loading

t-vi commented Dec 29, 2020 •

edited

Loading

Krovatkin commented Dec 29, 2020 •

edited

Loading

Krovatkin commented Dec 29, 2020 •

edited

Loading

Krovatkin commented Dec 29, 2020 •

edited

Loading