[Dynamo] Refine CPU fallback for TD+XLA by seanlatias · Pull Request #4935 · pytorch/xla

seanlatias · 2023-04-24T13:00:52Z

In this PR, we refined the CPU fallback mechanism originally implemented by @wonjoolee95 and made the following changes.

Fix the checking for XLA support.
We use torch.fx.Interpreter to execute each node of the input module and use torch_xla.debug.metrics to check whether that node goes through CPU fallback or not. If it does, it means it is not supported by XLA.
Fix the fallback meachnism.
2.1. Partition the graph according to our results from (1), which results in a new graph with subgraphs containing ops supported by XLA only.
2.2. Compile each subgraph into a compiled function call.
2.3. Replace the subgraph with the compiled function call.

With the above changes, we can even cover cases where a combination of operator and operand is not supported.

cc: @JackCaoG @alanwaketan

…have XLAData or IR

JackCaoG · 2023-04-24T19:44:06Z

-      XLA_CHECK(false)
-          << "_check_tensor_need_materialization "
-             "currently does not handle XLATensor without XLAData and IR";
+      need_materialization.push_back(true);


I think the right thing to do is to check xtensor->CurrentTensorData(), if it is not nullptr then

need_materialization.push_back(false);

since we don't need to execute a computation with a XLATensor with tensor_data(tensor data is a cpu at::Tensor that we know the value). We should leave the last else branch to throw an error to catch any future unhanlded case.

JackCaoG · 2023-04-24T20:41:51Z

+
+    @dynamo.optimize("torchxla_trace_once")
+    def fn_fallback(M, mat1, mat2):
+      A = torch.cummin(M, 1)


is cumin suppose to fallback?

Yes, I have checked that cummin is not supported. I think one problem of this kind of testing is that once we support the unsupported ops, we might need to update the test. I'm thinking maybe I can create a custom op that will never be supported to avoid this problem.

For the scope of this PR, I think you can just leave a TODO comment. We can track that kind of specialized unit test in a separate issue, if that sounds okay.

JackCaoG · 2023-04-24T20:51:05Z

+    mat1 = torch.randn(5, 10, device=xm.xla_device())
+    mat2 = torch.randn(5, 10, device=xm.xla_device())
+
+    res = fn_fallback(M, mat1, mat2)


I think we should

Check that res match the eager result.

Check the counter (try met.short_metrics_report() and you should see the fallback counter? but maybe fallback happens in pytorch then we don't even see it), I actually don't know what to expect but if there is a fallback I assume the ExecuteTime will be 2 instead of 1. You also need to run the fn_fallback once to let it compile and then do a clear counter than rerun to see ExecuteTime to be 2 instead of 1.

For more counter related stuff, check https://github.com/pytorch/xla/blob/master/torch_xla/debug/metrics.py

If we don't see the mentioned PyTorch's counter, we can also create/use our own metric when we collect the fallback nodes -- could be useful for debugging/testing.

JackCaoG · 2023-04-24T20:51:40Z

+    mat1 = torch.randn(2, 3, device=xm.xla_device())
+    mat2 = torch.randn(3, 3, device=xm.xla_device())
+
+    res = fn_fallback(M, mat1, mat2, 0.5)


wonjoo-wj

Thanks, @seanlatias! Seems like some existing dynamo tests are failing, mostly due to metrics. The errors should be able to be reproduced locally:

python test/dynamo/test_dynamo.py DynamoInferenceBasicTest.test_resnet18

wonjoo-wj · 2023-04-25T00:02:44Z

+
+    @dynamo.optimize("torchxla_trace_once")
+    def fn_fallback(M, mat1, mat2):
+      A = torch.cummin(M, 1)


For the scope of this PR, I think you can just leave a TODO comment. We can track that kind of specialized unit test in a separate issue, if that sounds okay.

wonjoo-wj · 2023-04-25T00:08:15Z

+    mat1 = torch.randn(5, 10, device=xm.xla_device())
+    mat2 = torch.randn(5, 10, device=xm.xla_device())
+
+    res = fn_fallback(M, mat1, mat2)


If we don't see the mentioned PyTorch's counter, we can also create/use our own metric when we collect the fallback nodes -- could be useful for debugging/testing.

wonjoo-wj · 2023-04-25T00:43:42Z

+        new_node = partitioned_graph.graph.call_function(
+            extract_internal(fused_module), node.args, None)
+        node.replace_all_uses_with(new_node)
+      partitioned_graph.graph.erase_node(node)


Wondering why we need this partitioned_graph.graph.erase_node(node)? Can we do graph.eliminate_dead_code() like https://github.com/pytorch/pytorch/blob/0d66db1b2a9470a50d930308dbffda017500b80b/torch/_prims/nvfuser_executor.py#L465 or is this something different?

Here I'm explicitly removing the old node (call to submodule), which I replace with a new node (call to optimized function). With eliminate_dead_code(), it goes through the entire graph and check whether a node is being used or not. Ideally my old node is not being used anymore so the two approaches should lead to the same results. However, I tried eliminate_dead_code() it seems it faces some errors when trying to eliminate the old node.

seanlatias · 2023-04-25T12:41:47Z

Thanks, @seanlatias! Seems like some existing dynamo tests are failing, mostly due to metrics. The errors should be able to be reproduced locally:
python test/dynamo/test_dynamo.py DynamoInferenceBasicTest.test_resnet18

Do you know why sometimes mark_step increases the compilation count?

JackCaoG · 2023-04-25T17:31:54Z

mark_step will cut the current graph and compile/execute it. It is expected that mark_step will increase both Compile and Execute count. After the initial dynamo init, running the dynamo program should not trigger mark_step. You can check https://github.com/pytorch/xla/blob/master/test/dynamo/test_dynamo.py#L71 for a more detail example.

seanlatias · 2023-04-25T17:53:06Z

Ok, because I see in the current dynamo bridge implementation, there are two places that increase the Compile count. One is the mark_step at the entry point (

xla/torch_xla/core/dynamo_bridge.py

Line 207 in 8db3f89

xm.mark_step()

) and the other one is when caching the graph (

xla/torch_xla/core/dynamo_bridge.py

Line 279 in 8db3f89

torch_xla._XLAC._xla_warm_up_cache(args_and_out, [])

). However, in the restnet19 test, it expects the Compile count to be just one (

xla/test/dynamo/test_dynamo.py

Line 82 in 8db3f89

self.assertEqual(met.metric_data('CompileTime')[0], 1)

). And I think this is one of the errors I'm getting.

wonjoo-wj · 2023-04-25T23:35:21Z

The metric comparison failure seems like it has to do with how the metrics.clear_counters() was moved. Previously, we do a metrics.clear_counters() after mark_step() at https://github.com/pytorch/xla/blob/master/torch_xla/core/dynamo_bridge.py#L223. With this PR, we moved this metrics.clear_counters() to the run_node() function at https://github.com/pytorch/xla/pull/4935/files#diff-158d06d13623de8a2b4c4ee54902f148a8e0abda80870e5f25ee0bef3cc369b0R323.

Let me also try to build this branch locally to confirm.

seanlatias · 2023-04-26T12:28:36Z

Thanks Wonjoo. But the clear_counters() does not affect the Compile metric. Please let me know if you also see similar issues.

seanlatias · 2023-04-26T16:37:19Z

After second thought, I think it makes sense for my approach to introduce extra Compile and Execute because I indeed execute the graph one more time to collect fallback info. So if this approach makes sense to you guys, the assertions related to metric checking in other dynamo-related tests will need to be changed. I will first push a code with only metric checking failures and we can discuss from there.

seanlatias · 2023-04-28T15:26:38Z

For the optimizer test, the number is way off because of an issue with the partitioner. I have created a PR (pytorch/pytorch#100195) to solve it.

wonjoo-wj · 2023-05-02T05:24:43Z

After second thought, I think it makes sense for my approach to introduce extra Compile and Execute because I indeed execute the graph one more time to collect fallback info. So if this approach makes sense to you guys, the assertions related to metric checking in other dynamo-related tests will need to be changed. I will first push a code with only metric checking failures and we can discuss from there.

Thanks for the investigation, Sean. Nice, that makes a lot sense to me. Let's update the Compile and Execute metrics in the tests. I also wonder what kind of implications this would have regarding performance since we're now adding another layer of execution on each initial trace, but we can look into that later.

Also if you could add this finding to comments in our dynamo tests explaining this change in compile/execute time, that'd be great.

wonjoo-wj · 2023-05-02T05:31:15Z

For the optimizer test, the number is way off because of an issue with the partitioner. I have created a PR (pytorch/pytorch#100195) to solve it.

Thanks for opening the PR, I've just triggered the CI run on it. LGTM but let's see if the reviewer has any feedback. Also, to make this PR build/test based of that PyTorch PR, we can add a PyTorch pin to this PR. If you add a file .torch_pin under torch_patches/ with the PyTorch PR number, the CI on this PR will build PyTorch based off that PR. Example PR of adding such pin: 40f41fb.

seanlatias · 2023-05-02T10:46:35Z

For the optimizer test, the number is way off because of an issue with the partitioner. I have created a PR (pytorch/pytorch#100195) to solve it.

Thanks for opening the PR, I've just triggered the CI run on it. LGTM but let's see if the reviewer has any feedback. Also, to make this PR build/test based of that PyTorch PR, we can add a PyTorch pin to this PR. If you add a file .torch_pin under torch_patches/ with the PyTorch PR number, the CI on this PR will build PyTorch based off that PR. Example PR of adding such pin: 40f41fb.

Thanks Wonjoo. Do you know how I can request review in PyTorch or who I should tag?

JackCaoG · 2023-05-02T18:37:28Z

@seanlatias I pinged Sherlock offline, we can help you find a reviewer for that pr.

wonjoo-wj · 2023-05-09T21:22:43Z

Had a quick sync with @seanlatias offline. Oddly, it seems like some fallback works on master branch if we just remove the assertion in our dynamo bridge at https://github.com/pytorch/xla/blob/master/torch_xla/core/dynamo_bridge.py#L226-L230.

Using a similar example that was in our original RFCs (#4742 and pytorch/pytorch#93601), the code below runs without any errors on master:

import torch
import torch_xla

import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch._dynamo as dynamo



@dynamo.optimize("torchxla_trace_once")
def fn_fallback(M, mat1, mat2, beta):
  # xla currently only support alpha and beta == 1
  ret = torch.addmm(M, mat1, mat2, beta=beta)
  ret2 = ret * 2;
  return ret2;

device = xm.xla_device()
M = torch.randn(2, 3, device=device)
mat1 = torch.randn(2, 3, device=device)
mat2 = torch.randn(3, 3, device=device)

res1 = fn_fallback(M, mat1, mat2, 0.5)
res2 = fn_fallback(M, mat1, mat2, 0.5)
print('res1:', res1)
print('res2:', res2)
print('----------')
print(met.metrics_report())

And with the metrics report I can see that the aten op did fallback to CPU:

Counter: aten::addmm
  Value: 1

Previously, this would fail with an error like

torch._dynamo.exc.BackendCompilerFailed: torchxla_trace_once raised AssertionError: compiler_fn did not return callable

cc @JackCaoG, am I missing something here? It appears the fallback that wasn't working before is now working on master.

Thanks @seanlatias for the investigation, let me know if I'm missing any details here.

seanlatias · 2023-05-09T21:30:59Z

Just to provide more details. I have also tried using the same fix (i.e., simply remove the check and the assertion) on Dynamo test suite. So far 3 huggingface models and 4 timm models that used to fail because of CPU fallback now work correctly (by using --accuracy flag). I also dump the generated HLO codes and they also look good to me. Just want to make sure it is safe if we remove this check because I see that the check was there from the very beginning when Dynamo+XLA integration is introduced (pytorch/pytorch#87741).

Also, in the original issue (pytorch/pytorch#93601), the root cause was actually the assertion fail from invoking CPU fallback.

JackCaoG · 2023-05-11T17:51:48Z

+
+    cpu_res = fn_fallback(M, mat1, mat2)
+    xla_res = dynamo_fn(xla_M, xla_mat1, xla_mat2)
+


can you use something similar to https://github.com/pytorch/xla/blob/master/torch_xla/core/dynamo_bridge.py#L57 to check there is no fallback ops in this case?

If this works, the fallback should happen on the pytorch cpu side and does not trigger a fallback counter on pytorch/xla end.

We should also check that xla::cummin is not in the counter, in case we lower this op in the future.

JackCaoG · 2023-05-11T17:52:56Z

+    cpu_res = fn_fallback(M, mat1, mat2, 0.5)
+    xla_res = dynamo_fn(M, mat1, mat2, 0.5)
+
+    self.assertTrue(torch.allclose(cpu_res, xla_res.cpu()))


ditto, we should add the counter check. Did we also handle the opernad shape invalid case in this pr? I thought we just handle ops that we don't lower today.

This PR now actually handles invalid shape/operand case because of how the updated FallBackNodeCollector works. The FallBackNodeCollector now executes each node of the input module so it can figure out which ops fall back (regardless of if it falls back because it's just not implemented in XLA or because specific shapes/operands are not supported).

JackCaoG · 2023-05-11T17:56:07Z

+    cpu_res = fn_fallback(M, mat1, mat2, 0.5)
+    xla_res = dynamo_fn(M, mat1, mat2, 0.5)
+
+    self.assertTrue(torch.allclose(cpu_res, xla_res.cpu()))


for this one we should also check the metrics for ExecuteTime(you need to clear metrics first). We are expecting to see 2 executions since fallback happens in the middle.

JackCaoG · 2023-05-11T17:59:54Z

    none_remover.add_nones(result)
-    return result
+    if len(result) == 1:
+      return result[0]


why is this needed?

wonjoo-wj · 2023-05-11T18:53:28Z

@seanlatias, I was hoping to push directly to PR/branch but seems like I can't since it's a local forked branch. Are you able to push directly to #5000? Just rebased this PR with latest and fixed some conflicts.

seanlatias · 2023-05-11T19:18:49Z

@seanlatias, I was hoping to push directly to PR/branch but seems like I can't since it's a local forked branch. Are you able to push directly to #5000? Just rebased this PR with latest and fixed some conflicts.

@wonjoolee95 yeah I can do that. But before doing so, I'm wondering if this approach is still valid? Because we probably only need to remove the fallback checking and things will just work.

seanlatias · 2023-05-12T15:35:35Z

@wonjoolee95 maybe let me do this. We can create an env var that determines whether we should use explicit CPU fallback (my method), or go through the original flow with checking removed. The default one will be the original flow with checking removed. How does that sound to you? If you agree, I can add the logics and push it to your branch.

wonjoo-wj · 2023-05-15T20:42:40Z

@wonjoolee95 maybe let me do this. We can create an env var that determines whether we should use explicit CPU fallback (my method), or go through the original flow with checking removed. The default one will be the original flow with checking removed. How does that sound to you? If you agree, I can add the logics and push it to your branch.

@seanlatias, I had a quick sync with Jack last week regarding the existing behavior and how the fallback works if we just remove the assertion. Based on our discussion and understanding, this shouldn't be the case -- it shouldn't work, as dynamo expects a single hash representing the graph that is entirely executable by XLA. And if this graph includes an unsupported op, it cannot be entirely executable by XLA.

However, as shown on #4935 (comment), I can still reproduce the behavior with unsupported ops working on master branches if the assertion is just simply removed. Let me do a quick experiment and post my findings why this might be happening. So let's wait on implementing the flag you mentioned. Also, let's move the discussion to the new PR #5000.

seanlatias · 2023-05-15T20:48:04Z

@wonjoolee95 Sounds good. Will do.

seanlatias · 2023-05-15T21:00:33Z

Close this and move to #5000.

wonjoo-wj and others added 11 commits April 24, 2023 12:44

Support CPU fallback for unsupported ops in dynamo

7daf813

Update imports

c5c5c10

move partition outside of extract compile graph

25b24cb

attempt to return compiled graph instead

cbdff78

replace submodule with a function call instead

1903897

make materialization check return true when input XLA tensors do not …

e39fe8c

…have XLAData or IR

fixed op support check

bfee2e9

fix lint issue

0c47189

add tests

c123155

fix cpp lint

fca9c24

modify tests

20a8100

JackCaoG requested review from JackCaoG and wonjoo-wj April 24, 2023 17:39

JackCaoG reviewed Apr 24, 2023

View reviewed changes

wonjoo-wj reviewed Apr 25, 2023

View reviewed changes

seanlatias added 3 commits April 26, 2023 19:08

Include module into fallback checking

99097d7

Add more tests with numerical checks

269aad2

Rename test file

f74338c

JackCaoG added the dynamo label May 11, 2023

JackCaoG reviewed May 11, 2023

View reviewed changes

seanlatias closed this May 15, 2023

wonjoo-wj mentioned this pull request May 23, 2023

[Dynamo] Refine CPU fallback for TD+XLA #5000

Merged


		cpu_res = fn_fallback(M, mat1, mat2)
		xla_res = dynamo_fn(xla_M, xla_mat1, xla_mat2)

Conversation

seanlatias commented Apr 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seanlatias commented Apr 25, 2023

Uh oh!

JackCaoG commented Apr 25, 2023

Uh oh!

seanlatias commented Apr 25, 2023

Uh oh!

wonjoo-wj commented Apr 25, 2023

Uh oh!

seanlatias commented Apr 26, 2023

Uh oh!

seanlatias commented Apr 26, 2023

Uh oh!

seanlatias commented Apr 28, 2023

Uh oh!

wonjoo-wj commented May 2, 2023

Uh oh!

wonjoo-wj commented May 2, 2023

Uh oh!

seanlatias commented May 2, 2023

Uh oh!

JackCaoG commented May 2, 2023

Uh oh!

wonjoo-wj commented May 9, 2023

Uh oh!

seanlatias commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wonjoo-wj commented May 11, 2023

JackCaoG Apr 24, 2023 •

edited

Loading

wonjoo-wj Apr 25, 2023 •

edited

Loading

wonjoo-wj Apr 25, 2023 •

edited

Loading

seanlatias commented May 9, 2023 •

edited

Loading