Add nvfuser support for prims.copy_to by IvanYashchuk · Pull Request #84545 · pytorch/pytorch

IvanYashchuk · 2022-09-05T16:09:43Z

I use nvFuser's aliasOutputToInput here and since it implicitly adds outputs to the fusion, I need to drop those within Python.

Now we can lower the batch_norm implementation from torch._decomp to nvprims(see test_batch_norm_forward_nvprims).

cc @EikanWang @jgong5 @wenzhe-nrv @sanchitintel @kevinstephano @jjsjann123 @ezyang @mruberry @ngimel @lezcano @fdrocha @peterbell10

…ensor twice

…=True, use no_grad as a workaround

facebook-github-bot · 2022-09-05T16:09:51Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84545
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit 8fe12d5 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

jjsjann123

I think we need to discuss more on what to expect from copy_to in upstream.

We won't be able to generically support copy_to in its wildest form in an fx graph. IIRC, the last time we discussed this with upstream on their functionalization pass, we agreed that copy_to will be applied sparingly and carefully.

We really only care about the case where we are doing some spot update on running stats for BN and I think that's what we should focus on. Maybe not necessarily in this PR, but we definitely need more checks and be more explicit on when to we would take copy_to in nvfuser graph.

jjsjann123 · 2022-09-06T02:20:14Z

torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h


+//! Specialized Record Functor for recording removing of outputs.
+
+template <class OutputType>


Is the template class here necessary? we are only using removeOutputRecord with NvfTensorView in this PR.

Right, it's not needed.

Remove output is not needed in this PR anymore, I removed this code.

jjsjann123 · 2022-09-06T02:25:44Z

torch/_prims/context.py


 def _is_func_unsupported_nvfuser(torch_function_mode, func, args, kwargs):
-    with torch.overrides.enable_torch_function_mode(
+    with torch.no_grad(), torch.overrides.enable_torch_function_mode(


Is no_grad here needed only because we added copy_to and that somehow messed up gradient? Or this is just a patch for an existing bug?

It's a patch for an existing bug for which I need to file an issue.

The problem was that the ATen function updates in-place the running mean and var without autograd graph connected and always expects "requires_grad" to be False for these arguments. And the decomposition was not doing that. Fixed in ca2c176.

jjsjann123 · 2022-09-06T02:44:49Z

torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h

+      dest = torch::jit::fuser::cuda::set(source);
+      // aliasOutputToInput implicitly adds the output to the fusion
+      // adding it in this path simplifies the logic upstream
+      fd.addOutput(source);


~~aliasOutputToInput adds dest to output, while in false block we are adding source to output. It sounds consistent with the comment above.~~

NVM, I was dumb... it makes sense, we are adding source to output in both cases.

jjsjann123 · 2022-09-06T02:54:40Z

torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h

+    auto source = fd.getFusionState(args.at(1))->as<NvfTensorView>();
+
+    if (dest->isFusionInput()) {
+      fd.fusionPtr()->aliasOutputToInput(source, dest);


I think this is the place where we actually need to perform a copy here.

auto tmp = xxxx::set(source); fd.fusionPtr()->aliasOutputToInput(tmp, dest);

I saw something in the python side doing this, but I'm pretty uncomfortable having these logic separated.

Yes, I also thought about this and tried that. Something was wrong, I don't remember what exactly, I should try again.

This approach is also better because we wouldn't need to try conditionally removing tensors from fusion output and adding them back later in the correct order since source is now completely disconnected.

auto tmp = torch::jit::fuser::cuda::set(source); is now used. My initial mistake was to use tmp = set(source) only for dest->isFusionInput() path, it becomes difficult to drop this output for the "else" branch.

jjsjann123 · 2022-09-06T02:56:05Z

torch/_prims/nvfuser_executor.py

+                    # If source is a fusion input, we need to place an operation
+                    # before the copy_to so that the copy is actually performed
+                    if _is_node_in_input(gm, source_node):
+                        source = fd.ops.set(source)


So this is the python side where we are making the copy? The condition here looks suspicious to me, shouldn't we be checking destination value being on input, instead of source.

"set" is not necessary for the in-place update of destination, any operation on the source tensor is enough, see for example func2 in test_copy_to.

This code patch is a workaround for the case when there were no operations defined to produce source and for the in-place update to be realized we need to place some operation, set seems the most natural choice for this. The case when there are no producers of source is when source is input to the fusion/graph.

I removed this code, there's no special handling on the Python side anymore.

jjsjann123 · 2022-09-06T02:57:01Z

torch/_prims/nvfuser_executor.py

+                    # (it was implicitly added by the copy_to op)
+                    # it will be added back later using correct expected order
+                    if _is_node_in_output(gm, source_node) and not _is_node_in_input(
+                        gm, source_node


Same here, shouldn't the second source_node be destination node instead?

When _is_node_in_input(gm, source_node) == True we hit the other path where we create a new temporary source, since it's temporary there's no way it could be marked as output in the graph.
Here we only remove from nvFusion outputs source that was marked as output inside aliasOutputToInput and then is going to be marked as output here:

pytorch/torch/_prims/nvfuser_executor.py

Lines 172 to 175 in 8fe12d5

out = FusionInterpreter(gm).run(*nv_args)

flat_out, unflatten_spec = tree_flatten(out)

for o in flat_out:

fd.add_output(o)

I removed this code, there's no special handling on the Python side anymore.

…py-to

…ython interface

pytorch-bot · 2022-09-12T14:44:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84545

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

win-vs2019-cpu-py3 / build workflows failing consistently with linker crash

✅ No Failures

As of commit dc529fb:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…py-to

ngimel · 2022-09-22T17:07:46Z

torch/_prims/nvfuser_executor.py


        return tree_unflatten(
-            fusion.execute(concrete_fusion_inputs),  # type: ignore[has-type]
+            fusion.execute(concrete_fusion_inputs)[drop_output_count:],  # type: ignore[has-type]


are the implicit extra outputs guaranteed to be in the beginning?

Yes, the order of adding outputs is respected and the real outputs we need to return are added last here:

pytorch/torch/_prims/nvfuser_executor.py

Lines 175 to 176 in 705b2b2

for o in flat_out:

fd.add_output(o)

IvanYashchuk · 2022-09-23T09:47:08Z

There would be a conflict with #84626 that the tests implemented here would stop working.

jjsjann123 · 2022-09-23T21:42:41Z

There would be a conflict with #84626 that the tests implemented here would stop working.

#84626 is merged now. I'll get a patch for this PR.

…py-to

jjsjann123 · 2022-09-26T18:30:04Z

I'm seeing the failing test. Working on a quick patch.

jjsjann123 · 2022-09-26T20:36:44Z

test/test_prims.py

+            b_sin = b.sin()
+            a1 = a.copy_(b_sin)
+            a1_sin = a1.sin()
+            a2 = a1.copy_(a1_sin)


This looks scary.

Currently codegen has no idea on proper race condition. Having a single buffer used as both RW in different operation is not something we support. We should remove this test.

Removing test is easy, but what happens here? Does codegen generate a kernel for this fusion, or does it fallback?

Yes, the codegen generates kernels for this fusion. There are no fallbacks at the nvFuser level, it's either more segments or an error. Since we're using executor="strictly_nvfuser" there's no fallback at the Python level as well.

Codegen generates two segments in this case:

Segmented_Fusion Dump: -- fusion segments: Segmented_Fusion{ groups: g{0, 1} g{2, 3} edges: group details: g{(pointwise) inputs: T0_g[ iS0{i0}, iS1{i1} ] float T1_g[ iS2{i3}, iS3{i4} ] float outputs: T2_g[ iS4{i3}, iS5{i4} ] float T3_g[ iS6{i3}, iS7{i4} ] float T2_g[ iS4{i3}, iS5{i4} ] = sinf(T1_g[ iS2{i3}, iS3{i4} ]); T3_g[ iS6{i3}, iS7{i4} ] = T2_g[ iS4{i3}, iS5{i4} ]; } g{(pointwise) inputs: T0_g[ iS0{i0}, iS1{i1} ] float outputs: T4_g[ iS8{i0}, iS9{i1} ] float T5_g[ iS10{i0}, iS11{i1} ] float T4_g[ iS8{i0}, iS9{i1} ] = sinf(T0_g[ iS0{i0}, iS1{i1} ]); T5_g[ iS10{i0}, iS11{i1} ] = T4_g[ iS8{i0}, iS9{i1} ]; } } //Segmented_Fusion

This corresponds to two kernels:

======= Codegen output for kernel: kernel1 ======= __global__ void kernel1(Tensor<float, 2> T1, Tensor<float, 2> T0, Tensor<float, 2> T7, Tensor<float, 2> T3) { int i71; i71 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x); if ((i71 < (T1.size[0] * T1.size[1]))) { float T6[1]; T6[0] = 0; T6[0] = T1[i71]; float T2[1]; T2[0] = sinf(T6[0]); float T8[1]; T8[0] = T2[0]; T7[i71] = T8[0]; float T9[1]; T9[0] = T2[0]; T3[i71] = T9[0]; } } ====================================== ======= Codegen output for kernel: kernel2 ======= __global__ void kernel2(Tensor<float, 2> T0, Tensor<float, 2> T7, Tensor<float, 2> T5) { int i69; i69 = (((nvfuser_index_t)blockIdx.x) * 128) + ((nvfuser_index_t)threadIdx.x); if ((i69 < (T0.size[0] * T0.size[1]))) { float T6[1]; T6[0] = 0; T6[0] = T0[i69]; float T4[1]; T4[0] = sinf(T6[0]); float T8[1]; T8[0] = T4[0]; T7[i69] = T8[0]; float T9[1]; T9[0] = T4[0]; T5[i69] = T9[0]; } } ======================================

jjsjann123 · 2022-09-26T20:44:10Z

test/test_prims.py

+            self.assertEqual(out[0], a)
+
+            a = torch.empty(3, 3, device='cuda')
+            self.assertEqual(out[0], func(a, b)[0])


I also feel the test is too relaxed and sends out a misleading message on what codegen alias supports.

We should be explicit on that the alias support right now is VERY limited. In the test example here, we should check that fusion maintains consistent behavior. We should check matching result on all outputs, as well as identical aliases among outputs and inputs.

Didn't nvfuser give up on aliasing inputs and outputs? (We had this discussion in transpose PR)

Changed:

- self.assertEqual(out[0], func(a, b)[0]) + self.assertEqual(out, func(a, b))

And added a comparison of storage (it should be the same).

Didn't nvfuser give up on aliasing inputs and outputs? (We had this discussion in transpose PR)

Yeah, we sorta did.
Our assumption is that functionalization pass will resolve this for us and we do not need to "return" outputs with correct alias.
Since test here doesn't use functionalization pass, I'm trying to keep tests cleaner.

…py-to

IvanYashchuk · 2022-09-28T13:44:04Z

torch/csrc/jit/codegen/cuda/executor.cpp

  for (const auto out_i : c10::irange(kernel->outputs().size())) {
    // TODO: FIX this short-cut where we trivially forward inputs to outputs
    if (kernel->outputs()[out_i]->isFusionInput()) {
-      TORCH_INTERNAL_ASSERT(false, "trivial input forwarding NOT IMPLEMENTED");


Changes added via IvanYashchuk#3

This PR adds nvFuser's implementation for batch_norm as there's no reference yet (#81191) and no in-place copy support (#84545). Pull Request resolved: #85562 Approved by: https://github.com/kevinstephano, https://github.com/ngimel

facebook-github-bot · 2022-10-04T00:27:14Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

linux-foundation-easycla · 2022-10-04T00:27:27Z

The committers listed above are authorized under a signed CLA.

✅ login: IvanYashchuk / name: Ivan Yashchuk (6243db2, 6592b87, 84783ba, 010334a, 8304a5f, 01d6f77, 299b522, a1adaa4, 8fe12d5, 9fa92a3, ba79385, 0d1b5ce, ca2c176, f490ef6, f533cb7, 734fb82, a5a9ad1, c3d36e9, 937e52f, 6fef581, 24f8027, 705b2b2, 8ec2697, 9e22a9e, 02d55e0, ca85c45, dc529fb)
✅ login: jjsjann123 (3942760, a91d95a, d34062e)

This PR adds nvFuser's implementation for batch_norm as there's no reference yet (pytorch/pytorch#81191) and no in-place copy support (pytorch/pytorch#84545). Pull Request resolved: pytorch/pytorch#85562 Approved by: https://github.com/kevinstephano, https://github.com/ngimel

github-actions · 2022-12-03T03:34:33Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

IvanYashchuk added 9 commits September 5, 2022 15:31

Add fd.remove_output; raise an error for marking as output the same t…

6243db2

…ensor twice

Add fd.ops.copy_to(dest, source)

6592b87

Add nvprims.copy_to

84783ba

Add FusionDefinition::removeOutput

010334a

Add tests

8304a5f

Always add source as output for simpler python code

01d6f77

Add special handling of nvprims.copy_to

299b522

For some reason within tracing context all tensors have requires_grad…

a1adaa4

…=True, use no_grad as a workaround

Add test_batch_norm_forward_nvprims

8fe12d5

IvanYashchuk added module: nvfuser module: primTorch labels Sep 5, 2022

IvanYashchuk requested review from jjsjann123 and kevinstephano September 5, 2022 16:09

pytorch-bot bot added the release notes: jit release notes category label Sep 5, 2022

facebook-github-bot added the cla signed label Sep 5, 2022

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 5, 2022

pytorchbot added the open source label Sep 5, 2022

jjsjann123 reviewed Sep 6, 2022

View reviewed changes

IvanYashchuk added 7 commits September 12, 2022 15:16

Merge remote-tracking branch 'upstream/viable/strict' into nvfuser-co…

9fa92a3

…py-to

Remove template from RemoveOutputRecord

ba79385

Remove no_grad in _is_func_unsupported_nvfuser

0d1b5ce

Add test for batch_norm running mean and var requires_grad consistency

ca2c176

Test train=True/False in test_batch_norm_forward_nvprims

f490ef6

func6->func

f533cb7

Always create a temp tensorview for copy_to that will be dropped in p…

734fb82

…ython interface

IvanYashchuk added 3 commits September 12, 2022 17:47

Remove tree_any

a5a9ad1

Remove remote_output

c3d36e9

Merge remote-tracking branch 'upstream/viable/strict' into nvfuser-co…

937e52f

…py-to

ngimel approved these changes Sep 22, 2022

View reviewed changes

IvanYashchuk mentioned this pull request Sep 23, 2022

Add nvFuser support for torch.native_batch_norm #85562

Closed

Merge remote-tracking branch 'upstream/viable/strict' into nvfuser-co…

8ec2697

…py-to

jjsjann123 reviewed Sep 26, 2022

View reviewed changes

jjsjann123 mentioned this pull request Sep 27, 2022

Trivial forwarding csarofeen/pytorch#1995

Merged

IvanYashchuk and others added 6 commits September 28, 2022 16:01

Merge remote-tracking branch 'upstream/viable/strict' into nvfuser-co…

9e22a9e

…py-to

cherry-picking trivial_forwarding support

3942760

disable cache; fixing input trivial forwarding

a91d95a

fixing input trivial forwarding again

d34062e

Test all outputs

02d55e0

Compare storage pointer

ca85c45

IvanYashchuk commented Sep 28, 2022

View reviewed changes

lintrunner

dc529fb

github-actions bot added the Stale label Dec 3, 2022

github-actions bot closed this Jan 2, 2023

IvanYashchuk removed the Stale label Jan 10, 2023

IvanYashchuk reopened this Jan 10, 2023

IvanYashchuk marked this pull request as draft January 10, 2023 07:51

IvanYashchuk added the no-stale label Jan 10, 2023


		//! Specialized Record Functor for recording removing of outputs.

		template <class OutputType>

	out = FusionInterpreter(gm).run(*nv_args)
	flat_out, unflatten_spec = tree_flatten(out)
	for o in flat_out:
	fd.add_output(o)

Conversation

IvanYashchuk commented Sep 5, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84545

❗ 1 Active SEVs

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk commented Sep 23, 2022

Uh oh!

jjsjann123 commented Sep 23, 2022

Uh oh!

jjsjann123 commented Sep 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

linux-foundation-easycla bot commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

IvanYashchuk commented Sep 5, 2022 •

edited by pytorch-bot bot

Loading

facebook-github-bot commented Sep 5, 2022 •

edited

Loading

pytorch-bot bot commented Sep 12, 2022 •

edited

Loading

linux-foundation-easycla bot commented Oct 4, 2022 •

edited

Loading