[AMP] Support XLA:TPU by cowanmeg · Pull Request #96370 · pytorch/pytorch

cowanmeg · 2023-03-09T00:25:35Z

With these changes
XLA:GPU users should use torch.cuda.amp.autocast() for AMP with float16
XLA:TPU users should use torch.amp.autocast('xla') for AMP with bfloat16

cc @mcarilli @ptrblck @leslie-fang-intel @jgong5

pytorch-bot · 2023-03-09T00:25:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96370

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 366470b:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-03-15T22:27:25Z

The committers listed above are authorized under a signed CLA.

✅ login: cowanmeg / name: Meghan Cowan (fbca623, 8ab0a4a, 8d5cfe5, 41767a2, 26b8af2, e23f231, 0aa111a, 8455aac, 44b479f, 43f599a, 920f216, 3b9c315, 2b211df, d0990d2, 8ce2d58, 308631c, e904ffa, 455fc9a, 0b80bf8, 23f56ed, 273be5e, 3cb91f2, 7e9d133, 3355bc1, 3f62520, b07b524, 0e10ca9, 1b4bb64, 2ebd101, 87e2043, 9edeacb, 366470b)

…-tpu

bdhirsh · 2023-03-23T15:09:32Z

+
+TORCH_LIBRARY_IMPL(_, AutocastXLA, m) {
+  m.fallback(torch::CppFunction::makeFallthrough());
+}


@cowanmeg all of these dispatcher registrations probably don't have to live in core - can we move them into the pytorch/xla repo?

Moved into pytorch/xla. Note, I moved the CastPolicy enum into autocast_mode.h so it could be included.

bdhirsh · 2023-03-23T15:12:24Z

-  // Naughtily, AutocastCUDA is also being used for XLA.  In the terminal state,
-  // it probably should get its own Autocast key
+  AutocastXLA, 
+  // AutocastXLA is only being used for TPUs. XLA GPUs continue to use AutocastCUDA.


@cowanmeg can you describe how this works a bit more - what's the UX here? Is the user expected to use torch.cuda.autocast() when using XLA with gpu's, and torch.xla.autocast()` when using tpu's?

Correct. Updated the summary for clarity.

bdhirsh · 2023-03-23T15:15:42Z

 struct AutocastContext {
  bool gpu_enabled = false;
  bool cpu_enabled = false;
+  bool xla_enabled = false;


cc @davidberard98 - do you mind reviewing the JIT changes in this file? I'm not too familiar with them.

looks good other than the two other comments (static runtime & bc-breaking)

decided to take the jit changes out since it's not used often in pytorch/xla

davidberard98 · 2023-03-24T04:45:30Z

          const auto cpu_enabled = p_node->Input(2).toBool();
-          const auto cuda_dtype = p_node->Input(3).toScalarType();
-          const auto cpu_dtype = p_node->Input(4).toScalarType();
+          const auto xla_enabled = p_node->Input(3).toBool();


@tenpercent can you take a look at this? should we just leave static runtime out?

davidberard98 · 2023-03-24T04:52:08Z

  variants: function

- func: _autocast_to_reduced_precision(Tensor(a) self, bool cuda_enabled, bool cpu_enabled, ScalarType cuda_dtype, ScalarType cpu_dtype) -> Tensor(a)
+- func: _autocast_to_reduced_precision(Tensor(a) self, bool cuda_enabled, bool cpu_enabled, bool xla_enabled, ScalarType cuda_dtype, ScalarType cpu_dtype, ScalarType xla_dtype) -> Tensor(a)


I think this is bc-breaking? not very familiar with how to do this, but I think we'd need to add upgraders for jit, right?

davidberard98 · 2023-03-24T04:52:26Z

 struct AutocastContext {
  bool gpu_enabled = false;
  bool cpu_enabled = false;
+  bool xla_enabled = false;


looks good other than the two other comments (static runtime & bc-breaking)

cowanmeg · 2023-06-20T17:16:13Z

@pytorchbot merge

pytorchmergebot · 2023-06-20T17:18:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-20T17:18:13Z

Merge failed

Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase and merge by leaving the following comment on this PR:
@pytorchbot merge -r
Or just rebase by leaving @pytorchbot rebase comment

Details for Dev Infra team

Raised by workflow job

cowanmeg · 2023-06-20T23:06:15Z

@pytorchbot merge

pytorchmergebot · 2023-06-20T23:08:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-20T23:08:06Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

cowanmeg · 2023-06-21T18:15:40Z

I think this failing inductor test is unrelated:

2023-06-20T21:08:10.1240937Z [2023-06-20 21:08:10,123] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.00213, (ref-fp64): 0.00064 and shape=torch.Size([256])
2023-06-20T21:08:10.1245080Z [2023-06-20 21:08:10,123] torch._dynamo.utils: [ERROR] Accuracy failed for key name backbone.fpn.layer_blocks.2.0.bias.grad
2023-06-20T21:08:10.1296323Z fail_accuracy

bdhirsh · 2023-06-21T19:24:44Z

@cowanmeg hmm, I don't think I see that failure in CI on the main branch https://hud.pytorch.org/, and it's a bit hard to tell immediately if it's flaky/unrelated, since that tests E2E logic, and appears to be running with autocast enabled. Can you try rebasing?

cowanmeg · 2023-06-22T23:03:03Z

@pytorchbot merge

pytorchmergebot · 2023-06-22T23:05:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-06-22T23:05:05Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

cowanmeg · 2023-06-23T15:38:51Z

I think the BC lint failure is a cancellation?

kit1980 · 2023-06-23T19:44:48Z

@pytorchbot merge

pytorchmergebot · 2023-06-23T19:46:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cowanmeg added 7 commits February 23, 2023 16:57

Turn on amp for xla

fbca623

Autocast xla

8ab0a4a

Merge branch 'master' of https://github.com/pytorch/pytorch into amp-tpu

8d5cfe5

updates

41767a2

tpu autocast partly working

26b8af2

Adding convolutions

e23f231

update comments

0aa111a

cowanmeg requested review from albanD and soulitzer as code owners March 9, 2023 00:25

pytorch-bot Bot added the release notes: jit release notes category label Mar 9, 2023

github-actions Bot added the module: amp (automated mixed precision) autocast label Mar 9, 2023

cowanmeg mentioned this pull request Mar 9, 2023

AMP for TPUs pytorch/xla#4740

Closed

pytorchbot added the open source label Mar 9, 2023

more ops

8455aac

albanD requested a review from bdhirsh March 17, 2023 21:38

cowanmeg added 2 commits March 22, 2023 04:31

Merge branch 'master' of https://github.com/cowanmeg/pytorch into amp…

44b479f

…-tpu

Add files, more ops, update

43f599a

soulitzer removed their request for review March 22, 2023 20:50

bdhirsh reviewed Mar 23, 2023

View reviewed changes

Comment thread third_party/kineto

bdhirsh reviewed Mar 23, 2023

View reviewed changes

Comment thread torch/__init__.py Outdated

bdhirsh reviewed Mar 23, 2023

View reviewed changes

davidberard98 reviewed Mar 24, 2023

View reviewed changes

cowanmeg added 3 commits April 12, 2023 16:34

feedback

920f216

small fix

3b9c315

fixes

2b211df

cowanmeg requested a review from bdhirsh April 13, 2023 16:38

github-actions Bot added the ciflow/inductor label Jun 15, 2023

pytorchmergebot added the merging label Jun 20, 2023

pytorchmergebot removed the merging label Jun 20, 2023

Merge branch 'main' of https://github.com/cowanmeg/pytorch into amp-tpu

87e2043

pytorchmergebot added the merging label Jun 20, 2023

pytorchmergebot removed the merging label Jun 20, 2023

cowanmeg added 2 commits June 21, 2023 19:42

Merge branch 'main' of https://github.com/cowanmeg/pytorch into amp-tpu

9edeacb

typo fix

366470b

pytorchmergebot added the merging label Jun 22, 2023

pytorchmergebot removed the merging label Jun 22, 2023

pytorchmergebot added the merging label Jun 23, 2023

pytorchmergebot added Merged and removed merging labels Jun 23, 2023

pytorchmergebot closed this in 6ff4548 Jun 23, 2023

carmocca mentioned this pull request Jun 26, 2023

Support AMP with TPUs Lightning-AI/pytorch-lightning#17927

Open

Conversation

cowanmeg commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/96370

✅ No Failures

Uh oh!

linux-foundation-easycla Bot commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cowanmeg commented Jun 20, 2023

Uh oh!

pytorchmergebot commented Jun 20, 2023

Merge started

Uh oh!

pytorchmergebot commented Jun 20, 2023

Merge failed

Uh oh!

cowanmeg commented Jun 20, 2023

Uh oh!

pytorchmergebot commented Jun 20, 2023

Merge started

Uh oh!

pytorchmergebot commented Jun 20, 2023

Merge failed

Uh oh!

cowanmeg commented Jun 21, 2023

Uh oh!

bdhirsh commented Jun 21, 2023

Uh oh!

cowanmeg commented Jun 22, 2023

Uh oh!

pytorchmergebot commented Jun 22, 2023

Merge started

Uh oh!

pytorchmergebot commented Jun 22, 2023

Merge failed

Uh oh!

cowanmeg commented Jun 23, 2023

Uh oh!

kit1980 commented Jun 23, 2023

Uh oh!

pytorchmergebot commented Jun 23, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cowanmeg commented Mar 9, 2023 •

edited

Loading

pytorch-bot Bot commented Mar 9, 2023 •

edited

Loading

linux-foundation-easycla Bot commented Mar 15, 2023 •

edited

Loading