Allow running with bfloat16 on XLA:GPU autocast by yeounoh · Pull Request #5598 · pytorch/xla

yeounoh · 2023-09-16T00:42:00Z

This is to follow up on #5570 , which enabled xla autocast with bfloat16 type on XLA:GPU, but restricted/casted bfloat16 to float16 and float32. Supporting bfloat16 on eligible HW platforms should bring in perf improvements for XLA:GPU. This is tested with the same set of ops that torch.autocast('cuda') uses for f16 autocasting.

baoleai · 2023-09-18T02:53:06Z

          # XLA:GPU with bfloat16 should run on `xla` backend
          # unless torch.autocast is compiled with cuda.
          backend = 'xla'
+          self._cuda_bfloat16 = True


Does PyTorch not support bfloat16 when running on xla backend?
I'm still confused when dtype is set to bfloat16:
When torch.cuda.is_available() returns False, torch_xla uses the XLA:GPU backend. However, in this patch it seems to fall back to use the GPU backend.
Similarly, when torch.cuda.is_available() returns True, torch_xla uses the GPU backend, but the dtype is still forced to float16 instead of bfloat16.

Hi @baoleai good catch -- I meant to do what the err msg says. We don't need to force torch.float16 on cuda backend.

Thanks, will update the PR.

Ok, let me go over -- let me know if I am missing anything.

"When torch.cuda.is_available() returns False, torch_xla uses the XLA:GPU backend. However, in this patch it seems to fall back to use the GPU backend."

When cuda is not available (False), we go into this code path

if xr.is_bf16_supported() and not torch.cuda.is_available(): # XLA:GPU with bfloat16 should run on `xla` backend # unless torch.autocast is compiled with cuda. backend = 'xla' self._cuda_bfloat16 = True

We uses xla backend for autocast.

Similarly, when torch.cuda.is_available() returns True, torch_xla uses the GPU backend, but the dtype is still forced to float16 instead of bfloat16.

This is the part I need to address, instead of else we want elif {cuda is not available}, to implement the intention laid out in the err msg.

So something like this, commit 50c1d2f1c598c2ec4ada8f7177861cc05056ff36

if dtype is None: dtype = torch.float16 elif dtype == torch.bfloat16 and not torch.cuda.is_available(): if xr.is_bf16_supported(): # XLA:GPU with bfloat16 should run on `xla` backend # unless torch.autocast is compiled with cuda. backend = 'xla' self._cuda_bfloat16 = True else: # This has been the default behavior for unsupported bfloat16 dtype dtype = torch.float16 error_message = "In XLA:GPU autocast, but bfloat16 is not supported on this HW.\n" error_message += ("Using the default cuda autocast dtype float16.")

JackCaoG · 2023-09-18T18:56:48Z

    self._xla_device = xm.xla_device_hw(device)
    if self._xla_device == 'GPU':
      backend = 'cuda'
+      self._cuda_bfloat16 = False


I am very confuse what this _cuda_bfloat16 actually means. Do you mind adding a comment above line 28 to explain all possible combinations and the expected behaviors?

+1, realized that this variable should be called _xla_bfloat16. Let me add a brief comment, too.

Something like this,

self._xla_bfloat16 = False # True if xla backend with bfloat16 dtype. if dtype is None: dtype = torch.float16 elif dtype == torch.bfloat16 and not torch.cuda.is_available(): if xr.is_bf16_supported(): # XLA:GPU with bfloat16 should run on `xla` backend # unless torch.autocast is compiled with cuda. backend = 'xla' self._xla_bfloat16 = True

commit dc9224336af6cf6313e60d197dab47323e8e509d (HEAD -> spmd_amp_gpu, origin/spmd_amp_gpu) Author: Yeounoh Chung <yeounoh@google.com> Date: Mon Sep 18 12:02:15 2023 -0700 Rename _cuda_bfloat16 to _xla_bfloat16 since it is set when xla backend is used for bfloat16.

Another confusing question is why there is only a special treatment for torch.cuda.is_available() is False and dtype=bfloat16 here, is it because XLA:GPU itself doesn't support bfloat16? But when torch.cuda.is_available() is True, there is no special treatment for bfloat16, then the following is still used

torch.set_autocast_xla_enabled(self.prev) torch.set_autocast_xla_dtype(self.prev_dtype)

Hi @baoleai XLA:GPU supports bfloat16, but we were using the cuda backend for autocast, when torch.cuda.is_available() was false. Instead we want to use xla backend.

When torch.cuda.is_available() is true, we use cuda backend for autocast, and since we are entering/existing the autocast context via torch_xla.amp.autocast, we still need to set

torch.set_autocast_xla_enabled(self.prev) torch.set_autocast_xla_dtype(self.prev_dtype)

JackCaoG · 2023-09-18T21:50:27Z

+      if self._xla_bfloat16:
+        torch.set_autocast_enabled(self._enabled)


if _xla_bfloat16 , shouldn't we use set_autocast_xla_enabled instead?

it will be set by torch.autocast, we need to set both -- since we are wrapping and calling torch.autocast. So the xla autocast is enabled if one is using torch_xla.amp.autocast, and as torch autocast is enabled we are also calling the torch.autocast with cuda or xla backend.

Can we leave a comment here, it is really confusing to see that if we are using xla_bf16, we need setup upstream autocast to enabled.

JackCaoG · 2023-09-18T21:50:43Z

+      if self._xla_bfloat16:
+        torch.set_autocast_enabled(self.prev)


same question

JackCaoG

LGTM if comment can be added. Chatted with @yeounoh offline I am going to approve to unblock.

…sh when torch cuda is not available.

This reverts commit c665b51.

…nd is used for bfloat16.

yeounoh · 2023-09-18T23:32:01Z

The GPU tests had passed already. Merging after adding the comments.

yeounoh added the performance label Sep 16, 2023

yeounoh requested a review from JackCaoG September 16, 2023 00:42

yeounoh self-assigned this Sep 16, 2023

yeounoh force-pushed the spmd_amp_gpu branch from cbddc5d to 1383493 Compare September 16, 2023 00:44

This was referenced Sep 16, 2023

[SPMD] AMP autocast failed when set XLA_USE_SPMD=1 #5497

Closed

Enable xla:gpu autocast for bfloat16 if not restricted #5570

Merged

baoleai reviewed Sep 18, 2023

View reviewed changes

JackCaoG reviewed Sep 18, 2023

View reviewed changes

JackCaoG approved these changes Sep 18, 2023

View reviewed changes

yeounoh added 19 commits September 18, 2023 16:09

Enable autocast for XLA:GPU

66e1a01

linter fix

83b5c71

XLA autocast test for GPU and TPU

1311abe

linter fix

1e90fe8

Ensure that xla autocast is properly enabled for GPU and does not cra…

623a26b

…sh when torch cuda is not available.

linter fix

7abae8b

Add tests

2308541

Support bf16

77cf2b0

linter fix

8cf78a0

exclude unsupported test cases

f92fd0a

Support bfloat16 for XLA:GPU

5275546

Revert "Support bfloat16 for XLA:GPU"

d7f0c06

This reverts commit c665b51.

Enable bfloat16 for XLA:GPU on cuda backend

35fdeca

Update test for bfloat16

c464e5a

linter fix

b5628b1

rebase master

0163707

fix bug in autocast_mode.py

9637bd3

Rename _cuda_bfloat16 to _xla_bfloat16 since it is set when xla backe…

c24c1da

…nd is used for bfloat16.

add comments

8358095

yeounoh force-pushed the spmd_amp_gpu branch from 5c317bb to 8358095 Compare September 18, 2023 23:09

yeounoh merged commit b214c10 into master Sep 18, 2023

		if self._xla_bfloat16:
		torch.set_autocast_enabled(self._enabled)

Conversation

yeounoh commented Sep 16, 2023

Uh oh!

baoleai Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yeounoh Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

yeounoh commented Sep 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baoleai Sep 18, 2023 •

edited

Loading

yeounoh Sep 18, 2023 •

edited

Loading

yeounoh Sep 18, 2023 •

edited

Loading

yeounoh Sep 18, 2023 •

edited

Loading