Support TPU v4 with new PyTorch/XLA TPU runtime by will-cromar · Pull Request #1393 · huggingface/accelerate

will-cromar · 2023-05-05T20:54:49Z

I've been working on migrating PyTorch/XLA from our legacy XRT runtime to PJRT. We have detailed documentation on the differences and changes here: https://github.com/pytorch/xla/blob/master/docs/pjrt.md

This PR collects all of the fixes that I've made so far to support XRT and PJRT interchangeably through Accelerate. In order of significance:

Update implementation of synchronize_rng_types to use xm.collective_broadcast to broadcast the RNG state tensor. Collective operations in general should be called from the main thread of each process to avoid unpredictable behaviors in XLA. But, our implementation of MpDeviceLoader calls accelerate.DataLoaderShard's __iter__ (which synchronizes the RNG) in all of the preloading threads. To ensure that the RNG is synchronized once from the main thread, call it in the MpDeviceLoaderWrapper's __iter__ instead.
In general, you should call xm.mark_step to finish any remaining steps before checkpointing. MpDeviceLoader is responsible for calling xm.mark_step at the beginning of each new step and at the end of the dataset iterator. If you checkpoint in the middle of iteration, replica 0 won't reach the mark_step at the beginning of the next iteration before it tries to checkpoint. To avoid making the user call mark_step themselves, call it for them on replica 0 before checkpointing in save_state.
Use XLA's all_gather in gather instead of mesh_reduce + torch.cat. With PJRT, rendezvous and mesh_reduce both use XLA collective ops to broadcast pickled data (docs) and move it back to the CPU. Batching all of the recursive all_gather calls together and calling xm.mark_step once reduces the number of transfers between host and device.
Don't override XLA_USE_BF16 when mixed_precision is not set. This isn't really mixed precision as much as it is a mechanism to silently convert all torch.floats to BF16 on the TPU. Real mixed precision support in PT/XLA is still a WIP, and we can update back here when it's stable.
Remove obsolete/unused _mp_fn in test script.

Tested:

accelerate test on v4-8 with XRT and PJRT
diffusers Stable Diffusion fine-tuning example on TPU v4-8

Accelerate will not work on TPU v2 and v3 with this PR, because both of them use use multithreading due to TPU design constraints. I'll follow up with the remaining fixes for TPU v2 and v3 in #1385.

cc @sgugger @JackCaoG

HuggingFaceDocBuilderDev · 2023-05-06T22:18:45Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for adding support for TPU v4! All the changes LGTM! Can you just run make style on your branch so that the quality check passes?

sgugger · 2023-05-08T17:53:51Z

Thanks again!

JackCaoG · 2023-05-08T18:19:31Z

Thanks both, this is super exciting!

will-cromar added 7 commits May 5, 2023 20:34

Fix XLA_USE_BF16 when not using mixed precision

81e4e78

Fix RNG sync during data loading

a212969

Fix hanging during checkpointing

9ea48ad

Remove extra _mp_fn

f2e35ec

Use all_gather to implement _tpu_gather

27bd8ef

Use collective_broadcast for torch RNG state

a3fdbdf

Formatting and comments.

879f6e9

will-cromar mentioned this pull request May 5, 2023

Support TPU v2 and v3 on new PyTorch/XLA TPU runtime #1385

Merged

will-cromar marked this pull request as ready for review May 5, 2023 21:15

sgugger approved these changes May 6, 2023

View reviewed changes

Fix formatting with make style

2ee9909

will-cromar requested a review from sgugger May 8, 2023 17:05

sgugger merged commit 145fca5 into huggingface:main May 8, 2023

will-cromar mentioned this pull request Mar 14, 2024

[RFC] torch_xla.step context manager pytorch/xla#6751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TPU v4 with new PyTorch/XLA TPU runtime#1393

Support TPU v4 with new PyTorch/XLA TPU runtime#1393
sgugger merged 8 commits intohuggingface:mainfrom
will-cromar:wcromar/pjrt-v4

will-cromar commented May 5, 2023

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2023 •

edited

Loading

Uh oh!

sgugger left a comment •

edited

Loading

Uh oh!

sgugger commented May 8, 2023

Uh oh!

JackCaoG commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

will-cromar commented May 5, 2023

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger commented May 8, 2023

Uh oh!

JackCaoG commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented May 6, 2023 •

edited

Loading

sgugger left a comment •

edited

Loading