Skip to content

[SPMD] Expose apply_backward_optimization_barrier#7477

Merged
alanwaketan merged 2 commits intomasterfrom
alanwaketan/spmd
Jun 26, 2024
Merged

[SPMD] Expose apply_backward_optimization_barrier#7477
alanwaketan merged 2 commits intomasterfrom
alanwaketan/spmd

Conversation

@alanwaketan
Copy link
Copy Markdown
Collaborator

Summary:
This PR exposes apply_backward_optimization_barrier to the spmd namespace.

Test Plan:
N.A.

@alanwaketan
Copy link
Copy Markdown
Collaborator Author

Thanks, Jack!

@alanwaketan alanwaketan merged commit 6894a08 into master Jun 26, 2024
@alanwaketan alanwaketan deleted the alanwaketan/spmd branch June 26, 2024 01:15
@bhavya01
Copy link
Copy Markdown
Collaborator

bhavya01 commented Jul 17, 2024

@alanwaketan Should we also merge this change in the 2.4 release? I saw test failure with latest 2.4 wheel

[2024-07-17, 16:47:21 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
  File "/home/ml-auto-solutions/transformers/examples/pytorch/language-modeling/run_clm.py", line 873, in <module>
    main()
  File "/home/ml-auto-solutions/transformers/examples/pytorch/language-modeling/run_clm.py", line 644, in main
[2024-07-17, 16:47:21 UTC] {logging_mixin.py:150} WARNING -     xs.apply_backward_optimization_barrier(model.model.layers[i])
AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'apply_backward_optimization_barrier'
    ```

@JackCaoG
Copy link
Copy Markdown
Collaborator

which model failed?

@bhavya01
Copy link
Copy Markdown
Collaborator

bhavya01 commented Jul 17, 2024

llama2-train-spmd

Still looking into why this passed in the previous test run
image

@bhavya01
Copy link
Copy Markdown
Collaborator

I realized that this is failing because of pytorch-tpu/transformers@ccf5b15

I replaced torch_xla.experimental.xla_sharding with torch_xla.distributed.spmd in the test and 2.4 release doesn't expose apply_optimization_barrier through the latter. The test passes locally with the fix.

@alanwaketan
Copy link
Copy Markdown
Collaborator Author

Sure, can I still backport things? I might have 2-3 PRs need to be back ported still?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants