Update numerical verification for SPMD Linear checkpointing by sdasgup3 · Pull Request #9113 · pytorch/xla

sdasgup3 · 2025-05-07T23:53:39Z

The current PR tracks a issue where an internal TPU CI is failing on v5p hardware. A specific test failing with assertion failure at test_train_spmd_linear_model.py#L49 and test_train_spmd_linear_model.py#L51 with maximum absolute difference of 0.0042718649 and 0.0000191778 respectively.

The fix here is to update the corresponding atols.

bhavya01 · 2025-05-08T17:13:31Z

I see that both the results are running on TPU, one with checkpointing and one without gradient checkpointing. The result should be exactly the same.

Generally, there is some tolerance when we are running on different hardwares but in this case, I expected them to be exactly the same. I think that we should take a closer look at this problem. We might find something is wrong with the way we checkpoint.

yaoshiang · 2025-06-17T22:29:43Z

This has been a long-lived bug and Bhavya initially expressed concern about some deviation between grad checkpointing and not. But more recently, he is okay with merging this fix and opening a new issue to further investigate numerical equiv. of checkpointing vs not. I have set this PR to automerge once the tests pass. Thanks guys.

yaoshiang · 2025-06-25T00:27:01Z

As I look at this bug a bit further, I think it's actually okay to include some tol and not expect exact equivalence between grad checkpointing and not.

With grad checkpointing vs not, you could imagine XLA reordering some operations, leading to some small rounding issues in the final bit of the weights. The best tolerance to measure would be on the weight and activation itself... by the time you get to loss, you can imagine a decent amount of movement.

yaoshiang · 2025-06-25T00:37:25Z

This PR is stuck in cicd and the creator is not working on it. Closing, and replaced with #9404

Update numerical verification for SPMD Linear checkpointing

af568a8

sdasgup3 requested a review from bhavya01 May 7, 2025 23:53

sdasgup3 added the CI CI related change label May 7, 2025

lsy323 approved these changes May 14, 2025

View reviewed changes

bhavya01 approved these changes Jun 17, 2025

View reviewed changes

yaoshiang enabled auto-merge (squash) June 17, 2025 22:27

yaoshiang disabled auto-merge June 23, 2025 18:56

yaoshiang self-assigned this Jun 25, 2025

yaoshiang mentioned this pull request Jun 25, 2025

adding tol for numeric test of checkpointing #9404

Merged

yaoshiang closed this Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update numerical verification for SPMD Linear checkpointing#9113

Update numerical verification for SPMD Linear checkpointing#9113
sdasgup3 wants to merge 1 commit intopytorch:masterfrom
sdasgup3:ci-failure-on-v5p

sdasgup3 commented May 7, 2025

Uh oh!

bhavya01 commented May 8, 2025

Uh oh!

yaoshiang commented Jun 17, 2025

Uh oh!

yaoshiang commented Jun 25, 2025 •

edited

Loading

Uh oh!

yaoshiang commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sdasgup3 commented May 7, 2025

Uh oh!

bhavya01 commented May 8, 2025

Uh oh!

yaoshiang commented Jun 17, 2025

Uh oh!

yaoshiang commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaoshiang commented Jun 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaoshiang commented Jun 25, 2025 •

edited

Loading