Reenable the distributed checkpointing test by JackCaoG · Pull Request #8424 · pytorch/xla

JackCaoG · 2024-11-27T18:43:14Z

This is follow up of #8386.

In the previous pr I found that someone during fallback the pytorch will try to update an existing XLATensor with a CPU tesnor with different shape. In that case we need to remove the sharding spec otherwise there will be a shape mismatch. However I found that in the distributed point we will swap the existing XLATensor with the cpu tensor and it seems like we want to keep the sharding spec.

@jonb377 one concern I have is that test only test the single host, I felt like if it is a actual multi-host case the CPU tensor withh have different shape(sharded) than the shardingspec? I am not sure if we have such test somewhere. Even if we clear the shardingspec after a torch_xla.sync() the tensor will be moved to the device, but most likely replicated. I am a bit worried if I am breaking the distributed checkpointing here.

Reenable the distributed checkpointing test

bf771b0

JackCaoG added the tpuci label Nov 27, 2024

JackCaoG marked this pull request as ready for review November 28, 2024 09:06

JackCaoG requested a review from tengyifei November 28, 2024 09:06

tengyifei approved these changes Dec 2, 2024

View reviewed changes

JackCaoG merged commit 591c397 into master Dec 2, 2024

rpsilva-aws pushed a commit to rpsilva-aws/xla that referenced this pull request Dec 6, 2024

Reenable the distributed checkpointing test (pytorch#8424)

dde47a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reenable the distributed checkpointing test#8424

Reenable the distributed checkpointing test#8424
JackCaoG merged 1 commit intomasterfrom
JackCaoG/reeneable_checkpoint_test

JackCaoG commented Nov 27, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JackCaoG commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JackCaoG commented Nov 27, 2024 •

edited

Loading