[DCP] Always flatten mapping even if no tensors present by fegin · Pull Request #125335 · pytorch/pytorch

fegin · 2024-05-01T21:11:34Z

Stack from ghstack (oldest at bottom):

Summary:
Right now DCP only flatten a mapping (e.g., dict) if that mapping has tensor objects. This behavior is odd as users may save different non-tensor objects on different ranks. Without flattening the mappings, we may lose these non-tensor objects. One use case is dataloader state_dict.

We may also want to do so for a list/tuple. But this will cause extra pickles. So we don't do this for now.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · 2024-05-01T21:11:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125335

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit ffcee65 with merge base 746da87 ():

NEW FAILURE - The following job has failed:

periodic / linux-focal-rocm6.0-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) (gh)
distributed/_tensor/test_attention.py::RingAttentionTest::test_ring_attention_compile_attention_fn1

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
profiler\test_profiler.py::TestProfiler::test_basic_chrome_trace
periodic / win-vs2019-cuda11.8-py3 / test (default, 4, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

LucasLLC

lgtm

wz337

LGTM

fegin · 2024-05-07T17:06:15Z

@pytorchbot merge -f "The failing tests are not related"

pytorchmergebot · 2024-05-07T17:08:32Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335

…125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336

Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: pytorch#125336 Approved by: https://github.com/LucasLLC ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335

* [DSD] Correctly handle _extra_state (#125336) Summary: distributed_state_dict should not try to use `getattr` to get `_extra_state` as this is not well-defined. Pull Request resolved: #125336 Approved by: https://github.com/LucasLLC ghstack dependencies: #125333, #125501, #125334, #125335 * lint * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

…ytorch#125337) Summary: Fixes pytorch#122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: pytorch#125337 Approved by: https://github.com/awgu ghstack dependencies: pytorch#125333, pytorch#125501, pytorch#125334, pytorch#125335, pytorch#125336

…125337) (#127219) * [DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) Summary: Fixes #122792 state_dict includes only persistent buffers, while named_buffers() would include non_persistent buffers. Pull Request resolved: #125337 Approved by: https://github.com/awgu ghstack dependencies: #125333, #125501, #125334, #125335, #125336 * lintrunner * lint --------- Co-authored-by: Chien-Chin Huang <chienchin@fb.com> Co-authored-by: Andrey Talman <atalman@fb.com>

… before 2.4 The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. ghstack-source-id: 1adbc53 Pull Request resolved: #134158

… before 2.4 The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. ghstack-source-id: f207aed Pull Request resolved: #134158

@pradeepfn

… before 2.4 (#134158) The original DCP doesn't flattening all the containers, which can cause issues, #125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes #133923 Pull Request resolved: #134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn

@pradeepfn

… before 2.4 (pytorch#134158) The original DCP doesn't flattening all the containers, which can cause issues, pytorch#125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes pytorch#133923 Pull Request resolved: pytorch#134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn

Update

bd0421d

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 1, 2024

fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels May 1, 2024

fegin requested review from LucasLLC and wz337 May 1, 2024 21:32

fegin mentioned this pull request May 1, 2024

Use stateful dataloader to checkpoint data iteration order and token buffer pytorch/torchtitan#279

Merged

Update

d08a80d

[ghstack-poisoned]

fegin changed the title ~~[DCP] Always unflatten containers even if no tensors present~~ [DCP] Always unflatten mapping even if no tensors present May 1, 2024

fegin changed the title ~~[DCP] Always unflatten mapping even if no tensors present~~ [DCP] Always flatten mapping even if no tensors present May 2, 2024

Update

ffcee65

[ghstack-poisoned]

fegin mentioned this pull request May 3, 2024

[DSD] Improve the performance of distributed state_dict #125501

Closed

LucasLLC approved these changes May 6, 2024

View reviewed changes

wz337 approved these changes May 7, 2024

View reviewed changes

pytorchmergebot added the merging label May 7, 2024

pytorchmergebot added the Merged label May 7, 2024

pytorchmergebot closed this in 6f1e3a6 May 7, 2024

pytorchmergebot removed the merging label May 7, 2024

mvpatel2000 mentioned this pull request May 17, 2024

[DSD] Correctly handle _extra_state (#125336) #126567

Merged

antoinebrl mentioned this pull request May 27, 2024

[DSD] Fix to remove non_persistent buffer in distributed state dict (#125337) #127219

Merged

github-actions bot deleted the gh/fegin/232/head branch June 7, 2024 01:55

bigning mentioned this pull request Aug 19, 2024

[Distributed Checkpointing][torch2.4] torch 2.4 can't load a checkpointing saved by torch2.3 #133923

Closed

fegin mentioned this pull request Aug 21, 2024

[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 #134158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DCP] Always flatten mapping even if no tensors present#125335

[DCP] Always flatten mapping even if no tensors present#125335
fegin wants to merge 3 commits intogh/fegin/232/basefrom
gh/fegin/232/head

fegin commented May 1, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 1, 2024 •

edited

Loading

Uh oh!

LucasLLC left a comment

Uh oh!

wz337 left a comment

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fegin commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125335

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented May 7, 2024

Uh oh!

pytorchmergebot commented May 7, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented May 1, 2024 •

edited

Loading

pytorch-bot bot commented May 1, 2024 •

edited

Loading