[DCP] Avoid in-place update and deepcopy during dudpe#149320

Closed

saumishr wants to merge 1 commit intopytorch:mainfrom

saumishr:export-D71245218

Contributor

saumishr commented Mar 17, 2025 •

edited by pytorch-bot bot

Loading

Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

Control job with deepcopy regression:

First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:

buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner

https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

cc @LucasLLC @pradeepfn @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot bot commented Mar 17, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149320

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 35c4f4a with merge base 6c7d841 ():

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
REGRESSION: benchmark ('aotdispatcher_training_subclass_cpu', 'compile_time_instruction_count') failed, actual result 9835640441 is 1.80% higher than expected 9662000000 ±+1.50% if this is an expected regression, please update the expected results.
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (#144480)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot bot added oncall: distributed release notes: distributed (checkpoint) labels

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

facebook-github-bot added the fb-exported label

saumishr requested review from meetv18 and pradeepfn

March 17, 2025 14:53

meetv18 added oncall: distributed checkpointing and removed oncall: distributed labels

pytorch-bot bot added the oncall: distributed label

meetv18 approved these changes

View reviewed changes

torch/distributed/checkpoint/_dedup_save_plans.py Outdated

Contributor

meetv18 Mar 17, 2025

nit:

Suggested change

      
                plan_to_item_indexes: list[set[MetadataIndex]] = [set(item.index for item in plan.items) for plan in all_plans]
          
                plan_to_item_indices: list[set[MetadataIndex]] = [set(item.index for item in plan.items) for plan in all_plans]

pytorch-bot bot added the ciflow/trunk label

saumishr force-pushed the export-D71245218 branch from 4f984ef to d199cad Compare

March 17, 2025 15:25

saumishr force-pushed the export-D71245218 branch from d199cad to f21e63d Compare

March 17, 2025 15:27

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

1 similar comment

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr added a commit to saumishr/pytorch that referenced this pull request


          [DCP] Avoid in-place update and deepcopy during dudpe (pytorch#149320)

407ae55

Summary:
Pull Request resolved: pytorch#149320

Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
- https://fburl.com/scuba/dai_modelstore/ac5so0i0
First save ~24.8s
Global step latency is ~7-8s
https://fburl.com/scuba/pytorch_dcp_logging/kdmwiemk

Test job with the new fix to avoid deepcopy:
- https://fburl.com/scuba/dai_modelstore/aqi93x3a
First save is ~21s
- global step latency ~2s
https://fburl.com/scuba/pytorch_dcp_logging/w7muzr84

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Reviewed By: MeetVadakkanchery

Differential Revision: D71245218

saumishr force-pushed the export-D71245218 branch 2 times, most recently from 407ae55 to 072a3f8 Compare

March 17, 2025 18:03

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr force-pushed the export-D71245218 branch from 072a3f8 to bb30ade Compare

March 17, 2025 18:07

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr force-pushed the export-D71245218 branch from bb30ade to 008915c Compare

March 17, 2025 18:18

saumishr force-pushed the export-D71245218 branch from 008915c to 7c3abf3 Compare

March 17, 2025 19:42

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr force-pushed the export-D71245218 branch from 7c3abf3 to 621a7ec Compare

March 17, 2025 19:46


          [DCP] Avoid in-place update and deepcopy during dudpe (pytorch#149320)

35c4f4a

Summary:
Pull Request resolved: pytorch#149320

Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
- https://fburl.com/scuba/dai_modelstore/ac5so0i0
First save ~24.8s
Global step latency is ~7-8s
https://fburl.com/scuba/pytorch_dcp_logging/kdmwiemk

Test job with the new fix to avoid deepcopy:
- https://fburl.com/scuba/dai_modelstore/aqi93x3a
First save is ~21s
- global step latency ~2s
https://fburl.com/scuba/pytorch_dcp_logging/w7muzr84

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Reviewed By: MeetVadakkanchery

Differential Revision: D71245218

Contributor

facebook-github-bot commented Mar 17, 2025

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr force-pushed the export-D71245218 branch from 621a7ec to 35c4f4a Compare

March 17, 2025 20:00

Contributor

facebook-github-bot commented Mar 18, 2025

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Mar 18, 2025

Merge started

Your change will be merged while ignoring the following 2 checks: pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot added the Merged label

pytorchmergebot closed this in

381d0cb

pytorchmergebot removed the merging label

meetv18 mentioned this pull request

[dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts #155192

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk fb-exported Merged oncall: distributed checkpointing oncall: distributed release notes: distributed (checkpoint)