Skip to content

[DCP] Avoid in-place update and deepcopy during dudpe#149320

Closed
saumishr wants to merge 1 commit intopytorch:mainfrom
saumishr:export-D71245218
Closed

[DCP] Avoid in-place update and deepcopy during dudpe#149320
saumishr wants to merge 1 commit intopytorch:mainfrom
saumishr:export-D71245218

Conversation

@saumishr
Copy link
Contributor

@saumishr saumishr commented Mar 17, 2025

Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

Control job with deepcopy regression:

First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:

buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner

https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

cc @LucasLLC @pradeepfn @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149320

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 35c4f4a with merge base 6c7d841 (image):

UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint) labels Mar 17, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

@saumishr saumishr requested review from meetv18 and pradeepfn March 17, 2025 14:53
@meetv18 meetv18 added oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Mar 17, 2025
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Mar 17, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
plan_to_item_indexes: list[set[MetadataIndex]] = [set(item.index for item in plan.items) for plan in all_plans]
plan_to_item_indices: list[set[MetadataIndex]] = [set(item.index for item in plan.items) for plan in all_plans]

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 17, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

saumishr added a commit to saumishr/pytorch that referenced this pull request Mar 17, 2025
Summary:
Pull Request resolved: pytorch#149320

Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
- https://fburl.com/scuba/dai_modelstore/ac5so0i0
First save ~24.8s
Global step latency is ~7-8s
https://fburl.com/scuba/pytorch_dcp_logging/kdmwiemk

Test job with the new fix to avoid deepcopy:
- https://fburl.com/scuba/dai_modelstore/aqi93x3a
First save is ~21s
- global step latency ~2s
https://fburl.com/scuba/pytorch_dcp_logging/w7muzr84

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Reviewed By: MeetVadakkanchery

Differential Revision: D71245218
@saumishr saumishr force-pushed the export-D71245218 branch 2 times, most recently from 407ae55 to 072a3f8 Compare March 17, 2025 18:03
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

Summary:
Pull Request resolved: pytorch#149320

Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
- https://fburl.com/scuba/dai_modelstore/ac5so0i0
First save ~24.8s
Global step latency is ~7-8s
https://fburl.com/scuba/pytorch_dcp_logging/kdmwiemk

Test job with the new fix to avoid deepcopy:
- https://fburl.com/scuba/dai_modelstore/aqi93x3a
First save is ~21s
- global step latency ~2s
https://fburl.com/scuba/pytorch_dcp_logging/w7muzr84

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Reviewed By: MeetVadakkanchery

Differential Revision: D71245218
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D71245218

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (checkpoint)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants