fully_shard load state_dict by rohan-varma · Pull Request #90945 · pytorch/pytorch

rohan-varma · 2022-12-15T20:14:02Z

Stack from ghstack (oldest at bottom):

-> fully_shard load state_dict #90945

Ensures that load_state_dict for fully_shard works:

Don't add back FSDP prefix
Small fix to ensure mixed precision check for buffers work

Follow ups:

state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations
No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc.

[ghstack-poisoned]

pytorch-bot · 2022-12-15T20:14:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90945

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 298681f:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 1aa5381 Pull Request resolved: #90945

[ghstack-poisoned]

rohan-varma · 2022-12-15T22:53:57Z

torch/testing/_internal/common_fsdp.py

        [test_name_mapping[str(s)] if s is not None else "none" for s in args]
    )

+def _broadcast_state_dict(rank, state_dict):


not strictly needed right now but will be used in composable tests.

[ghstack-poisoned]

awgu

LGTM! There a few to-dos you left for yourself. Feel free to address those before landing.

test/distributed/_composable/test_fully_shard.py

awgu · 2022-12-16T00:30:46Z

test/distributed/_composable/test_fully_shard.py

+    @skip_if_lt_x_gpu(2)
+    def test_state_dict_save_load_flow(self):
+        """
+        E2E test of save + load with rank0_only + CPU offload for TransformerWithSharedParams


In the future, will this test include different state dict types and subtest the different configs?

awgu · 2022-12-16T00:34:02Z

torch/distributed/fsdp/_state_dict_utils.py

-            buffers, buffer_dtypes, fsdp_state.compute_device
+    if buffers:
+        mixed_precision_enabled_for_buffers = (
+            fsdp_state._mixed_precision_enabled_for_buffers() if not _is_composable(fsdp_state)


To-do: We can make _mixed_precision_enabled_for_buffers() not be a method of FullyShardedDataParallel to make this not have to if / else here. We would be able to just check fsdp_state.mxied_precision.buffer_dtype is not None -- flexible whether that is in its own function or written like that every time.

Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. [ghstack-poisoned]

ghstack-source-id: 2e6a824 Pull Request resolved: #90945

Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. [ghstack-poisoned]

ghstack-source-id: 72ce2ef Pull Request resolved: #90945

Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. [ghstack-poisoned]

ghstack-source-id: 980b4f2 Pull Request resolved: #90945

awgu · 2022-12-19T21:56:30Z

torch/testing/_internal/common_fsdp.py

-            param.zero_()
-    if zero_buffers:
-        for buffer in model.buffers():
+    ctx = FSDP.summon_full_params(model) if summon_full else suppress()


Should we include any to-do or issue for following up on this? Or, could you remind me what the current status on this is?

yeah, let me file an issue summarizing it.

rohan-varma · 2022-12-20T07:24:36Z

@pytorchbot merge -f "CI passed"

pytorchmergebot · 2022-12-20T07:26:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. Pull Request resolved: pytorch#90945 Approved by: https://github.com/awgu

Ensures that load_state_dict for fully_shard works: - Don't add back FSDP prefix - Small fix to ensure mixed precision check for buffers work Follow ups: - state_dict_type does not work, blocking rank0_only and CPU offload as well as other state dict implementations - No testing when wrapped with AC, using mixed precision, integration with distributed checkpoint, etc. Pull Request resolved: pytorch#90945 Approved by: https://github.com/awgu ghstack-source-id: 1fd8b50

[WIP] fully_shard load state_dict

d75039b

[ghstack-poisoned]

rohan-varma requested review from H-Huang, awgu, kwen2501, mrshenli, pritamdamania87, wanchaol and zhaojuanmao as code owners December 15, 2022 20:14

rohan-varma mentioned this pull request Dec 15, 2022

[FSDP] Introduce "fully sharded module"; remove comm. module #90933

Closed

pytorch-bot bot added the topic: not user facing topic category label Dec 15, 2022

rohan-varma added a commit that referenced this pull request Dec 15, 2022

[WIP] fully_shard load state_dict

e2ef9d2

ghstack-source-id: 1aa5381 Pull Request resolved: #90945

rohan-varma mentioned this pull request Dec 15, 2022

[fully_shard] test load state_dict #90898

Closed

Update on "[WIP] fully_shard load state_dict"

a0098db

[ghstack-poisoned]

rohan-varma changed the title ~~[WIP] fully_shard load state_dict~~ fully_shard load state_dict Dec 15, 2022

rohan-varma commented Dec 15, 2022

View reviewed changes

Update on "fully_shard load state_dict"

9f0407e

[ghstack-poisoned]

Update on "fully_shard load state_dict"

51c0af6

[ghstack-poisoned]

awgu approved these changes Dec 16, 2022

View reviewed changes

rohan-varma added a commit that referenced this pull request Dec 16, 2022

[WIP] fully_shard load state_dict

6250c7d

ghstack-source-id: 2e6a824 Pull Request resolved: #90945

rohan-varma added a commit that referenced this pull request Dec 16, 2022

[WIP] fully_shard load state_dict

7aa7e21

ghstack-source-id: 72ce2ef Pull Request resolved: #90945

rohan-varma added a commit that referenced this pull request Dec 19, 2022

[WIP] fully_shard load state_dict

f2a20dc

ghstack-source-id: 980b4f2 Pull Request resolved: #90945

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 19, 2022

awgu reviewed Dec 19, 2022

View reviewed changes

pytorchmergebot added the Merged label Dec 20, 2022

pytorchmergebot closed this in 7330eab Dec 20, 2022

This was referenced Dec 20, 2022

[FSDP] Re-support model dtype change after FSDP init #91192

Closed

[FSDP] Test use_orig_params=True, no_sync(), mixed precision #91193

Closed

[FSDP][Easy] Fix context manager syntax #91410

Closed

This was referenced Jan 5, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True #91767

Closed

[PoC][FSDP] Async reduce-scatter #91865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fully_shard load state_dict#90945

fully_shard load state_dict#90945
rohan-varma wants to merge 9 commits intogh/rohan-varma/627/basefrom
gh/rohan-varma/627/head

rohan-varma commented Dec 15, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 15, 2022 •

edited

Loading

Uh oh!

rohan-varma Dec 15, 2022

Uh oh!

awgu left a comment

Uh oh!

Uh oh!

awgu Dec 16, 2022

Uh oh!

awgu Dec 16, 2022

Uh oh!

awgu Dec 19, 2022

Uh oh!

rohan-varma Dec 20, 2022

Uh oh!

rohan-varma commented Dec 20, 2022

Uh oh!

pytorchmergebot commented Dec 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rohan-varma commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90945

✅ No Failures

Uh oh!

rohan-varma Dec 15, 2022

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

awgu Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Dec 16, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Dec 19, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma commented Dec 20, 2022

Uh oh!

pytorchmergebot commented Dec 20, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rohan-varma commented Dec 15, 2022 •

edited

Loading

pytorch-bot bot commented Dec 15, 2022 •

edited

Loading