[FSDP] Move the sharded_state_dict logic to the post hook to avoid OOM by fegin · Pull Request #82613 · pytorch/pytorch

fegin · 2022-08-01T21:53:02Z

Stack from ghstack (oldest at bottom):

-> [FSDP] Move the sharded_state_dict logic to the post hook to avoid OOM #82613

The original implementation put the call of _summon_full_params() in state_dict(). However, because state_dict() is recursive, _summon_full_params() will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue.

Differential Revision: D38329396

The original implementation put the call of `_summon_full_params()` in `state_dict()`. However, because `state_dict()` is recursive, `_summon_full_params()` will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue. Differential Revision: [D38329396](https://our.internmc.facebook.com/intern/diff/D38329396/) [ghstack-poisoned]

facebook-github-bot · 2022-08-01T21:53:05Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82613
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 1 New Failures, 5 Pending

As of commit d0f7f00 (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (1/1)

Step: "Test" (full log | diagnosis details)

2022-08-02T23:24:14.1290800Z RuntimeError: CUDA error: an illegal memory access was encountered

2022-08-02T23:24:14.1286781Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 1266, in set_rng_seed
2022-08-02T23:24:14.1287168Z     torch.manual_seed(seed)
2022-08-02T23:24:14.1287626Z   File "/opt/conda/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed
2022-08-02T23:24:14.1288001Z     torch.cuda.manual_seed_all(seed)
2022-08-02T23:24:14.1288479Z   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/random.py", line 113, in manual_seed_all
2022-08-02T23:24:14.1288848Z     _lazy_call(cb, seed_all=True)
2022-08-02T23:24:14.1289321Z   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 156, in _lazy_call
2022-08-02T23:24:14.1289650Z     callable()
2022-08-02T23:24:14.1290078Z   File "/opt/conda/lib/python3.10/site-packages/torch/cuda/random.py", line 111, in cb
2022-08-02T23:24:14.1290455Z     default_generator.manual_seed(seed)
2022-08-02T23:24:14.1290800Z RuntimeError: CUDA error: an illegal memory access was encountered
2022-08-02T23:24:14.1291273Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2022-08-02T23:24:14.1291726Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2022-08-02T23:24:14.1291942Z 
2022-08-02T23:24:14.1292773Z ----------------------------------------------------------------------
2022-08-02T23:24:14.1293098Z Ran 20054 tests in 4855.161s
2022-08-02T23:24:14.1293266Z 
2022-08-02T23:24:14.1293433Z FAILED (errors=1, skipped=3552, expected failures=246)
2022-08-02T23:24:14.1293640Z 
2022-08-02T23:24:14.1293763Z Generating XML reports...
2022-08-02T23:24:16.2159459Z Generated XML report: test-reports/python-unittest/test_ops/TEST-TestCommonCUDA-20220802220318.xml

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

The original implementation put the call of `_summon_full_params()` in `state_dict()`. However, because `state_dict()` is recursive, `_summon_full_params()` will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue. Differential Revision: [D38329396](https://our.internmc.facebook.com/intern/diff/D38329396/) ghstack-source-id: 163196066 Pull Request resolved: #82613

rohan-varma

Thanks for the fix! It would be great to test it on a use case where full state dict fails but this one succeeds.

rohan-varma · 2022-08-01T23:41:40Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+                state_dict[fqn] = init_from_local_shards(
+                    local_shards, param.size(), process_group=self.process_group
+                )  # type: ignore[assignment]
+        state_dict.pop(f"{prefix}{FLAT_PARAM}")


I'm assuming that this is removing the key checkpointed by the super().state_dict() call?

rohan-varma · 2022-08-01T23:42:57Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

-        elif self._state_dict_type == StateDictType.LOCAL_STATE_DICT:
+        elif (
+            self._state_dict_type == StateDictType.LOCAL_STATE_DICT or
+            self._state_dict_type == StateDictType.SHARDED_STATE_DICT


so it seems that sharded state dict calling state_dict is not meant to do much, we remove the checkpointed FLAT_PARAM and add new data to the state_dict for the sharded original parameters.

If this is the case, could we just remove the super().state_dict calls, recurse ourselves and call the post hook?

We need the recursive calls for 1.) constructing the correct prefix, 2.) calling post_hook in a reversed order. We can do this ourselves but it will be better to reuse state_dict logic. The only thing that sharded_state_dict does not need is the detach one but that should not causing too many overheads.

rohan-varma · 2022-08-01T23:43:34Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+            "not be SUMMON_FULL_PARAMS."
+        )
+        with self._summon_full_params(recurse=False, writeback=False):
+            for fqn, _, _ in self._param_fqns:


wouldn't named_parameters() also just give the parameter names we need?

named_parameters() will give us more parameters than what we need. We need to use recursive named_parameters() and this will give us more parameters: 1.) parameters in children FSDP modules 2.) parameters that are ignored by FSDP.

…to avoid OOM" The original implementation put the call of `_summon_full_params()` in `state_dict()`. However, because `state_dict()` is recursive, `_summon_full_params()` will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue. Differential Revision: [D38329396](https://our.internmc.facebook.com/intern/diff/D38329396/) [ghstack-poisoned]

Pull Request resolved: #82613 The original implementation put the call of `_summon_full_params()` in `state_dict()`. However, because `state_dict()` is recursive, `_summon_full_params()` will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue. ghstack-source-id: 163330033 Differential Revision: [D38329396](https://our.internmc.facebook.com/intern/diff/D38329396/)

fegin · 2022-08-03T17:14:53Z

@pytorchbot merge

pytorchmergebot · 2022-08-03T17:16:08Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-08-03T17:17:04Z

Hey @fegin.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

#82613) (#82613) Summary: The original implementation put the call of `_summon_full_params()` in `state_dict()`. However, because `state_dict()` is recursive, `_summon_full_params()` will also behave like the recursive version even if recursive is set to False. This PR put the logic in the post hook to solve the OOM issue. Pull Request resolved: #82613 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/b750c10fbe288a201e623e89473bd7ea0f485d56 Original Phabricator Test Plan: CI Reviewed By: rohan-varma Differential Revision: D38329396 Pulled By: fegin fbshipit-source-id: 2f560f9b7ba73ad515987a65a684076f605a7635

fegin requested review from H-Huang, awgu, mingzhe09088, mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners August 1, 2022 21:53

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Aug 1, 2022

fegin mentioned this pull request Aug 1, 2022

[FSDP] Implement _param_fqns() to return all parameter FQNs for the FSDP module #82595

Closed

rohan-varma approved these changes Aug 1, 2022

View reviewed changes

pytorchmergebot added the Merged label Aug 3, 2022

pytorchmergebot closed this in b750c10 Aug 3, 2022

facebook-github-bot deleted the gh/fegin/20/head branch August 7, 2022 14:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Move the sharded_state_dict logic to the post hook to avoid OOM#82613

[FSDP] Move the sharded_state_dict logic to the post hook to avoid OOM#82613
fegin wants to merge 2 commits intogh/fegin/20/basefrom
gh/fegin/20/head

fegin commented Aug 1, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 1, 2022 •

edited

Loading

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (1/1)

Uh oh!

rohan-varma left a comment

Uh oh!

rohan-varma Aug 1, 2022

Uh oh!

fegin Aug 2, 2022

Uh oh!

rohan-varma Aug 1, 2022

Uh oh!

fegin Aug 2, 2022

Uh oh!

rohan-varma Aug 1, 2022

Uh oh!

fegin Aug 2, 2022

Uh oh!

fegin commented Aug 3, 2022

Uh oh!

pytorchmergebot commented Aug 3, 2022

Uh oh!

github-actions bot commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fegin commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 1 New Failures, 5 Pending

🕵️ 1 new failure recognized by patterns

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (default, 1, 4, linux.4xlarge.nvidia.gpu) (1/1)

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

rohan-varma Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

fegin Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

fegin Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

rohan-varma Aug 1, 2022

Choose a reason for hiding this comment

Uh oh!

fegin Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

fegin commented Aug 3, 2022

Uh oh!

pytorchmergebot commented Aug 3, 2022

Uh oh!

github-actions bot commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented Aug 1, 2022 •

edited

Loading

facebook-github-bot commented Aug 1, 2022 •

edited

Loading