[FSDP][7/N] Support `replicate` in `fully_shard` by awgu · Pull Request #91044 · pytorch/pytorch

awgu · 2022-12-16T23:05:09Z

Stack from ghstack:

[FSDP][7/N] Support replicate in fully_shard #91044 [FSDP][7/N] Support replicate in fully_shard
[FSDP][6/N] Add note explaining idioms for _FSDPState traversal #90959 [FSDP][6/N] Add note explaining idioms for _FSDPState traversal
[FSDP][5/N] Add manual "wrapping" support for fully_shard #90874 [FSDP][5/N] Add manual "wrapping" support for fully_shard
[FSDP][4/N] Refactor func to share state/init handle attrs #90871 [FSDP][4/N] Refactor func to share state/init handle attrs
fully_shard load state_dict #90945 fully_shard load state_dict

This PR supports nesting replicate in fully_shard.

The PR achieves this by treating replicate-annotated modules are ignored modules. This means that all submodules in the replicate-annotated module's subtree are ignored, including nested fully_shard-annotated modules, which is the desired behavior.

This PR reworks some tree traversal.

One end goal is for state._handles to follow the same order for both the wrapper and composable paths. This implies that _get_fsdp_handles() returns the same value for both paths.

The helper function _get_fully_sharded_module_to_states() now follows a left-to-right DFS from each fully sharded module instead of a BFS. The left-to-right DFS follows .modules() order.
The composable auto "wrap" initialization function _init_param_handles_from_module() follows the reverse left-to-right DFS order. As noted in the code comments, this initialization order is a valid reverse topological sort, but it differs from the wrapper path. This is the only difference with respect to initialization order through the entire process.

mod: Module(
    submod1: Submodule()
    submod2: Submodule(
        subsubmod: Subsubmodule(),
    ),
)

For left-to-right DFS, the order is mod, submod1, submod2, subsubmod. (For context, right-to-left DFS would be mod, submod2, subsubmod, submod1. In other words, the left-to-right vs. right-to-left corresponds to .children() vs. reversed(.children()) respectively.) Then, reverse left-to-right DFS is subsubmod, submod2, submod1, mod, which is a valid initialization order. However, the wrapper auto wrap initialization order would be submod1, subsubmod, submod2, mod since it directly follows a left-to-right DFS and initializes as a part of the recursive DFS logic.

At the end of _init_param_handles_from_module(), we reverse the newly populated state._handles, so this is the reverse reverse left-to-right DFS order, which is equivalent to the left-to-right DFS order. Thus, state._handles has the same order for both paths.

Another goal is for _get_fsdp_states() to not traverse into any submodule that is annotated with an API that is not compatible with fully_shard (e.g. replicate). To achieve this while preserving that _get_fsdp_states() follows .modules() order, we again use a left-to-right DFS.

The reason the DFSs may look strange is because I implemented them non-recursively, which requires a stack.

test_get_fully_sharded_module_to_states() in test_utils.py checks the traversal order of _get_fully_sharded_module_to_states().
test_policy() in test_fully_shard.py checks the traversal order returned by _get_fsdp_handles().

Due to a circular dependency issue, we must move the graph/tree traversal helpers to their own file _traversal_utils.py, and any usages must import the entire file like import torch.distributed.fsdp._traversal_utils as traversal_utils instead of from torch.distributed.fsdp._traversal_utils import ....

The cycle comes from the fact that the traversals require _composable(), which requires _get_registry() from composable/contract.py, which when imported, imports composable/fully_shard.py, which requires the traversals.

[ghstack-poisoned]

pytorch-bot · 2022-12-16T23:05:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91044

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 570c2e1:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: 4dc6b99 Pull Request resolved: #91044

This PR needs to be rebased to get the newly landed PRs. [ghstack-poisoned]

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

awgu · 2022-12-17T23:26:42Z

test/distributed/fsdp/test_fsdp_overlap.py

@@ -93,9 +93,9 @@ def world_size(self):
    def _dist_train(self):
        rank = self.rank
        world_size = self.world_size


This test is disabled in CI, but I found that it is broken when running locally. This change fixes the test.

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

ghstack-source-id: 960f26e Pull Request resolved: #91044

This PR supports nesting `replicate` in `fully_shard`. - The PR achieves this by treating `replicate`-annotated modules are ignored modules. This means that all submodules in the `replicate`-annotated module's subtree are ignored, including nested `fully_shard`-annotated modules, which is the desired behavior. --- This PR reworks some tree traversal. One end goal is for `state._handles` to follow the same order for both the wrapper and composable paths. This implies that `_get_fsdp_handles()` returns the same value for both paths. - The helper function `_get_fully_sharded_module_to_states()` now follows a left-to-right DFS from each fully sharded module instead of a BFS. The left-to-right DFS follows `.modules()` order. - The composable auto "wrap" initialization function `_init_param_handles_from_module()` follows the reverse left-to-right DFS order. As noted in the code comments, this initialization order is a valid reverse topological sort, but it differs from the wrapper path. This is the _only_ difference with respect to initialization order through the entire process. ``` mod: Module( submod1: Submodule() submod2: Submodule( subsubmod: Subsubmodule(), ), ) ``` For left-to-right DFS, the order is `mod`, `submod1`, `submod2`, `subsubmod`. (For context, right-to-left DFS would be `mod`, `submod2`, `subsubmod`, `submod1`. In other words, the left-to-right vs. right-to-left corresponds to `.children()` vs. `reversed(.children())` respectively.) Then, reverse left-to-right DFS is `subsubmod`, `submod2`, `submod1`, `mod`, which is a valid initialization order. However, the wrapper auto wrap initialization order would be `submod1`, `subsubmod`, `submod2`, `mod` since it directly follows a left-to-right DFS and initializes as a part of the recursive DFS logic. - At the end of `_init_param_handles_from_module()`, we reverse the newly populated `state._handles`, so this is the reverse reverse left-to-right DFS order, which is equivalent to the left-to-right DFS order. Thus, `state._handles` has the same order for both paths. Another goal is for `_get_fsdp_states()` to not traverse into any submodule that is annotated with an API that is not compatible with `fully_shard` (e.g. `replicate`). To achieve this while preserving that `_get_fsdp_states()` follows `.modules()` order, we again use a left-to-right DFS. The reason the DFSs may look strange is because I implemented them non-recursively, which requires a stack. - `test_get_fully_sharded_module_to_states()` in `test_utils.py` checks the traversal order of `_get_fully_sharded_module_to_states()`. - `test_policy()` in `test_fully_shard.py` checks the traversal order returned by `_get_fsdp_handles()`. --- Due to a circular dependency issue, we must move the graph/tree traversal helpers to their own file `_traversal_utils.py`, and any usages must import the entire file like `import torch.distributed.fsdp._traversal_utils as traversal_utils` instead of `from torch.distributed.fsdp._traversal_utils import ...`. The cycle comes from the fact that the traversals require `_composable()`, which requires `_get_registry()` from `composable/contract.py`, which when imported, imports `composable/fully_shard.py`, which requires the traversals. [ghstack-poisoned]

ghstack-source-id: 0393d1a Pull Request resolved: #91044

This PR supports nesting `replicate` in `fully_shard`. - The PR achieves this by treating `replicate`-annotated modules are ignored modules. This means that all submodules in the `replicate`-annotated module's subtree are ignored, including nested `fully_shard`-annotated modules, which is the desired behavior. --- This PR reworks some tree traversal. One end goal is for `state._handles` to follow the same order for both the wrapper and composable paths. This implies that `_get_fsdp_handles()` returns the same value for both paths. - The helper function `_get_fully_sharded_module_to_states()` now follows a left-to-right DFS from each fully sharded module instead of a BFS. The left-to-right DFS follows `.modules()` order. - The composable auto "wrap" initialization function `_init_param_handles_from_module()` follows the reverse left-to-right DFS order. As noted in the code comments, this initialization order is a valid reverse topological sort, but it differs from the wrapper path. This is the _only_ difference with respect to initialization order through the entire process. ``` mod: Module( submod1: Submodule() submod2: Submodule( subsubmod: Subsubmodule(), ), ) ``` For left-to-right DFS, the order is `mod`, `submod1`, `submod2`, `subsubmod`. (For context, right-to-left DFS would be `mod`, `submod2`, `subsubmod`, `submod1`. In other words, the left-to-right vs. right-to-left corresponds to `.children()` vs. `reversed(.children())` respectively.) Then, reverse left-to-right DFS is `subsubmod`, `submod2`, `submod1`, `mod`, which is a valid initialization order. However, the wrapper auto wrap initialization order would be `submod1`, `subsubmod`, `submod2`, `mod` since it directly follows a left-to-right DFS and initializes as a part of the recursive DFS logic. - At the end of `_init_param_handles_from_module()`, we reverse the newly populated `state._handles`, so this is the reverse reverse left-to-right DFS order, which is equivalent to the left-to-right DFS order. Thus, `state._handles` has the same order for both paths. Another goal is for `_get_fsdp_states()` to not traverse into any submodule that is annotated with an API that is not compatible with `fully_shard` (e.g. `replicate`). To achieve this while preserving that `_get_fsdp_states()` follows `.modules()` order, we again use a left-to-right DFS. The reason the DFSs may look strange is because I implemented them non-recursively, which requires a stack. - `test_get_fully_sharded_module_to_states()` in `test_utils.py` checks the traversal order of `_get_fully_sharded_module_to_states()`. - `test_policy()` in `test_fully_shard.py` checks the traversal order returned by `_get_fsdp_handles()`. --- Due to a circular dependency issue, we must move the graph/tree traversal helpers to their own file `_traversal_utils.py`, and any usages must import the entire file like `import torch.distributed.fsdp._traversal_utils as traversal_utils` instead of `from torch.distributed.fsdp._traversal_utils import ...`. The cycle comes from the fact that the traversals require `_composable()`, which requires `_get_registry()` from `composable/contract.py`, which when imported, imports `composable/fully_shard.py`, which requires the traversals. [ghstack-poisoned]

ghstack-source-id: 4b369f3 Pull Request resolved: #91044

awgu · 2022-12-20T16:47:15Z

@pytorchbot merge

pytorchmergebot · 2022-12-20T16:49:05Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP][7/N] Support replicate in fully_shard

0efa85d

[ghstack-poisoned]

awgu requested review from H-Huang, kwen2501, mrshenli, pritamdamania87, rohan-varma, wanchaol and zhaojuanmao as code owners December 16, 2022 23:05

This was referenced Dec 16, 2022

[FSDP][4/N] Refactor func to share state/init handle attrs #90871

Closed

[FSDP][5/N] Add manual "wrapping" support for fully_shard #90874

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 16, 2022

awgu mentioned this pull request Dec 16, 2022

[FSDP][6/N] Add note explaining idioms for _FSDPState traversal #90959

Closed

awgu added the topic: improvements topic category label Dec 16, 2022

yhcharles self-requested a review December 16, 2022 23:07

Update on "[FSDP][7/N] Support replicate in fully_shard"

bdaa8d0

[ghstack-poisoned]

Update on "[FSDP][7/N] Support replicate in fully_shard"

67529c4

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request Dec 17, 2022

[FSDP][7/N] Support replicate in fully_shard

8c8831a

ghstack-source-id: 4dc6b99 Pull Request resolved: #91044

Update on "[FSDP][7/N] Support replicate in fully_shard"

a1a8f56

This PR needs to be rebased to get the newly landed PRs. [ghstack-poisoned]

Update on "[FSDP][7/N] Support replicate in fully_shard"

062a9d1

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

Update on "[FSDP][7/N] Support replicate in fully_shard"

ef19ced

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

Update on "[FSDP][7/N] Support replicate in fully_shard"

1e124d6

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

awgu commented Dec 17, 2022

View reviewed changes

Update on "[FSDP][7/N] Support replicate in fully_shard"

f22d810

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

Update on "[FSDP][7/N] Support replicate in fully_shard"

28c5b84

This PR needs to be rebased to get the newly landed PRs. I will update with a proper PR summary before requesting for review. [ghstack-poisoned]

awgu pushed a commit that referenced this pull request Dec 19, 2022

[FSDP][7/N] Support replicate in fully_shard

d8bccb1

ghstack-source-id: 960f26e Pull Request resolved: #91044

awgu pushed a commit that referenced this pull request Dec 19, 2022

[FSDP][7/N] Support replicate in fully_shard

2f777ca

ghstack-source-id: 0393d1a Pull Request resolved: #91044

awgu pushed a commit that referenced this pull request Dec 20, 2022

[FSDP][7/N] Support replicate in fully_shard

37cb3b1

ghstack-source-id: 4b369f3 Pull Request resolved: #91044

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 20, 2022

pytorchmergebot added the Merged label Dec 20, 2022

pytorchmergebot closed this in aec09ee Dec 20, 2022

This was referenced Dec 20, 2022

[FSDP] Re-support model dtype change after FSDP init #91192

Closed

[FSDP] Test use_orig_params=True, no_sync(), mixed precision #91193

Closed

[FSDP][Easy] Fix context manager syntax #91410

Closed

This was referenced Jan 5, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True #91767

Closed

[PoC][FSDP] Async reduce-scatter #91865

Closed

facebook-github-bot deleted the gh/awgu/283/head branch June 8, 2023 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][7/N] Support `replicate` in `fully_shard`#91044

[FSDP][7/N] Support `replicate` in `fully_shard`#91044
awgu wants to merge 17 commits intogh/awgu/283/basefrom
gh/awgu/283/head

awgu commented Dec 16, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 16, 2022 •

edited

Loading

Uh oh!

awgu Dec 17, 2022

Uh oh!

awgu commented Dec 20, 2022

Uh oh!

pytorchmergebot commented Dec 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

awgu commented Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91044

✅ No Failures

Uh oh!

awgu Dec 17, 2022

Choose a reason for hiding this comment

Uh oh!

awgu commented Dec 20, 2022

Uh oh!

pytorchmergebot commented Dec 20, 2022

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

awgu commented Dec 16, 2022 •

edited

Loading

pytorch-bot bot commented Dec 16, 2022 •

edited

Loading