[DSD] Correctly handle shared parameters for optimizer state_dict by fegin · Pull Request #128685 · pytorch/pytorch

fegin · 2024-06-14T06:26:01Z

Stack from ghstack (oldest at bottom):

See the discussion in #128076

Current implementation of set_optimizer_state_dict() assumes that all the fqns returned by _get_fqns() must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue.

Differential Revision: D58573487

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC @MeetVadakkanchery @mhorowitz

Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) [ghstack-poisoned]

pytorch-bot · 2024-06-14T06:26:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128685

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 6 Unrelated Failures

As of commit de9c20e with merge base 73ba432 ():

NEW FAILURE - The following job has failed:

periodic / win-vs2019-cuda11.8-py3 / build (gh)
Process completed with exit code 1.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (gh) (similar failure)
'Test'
periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh) (similar failure)
'Test'
trunk / win-vs2019-cpu-py3 / build (gh) (matched win rule in flaky-rules.json)
The process cannot access the file 'C:\actions-runner\_work\_actions\pytorch\pytorch\main\functorch\examples\maml_omniglot' because it is being used by another process.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness
pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh) (trunk failure)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/BUILD:31:17: Compiling torch_xla/csrc/elementwise.cpp failed: (Exit 1): gcc failed: error executing command (from target //torch_xla/csrc:tensor) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 417 arguments skipped)
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
inductor/test_flex_attention.py::TestFlexAttention::test_fw_bw_graph_correctness

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-06-14T06:26:13Z