Fix tied weight embeddings fails to load state dict by j316chuck · Pull Request #128076 · pytorch/pytorch

j316chuck · 2024-06-05T22:29:56Z

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

pytorch-bot · 2024-06-05T22:29:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128076

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit 996783b with merge base a7c5968 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 2, linux.rocm.gpu) (gh)
distributed/checkpoint/test_state_dict.py::TestStateDict::test_activation_ckpt_fqns_ddp
periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh)
distributed/_composable/fully_shard/test_fully_shard_model_checkpoint.py::TestModelCheckpointing::test_full_state_dict_save_load_mixed_sharding

FLAKY - The following job failed but was likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_8-cuda11_8-test / test (gh) (similar failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (#127288)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (#127289)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2024-06-05T22:30:00Z

❌ - login: @j316chuck / name: Charles Tang . The commit (0b0c8f5, 996783b, 766b66a) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

awgu · 2024-06-06T00:42:30Z

@fegin @wz337 @LucasLLC could one of you guys help review? thanks!

pytorch-bot · 2024-06-06T00:50:19Z

Please seek CI approval before scheduling CIFlow labels

fegin · 2024-06-06T00:52:27Z

I'm still trying to understand why the test was broken. But I don't think this is an appropriate fix we should land. We should fix _iterate_valid_model_state if it does not correctly traverse to the module.

j316chuck · 2024-06-07T07:13:47Z

@fegin you can reproduce this test failure on a 2-gpu instance:

git clone https://github.com/mosaicml/composer/git 
cd composer 
pip install -e .[all]
LOCAL_WORLD_SIZE=1 python3   -m coverage run -m pytest -v --durations=20 -m 'not daily and not remote and gpu and (doctest or not doctest)' -o tmp_path_retention_policy=none --codeblocks -k

Make sure to replace what we have here in _patch_pytorch.py with your original logic.

j316chuck · 2024-06-07T07:17:14Z

Concretely I think the difference is:

    for name, param in chain(model.named_parameters(), model.named_buffers()):
           print(name)

Does not contain tied weight embeddings.

However, your function _iterate_valid_model_state:

for name, _ in _iterate_valid_model_state(model):
      print(name)

Does contain tied weight embeddings.

This yields the error in the test: KeyError: 'model.cls.predictions.decoder.weight'

fegin · 2024-06-07T16:23:38Z

@j316chuck It would be helpful if you can point out how this tied embedding weight is initialized. I'm not sure what does "tied" mean for the embedding weight. As I mentioned in the issue, _iterate_valid_model_state also uses named_buffers and named_parameters internally. The only difference is _iterate_valid_model_state uses the non-recursive version and performs the recursion by itself. It would be more helpful that I can understand what this tied embedding weight mean.

fegin · 2024-06-07T17:49:18Z

@j316chuck Another way to rephrase my question, you will need an unittest in PyTorch, not in the composer, to verify this PR to get this PR landed.

j316chuck · 2024-06-08T00:12:15Z

@fegin can you help me add this unit test, I ran into a lot of lint / pr issues when trying to commit directly into pytorch. I believe I linked you the composer unit test that you can pattern match off of fwiw? Here is how we initialize the model in composer from HF.

A tied weight embedding layer is when the input weight embeddings is tied into the output layer of the LLM as well. This layer is double counted in _iterate_valid_model_state but not chain(model.named_parameters(), model.named_buffers()). This creates the state dict loading issues we saw in our composer unit tests.

Here's an example of tied weight embeddings:

import torch
import torch.nn as nn

class TiedEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(TiedEmbeddingModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.decoder = nn.Linear(embedding_dim, vocab_size)
        self.decoder.weight = self.embedding.weight  # Tying weights

    def forward(self, input):
        embedded = self.embedding(input)
        output = self.decoder(embedded)
        return output

# Example usage
vocab_size = 10000
embedding_dim = 300
model = TiedEmbeddingModel(vocab_size, embedding_dim)

# Save model state_dict
torch.save(model.state_dict(), 'tied_embedding_model.pth')

# Load model state_dict
loaded_model = TiedEmbeddingModel(vocab_size, embedding_dim)
loaded_model.load_state_dict(torch.load('tied_embedding_model.pth'))

Skylion007 · 2024-06-11T14:43:13Z

We definitely want to cherry pick this into 2.4rc as it's a pretty nasty regression (TensorEngine stops working).

fegin · 2024-06-13T21:25:26Z

@j316chuck I can reproduce the issue with the model definition you provided. However, #125336 is correct. It just surfaces an issue of shared parameter not properly handled in distributed state_dict. This PR will hide the issue again, which may become an issue in the future. I'll submit a PR to fix the issue.

fegin · 2024-06-14T16:44:00Z

@j316chuck Please check if #128685 fixes the issue.

…or optimizer state_dict" * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…te_dict" * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…or optimizer state_dict" * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…te_dict" * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC MeetVadakkanchery mhorowitz [ghstack-poisoned]

…28685) * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: #128685 Approved by: https://github.com/LucasLLC

fegin · 2024-06-18T21:46:53Z

Close in favor of #128685

…28685) * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: #128685 Approved by: https://github.com/LucasLLC (cherry picked from commit 1a52791)

#129252) [DSD] Correctly handle shared parameters for optimizer state_dict (#128685) * Fixes #128011 See the discussion in #128076 Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue. Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/) Pull Request resolved: #128685 Approved by: https://github.com/LucasLLC (cherry picked from commit 1a52791)

atalman · 2024-06-27T21:07:53Z

Cherry pick #128685 picked hence demilestoning this

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels Jun 5, 2024

j316chuck changed the title ~~Update state_dict.py~~ Fix Tied Weight Embeddings Fail to Load on Torch 2.3 Jun 5, 2024

j316chuck changed the title ~~Fix Tied Weight Embeddings Fail to Load on Torch 2.3~~ Fix Tied Weight Embeddings Fail to Load State Dict on Torch 2.3 Jun 5, 2024

j316chuck changed the title ~~Fix Tied Weight Embeddings Fail to Load State Dict on Torch 2.3~~ Fix Tied Weight Embeddings Fail to Load State Dict Jun 5, 2024

j316chuck changed the title ~~Fix Tied Weight Embeddings Fail to Load State Dict~~ Fix tied weight embeddings fails to load state dict Jun 5, 2024

pytorchbot added the open source label Jun 5, 2024

awgu assigned fegin, LucasLLC and wz337 Jun 6, 2024

fegin added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jun 6, 2024

pytorch-bot bot removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jun 6, 2024

janeyx99 requested a review from fegin June 6, 2024 17:15

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 6, 2024

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jun 7, 2024

fix

0b0c8f5

j316chuck force-pushed the patch-1 branch from 2d32e5d to 0b0c8f5 Compare June 10, 2024 22:38

j316chuck added 2 commits June 10, 2024 22:38

fix

766b66a

add weight embedding failed test

996783b

Skylion007 added this to the 2.4.0 milestone Jun 11, 2024

fegin mentioned this pull request Jun 14, 2024

[DSD] Correctly handle shared parameters for optimizer state_dict #128685

Closed

fegin closed this Jun 18, 2024

fegin mentioned this pull request Jun 21, 2024

[DSD] Correctly handle shared parameters for optimizer state_dict (#1… #129252

Merged

atalman removed this from the 2.4.0 milestone Jun 27, 2024

yzhangcs mentioned this pull request Jan 7, 2025

Support Gemma2 in torchtitan pytorch/torchtitan#594

Closed

Conversation

j316chuck commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128076

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

linux-foundation-easycla bot commented Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgu commented Jun 6, 2024

Uh oh!

pytorch-bot bot commented Jun 6, 2024

Uh oh!

fegin commented Jun 6, 2024

Uh oh!

j316chuck commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j316chuck commented Jun 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jun 7, 2024

Uh oh!

fegin commented Jun 7, 2024

Uh oh!

j316chuck commented Jun 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 commented Jun 11, 2024

Uh oh!

fegin commented Jun 13, 2024

Uh oh!

fegin commented Jun 14, 2024

Uh oh!

fegin commented Jun 18, 2024

Uh oh!

atalman commented Jun 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

j316chuck commented Jun 5, 2024 •

edited

Loading

pytorch-bot bot commented Jun 5, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 5, 2024 •

edited

Loading

j316chuck commented Jun 7, 2024 •

edited

Loading

j316chuck commented Jun 7, 2024 •

edited

Loading

j316chuck commented Jun 8, 2024 •

edited

Loading