Bugfix for generation with an early-stopping process by ojh31 · Pull Request #32641 · huggingface/transformers

ojh31 · 2024-08-12T23:37:32Z

What does this PR do?

Fixes a bug where we get a shape mismatch from the forward call to a model which uses key-value caching if one of the processes finishes early during generation.

Fixes #32603

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

ojh31 · 2024-08-13T00:01:55Z

Failing tests seem to be related to versioning issues with CircleCI runners so marking this as ready for review anyway

amyeroberts · 2024-08-13T09:00:55Z

@ojh31 Could you try rebasing on main? This should resolve the failing tests.

cc @gante for review

gante · 2024-08-13T12:51:03Z

Hi @ojh31 👋 Thank you for opening #32603 and this PR!

I see, multi-gpu + new cache changes likely brought problems. I may request changes to this PR, but let me first sync with our transformers multi-device master!

@SunMarc are the current shape-related problems of FSDP + generate on your radar? Is the snippet in the original issue the intended usage for FSDP + generate?

This PR fixes the issue by removing the continue when synced_gpus=True, which makes me believe the original cause is a mismatch between the cache size (which gets updated in forward) and its indexing tensor cache_positions (which does NOT get updated because of the continue)

SunMarc · 2024-08-13T15:59:11Z

Hey @ojh31, usually for generation, we need to first unwrap the model. This is what is done in TRL for example : https://github.com/huggingface/trl/blob/54f806b6ffdfa49f584340aec18d079a58a3a342/trl/trainer/online_dpo_trainer.py#L536. Can you try that and see if you still get the error ? Also did it worked on past version of transformers ?

@gante , we never had issues with FSDP + generate in the past. I'll investigate a bit ! If we have this issue in FSDP, we should also have for deepspeed.

ojh31 · 2024-08-13T17:18:06Z

Hey @ojh31, usually for generation, we need to first unwrap the model. This is what is done in TRL for example : https://github.com/huggingface/trl/blob/54f806b6ffdfa49f584340aec18d079a58a3a342/trl/trainer/online_dpo_trainer.py#L536. Can you try that and see if you still get the error ? Also did it worked on past version of transformers ?

Thanks for taking a look at this @SunMarc! I get the same error modifying the last block as follows:

@contextmanager
def unwrap_model_for_generation(
    model: PreTrainedModel, accelerator: "Accelerator"
):
    yield accelerator.unwrap_model(model)


for batch_input_ids, batch_attention_mask in dataloader:
    with torch.no_grad():
        model.forward(input_ids=batch_input_ids, attention_mask=batch_attention_mask)
    with FSDP.summon_full_params(model, recurse=False):
        with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
            outputs = unwrapped_model.generate(
                input_ids=batch_input_ids,
                attention_mask=batch_attention_mask, 
                tokenizer=tokenizer,
                synced_gpus=True,
            )

I get the same error on transformers 4.42.4 and 4.44.0. However, I just randomly tried 4.35 and that did not hit the same error!

ojh31 · 2024-08-13T18:13:23Z

git bisect finds that bd5091d is the breaking commit

SunMarc · 2024-08-14T10:18:49Z

Wow thanks for testing and finding the commit that created this issue @ojh31 ! Do you have any insights regarding this commit and the error that we are facing @gante ? Is this linked to your intuition of why this is not working ?

qiuosier · 2024-08-30T16:51:10Z

I am hitting the same issue with deepspeed zero3 on 4.44.0. For me, the mismatch is between the attention_mask (which is being updated after the continue) and the key_states/value_states (updated during forward).

gante · 2024-09-02T11:52:07Z

(see this comment in the original issue to avoid parallel discussions 🤗 )

gante · 2024-10-11T16:41:38Z

with #34095, this PR should not be needed 🤗

gante · 2024-10-12T15:46:33Z

(please reopen and ping me if the original issue is not fixed :) )

ojh31 changed the title ~~Bugfix for generation with an early-stopping process~~ [WIP] Bugfix for generation with an early-stopping process Aug 12, 2024

ojh31 mentioned this pull request Aug 12, 2024

Shape mismatch when generating with multiple processes #32603

Closed

4 tasks

ojh31 changed the title ~~[WIP] Bugfix for generation with an early-stopping process~~ Bugfix for generation with an early-stopping process Aug 13, 2024

ojh31 marked this pull request as ready for review August 13, 2024 00:01

ojh31 added 3 commits August 13, 2024 09:41

Tracked finish_position in _sample()

4886ee3

Updated other methods using this_peer_finished

aaa07db

Fixed some whitespace issues

112c100

ojh31 force-pushed the oskar/track-finish-position branch from 186c93a to 112c100 Compare August 13, 2024 16:45

qiuosier mentioned this pull request Aug 31, 2024

Trainer: add predict with generate #32346

Open

gante mentioned this pull request Sep 2, 2024

tracker: move prepare_inputs_for_generation into the generation mixin 🧹 #32685

Closed

8 tasks

gante mentioned this pull request Oct 11, 2024

Generate: Fix modern llm generate calls with synced_gpus #34095

Merged

gante closed this in #34095 Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for generation with an early-stopping process#32641

Bugfix for generation with an early-stopping process#32641
ojh31 wants to merge 3 commits intohuggingface:mainfrom
ojh31:oskar/track-finish-position

ojh31 commented Aug 12, 2024 •

edited

Loading

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

amyeroberts commented Aug 13, 2024

Uh oh!

gante commented Aug 13, 2024

Uh oh!

SunMarc commented Aug 13, 2024

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

SunMarc commented Aug 14, 2024

Uh oh!

qiuosier commented Aug 30, 2024

Uh oh!

gante commented Sep 2, 2024

Uh oh!

gante commented Oct 11, 2024

Uh oh!

gante commented Oct 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ojh31 commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

amyeroberts commented Aug 13, 2024

Uh oh!

gante commented Aug 13, 2024

Uh oh!

SunMarc commented Aug 13, 2024

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

ojh31 commented Aug 13, 2024

Uh oh!

SunMarc commented Aug 14, 2024

Uh oh!

qiuosier commented Aug 30, 2024

Uh oh!

gante commented Sep 2, 2024

Uh oh!

gante commented Oct 11, 2024

Uh oh!

gante commented Oct 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ojh31 commented Aug 12, 2024 •

edited

Loading