Skip to content

Bugfix for generation with an early-stopping process#32641

Closed
ojh31 wants to merge 3 commits intohuggingface:mainfrom
ojh31:oskar/track-finish-position
Closed

Bugfix for generation with an early-stopping process#32641
ojh31 wants to merge 3 commits intohuggingface:mainfrom
ojh31:oskar/track-finish-position

Conversation

@ojh31
Copy link

@ojh31 ojh31 commented Aug 12, 2024

What does this PR do?

Fixes a bug where we get a shape mismatch from the forward call to a model which uses key-value caching if one of the processes finishes early during generation.

Fixes #32603

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ojh31 ojh31 changed the title Bugfix for generation with an early-stopping process [WIP] Bugfix for generation with an early-stopping process Aug 12, 2024
@ojh31 ojh31 changed the title [WIP] Bugfix for generation with an early-stopping process Bugfix for generation with an early-stopping process Aug 13, 2024
@ojh31
Copy link
Author

ojh31 commented Aug 13, 2024

Failing tests seem to be related to versioning issues with CircleCI runners so marking this as ready for review anyway

@ojh31 ojh31 marked this pull request as ready for review August 13, 2024 00:01
@amyeroberts
Copy link
Contributor

@ojh31 Could you try rebasing on main? This should resolve the failing tests.

cc @gante for review

@gante
Copy link
Contributor

gante commented Aug 13, 2024

Hi @ojh31 👋 Thank you for opening #32603 and this PR!

I see, multi-gpu + new cache changes likely brought problems. I may request changes to this PR, but let me first sync with our transformers multi-device master!


@SunMarc are the current shape-related problems of FSDP + generate on your radar? Is the snippet in the original issue the intended usage for FSDP + generate?

This PR fixes the issue by removing the continue when synced_gpus=True, which makes me believe the original cause is a mismatch between the cache size (which gets updated in forward) and its indexing tensor cache_positions (which does NOT get updated because of the continue)

@SunMarc
Copy link
Member

SunMarc commented Aug 13, 2024

Hey @ojh31, usually for generation, we need to first unwrap the model. This is what is done in TRL for example : https://github.com/huggingface/trl/blob/54f806b6ffdfa49f584340aec18d079a58a3a342/trl/trainer/online_dpo_trainer.py#L536. Can you try that and see if you still get the error ? Also did it worked on past version of transformers ?

@gante , we never had issues with FSDP + generate in the past. I'll investigate a bit ! If we have this issue in FSDP, we should also have for deepspeed.

@ojh31 ojh31 force-pushed the oskar/track-finish-position branch from 186c93a to 112c100 Compare August 13, 2024 16:45
@ojh31
Copy link
Author

ojh31 commented Aug 13, 2024

Hey @ojh31, usually for generation, we need to first unwrap the model. This is what is done in TRL for example : https://github.com/huggingface/trl/blob/54f806b6ffdfa49f584340aec18d079a58a3a342/trl/trainer/online_dpo_trainer.py#L536. Can you try that and see if you still get the error ? Also did it worked on past version of transformers ?

Thanks for taking a look at this @SunMarc! I get the same error modifying the last block as follows:

@contextmanager
def unwrap_model_for_generation(
    model: PreTrainedModel, accelerator: "Accelerator"
):
    yield accelerator.unwrap_model(model)


for batch_input_ids, batch_attention_mask in dataloader:
    with torch.no_grad():
        model.forward(input_ids=batch_input_ids, attention_mask=batch_attention_mask)
    with FSDP.summon_full_params(model, recurse=False):
        with unwrap_model_for_generation(model, accelerator) as unwrapped_model:
            outputs = unwrapped_model.generate(
                input_ids=batch_input_ids,
                attention_mask=batch_attention_mask, 
                tokenizer=tokenizer,
                synced_gpus=True,
            )

I get the same error on transformers 4.42.4 and 4.44.0. However, I just randomly tried 4.35 and that did not hit the same error!

@ojh31
Copy link
Author

ojh31 commented Aug 13, 2024

git bisect finds that bd5091d is the breaking commit

@SunMarc
Copy link
Member

SunMarc commented Aug 14, 2024

Wow thanks for testing and finding the commit that created this issue @ojh31 ! Do you have any insights regarding this commit and the error that we are facing @gante ? Is this linked to your intuition of why this is not working ?

@qiuosier
Copy link

I am hitting the same issue with deepspeed zero3 on 4.44.0. For me, the mismatch is between the attention_mask (which is being updated after the continue) and the key_states/value_states (updated during forward).

@gante
Copy link
Contributor

gante commented Sep 2, 2024

(see this comment in the original issue to avoid parallel discussions 🤗 )

@gante
Copy link
Contributor

gante commented Oct 11, 2024

with #34095, this PR should not be needed 🤗

@gante gante closed this in #34095 Oct 12, 2024
@gante
Copy link
Contributor

gante commented Oct 12, 2024

(please reopen and ping me if the original issue is not fixed :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shape mismatch when generating with multiple processes

5 participants