Skip to content

Prepend bos token to Blip generations#29642

Merged
gante merged 5 commits intohuggingface:mainfrom
zucchini-nlp:gen_max_length
Mar 21, 2024
Merged

Prepend bos token to Blip generations#29642
gante merged 5 commits intohuggingface:mainfrom
zucchini-nlp:gen_max_length

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

Fixes a failing test reported in our internal slack. In the prev PR #28994 , we removed bos initialization when embeds are passed as input to generate, and instead initialized with empty placeholder. As a temporary workaround I suggest to artificially prepend "bos" token after generation.

For long term, I suggest to reconsider how we work with multimodal LLMs for generation, I guess we'll have more of these models in the future. I analyzed the models we have in more detail, and what I want to do is use the current Llava-style workflow for generation, but with one change. We will need a method similar to "prepare_inputs_for_generation", which will be responsible to taking inputs in all modalities and concatenating them in correct way. The method will be called in generate, while preparing model inputs. I think calling "concat_modalities" from within generate would make it easier for us:

  • not to worry about attention and other input lengths mismatch or losing padded attentions, since "concat" will already give us the correct multimodal length. All we do later is simply add one more id for new token
  • easy to do assisted decoding with these workflow, which I have been trying to make work for a while
  • we will always have the prompt as part of generated text, which is intuitive for decoder-only models. I guess all VLMs can be thought of as decoder only

I made a board for visual explanation here and a proof-of-concept that it works here.

@gante I want your thoughts on this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the diff in this PR: It can work as a temporary post-generation fix, if we decrement all set length arguments (max_length, max_new_tokens, min_length, min_new_tokens) by 1 to ensure the final length stays consisted with the user's parameterization

Comment on lines +872 to +873
# max_length for BLIP includes prompt length, so we need more than 20 to reach EOS here
predictions = model.generate(**inputs, max_length=25)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the max_new_tokens argument, which prevents prompt length-based issues :)

Copy link
Member Author

@zucchini-nlp zucchini-nlp Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uncovered a new problem with BLIP 🙃 Apparently the code was not working properly with max_new_tokens.

That was fixed in PR #29585 , so this should be merged only after #29585 is merged and main is rebased.

EDIT: comes out it was a slow test, so the CI is green now

@gante
Copy link
Contributor

gante commented Mar 14, 2024

Regarding the generate changes for VLMs: we actually should go beyond that! The key idea is to rewrite generate in a way that it becomes more composable, with a few blocks expected to be overwritten at a model level (like the function you're proposing). Let's chat about that offline, to figure out the requirements and make a plan, so we can open a GH issue with the roadmap for it :)

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

(ping me again when this PR's dependency gets merged/approved, so we can merge both)

@gante gante requested a review from amyeroberts March 18, 2024 15:15
Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this and fixing!

Only request before merging is:

  • Run slow model tests
  • Make sure there's generation tests which cover this behaviour. Just skimming test_modeling_instructblip.py, it doesn't look like GenerationTesterMixin is used

zucchini-nlp and others added 2 commits March 19, 2024 13:26
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@zucchini-nlp
Copy link
Member Author

@amyeroberts

  1. Yes all blip tests are passing now.
  2. Do you mean a test for the BOS token? Maybe we do not really need to check for that, since the PR is actually fixing a failing test. That means there is a test, even though not checking the BOS explicitly.

@amyeroberts
Copy link
Contributor

@zucchini-nlp

  • OK, just to confirm, that's for BLIP 2 and instruct blip? It's surprising if there's no effect on the integration tests for BLIP 2, as this is adding a new behaviour.
  • No, not necessarily a specific BOS test (although that would be nice). What I meant is to make sure generation behaviour is as expected. For each model, there's specific integration tests which call .generate, but the GenerationTesterMixin isn't used, which I'm concerned means we don't have proper coverage

@zucchini-nlp
Copy link
Member Author

@amyeroberts

  1. Yes, only this two models. It is not adding behavior because it was the default behavior for BLIP before a prev PR was merged, which broke it apparently, as we did not run slow tests in VLMs. So this PR fixes it.
  2. Oh, I see now. Agreed, I do not know why it was not added. I will take a look if it works with GenerationTesterMixin, otherwise make custom tests.

@zucchini-nlp
Copy link
Member Author

I added GenerationTesterMixin for blip. But turns out none of the VLMs had tests for generation, because of using both text and pixel values in input, while the Mixin is not ready for it. BLIP is running correctly now, fortunately because BLIP can generate from text only, if image is not passed.

@gante , I think we can update generation tests more, to support model kwargs as input. Then check that all generative models have the GenerationTesterMixin. I can do it in a separate PR 😄

@gante
Copy link
Contributor

gante commented Mar 20, 2024

@gante , I think we can update generation tests more, to support model kwargs as input. Then check that all generative models have the GenerationTesterMixin. I can do it in a separate PR 😄

👍 Instead of the GenerationTesterMixin's current _get_input_ids_and_config, we could have an input preparation function like the ModelTesterMixin's _prepare_for_class -- depending on the auto class mapping, prepare the correct inputs for the model, to then feed to generate. Looking forward to that rework @zucchini-nlp !

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this fix!

Multimodal generation tests sound great ❤️ Looking forward to the future PR!

@gante gante merged commit b469ebc into huggingface:main Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants