Skip to content

Enable interleaved image-image generation w/ Chameleon & Anole#31919

Closed
leloykun wants to merge 19 commits intohuggingface:chameleonfrom
leloykun:fc--anole
Closed

Enable interleaved image-image generation w/ Chameleon & Anole#31919
leloykun wants to merge 19 commits intohuggingface:chameleonfrom
leloykun:fc--anole

Conversation

@leloykun
Copy link
Contributor

@leloykun leloykun commented Jul 11, 2024

What does this PR do?

  • Adds modelling for the VQVAE decoder & also includes it in the conversion step.
  • Adds support for decoding the BPE tokens -> discrete image tokens -> pixel values -> PIL object
  • Reimplements Chameleon's FSM to be more compatible with Transformers and Outlines (for structured generation)

Links:

(partially) Implements # (issue)

@ArthurZucker @zucchini-nlp @JoyBoy-Su

@leloykun leloykun marked this pull request as draft July 12, 2024 05:58
@leloykun leloykun changed the title Add config for enabling image generation w/ Chameleon & Anole Enabling image generation w/ Chameleon & Anole Jul 12, 2024
@leloykun leloykun changed the title Enabling image generation w/ Chameleon & Anole Enable image generation w/ Chameleon & Anole Jul 12, 2024
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super interesting! I don't think we have any image generation models in transformers right @amyeroberts ?

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed super cool to see an image generation model in Transformers. Thanks for the PR!

I didn't really go through the changes, but one thing to consider is to move image-postprocessing logic into processors

@amyeroberts
Copy link
Contributor

@ArthurZucker Indeed - this would be a great addition to the library!

I didn't really go through the changes, but one thing to consider is to move image-postprocessing logic into processors

@zucchini-nlp This should be handled in a similar way to how we do post-processing with tokenizers: if it involves post processing of images, then the logic should be available with the image processor, but we can then post-process with the processor. For example, most processors will have a batch_decode method.

@leloykun leloykun mentioned this pull request Jul 14, 2024
@leloykun leloykun changed the title Enable image generation w/ Chameleon & Anole Enable interleaved image-image generation w/ Chameleon & Anole Jul 15, 2024
@leloykun leloykun marked this pull request as ready for review July 15, 2024 14:36
@leloykun
Copy link
Contributor Author

leloykun commented Jul 15, 2024

Hi @zucchini-nlp & @amyeroberts! I've already moved some of the image-postprocessing step to the processor. But I left the token -> pixel value decoding step in the model cuz it uses the vqvae model. This makes the interface a bit weirder tho:

  1. Feed prompt, pixel values, & other inputs to model and get a sequence of tokens
  2. Preprocess tokens into text and image segments
  3. Pass the image segments back to the model to get pixel values
  4. Pass text segments and pixel values to processor to get final results

What do you guys think would be a better way to do this?

Copy link
Contributor Author

@leloykun leloykun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts @zucchini-nlp @ArthurZucker

This should now be ready for review!

Comment on lines 88 to 116
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also re-implemented Chameleon's FSM to make it more compatible with Transformers & Outlines (for structured generation). Chameleon uses this FSM to dynamically switch between text gen & image gen. And without it, Chameleon's performance pretty much collapses.

I believe the interleaved-text-image mode is closest to the original implementation & it does seem to work as expected, but the conversion may have introduced bugs. I need help to test it further!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, I left a few comments.

The main issue is the FSM constrained generation. I see that we need it for beetter generation quality but I'm thinking if we can do it another way

Regarding the weird workflow when one need to generate ids, decode into pixel values and then post_process into image. IMO it can be wrapped in a custom generate() within modeling code, similarly to what MusicGen does as an example. We could call the super().generate(). After that check what is the generation_config.mutimodal_mode and apply the needed steps for post processing, which includes getting image-segment and decoding it to pixel values. However, we need to think about how to return image pixels in interleaved-mode, as the base GenerateDecoderOnlyOutput doesn't allow it.
@gante here as well to, WDYT about it?

Also, we need to add tests for Chameleon as an image generation model 😄

Comment on lines 1027 to 1045
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to refactor this and Encoder modules, to split into smaller blocks following #31534 (comment)

Comment on lines 1 to 14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a detailed explanation. This actually belongs in the transformers/generation/logits_processors.py.

I am not sure about the general design of FSM however, as it is a lot of code to maintain and is tailored for Chameleon case only. I remember we had some discussions of making outlines part of the library. Ping @gante on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Custom generation code is extremely costly to maintain given our current bandwidth, and should only be done in very high impact models (like whisper). As such, my suggestion here is to NOT add this generation code to transformers.

We can, however, add it to non-actively maintained parts of the library. For instance, it could go into the examples/research_projects folder, where we can pin a transformers version and don't worry about maintenance. We can also link to that research project on the model card, for further visibility.

If this FSM pattern pops up in more VLMs, we can then add it to the main transformers body.

WDYT? 🤗

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this internally on slack. So, given the maintenance cost of FSM and probably not so wide usage beyond Chameleon, we want to find another way to support interleaved-generation.

@leloykun am i right that w/o this the FSM the generation completely fails or it's only about the quality of generation? If the second case, what's the difference in performance between the two versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp, unfortunately, it seems that Chameleon can't do interleaved generation by itself without the FSM. It just produces gibberish & can't generate images properly. So yah, if we remove the FSM, we'll have to turn off interleaved generation.

But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chiming in here, you could implement this specific part in Outlines easily for now. The FSM logic is heavy on maintenance; the only viable alternative would be importing this from Outlines but that’s a longer discussion 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, in that case it would be better to support text-only & image-only models in the library. And add in the docs that for interleaved generation one can install and use outlines processors.

But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency

Yes, totally agreed. Now that you've mentioned it, I remembered we have this processor that sets some ids to "-inf". We can use that for Chameleon

class SuppressTokensLogitsProcessor(LogitsProcessor):

@gante
Copy link
Contributor

gante commented Jul 16, 2024

However, we need to think about how to return image pixels in interleaved-mode, as the base GenerateDecoderOnlyOutput doesn't allow it. @gante here as well to, WDYT about it?

In these cases we usually overload the base class in the modeling file! Example

@zucchini-nlp

@zucchini-nlp zucchini-nlp deleted the branch huggingface:chameleon July 17, 2024 05:41
@zucchini-nlp
Copy link
Member

@leloykun Oops, seems like this PR was into Chameleon branch. I merged chameleon PR and the branch was deleted.

Can you pls reopen in into main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants