Enable interleaved image-image generation w/ Chameleon & Anole by leloykun · Pull Request #31919 · huggingface/transformers

leloykun · 2024-07-11T20:52:47Z

What does this PR do?

Adds modelling for the VQVAE decoder & also includes it in the conversion step.
- I've also uploaded the converted model here: https://huggingface.co/leloy/Anole-7b-v0.1-hf
Adds support for decoding the BPE tokens -> discrete image tokens -> pixel values -> PIL object
Reimplements Chameleon's FSM to be more compatible with Transformers and Outlines (for structured generation)

Links:

(partially) Implements # (issue)

Multimodal-in and multimodal-out GAIR-NLP/anole#18

@ArthurZucker @zucchini-nlp @JoyBoy-Su

ArthurZucker

This looks super interesting! I don't think we have any image generation models in transformers right @amyeroberts ?

zucchini-nlp

Indeed super cool to see an image generation model in Transformers. Thanks for the PR!

I didn't really go through the changes, but one thing to consider is to move image-postprocessing logic into processors

amyeroberts · 2024-07-12T13:01:26Z

@ArthurZucker Indeed - this would be a great addition to the library!

I didn't really go through the changes, but one thing to consider is to move image-postprocessing logic into processors

@zucchini-nlp This should be handled in a similar way to how we do post-processing with tokenizers: if it involves post processing of images, then the logic should be available with the image processor, but we can then post-process with the processor. For example, most processors will have a batch_decode method.

leloykun · 2024-07-15T14:42:04Z

Hi @zucchini-nlp & @amyeroberts! I've already moved some of the image-postprocessing step to the processor. But I left the token -> pixel value decoding step in the model cuz it uses the vqvae model. This makes the interface a bit weirder tho:

Feed prompt, pixel values, & other inputs to model and get a sequence of tokens
Preprocess tokens into text and image segments
Pass the image segments back to the model to get pixel values
Pass text segments and pixel values to processor to get final results

What do you guys think would be a better way to do this?

leloykun

@amyeroberts @zucchini-nlp @ArthurZucker

This should now be ready for review!

leloykun · 2024-07-15T14:52:01Z

src/transformers/models/chameleon/logits_processor.py

I've also re-implemented Chameleon's FSM to make it more compatible with Transformers & Outlines (for structured generation). Chameleon uses this FSM to dynamically switch between text gen & image gen. And without it, Chameleon's performance pretty much collapses.

I believe the interleaved-text-image mode is closest to the original implementation & it does seem to work as expected, but the conversion may have introduced bugs. I need help to test it further!

zucchini-nlp

Thanks for working on this, I left a few comments.

The main issue is the FSM constrained generation. I see that we need it for beetter generation quality but I'm thinking if we can do it another way

Regarding the weird workflow when one need to generate ids, decode into pixel values and then post_process into image. IMO it can be wrapped in a custom generate() within modeling code, similarly to what MusicGen does as an example. We could call the super().generate(). After that check what is the generation_config.mutimodal_mode and apply the needed steps for post processing, which includes getting image-segment and decoding it to pixel values. However, we need to think about how to return image pixels in interleaved-mode, as the base GenerateDecoderOnlyOutput doesn't allow it.
@gante here as well to, WDYT about it?

Also, we need to add tests for Chameleon as an image generation model 😄

src/transformers/models/chameleon/configuration_chameleon.py

src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py

src/transformers/models/chameleon/image_processing_chameleon.py

src/transformers/models/chameleon/modeling_chameleon.py

zucchini-nlp · 2024-07-16T06:57:48Z

src/transformers/models/chameleon/modeling_chameleon.py

We might want to refactor this and Encoder modules, to split into smaller blocks following #31534 (comment)

src/transformers/models/chameleon/modeling_chameleon.py

src/transformers/models/chameleon/processing_chameleon.py

zucchini-nlp · 2024-07-16T07:28:18Z

src/transformers/models/chameleon/logits_processors.py

Thanks for a detailed explanation. This actually belongs in the transformers/generation/logits_processors.py.

I am not sure about the general design of FSM however, as it is a lot of code to maintain and is tailored for Chameleon case only. I remember we had some discussions of making outlines part of the library. Ping @gante on this

Custom generation code is extremely costly to maintain given our current bandwidth, and should only be done in very high impact models (like whisper). As such, my suggestion here is to NOT add this generation code to transformers.

We can, however, add it to non-actively maintained parts of the library. For instance, it could go into the examples/research_projects folder, where we can pin a transformers version and don't worry about maintenance. We can also link to that research project on the model card, for further visibility.

If this FSM pattern pops up in more VLMs, we can then add it to the main transformers body.

WDYT? 🤗

We discussed this internally on slack. So, given the maintenance cost of FSM and probably not so wide usage beyond Chameleon, we want to find another way to support interleaved-generation.

@leloykun am i right that w/o this the FSM the generation completely fails or it's only about the quality of generation? If the second case, what's the difference in performance between the two versions?

@zucchini-nlp, unfortunately, it seems that Chameleon can't do interleaved generation by itself without the FSM. It just produces gibberish & can't generate images properly. So yah, if we remove the FSM, we'll have to turn off interleaved generation.

But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency

Just chiming in here, you could implement this specific part in Outlines easily for now. The FSM logic is heavy on maintenance; the only viable alternative would be importing this from Outlines but that’s a longer discussion 🙂

Hmm, in that case it would be better to support text-only & image-only models in the library. And add in the docs that for interleaved generation one can install and use outlines processors.

But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency

Yes, totally agreed. Now that you've mentioned it, I remembered we have this processor that sets some ids to "-inf". We can use that for Chameleon

transformers/src/transformers/generation/logits_process.py

Line 1801 in 24cfcc2

class SuppressTokensLogitsProcessor(LogitsProcessor):

gante · 2024-07-16T08:40:51Z

However, we need to think about how to return image pixels in interleaved-mode, as the base GenerateDecoderOnlyOutput doesn't allow it. @gante here as well to, WDYT about it?

In these cases we usually overload the base class in the modeling file! Example

@zucchini-nlp

zucchini-nlp · 2024-07-17T05:50:38Z

@leloykun Oops, seems like this PR was into Chameleon branch. I merged chameleon PR and the branch was deleted.

Can you pls reopen in into main?

leloykun marked this pull request as draft July 12, 2024 05:58

leloykun changed the title ~~Add config for enabling image generation w/ Chameleon & Anole~~ Enabling image generation w/ Chameleon & Anole Jul 12, 2024

leloykun changed the title ~~Enabling image generation w/ Chameleon & Anole~~ Enable image generation w/ Chameleon & Anole Jul 12, 2024

ArthurZucker reviewed Jul 12, 2024

View reviewed changes

zucchini-nlp reviewed Jul 12, 2024

View reviewed changes

leloykun mentioned this pull request Jul 14, 2024

Chameleon: add model #31534

Merged

leloykun force-pushed the fc--anole branch from 2d9eb42 to 4fba19c Compare July 14, 2024 16:12

leloykun changed the title ~~Enable image generation w/ Chameleon & Anole~~ Enable interleaved image-image generation w/ Chameleon & Anole Jul 15, 2024

leloykun marked this pull request as ready for review July 15, 2024 14:36

leloykun commented Jul 15, 2024

View reviewed changes

zucchini-nlp reviewed Jul 16, 2024

View reviewed changes

leloykun added 16 commits July 16, 2024 22:53

add config for enabling image generation

112761d

add VQAGanDecoder

c475921

also convert vqmodel decoder weights

b2c0c51

clamp token sequence length to 1024

4639209

rm timestep embedding

651e975

fix formatting

ea1dd84

re-add post_quant_conv

1ea6cf6

move image postprocessing to image processor

f746901

make tensor shapes more consistent

9a20037

make <image> token non-special so we can decode it

ffe095e

fix rebase

49a6cb4

fix rebase conflicts

4fbeaf4

bugfixes

9fb14b8

implement interleaved image-text generation

65fecdb

fix nits

0cdd6c4

rename logits_processor.py to logits_processors.py

36b6018

leloykun added 3 commits July 16, 2024 22:57

fix nits by linter

d3270b1

move operations to where the token tensors are

548c509

revert to doing ChameleonFSMLogitsProcessor computations on the CPU

ecf5bb9

leloykun force-pushed the fc--anole branch from 7d4c5cd to ecf5bb9 Compare July 16, 2024 14:57

zucchini-nlp deleted the branch huggingface:chameleon July 17, 2024 05:41

zucchini-nlp closed this Jul 17, 2024

Conversation

leloykun commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

amyeroberts commented Jul 12, 2024

Uh oh!

leloykun commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leloykun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented Jul 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leloykun commented Jul 11, 2024 •

edited

Loading

leloykun commented Jul 15, 2024 •

edited

Loading

gante commented Jul 16, 2024 •

edited

Loading