Enable interleaved image-image generation w/ Chameleon & Anole#31919
Enable interleaved image-image generation w/ Chameleon & Anole#31919leloykun wants to merge 19 commits intohuggingface:chameleonfrom
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
This looks super interesting! I don't think we have any image generation models in transformers right @amyeroberts ?
zucchini-nlp
left a comment
There was a problem hiding this comment.
Indeed super cool to see an image generation model in Transformers. Thanks for the PR!
I didn't really go through the changes, but one thing to consider is to move image-postprocessing logic into processors
|
@ArthurZucker Indeed - this would be a great addition to the library!
@zucchini-nlp This should be handled in a similar way to how we do post-processing with tokenizers: if it involves post processing of images, then the logic should be available with the image processor, but we can then post-process with the processor. For example, most processors will have a |
|
Hi @zucchini-nlp & @amyeroberts! I've already moved some of the image-postprocessing step to the processor. But I left the token -> pixel value decoding step in the model cuz it uses the vqvae model. This makes the interface a bit weirder tho:
What do you guys think would be a better way to do this? |
leloykun
left a comment
There was a problem hiding this comment.
@amyeroberts @zucchini-nlp @ArthurZucker
This should now be ready for review!
There was a problem hiding this comment.
I've also re-implemented Chameleon's FSM to make it more compatible with Transformers & Outlines (for structured generation). Chameleon uses this FSM to dynamically switch between text gen & image gen. And without it, Chameleon's performance pretty much collapses.
I believe the interleaved-text-image mode is closest to the original implementation & it does seem to work as expected, but the conversion may have introduced bugs. I need help to test it further!
zucchini-nlp
left a comment
There was a problem hiding this comment.
Thanks for working on this, I left a few comments.
The main issue is the FSM constrained generation. I see that we need it for beetter generation quality but I'm thinking if we can do it another way
Regarding the weird workflow when one need to generate ids, decode into pixel values and then post_process into image. IMO it can be wrapped in a custom generate() within modeling code, similarly to what MusicGen does as an example. We could call the super().generate(). After that check what is the generation_config.mutimodal_mode and apply the needed steps for post processing, which includes getting image-segment and decoding it to pixel values. However, we need to think about how to return image pixels in interleaved-mode, as the base GenerateDecoderOnlyOutput doesn't allow it.
@gante here as well to, WDYT about it?
Also, we need to add tests for Chameleon as an image generation model 😄
src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py
Outdated
Show resolved
Hide resolved
src/transformers/models/chameleon/image_processing_chameleon.py
Outdated
Show resolved
Hide resolved
src/transformers/models/chameleon/image_processing_chameleon.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
We might want to refactor this and Encoder modules, to split into smaller blocks following #31534 (comment)
There was a problem hiding this comment.
Thanks for a detailed explanation. This actually belongs in the transformers/generation/logits_processors.py.
I am not sure about the general design of FSM however, as it is a lot of code to maintain and is tailored for Chameleon case only. I remember we had some discussions of making outlines part of the library. Ping @gante on this
There was a problem hiding this comment.
Custom generation code is extremely costly to maintain given our current bandwidth, and should only be done in very high impact models (like whisper). As such, my suggestion here is to NOT add this generation code to transformers.
We can, however, add it to non-actively maintained parts of the library. For instance, it could go into the examples/research_projects folder, where we can pin a transformers version and don't worry about maintenance. We can also link to that research project on the model card, for further visibility.
If this FSM pattern pops up in more VLMs, we can then add it to the main transformers body.
WDYT? 🤗
There was a problem hiding this comment.
We discussed this internally on slack. So, given the maintenance cost of FSM and probably not so wide usage beyond Chameleon, we want to find another way to support interleaved-generation.
@leloykun am i right that w/o this the FSM the generation completely fails or it's only about the quality of generation? If the second case, what's the difference in performance between the two versions?
There was a problem hiding this comment.
@zucchini-nlp, unfortunately, it seems that Chameleon can't do interleaved generation by itself without the FSM. It just produces gibberish & can't generate images properly. So yah, if we remove the FSM, we'll have to turn off interleaved generation.
But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency
There was a problem hiding this comment.
Just chiming in here, you could implement this specific part in Outlines easily for now. The FSM logic is heavy on maintenance; the only viable alternative would be importing this from Outlines but that’s a longer discussion 🙂
There was a problem hiding this comment.
Hmm, in that case it would be better to support text-only & image-only models in the library. And add in the docs that for interleaved generation one can install and use outlines processors.
But for the text-only & image-only modes, we'd just need to mask the logits for the other modality. However, I think we should do this outside of the _forward func and preferably w/ a LogitsProcessor for consistency
Yes, totally agreed. Now that you've mentioned it, I remembered we have this processor that sets some ids to "-inf". We can use that for Chameleon
|
@leloykun Oops, seems like this PR was into Chameleon branch. I merged chameleon PR and the branch was deleted. Can you pls reopen in into |
What does this PR do?
VQVAEdecoder & also includes it in the conversion step.Links:
(partially) Implements # (issue)
@ArthurZucker @zucchini-nlp @JoyBoy-Su