Emu3: add model by zucchini-nlp · Pull Request #33770 · huggingface/transformers

zucchini-nlp · 2024-09-27T16:17:26Z

What does this PR do?

As per title. The code can work for generating text in single-batch scenarios but the generated text doesn't match input image. For batched generation, seems like the orig impl neither supports it mostly because image features from processor are returned with different shapes (smart resize to converse as much orig image size as possible). We can try to do padding similar to llava-next but I am not sure if will just work, I'll contact the authors

TODO:

Batched generation
Upload chat template and change the image-placeholder token from extra-0 to smth like <image>
Match the orig implementation on logit level
Tests, many more tests
Check out image generation and see how we can enable interleaved image+text generation as in Chameleon. Maybe not natively with transformers, but we can provide scripts with external libraries for structured generation -> not possible because text-generation and image-generation are two different checkpoints with different weights

from PIL import Image
import torch
import requests

from transformers import (
    Emu3Config,
    Emu3ForConditionalGeneration,
    Emu3ImageProcessor,
    Emu3Processor,
)

output_dir = "/raid/raushan/emu3"
processor = Emu3Processor.from_pretrained(output_dir)
model = Emu3ForConditionalGeneration.from_pretrained(output_dir, torch_dtype="bfloat16", device_map="auto")
processor.tokenizer.padding_side = "left"

text = "You are a helpful assistant. USER: <|extra_0|>Please describe the image. ASSISTANT:"
image = Image.open("/raid/raushan/image.png")
image2 = Image.open(requests.get("https://www.ilankelman.org/stopsigns/australia.jpg", stream=True).raw)

inputs = processor(
    text=[text, text],
    images=[image2, image],
    return_tensors="pt",
    padding=True,
)

inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)

out = model.generate(**inputs, max_new_tokens=100)
text_out = processor.batch_decode(out, skip_special_tokens=True)
print(text_out)

And for image generation:

from PIL import Image
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
import torch
import requests

from transformers import (
    Emu3Config,
    Emu3ForConditionalGeneration,
    Emu3ImageProcessor,
    Emu3Processor,
)

output_dir = "/raid/raushan/emu3-gen"
processor = Emu3Processor.from_pretrained(output_dir)
model = Emu3ForConditionalGeneration.from_pretrained(output_dir, torch_dtype="bfloat16", device_map="auto", ) # attn_implementation="flash_attention_2",


inputs = processor(
    text=["a portrait of young girl. masterpiece, film grained, best quality.", "a dog running under the rain"],
    padding=True,
    return_tensors="pt",
    return_for_image_generation=True,
)
inputs = inputs.to(device="cuda:0", dtype=torch.bfloat16)

image_sizes = inputs.pop("image_sizes")
HEIGHT, WIDTH = image_sizes[0]
VISUAL_TOKENS = model.model.vocabulary_mapping.image_tokens

def prefix_allowed_tokens_fn(batch_id, input_ids):
    height, width = HEIGHT, WIDTH
    visual_tokens = VISUAL_TOKENS
    image_token_id = processor.tokenizer.encode("<|image token|>", return_tensors="pt")[0].to(model.device) # torch.tensor([processor.tokenizer.image_token_id], device=model.device)
    eoi_token_id = processor.tokenizer.encode("<|image end|>", return_tensors="pt")[0] # torch.tensor([processor.tokenizer.eoi_token_id], device=model.device)
    eos_token_id = processor.tokenizer.encode("<|extra_204|>", return_tensors="pt")[0] # torch.tensor([processor.tokenizer.eos_token_id], device=model.device)
    pad_token_id = processor.tokenizer.encode("<|endoftext|>", return_tensors="pt")[0] # torch.tensor([processor.tokenizer.pad_token_id], device=model.device)
    eol_token_id = processor.tokenizer.encode("<|extra_200|>", return_tensors="pt")[0]
    eof_token_id = processor.tokenizer.encode("<|extra_201|>", return_tensors="pt")[0]

    position = torch.nonzero(input_ids == image_token_id, as_tuple=True)[0][0]
    offset = input_ids.shape[0] - position
    if offset % (width + 1) == 0:
        return (eol_token_id, )
    elif offset == (width + 1) * height + 1:
        return (eof_token_id, )
    elif offset == (width + 1) * height + 2:
        return (eoi_token_id, )
    elif offset == (width + 1) * height + 3:
        return (eos_token_id, )
    elif offset > (width + 1) * height + 3:
        return (pad_token_id, )
    else:
        return visual_tokens


out = model.generate(
    **inputs,
    max_new_tokens=50_000,
    prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
    do_sample=True,
    top_k=2048,
    return_dict_in_generate=True,
)

print(out.sequences.shape, inputs.input_ids.shape)

image = model.model.decode_image_tokens(out.sequences[:, inputs.input_ids.shape[1]: ], height=HEIGHT, width=WIDTH)
images = processor.postprocess(list(image.float()), return_tensors="PIL.Image.Image") # internally we convert to np but it's not supported in bf16 precision
for i, image in enumerate(images['pixel_values']):
    image.save(f"result_{i}.png")

HuggingFaceDocBuilderDev · 2024-09-27T16:40:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-10-28T09:16:29Z

src/transformers/models/emu3/modular_emu3.py

+@add_start_docstrings(
+    "The Emu3 Text Model which consists of transformer with self attention layers.",
+    EMU3_START_DOCSTRING,
+)
+class Emu3TextModel(Emu3PreTrainedModel):
+    config_class = Emu3TextConfig


Adding LlamaModel to bases messes up the auto-generated modeling file by adding new classes like Emu3TextAttention and so on, while we have Emu3Attention

I think this should be solved by #34487!

To try again!

utils/modular_model_converter.py

zucchini-nlp · 2024-10-28T09:19:28Z

I think this is ready for review. @ArthurZucker will you be reviewing or is there anyone I can tag for initial review?

Btw, the repo consistency tests will fail because the modular doesn't import EmuTextConfig. I found that the modular imports all the things specified in module-file import section + all things in old-model-file import section. But Emu3TextConfig is in neither of them, so prob we need also to check imports between one model files. I'll think how to fix that

ArthurZucker · 2024-10-28T12:50:53Z

You can tag @Cyrilvallez !

Cyrilvallez

Thanks a lot, great work! With the new modular version #34487, I think we can still improve a bit! Should be merged very soon, but this is already very nice imo if you don't want to wait 🤗

src/transformers/models/emu3/__init__.py

src/transformers/models/emu3/modular_emu3.py

Cyrilvallez · 2024-10-30T18:17:13Z

src/transformers/models/emu3/modular_emu3.py

+@add_start_docstrings(
+    "The Emu3 Text Model which consists of transformer with self attention layers.",
+    EMU3_START_DOCSTRING,
+)
+class Emu3TextModel(Emu3PreTrainedModel):
+    config_class = Emu3TextConfig


I think this should be solved by #34487!

src/transformers/models/emu3/modular_emu3.py

ArthurZucker

Waiting for the updates regarding @Cyrilvallez 's PR, will review again once updated

src/transformers/models/emu3/modular_emu3.py

qubvel · 2025-01-08T18:08:56Z

heh, is something wrong with code owners?

stevhliu

Thanks :)

docs/source/en/model_doc/emu3.md

zucchini-nlp · 2025-01-08T18:33:21Z

Yeah, seems like it automatically tags all code owners depending on files touched/created...

@ArthurZucker, would be nice to not tag that many people at once

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

ArthurZucker

Nice thanks for iterating! My only comment is that I have not personnaly looked enough at the MIMI or the VQVAE from Chameleon you would know better, but the more standard the better!
A few nits but good to go IMO.

ArthurZucker · 2025-01-09T15:43:38Z

docs/source/en/model_doc/emu3.md

+
+# autoregressively complete prompt
+output = model.generate(**inputs, max_new_tokens=50)
+print(processor.decode(output[0], skip_special_tokens=True))


nice to have some expected outputs!

src/transformers/models/emu3/modular_emu3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

zucchini-nlp · 2025-01-10T11:22:57Z

Let's merge 🚀

* model can convert to HF and be loaded back * nit * works in single batch generation but hallucinates * use the image tokens * add image generation * now it works * add tests * update * add modulare but it doesn't work for porting docstring :( * skip some tests * add slow tests * modular removed the import? * guess this works * update * update * fix copies * fix test * fix copies * update * docs * fix tests * last fix tests? * pls * repo consistency * more style * style * remove file * address comments * tiny bits * update after the new modular * fix tests * add one more cond in check attributes * decompose down/up/mid blocks * allow static cache generation in VLMs * nit * fix copies * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/model_doc/emu3.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * fix VAE upsampling * Update src/transformers/models/emu3/modular_emu3.py Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * address comments * state overwritten stuff explicitly * fix copies * add the flag for flex attn --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

zucchini-nlp added 2 commits September 27, 2024 17:49

model can convert to HF and be loaded back

253fd71

nit

bfce946

works in single batch generation but hallucinates

9f04cd9

zucchini-nlp mentioned this pull request Oct 7, 2024

Implement LlamaGen for Image Generation #33905

Open

zucchini-nlp added 14 commits October 14, 2024 19:52

use the image tokens

6bfc608

add image generation

5486574

now it works

7050c96

add tests

510ad04

Merge remote-tracking branch 'upstream/main' into emu3

f10f1e8

update

f25113e

add modulare but it doesn't work for porting docstring :(

dbe6b37

skip some tests

65436f1

Merge remote-tracking branch 'upstream/main' into emu3

17c5d93

add slow tests

0b26b80

Merge remote-tracking branch 'upstream/main' into emu3

9c966ac

modular removed the import?

2fd840c

guess this works

468c7cb

Merge remote-tracking branch 'upstream/main' into emu3

69ebfdd

zucchini-nlp commented Oct 28, 2024

View reviewed changes

utils/modular_model_converter.py Outdated Show resolved Hide resolved

update

62625ca

zucchini-nlp requested a review from ArthurZucker October 28, 2024 09:19

ArthurZucker requested a review from Cyrilvallez October 28, 2024 12:51

Merge branch 'main' into emu3

0c3ca61

Cyrilvallez approved these changes Oct 30, 2024

View reviewed changes

Cyrilvallez mentioned this pull request Oct 31, 2024

Large modular logic refactoring #34487

Merged

ArthurZucker reviewed Oct 31, 2024

View reviewed changes

src/transformers/models/emu3/modular_emu3.py Outdated Show resolved Hide resolved

src/transformers/models/emu3/modular_emu3.py Outdated Show resolved Hide resolved

zucchini-nlp requested review from molbap, qubvel, stevhliu and yonigozlan as code owners January 8, 2025 17:53

stevhliu approved these changes Jan 8, 2025

View reviewed changes

zucchini-nlp and others added 8 commits January 9, 2025 10:56

Update docs/source/en/model_doc/emu3.md

783f274

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

f0c1275

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

1885532

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

d5a30b2

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

6ac924d

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

2aaab17

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

097be9c

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Update docs/source/en/model_doc/emu3.md

d4af7c3

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

qubvel added New model Multimodal labels Jan 9, 2025

fix VAE upsampling

a782d0d

ArthurZucker reviewed Jan 9, 2025

View reviewed changes

zucchini-nlp and others added 7 commits January 10, 2025 10:43

Update src/transformers/models/emu3/modular_emu3.py

5821cd2

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

address comments

21e0f38

state overwritten stuff explicitly

69440ba

Merge branch 'main' into emu3

3812687

fix copies

6f57070

Merge branch 'main' into emu3

d4bb4e4

add the flag for flex attn

7e42a1f

zucchini-nlp merged commit 52e1f87 into huggingface:main Jan 10, 2025

zucchini-nlp changed the title ~~[WIP] Emu3: add model~~ Emu3: add model Jan 13, 2025

Conversation

zucchini-nlp commented Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Sep 27, 2024

Uh oh!

zucchini-nlp Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Dec 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Oct 28, 2024

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cyrilvallez Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qubvel commented Jan 8, 2025

Uh oh!

stevhliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Jan 8, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp commented Jan 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zucchini-nlp commented Sep 27, 2024 •

edited

Loading

zucchini-nlp commented Oct 28, 2024 •

edited

Loading