add attention_mask and position_ids in assisted model by jiqing-feng · Pull Request #26892 · huggingface/transformers

jiqing-feng · 2023-10-18T06:26:47Z

Do you think that we should also add assistant_attention_mask and assistant_position_ids in assisted_decoding? I see that the original model has attention_mask and position_ids(in most models) in the model inputs but the assistant model has no these kinds of input.

If you think it is okay to align the inputs of the original model and the assistant model, maybe we can find a more elegant way to integrate it. Thx!

gante · 2023-10-25T10:34:21Z

Hi @jiqing-feng 👋

I agree in principle with the changes that you are proposing, but you probably need to do a few changes to make our CI go green :)

jiqing-feng · 2023-10-29T14:32:13Z

Hi @gante . I use assistant_model.prepare_inputs_for_generation to get the inputs of the assistant model. The CI all goes green and I also tested on my several examples to make sure the outputs is correct. Would you please help me review it? Thx!

HuggingFaceDocBuilderDev · 2023-10-31T13:04:31Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

gante

Added a few nits -- after those are addressed, we're ready to merge :)

src/transformers/generation/utils.py

jiqing-feng · 2023-11-01T02:12:12Z

Hi @gante . Would you please review it again? Thx!

gante

Thank you for iterating! 💛

gante · 2023-11-02T15:53:15Z

src/transformers/generation/utils.py

+                    else:
+                        input_ids_len = assistant_inputs["input_ids"].shape[-1]
+
+                    if input_ids_len not in (0, 1):


Suggested change

if input_ids_len not in (0, 1):

if input_ids_len not in (1, 2):

gante · 2023-11-02T16:04:36Z

@jiqing-feng Ah, actually I have two requests before asking for the green light of a core maintainer:

There is a merge conflict, due to recent changes for a new model. If you're not able to sort it out, let me know :)
Let's confirm that we haven't lost throughput with the changes (e.g. the assertion might be producing slowdowns). To test it, feel free to clone this folder, move there, and then run python benchmark_decoder_open.py facebook/opt-6.7b --aux-model facebook/opt-125m --dtype fp16 --num-samples 20 on main and on your branch. The execution times should be nearly identical! 🤗 If you have your own test script, feel free to use it instead -- just let us know of the numbers :)

jiqing-feng · 2023-11-03T02:40:21Z

Hi @gante . I tested it on my CPU device since the GPU is unavailable to me. The new branch is a little faster (around 3%) than the main branch. The test script is as follows, feel free to test it on both GPU and CPU.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

prompt = "Speculative decoding is"
checkpoint = "bigscience/bloom-7b1"
assistant_checkpoint = "bigscience/bloom-560m"
device = "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True).to(device)

generation_kwargs = {"do_sample": False, "max_new_tokens": 64, "temperature": 1.0, "top_p": 1.0, "num_beams": 1}

assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True).to(device)

for i in range(5):
    start = time.time()
    outputs = model.generate(**inputs, assistant_model=assistant_model, **generation_kwargs)
    end = time.time()
    new_tokens = outputs.shape[-1] - inputs["input_ids"].shape[-1]
    print(f"Assistant decoding latency per token is {(end-start)/new_tokens * 1000} ms")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

jiqing-feng · 2023-11-07T01:37:00Z

Hi @gante . I tested it on my CPU device since the GPU is unavailable to me. The new branch is a little faster (around 3%) than the main branch. The test script is as follows, feel free to test it on both GPU and CPU.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

prompt = "Speculative decoding is"
checkpoint = "bigscience/bloom-7b1"
assistant_checkpoint = "bigscience/bloom-560m"
device = "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True).to(device)

generation_kwargs = {"do_sample": False, "max_new_tokens": 64, "temperature": 1.0, "top_p": 1.0, "num_beams": 1}

assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True).to(device)

for i in range(5):
    start = time.time()
    outputs = model.generate(**inputs, assistant_model=assistant_model, **generation_kwargs)
    end = time.time()
    new_tokens = outputs.shape[-1] - inputs["input_ids"].shape[-1]
    print(f"Assistant decoding latency per token is {(end-start)/new_tokens * 1000} ms")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Hi @gante . Could you have a look at this? Thx!

gante · 2023-11-07T15:56:32Z

Hi @jiqing-feng

Running on my end (python benchmark_decoder_open.py facebook/opt-6.7b --aux-model facebook/opt-125m --dtype fp16 from this folder), I got

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/joao/huggingface-demos/experiments/faster_generation/utils.py", line 99, in run_new_model
    new_outputs = run_prediction_loop(model, tokenizer, args.num_samples, args.temperature, aux_model)
  File "/home/joao/huggingface-demos/experiments/faster_generation/benchmark_decoder_open.py", line 35, in run_prediction_loop
    gen_out = model.generate(
  File "/home/joao/venvs/hf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 1736, in generate
    return self.assisted_decoding(
  File "/home/joao/transformers/src/transformers/generation/utils.py", line 4594, in assisted_decoding
    assistant_attention_mask = torch.cat(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument tensors in method wrapper_CUDA_cat)

i.e. the newly generated masks that are appended must be created in the same device as the existing mask :)

jiqing-feng · 2023-11-08T03:28:29Z

Hi @gante . Would you please try it again? It should be fixed and I also tested it on A100, the results and performance are exactly the same. BTW, the failed test seems not related to my changes.

gante · 2023-11-08T11:46:55Z

@jiqing-feng perfect, all works well on my end.

Two related notes:

The CI is indeed red for external reasons, waiting for this PR to get merged
The diff shows that assistant_accepts_encoder_outputs (a recent addition in assisted generation for distil-whisper, to support assistants with shared encoders) is removed, which means your changes are not built on top of the latest version.

👉 you will need to rebase your changes to fix both issues, but only after the PR linked above gets merged. You may get minor rebase issues due to 2., but they should be trivial to fix

After that is done, I'll tag a core maintainer for a final quick check :)

jiqing-feng · 2023-11-08T16:19:15Z

2. assistant_accepts_encoder_outputs

Hi @gante . I removed assistant_accepts_encoder_outputs because it is useless in my new changes, all inputs should be generated by assistant_model.prepare_inputs_for_generation.

gante · 2023-11-08T18:55:22Z

Hi @gante . I removed assistant_accepts_encoder_outputs because it is useless in my new changes, all inputs should be generated by assistant_model.prepare_inputs_for_generation.

🤦 my apologies, you're absolutely right.

In that case, rebasing to get the CI green is all you need to do. Tagging a core maintainer for a quick final check :)

gante

Good to go, thank you for iterating with me 💛

(Note: results also validated on my end, no slowdown nor generative performance drop)

jiqing-feng · 2023-11-09T04:46:10Z

@jiqing-feng perfect, all works well on my end.

Two related notes:

The CI is indeed red for external reasons, waiting for this PR to get merged

The diff shows that assistant_accepts_encoder_outputs (a recent addition in assisted generation for distil-whisper, to support assistants with shared encoders) is removed, which means your changes are not built on top of the latest version.

👉 you will need to rebase your changes to fix both issues, but only after the PR linked above gets merged. You may get minor rebase issues due to 2., but they should be trivial to fix

After that is done, I'll tag a core maintainer for a final quick check :)

Hi @gante . I see this PR you mentioned has been merged and my PR is already up to date, but some of the CI are still red.

amyeroberts · 2023-11-09T11:44:15Z

@jiqing-feng There were some unexpected failures because of new package releases - thankfully not related to this PR! They should now be resolved on main - rebasing should fix them here.

Hi @amyeroberts . #26892 (comment) could test the outputs before and after my changes. I guess you mean adding a test in the tests file to run the CI, if so, would you please tell me which file should be modified to add this test? Thx

Yes, I meant to add a test to the CI runs. It looks like it should be tested in tests/generation/test_utils.py - but I'll let @gante confirm

gante · 2023-11-09T15:18:23Z

(woops, wrong button)

gante · 2023-11-09T16:20:35Z

@amyeroberts not sure if we can test this feature reliably: there are no output differences, since assisted generation always outputs what the main model dictates and this PR only modifies the assistant model's inputs to be more aligned with the main model's.

What we should see on average is a higher speedup with masked inputs, as the assistant model will receive the same inputs and thus has a higher chance of matching the main model, but that is far guaranteed for all calls. A speed test would be very flaky 🤔

amyeroberts · 2023-11-09T16:29:46Z

@gante I understand - I wasn't clear enough before. Really all I was looking for it to make sure that this can be safely used for different assistant models i.e. can I pass in a decoder-only model? How about encoder-decoder. So not speed or values, just API

gante · 2023-11-09T16:45:40Z

@amyeroberts we do have Mixin tests (e.g.), so any issue regarding API should have been caught there :)

amyeroberts · 2023-11-09T16:50:54Z

@gante Sweet - in that case it's all good 👍 Re the failing tests - there's some PRs due to be merge which should (hopefully, this time) resolve the issues we've been having

amyeroberts

Thanks again for adding!

jiqing-feng · 2023-11-10T02:17:04Z

Hi, @gante @amyeroberts . All CI are green. I think it is time to merge : )

gante · 2023-11-10T11:05:34Z

@jiqing-feng thank you for iterating with us and making transformers better 💛

VsonicV · 2023-11-14T13:32:45Z

@amyeroberts @jiqing-feng There are currently some unexpected CI failures caused by test_assisted_decoding_sample (see #27351 and #27450 ). Are they related to this recently merged PR? I can see from the testing log that this PR did not run those tests involving test_assisted_decoding_sample during CI checking. Thanks!

jiqing-feng · 2023-11-15T01:33:01Z

@amyeroberts @jiqing-feng There are currently some unexpected CI failures caused by test_assisted_decoding_sample (see #27351 and #27450 ). Are they related to this recently merged PR? I can see from the testing log that this PR did not run those tests involving test_assisted_decoding_sample during CI checking. Thanks!

Hi, @VsonicV . Sorry for the failed CI. It is weird that I can successfully run pytest in my local repo (which has updated to origin/main). I see that your CI failed at blenderbot and pegasus, but I can pass the local test by running pytest locally. Would you please update your repo and rerun the CI? Thx!
If it still doesn't solve your problem, just revert my changes to see if this is my PR's error.

VsonicV · 2023-11-15T02:20:34Z

Hi, @jiqing-feng, thanks for the quick check. This happened exactly the same for me: I can run pytest tests/models/blenderbot/test_modeling_blenderbot.py, etc., successfully in my local up-to-date repo, but it failed in the CI checks. Moreoever, these CI failures not only happen at blenderbot and pegasus, it also happened for 'umt5' (in one of my previous CI tests), and switch_transformers and t5 in another recent PR (see #27450 ). I asked here because this is the only recent PR that seems related to test_assisted_decoding_sample, but maybe the problem is somewhere else. Thanks for the help anyways!

jiqing-feng · 2023-11-15T02:42:52Z

Hi, @jiqing-feng, thanks for the quick check. This happened exactly the same for me: I can run pytest tests/models/blenderbot/test_modeling_blenderbot.py, etc., successfully in my local up-to-date repo, but it failed in the CI checks. Moreoever, these CI failures not only happen at blenderbot and pegasus, it also happened for 'umt5' (in one of my previous CI tests), and switch_transformers and t5 in another recent PR (see #27450 ). I asked here because this is the only recent PR that seems related to test_assisted_decoding_sample, but maybe the problem is somewhere else. Thanks for the help anyways!

I submitted a new PR, and all CI passed. Would you apply my PR and see if the CI is ok?

Furthermore, it is worth a try that update your repo by merging the origin/main and pushing these updates to rerun the CI.

VsonicV · 2023-11-15T06:56:34Z

@jiqing-feng Hi, thanks for this prompt fix! I will rebase my PR and re-do the CI checks after your new PR is merged. Fingers crossed!

patrickvonplaten · 2023-11-15T23:41:40Z

This PR broke speculative decoding for Whisper, can we maybe revert it for now?

This reverts commit 184f60d.

patrickvonplaten · 2023-11-15T23:52:57Z

Issue reported here: https://huggingface.co/openai/whisper-large-v3/discussions/20

* Revert "add attention_mask and position_ids in assisted model (#26892)" This reverts commit 184f60d. * more debug

) * add attention_mask and position_ids in assisted model * fix bug * fix attention mask * fix attention_mask * check assist inputs * check assist input ids length * fix assist model type * set assist attention mask device

…ngface#27523) * Revert "add attention_mask and position_ids in assisted model (huggingface#26892)" This reverts commit 184f60d. * more debug

jiqing-feng added 2 commits October 17, 2023 23:21

add attention_mask and position_ids in assisted model

e9b33b3

fix bug

75d47fd

jiqing-feng added 3 commits October 29, 2023 03:19

Merge branch 'main' into assist

b864252

fix attention mask

0f3faf0

fix attention_mask

4fd18ff

gante reviewed Oct 31, 2023

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

src/transformers/generation/utils.py Show resolved Hide resolved

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

jiqing-feng added 2 commits October 31, 2023 18:56

check assist inputs

55feaa4

check assist input ids length

d121258

gante approved these changes Nov 2, 2023

View reviewed changes

gante requested review from amyeroberts and removed request for amyeroberts November 2, 2023 15:54

jiqing-feng and others added 2 commits November 3, 2023 10:12

Merge branch 'main' into assist

a039ce5

fix assist model type

a1e3b65

jiqing-feng requested a review from gante November 7, 2023 01:36

jiqing-feng added 2 commits November 7, 2023 17:21

set assist attention mask device

54a94c2

Merge branch 'main' into assist

ba8df18

Merge branch 'main' into assist

bba2f82

gante approved these changes Nov 8, 2023

View reviewed changes

gante closed this Nov 9, 2023

gante reopened this Nov 9, 2023

amyeroberts approved these changes Nov 9, 2023

View reviewed changes

Merge branch 'huggingface:main' into assist

024a24c

gante merged commit 184f60d into huggingface:main Nov 10, 2023

jiqing-feng deleted the assist branch November 15, 2023 01:19

jiqing-feng mentioned this pull request Nov 15, 2023

fix assisted decoding assistant model inputs #27503

Merged

VsonicV mentioned this pull request Nov 15, 2023

Support ONNX export for causal LM sequence classifiers #27450

Merged

5 tasks

patrickvonplaten added a commit that referenced this pull request Nov 15, 2023

Revert "add attention_mask and position_ids in assisted model (#26892)"

e923f3f

This reverts commit 184f60d.

patrickvonplaten mentioned this pull request Nov 15, 2023

Revert "add attention_mask and position_ids in assisted model" #27523

Merged

patrickvonplaten added a commit that referenced this pull request Nov 16, 2023

Revert "add attention_mask and position_ids in assisted model" (#27523)

5603fad

* Revert "add attention_mask and position_ids in assisted model (#26892)" This reverts commit 184f60d. * more debug

	if input_ids_len not in (0, 1):
	if input_ids_len not in (1, 2):

Conversation

jiqing-feng commented Oct 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Oct 25, 2023

Uh oh!

jiqing-feng commented Oct 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 31, 2023

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Nov 1, 2023

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

gante Nov 2, 2023

Choose a reason for hiding this comment

Uh oh!

gante commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 7, 2023

Uh oh!

gante commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiqing-feng commented Nov 8, 2023

Uh oh!

gante commented Nov 8, 2023

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Nov 9, 2023

Uh oh!

gante commented Nov 9, 2023

Uh oh!

gante commented Nov 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amyeroberts commented Nov 9, 2023

Uh oh!

gante commented Nov 9, 2023

Uh oh!

amyeroberts commented Nov 9, 2023

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Nov 10, 2023

Uh oh!

gante commented Nov 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VsonicV commented Nov 14, 2023

Uh oh!

jiqing-feng commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VsonicV commented Nov 15, 2023

Uh oh!

jiqing-feng commented Oct 18, 2023 •

edited

Loading

jiqing-feng commented Oct 29, 2023 •

edited

Loading

gante commented Nov 2, 2023 •

edited

Loading

jiqing-feng commented Nov 3, 2023 •

edited

Loading

gante commented Nov 7, 2023 •

edited

Loading

jiqing-feng commented Nov 8, 2023 •

edited

Loading

gante commented Nov 8, 2023 •

edited

Loading

jiqing-feng commented Nov 9, 2023 •

edited

Loading

gante commented Nov 9, 2023 •

edited

Loading

gante commented Nov 10, 2023 •

edited

Loading

jiqing-feng commented Nov 15, 2023 •

edited

Loading

jiqing-feng commented Nov 15, 2023 •

edited

Loading