Fix prepare + apply by jmamou · Pull Request #7 · keyboardAnt/transformers

jmamou · 2024-12-08T15:41:33Z

No description provided.

jmamou · 2024-12-09T14:37:57Z

Last 2 commits include

simplify suppress_tokens
refactor AssistantToTargetTranslator to avoid moving tensors to cpu
fix _prepare_assistant_input_ids of USD
fix logits_processors bug: logits_processors was called after sampling assistant token ids.

jmamou · 2024-12-09T14:46:03Z

@gauravjain14 this PR addresses huggingface#35029 (comment)

gauravjain14 · 2024-12-09T17:09:20Z

@jmamou
To try this, do I need to apply the changes on top of #6?

jmamou · 2024-12-09T18:41:32Z

@jmamou To try this, do I need to apply the changes on top of #6?

no, just checkout fix_prepare branch

gauravjain14

Overall, the changes look good to me. I was able to run the failing test cases and they seem to have been resolved with this.

src/transformers/generation/candidate_generator.py

keyboardAnt

Thanks @jmamou! It's good news that @gauravjain14's tests pass for this PR. I added some questions and minor comments, mostly about simplifying the implementation.

src/transformers/generation/utils.py

src/transformers/generation/candidate_generator.py

keyboardAnt · 2024-12-10T19:34:11Z

src/transformers/generation/candidate_generator.py

+            if i > 0:
+                self._prev_assistant_ids = self._prev_assistant_ids[:,:-i]
+            assistant_input_ids = torch.cat([self._prev_assistant_ids, assistant_new_ids], dim=-1)      
+        assistant_input_ids = assistant_input_ids.to(torch.int)


According to the documentation, cat operates on arrays of the same type. Wdyt about ensuring that self._prev_assistant_ids and assistant_new_ids are already of torch.int type?

do you mean adding before cat

self._prev_assistant_ids = self._prev_assistant_ids.to(torch.int) assistant_new_ids = assistant_new_ids.to(torch.int)

Wdyt about ensuring we only assign torch.int to self._prev_assistant_ids and assistant_new_ids in the first place—so that we never need to cast them into torch.int?

we get all the IDs from the tokenizer and their type is int. Do you think that it is necessary to ensure they are of int type?

src/transformers/generation/candidate_generator.py

keyboardAnt

I'm somewhat puzzled by the target_vocab_size argument. 👀

src/transformers/generation/logits_process.py

src/transformers/generation/candidate_generator.py

keyboardAnt

with model microsoft/Phi-3-medium-128k-instruct
len(target_tokenizer.get_vocab()) = 32011
while config.vocab_size= 32064

Where/why do we set config.vocab_size = 32064 if we know that len(target_tokenizer.get_vocab()) = 32011?

jmamou · 2024-12-12T10:04:02Z

with model microsoft/Phi-3-medium-128k-instruct
len(target_tokenizer.get_vocab()) = 32011
while config.vocab_size= 32064

Where/why do we set config.vocab_size = 32064 if we know that len(target_tokenizer.get_vocab()) = 32011?

we don't set it.
It is part of model config
https://huggingface.co/microsoft/Phi-3-medium-128k-instruct/blob/main/config.json#L169

I suppose that some models pad their vocabulary size for efficiency, 64 is a power of 2....

Another example Qwen/Qwen2-0.5B-Instruct

Relevant discussion https://huggingface.co/microsoft/phi-1_5/discussions/29

…zers mapping improvements

keyboardAnt

@jmamou, thanks for the clarification about the vocabulary sizes. I’ve added a few comments. My main concern is the suggested breaking change in SuppressTokensLogitsProcessor.

src/transformers/generation/logits_process.py

tests/generation/test_configuration_utils.py

src/transformers/generation/utils.py

src/transformers/generation/candidate_generator.py

jmamou · 2024-12-15T10:13:51Z

@jmamou, thanks for the clarification about the vocabulary sizes. I’ve added a few comments. My main concern is the suggested breaking change in SuppressTokensLogitsProcessor.

original implementation of SuppressTokensLogitsProcessor was buggy and not optimal. please explain your concern ...

keyboardAnt · 2024-12-15T15:11:11Z

@jmamou, thanks for the clarification about the vocabulary sizes. I’ve added a few comments. My main concern is the suggested breaking change in SuppressTokensLogitsProcessor.

original implementation of SuppressTokensLogitsProcessor was buggy and not optimal. please explain your concern ...

My concern is that such a change might cause failures for users of Hugging Face Transformers who call SuppressTokensLogitsProcessor while expecting the existing API. Changing the API would require these users to adjust their current implementations.

Another option is to extend the API of the existing class without breaking it or to create an entirely new class.

jmamou · 2024-12-15T17:48:07Z

SuppressTokensLogitsProcessor

Sorry for the misunderstanding ... I was not aware that SuppressTokensLogitsProcessor was already part of HF transformers.

jmamou · 2024-12-15T18:24:59Z

SuppressTokensLogitsProcessor

Sorry for the misunderstanding ... I was not aware that SuppressTokensLogitsProcessor was already part of HF transformers.

I opt for the second option of creating a new class.

keyboardAnt · 2024-12-15T18:48:04Z

SuppressTokensLogitsProcessor

Sorry for the misunderstanding ... I was not aware that SuppressTokensLogitsProcessor was already part of HF transformers.

I opt for the second option of creating a new class.

Sounds good. Bugs in the existing SuppressTokensLogitsProcessor will then no longer be relevant for USD and can be reported to Hugging Face or fixed in separate PRs (not urgent).

gauravjain14 · 2024-12-15T19:54:33Z

What is the expectation on the generation_mode being ASSISTED_GENERATION when speculative decoding with different tokenizers is enabled?

transformers/src/transformers/generation/utils.py

Line 2165 in 1ed1de2

if generation_mode == GenerationMode.ASSISTED_GENERATION:

When I run this script - https://gist.github.com/gauravjain14/19edce088b1f1e7b5dc9ace684e53f8d - with do_sample=True
the first call into the function generate has generation_mode=GenerationMode.ASSISTED_GENERATION but subsequent calls into the function have generation_mode=GenerationMode.GREEDY_SEARCH.

Is this expected? @jmamou, @keyboardAnt?

jmamou · 2024-12-16T08:20:34Z

What is the expectation on the generation_mode being ASSISTED_GENERATION when speculative decoding with different tokenizers is enabled?

transformers/src/transformers/generation/utils.py

Line 2165 in 1ed1de2

if generation_mode == GenerationMode.ASSISTED_GENERATION:

When I run this script - https://gist.github.com/gauravjain14/19edce088b1f1e7b5dc9ace684e53f8d - with do_sample=True the first call into the function generate has generation_mode=GenerationMode.ASSISTED_GENERATION but subsequent calls into the function have generation_mode=GenerationMode.GREEDY_SEARCH.

Is this expected? @jmamou, @keyboardAnt?

generation_mode of the target is GenerationMode.ASSISTED_GENERATION while generation_mode of the assistant model should be GenerationMode.SAMPLE (do_sample=True) or GenerationMode.GREEDY_SEARCH (do_sample=False).
That's the reason why you can get GenerationMode.ASSISTED_GENERATION for the first call to generate (self is target). But you should get GenerationMode.SAMPLE for subsequent generate calls (self is assistant) until the target generate call for validation.

keyboardAnt · 2024-12-16T23:04:35Z

@jmamou, please hit the 'Re-request Review' button when you're ready.

keyboardAnt

LGTM.

keyboardAnt · 2024-12-17T23:03:02Z

@jmamou, it seems like the changes fail the CI tests. Do they pass for you locally?

jmamou · 2024-12-18T12:49:39Z

@jmamou, it seems like the changes fail the CI tests. Do they pass for you locally?

After solving conflicts and tests, remaining failing test does not seem to be related to USD https://app.circleci.com/pipelines/github/huggingface/transformers/113897/workflows/98283892-64b7-4e14-b8a3-8f7da8f9aa61/jobs/1523429?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary

* gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update readme Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * gptqmodel need use checkpoint_format (#1) * gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * Revert quantizer_gptq.py (#2) * revert quantizer_gptq.py change * pass **kwargs * limit gptqmodel and optimum version Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix warning Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix version check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unrelated changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * enable gptqmodel tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix requires gptq Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix Transformer compat (#3) * revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format again Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update gptqmodel version (#6) * update gptqmodel version * update gptqmodel version * fix unit test (#5) * update gptqmodel version * update gptqmodel version * "not self.use_exllama" is not equivalent to "self.use_exllama==False" * fix unittest * update gptqmodel version * backend is loading_attibutes (#7) * fix format and tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix memory check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix device mismatch Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix result check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * update tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * review: update docs (#10) * review: update docs (#12) * review: update docs * fix typo * update tests for gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update document (#9) * update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * typo * doc note for asymmetric quant * typo with apple silicon(e) * typo for marlin * column name revert: review * doc rocm support * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com> Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com> Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Resolve vptq conflict * Rename spqr package to spqr_quant * Get rid of aqlm mention * Start working on tests * Resolve ruff code checks * Ruff format * Isort * Test updates * Add gpu tag * Rename to modules_to_not_convert * Config update * Docs and config update * Docs and config update * Update to update_torch_dtype * spqr config parameter validation * Ruff update * Apply ruff fixes * Test fixes * Ruff update * Mark tests as @slow again; Ruff; Docstring update * Ruff * Remove absolute path * Resolve typo * Remove redundandt log * Check accelerate/spqr availability * Ruff fix * Check if the config contains proper shapes * Ruff test * Documentation update * overview update * Ruff checks * Ruff code quality * Make style * Update docs/source/en/quantization/spqr.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update spqr.md * Enable gptqmodel (huggingface#35012) * gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update readme Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * gptqmodel need use checkpoint_format (#1) * gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * Revert quantizer_gptq.py (#2) * revert quantizer_gptq.py change * pass **kwargs * limit gptqmodel and optimum version Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix warning Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix version check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unrelated changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * enable gptqmodel tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix requires gptq Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix Transformer compat (#3) * revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format again Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update gptqmodel version (#6) * update gptqmodel version * update gptqmodel version * fix unit test (#5) * update gptqmodel version * update gptqmodel version * "not self.use_exllama" is not equivalent to "self.use_exllama==False" * fix unittest * update gptqmodel version * backend is loading_attibutes (#7) * fix format and tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix memory check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix device mismatch Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix result check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * update tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * review: update docs (#10) * review: update docs (#12) * review: update docs * fix typo * update tests for gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update document (#9) * update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * typo * doc note for asymmetric quant * typo with apple silicon(e) * typo for marlin * column name revert: review * doc rocm support * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com> Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com> Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Fix : Nemotron Processor in GGUF conversion (huggingface#35708) * fixing nemotron processor * make style * Update docs/source/en/quantization/spqr.md Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * Add missing TOC to doc --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com> Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com> Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* move `TestAssistedCandidateGeneratorDifferentTokenizers` into a new testing file * refactor * NOTHING. add space to rerun github actions tests * remove it... * `UniversalSpeculativeDecodingGenerator` * Use `UniversalSpeculativeDecodingGenerator` when `generation_config.do_sample=True` * assistant tokenizes only the target's new suffix * formatting * fix code * fix code * formatting * add `TestGenerateWithDifferentModels` * `TestGenerateWithDifferentModels` parameterize on `do_sample` * `AssistantVocabMapping` & `AssistantVocabMappingCache` * formatting * `AssistantToTargetTranslator`: `get_target_input_ids` & `get_target_logits` * improve `_get_assistant_to_target_input_ids` & formatting * renaming * WIP: debugging `min_new_tokens` * fix get_target_ids * `UniversalSpeculativeDecodingGenerator` * assistant tokenizes only the target's new suffix * formatting * fix code * fix code * formatting * `TestGenerateWithDifferentModels` parameterize on `do_sample` * `AssistantVocabMapping` & `AssistantVocabMappingCache` * formatting * `AssistantToTargetTranslator`: `get_target_input_ids` & `get_target_logits` * improve `_get_assistant_to_target_input_ids` & formatting * renaming * WIP: debugging `min_new_tokens` * fix get_target_ids * fix device issue * fix get_assistant_input_ids * add `TestAssistedCandidateGeneratorDifferentTokenizers` * formatting * `AssistantVocabTranslatorCache` refactor & tests * revert changes in `src/transformers/generation/logits_process.py` * refactor `AssistedCandidateGenerator` * refactor `AssistedCandidateGeneratorDifferentTokenizers` * formatting * refactor `UniversalSpeculativeDecodingGenerator` * fix negative value for max_new_tokens * fix generation length target + attention_mask vs. assistant + attent * fix device * fix negative max_new_tokens bug * fix UAG * minor * formatting * `AssistedCandidateGeneratorDifferentTokenizers` `lookbehind`s init * resolve conflict & formatting * rerun CI tests * remove space... * remove old code * fix candidate_input_ids device * minor * formatting * Fix prepare + apply (#7) * fix prepare + apply * move to cpu * simplity suppress_tokens * fix bugs and refacatoring * device move * handle self.config.vocab_size > len(target_tokenizer.get_vocab()) * no need to normalize in candidate_generator * address Nadav's comments + minor * optimize device move + SuppressTokensLogitsProcessor * AssistantToTargetTranslator, SuppressTokensLogitsProcessor and tokenizers mapping improvements * padding size * padding improvement * fix and simplify get_target_logits * renaming in get_target_logits * minor * add filter_value and suppress_tokens_id * style + rename * remove TODO * restore original SelectTokensLogitsProcessor with modification * fix style * fix _update_past_and_masks and optimize code * remove assistant_vocab_size arg * fix attention_mask * call _prepare_attention_mask also if not has_past_key_values * handling attention mask for first generation * comment * restore test * remove SelectTokensLogitsProcessor * _update_past_and_masks implementation for USD * Add unittests for Universal Assisted generation * fix style * update tests * Remove unused import and fix `test_speculation_depth` test * exclude special and reserved tokens from tokenizer for UAG * mv `test_universal_assisted_generation.py` to `generation/test_candidate_generator.py` * Remove unused imports and fix style using `make style` (#9) * formatting * Swap gated `meta-llama/llama-3.2` with `allenai/llama` (#10) * Fix space sign disagreement (#12) * default values for AssistantToTargetTranslator fileds * fix space sign * minor * fix test + style * Default values for some fields of assistant to target translator (#11) * default values for AssistantToTargetTranslator fileds * fix * add support to empty logit_processors * Update candidate_generator.py (#15) fix typo * BUG fix in _prepare_assistant_input_ids (#14) * fix _prepare_assistant_input_ids * target_to_assistant_input_ids * Update src/transformers/generation/candidate_generator.py Co-authored-by: Nadav Timor <nadav.timor@weizmann.ac.il> --------- Co-authored-by: Nadav Timor <nadav.timor@weizmann.ac.il> * typo (`target_to_assistant_input_ids`) * formatting * merge upstream/main * Fix minor review comments (#16) * Fix: `token_ids.to(torch.int64)` (#18) * tok ids to `torch.int64` (reference: https://huggingface.co/docs/transformers.js/en/api/tokenizers) * `LongTensor` * fix dtype * `assistant_input_ids.to(dtype=torch.long)` * Remove unused import from test_candidate_generator.py * Remove unused import from test_candidate_generator.py * Remove `numpy` import * resolve pr comments (#19) * `AssistantToTargetTranslator` docstring * (per gante's comment) `filter_value` and `suppress_tokens_id` to class constants * update `AssistantToTargetTranslator` docstring * (gante's comment) replace `match-case` * formatting * Fix Joao's comments (#21) * remove threading * fix logits_processor * fix test device * fix style (#23) * Move atm (#24) * move AssistantToTargetTranslator * fixup * fix logit_processor * add atm_translator test * refactor test * remove threading from test * add require_torch in tests * move AssistantVocabTranslatorCache + add tests * ruff fix --------- Co-authored-by: jmamou <jonathan.mamou@intel.com> Co-authored-by: Gaurav <gauravj@d-matrix.ai> Co-authored-by: Gaurav Jain <gaurjain14@gmail.com> Co-authored-by: gauravjain14 <41287729+gauravjain14@users.noreply.github.com>

jmamou added 4 commits December 8, 2024 06:16

fix prepare + apply

6097a8d

move to cpu

71562fc

simplity suppress_tokens

3b4e9da

fix bugs and refacatoring

1dcdae4

device move

10d1e56

jmamou requested a review from keyboardAnt December 9, 2024 15:04

gauravjain14 approved these changes Dec 9, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

gauravjain14 mentioned this pull request Dec 9, 2024

[WIP] drafting a fix - cropping the kv cache #6

Closed

jmamou added 2 commits December 10, 2024 04:09

handle self.config.vocab_size > len(target_tokenizer.get_vocab())

f9a260f

no need to normalize in candidate_generator

0d3310d

keyboardAnt reviewed Dec 10, 2024

View reviewed changes

jmamou added 2 commits December 11, 2024 03:12

address Nadav's comments + minor

98cd50b

optimize device move + SuppressTokensLogitsProcessor

8260624

jmamou mentioned this pull request Dec 12, 2024

Add unittests for Universal Assisted generation #8

Merged

keyboardAnt reviewed Dec 12, 2024

View reviewed changes

src/transformers/generation/logits_process.py Outdated Show resolved Hide resolved

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

keyboardAnt reviewed Dec 12, 2024

View reviewed changes

jmamou added 5 commits December 12, 2024 04:06

AssistantToTargetTranslator, SuppressTokensLogitsProcessor and tokeni…

ff7977e

…zers mapping improvements

padding size

38d81b1

padding improvement

6a7d3b3

fix and simplify get_target_logits

e4e53b9

renaming in get_target_logits

a19a9de

keyboardAnt requested changes Dec 13, 2024

View reviewed changes

jmamou added 2 commits December 15, 2024 03:35

minor

c4e4186

add filter_value and suppress_tokens_id

0ec0788

style + rename

200f7a0

jmamou added 7 commits December 16, 2024 04:36

remove TODO

95bfa2c

restore original SelectTokensLogitsProcessor with modification

1cbc871

fix style

4a94849

fix _update_past_and_masks and optimize code

f1b6b08

remove assistant_vocab_size arg

df68533

fix attention_mask

35e354a

call _prepare_attention_mask also if not has_past_key_values

a558bd0

jmamou added 2 commits December 17, 2024 01:47

handling attention mask for first generation

5c3ad58

comment

811a4e5

jmamou requested a review from keyboardAnt December 17, 2024 10:33

jmamou added 3 commits December 17, 2024 02:37

restore test

2dcc9ed

remove SelectTokensLogitsProcessor

f2be0da

_update_past_and_masks implementation for USD

83b8250

keyboardAnt approved these changes Dec 17, 2024

View reviewed changes

keyboardAnt merged commit 9d4d9f9 into usd Dec 17, 2024

keyboardAnt deleted the fix_prepare branch December 17, 2024 22:52

Conversation

jmamou commented Dec 8, 2024

Uh oh!

jmamou commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmamou commented Dec 9, 2024

Uh oh!

gauravjain14 commented Dec 9, 2024

Uh oh!

jmamou commented Dec 9, 2024

Uh oh!

gauravjain14 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keyboardAnt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keyboardAnt Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

jmamou Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

keyboardAnt Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

jmamou Dec 12, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

keyboardAnt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keyboardAnt left a comment

Choose a reason for hiding this comment

Uh oh!

jmamou commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keyboardAnt left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmamou commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keyboardAnt commented Dec 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmamou commented Dec 15, 2024

Uh oh!

jmamou commented Dec 15, 2024

Uh oh!

keyboardAnt commented Dec 15, 2024

Uh oh!

gauravjain14 commented Dec 15, 2024

Uh oh!

jmamou commented Dec 16, 2024

Uh oh!

keyboardAnt commented Dec 16, 2024

Uh oh!

keyboardAnt left a comment

jmamou commented Dec 9, 2024 •

edited

Loading

keyboardAnt left a comment •

edited

Loading

jmamou commented Dec 12, 2024 •

edited

Loading

keyboardAnt left a comment •

edited

Loading

jmamou commented Dec 15, 2024 •

edited

Loading

keyboardAnt commented Dec 15, 2024 •

edited

Loading