Universal Speculative Decoding CandidateGenerator#35029
Universal Speculative Decoding CandidateGenerator#35029gante merged 112 commits intohuggingface:mainfrom
CandidateGenerator#35029Conversation
|
Hi @keyboardAnt - I checked out your changes and did the following evaluations to obtain speed-up when using The following is the summary - Dataset Used - Case 1: Case 2: Case 3: A couple of things that remain to be observed -
|
|
Running more evaluation with This only occurs when the assistant model and the target model have different tokenizers. At least what I have consistently observed till now. Any idea what could be causing this? I am running these evaluations on 4xT4 system with about 64GB memory. The models I am using are |
@gauravjain14
|
@gauravjain14 |
seems to be related to #22546 (comment) |
|
We have run USD with |
|
@jmamou - I’m replying to your question here. Our evaluation so far has focused solely on single-threaded inference, but I don’t see a strong reason to restrict the code to single-threading. The thread-safe lock implementation helps prevent race conditions during multi-threaded execution and is considered standard. @gante, do you happen to know if there are any multithreading use cases today? I know there was no multiprocessing support as of August 2024 (#32864). If there aren’t any actual use cases, do you think removing this thread-safe locking functionality would make sense? |
* remove threading * fix logits_processor * fix test device
* move AssistantToTargetTranslator * fixup
|
@keyboardAnt In general, we avoid threading in the core library whenever possible -- Not to say that these issues are not fixable, but since we lack the capacity to handle existing issues, we're trying to prevent code changes that are likely to cause issues in the future 🤗 |
@gante |
gante
left a comment
There was a problem hiding this comment.
Looks good to me :)
Two minor nits and I'm happy to approve and merge:
- See comment below
- Missing an integration test in
tests/generation/test_candidate_generator: with temperature set to nearly 0, USD should match vanilla sampling. (make sure to use a small model, it can even be a pair of dummy models likehf-internal-testing/tiny-random-gpt2andhf-internal-testing/tiny-random-MistralForCausalLM, since the actual test doesn't matter)
gante
left a comment
There was a problem hiding this comment.
LGTM, let's make CI green and merge 🤗
@gante |
📄 ICML oral (top %1): Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies, https://arxiv.org/abs/2502.05202 - Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Oren Pereg, Gaurav Jain, Moshe Wasserblat, David Harel
This PR is a collaborative effort with @jmamou and @gauravjain14. This PR supersedes #34760 and builds upon #35009.
This PR is open for initial review, although some areas are still under development.
What does this PR do?
This PR introduces the
UniversalSpeculativeDecodingGeneratorclass, enabling speculative decoding for assistants with slightly different tokenizers. The key addition is two logits processors (LogitsProcessor) that ensure the assistant generates tokens exclusively from the target vocabulary, maintaining alignment and preserving the target distribution without altering the verification method. Theoretically, it is agnostic to thedo_samplechoice. This avoids issues like #32867 and #33534 and sets the stage for advanced universal speculative decoding techniques (that we are currently researching and have not yet been published).Motivation and Context
This update resolves prior inconsistencies in speculative decoding caused by misaligned vocabularies. Key benefits include:
This PR is a step toward advancements in Universal Assisted Generation, in collaboration with @danielkorat, @orenpereg, @mosheber, @jmamou, @gante, @lewtun, and @MosheWasserb.
Related
Issues:
PRs:
test_generated_length_assisted_generation#34935 - please review and mergeDependencies
AssistedCandidateGeneratorfor Improved Modularity and Reusability #35009 first.Before Submitting Checklist
Who can review?
@gante / @ArthurZucker / @zucchini-nlp