[BugFix] Fix initialization of draft model. #29319
[BugFix] Fix initialization of draft model. #29319tlrmchlsmth merged 3 commits intovllm-project:mainfrom
Conversation
…ulations. Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug related to the initialization of draft models with Mixture of Experts (MoE) layers, ensuring compatibility with the high-throughput backend. The fix involves correctly calling prepare_communication_buffer_for_model on the draft model. My review includes a suggestion to make the condition for this call more robust to prevent potential NoneType errors during model loading.
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru>
dc0f71d to
68cab0e
Compare
vllm/v1/worker/gpu_model_runner.py
Outdated
| if (drafter := getattr(self, "drafter", None)) and (drafter_model := getattr(drafter, 'model', None)): | ||
| prepare_communication_buffer_for_model(drafter_model) |
There was a problem hiding this comment.
I think we need a way to share the All2All state between self.model and the drafter -- this may duplicate state
There was a problem hiding this comment.
@bnellnm @varun-sundar-rabindranath could you take a look?
There was a problem hiding this comment.
From an offline conversation - the DeepEP buffers will be cached, so this won't involve any extra state for those All2All backends. might not be the case for the FlashInfer All2Alls -- @pavanimajety would you know if this causes any overhead in that case?
I think we should go ahead and land this for now to get MTP working on main
There was a problem hiding this comment.
We cache the all2all handles here for deepep high-throughput here
This however, hashes on some model and DP/EP properties like hidden_size, num_local_experts and num_global_experts. This means that if the draft model's properties differ from the base-model, which is likely, we will create a new all2all handle. But this is necessary.
I think this is good to land to unbreak main .
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
Thanks for the fix @halyavin ! |
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Andrey Khalyavin <halyavin@yandex-team.ru> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
This initialization is needed to make MTP for DeepSeek V3 work with high-throughput backend.
DeepSeek V3 draft model also has a MoE layer. The method
DeviceCommunicationBase.prepare_communication_buffer_for_modelcallsFusedMoEMethodBase.init_prepare_finalizemethod which in turn setsfused_expertsfield of the layer. During MTP calculation without this fieldFusedMoE.forward_implmethod sees thatusing_modular_kernelproperty is false and setsdo_naive_dispatch_combineflag and as a consequence callsget_ep_group().dispatch(). Butdispatchmethod is not implemented inDeepEPHTAll2AllManagerclass which throws an exception.Calling
prepare_communication_buffer_for_modelon a draft model makes this exception go away and makes MTP working.