Replies: 2 comments 7 replies
-
|
there are a bunch of MTP PR's but they don't seem to be making much progress and seem to be EAGLE 3 focused. I'm also watching this diffusion draft approach closely. |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
For anyone interested in DFlash, please take a look at this PR: #22105 Now it comes to llama.cpp :) |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Just saw this new repo on r/locallama and its seems promising, they (z-lab) claim to have a decent speedup even over eagle3, which is still a draft pr here. The idea is basically to not do a autoregressive draft model but a diffusion one to speed things up and ofc the stats vary but for Qwen3.5-35B-A3B for example they claim a 2-2.8x speedup over normal token generation with an acceptance length mostly beating even mtp other models are at times even better like qwen3.5 9b with up to 3.5x, as can be seen in their repo and huggingface. Im not sure if the devs here think its worth it rn with eagle3 having a draft pr already so ive created this as a discussion to just bring this up (;
The repo is available here
and the already available draft models can be found here
Beta Was this translation helpful? Give feedback.
All reactions