Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
…gging fi mla kernel issue
| return self.forward_decode(positions, hidden_states, kv_cache, | ||
| attn_metadata) | ||
|
|
||
| def forward_prefill( |
There was a problem hiding this comment.
does flashinfer have prefill kernel?
|
Nice job! And I wonder how do you to solve MLA prefill kernel because there is no avaiable MLA prefill kernel but only decode kernel in flashinfer library. |
|
@liangzelang this PR will perform the regular up projection to turn MLA into MHA for prefill. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
|
Update:
Then it will be ready for review |
Signed-off-by: simon-mo <simon.mo@hey.com>
|
I think the accuracy issues of the FlashInfer kernel might be related to this: |
Co-authored-by: cennn <2523403608@qq.com> Signed-off-by: simon-mo <simon.mo@hey.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Thanks to @cennn, the accuracy issue has been partially identified. We are now at a point the kernel generate coherent output. However, the accuracy is still lower than that of MHA implementation. |
Future work
cc @LucasWilkinson thank you! |
|
-> #12528 |
Status (12/05/2024):
Currently, I have implemented MLA in KV cache format and utilized FlashInfer's MLA decode kernel, with correct output. The throughput for a sample case already goes from 10.47 rps to 18.5 rps . The PR is still very messy and lack proper design but we demonstrated space savings and speed up.
Before
Some todos:
Figure out CUDA graph issue.Will just opt it out for now (also turn off chunked prefill)--disable-mlaand DISABLE_MLASome out of scope: