[WIP] Deepseek V2 MLA by simon-mo · Pull Request #10927 · vllm-project/vllm

simon-mo · 2024-12-05T11:10:59Z

Status (12/05/2024):

Currently, I have implemented MLA in KV cache format and utilized FlashInfer's MLA decode kernel, with correct output. The throughput for a sample case already goes from 10.47 rps to 18.5 rps . The PR is still very messy and lack proper design but we demonstrated space savings and speed up.

Before

$ VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=2 python benchmark_throughput.py --model deepseek-ai/DeepSeek-V2-Lite-Chat --trust-remote-code --enforce-eager --max-model-len 8192 --input-len 1000 --output-len 100 --num-prompts 200 --dtype float16
...
INFO 12-06 07:43:15 model_runner.py:1105] Loading model weights took 29.3010 GB
WARNING 12-06 07:43:15 fused_moe.py:326] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/simonmo/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json
INFO 12-06 07:43:16 worker.py:235] Memory profiling results: duration=1.09 seconds, total_gpu_memory=79.10GiB, initial_memory_usage=34.44GiB, peak_torch_memory=30.56GiB, memory_usage_post_profile=34.81GiB, non_torch_memory=5.21GiB, kv_cache_size=35.42GiB, gpu_memory_utilization=0.90.
INFO 12-06 07:43:16 gpu_executor.py:76] # GPU blocks: 5373, # CPU blocks: 606
INFO 12-06 07:43:16 gpu_executor.py:80] Maximum concurrency for 8192 tokens per request: 10.49x
Processed prompts:   0%|                                                                                                               | 0/200 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]WARNING 12-06 07:43:21 scheduler.py:1536] Sequence group 83 is preempted by PreemptionMode.RECOMPUTE mode because there is not enough KV cache space. This can affect the end-to-end performance. Increase gpu_memory_utilization or tensor_parallel_size to provide more KV cache memory. total_num_cumulative_preemption=1
INFO 12-06 07:43:23 metrics.py:460] Avg prompt throughput: 16771.0 tokens/s, Avg generation throughput: 823.6 tokens/s, Running: 81 reqs, Swapped: 0 reqs, Pending: 119 reqs, GPU KV cache usage: 99.5%, CPU KV cache usage: 0.0%.
Processed prompts:  39%|█████████████████████████████████████▊                                                           | 78/200 [00:09<00:11, 11.00it/s, est. speed input: 8544.12 toks/s, output: 854.41 toks/s]INFO 12-06 07:43:28 metrics.py:460] Avg prompt throughput: 16982.3 tokens/s, Avg generation throughput: 1133.8 tokens/s, Running: 82 reqs, Swapped: 0 reqs, Pending: 39 reqs, GPU KV cache usage: 98.0%, CPU KV cache usage: 0.0%.
Processed prompts:  80%|███████████████████████████████████████████████████████████████████████████▋                  | 161/200 [00:14<00:01, 24.54it/s, est. speed input: 11268.57 toks/s, output: 1126.86 toks/s]INFO 12-06 07:43:33 metrics.py:460] Avg prompt throughput: 7962.9 tokens/s, Avg generation throughput: 1323.7 tokens/s, Running: 39 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 46.5%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:18<00:00, 10.70it/s, est. speed input: 10697.81 toks/s, output: 1069.78 toks/s]
Throughput: 10.47 requests/s, 11519.10 total tokens/s, 1047.19 output tokens/s

$ VLLM_ATTENTION_BACKEND=FLASHINFER CUDA_VISIBLE_DEVICES=2 python benchmark_throughput.py --model deepseek-ai/DeepSeek-V2-Lite-Chat --trust-remote-code --enforce-eager --max-model-len 8192 --input-len 1000 --output-len 100 --num-prompts 200 --dtype float16
INFO 12-06 07:38:35 model_runner.py:1105] Loading model weights took 29.3010 GB
WARNING 12-06 07:38:36 fused_moe.py:326] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/simonmo/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_H100_80GB_HBM3.json
INFO 12-06 07:38:36 worker.py:235] Memory profiling results: duration=0.93 seconds, total_gpu_memory=79.10GiB, initial_memory_usage=34.44GiB, peak_torch_memory=30.56GiB, memory_usage_post_profile=34.81GiB, non_torch_memory=5.21GiB, kv_cache_size=35.42GiB, gpu_memory_utilization=0.90.
INFO 12-06 07:38:36 gpu_executor.py:76] # GPU blocks: 42987, # CPU blocks: 4854
INFO 12-06 07:38:36 gpu_executor.py:80] Maximum concurrency for 8192 tokens per request: 83.96x
Processed prompts:   0%|                                                                                                               | 0/200 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]2024-12-06 07:38:40,331 - INFO - flashinfer.jit: Loading JIT ops: batch_decode_mla_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_512_use_swa_False_use_logits_cap_False
INFO 12-06 07:38:44 metrics.py:460] Avg prompt throughput: 31760.2 tokens/s, Avg generation throughput: 31.8 tokens/s, Running: 160 reqs, Swapped: 0 reqs, Pending: 40 reqs, GPU KV cache usage: 23.4%, CPU KV cache usage: 0.0%.
INFO 12-06 07:38:49 metrics.py:460] Avg prompt throughput: 7974.3 tokens/s, Avg generation throughput: 3317.3 tokens/s, Running: 200 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 31.6%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:10<00:00, 19.21it/s, est. speed input: 19206.17 toks/s, output: 1920.62 toks/s]
Throughput: 18.50 requests/s, 20347.06 total tokens/s, 1849.73 output tokens/s

Some todos:

Document the strategy we are using and follow up works. Currently there's still KV cache waste but i think it is the best we can do until a hybrid cache allocator.
Design FLASHINFER_MLA backend, and feature flag MLA (hopefully on by default).
Misc: implement q_lora, cache the mat absorb matrices.
~~Figure out CUDA graph issue.~~ Will just opt it out for now (also turn off chunked prefill)
Figure out the TP story.
Feature flag --disable-mla and DISABLE_MLA
Benchmark.
Support deepseek V3

Some out of scope:

No prefill decode selectors in the model files
Test piece wise cuda graph in V1
Support chunked prefill

github-actions · 2024-12-05T11:11:15Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

…gging fi mla kernel issue

youkaichao · 2024-12-06T08:01:10Z

vllm/model_executor/models/deepseek_v2.py

+            return self.forward_decode(positions, hidden_states, kv_cache,
+                                       attn_metadata)
+
+    def forward_prefill(


does flashinfer have prefill kernel?

liangzelang · 2024-12-30T08:55:59Z

Nice job! And I wonder how do you to solve MLA prefill kernel because there is no avaiable MLA prefill kernel but only decode kernel in flashinfer library.

simon-mo · 2024-12-30T20:03:07Z

@liangzelang this PR will perform the regular up projection to turn MLA into MHA for prefill.

mergify · 2024-12-31T19:03:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simon-mo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: simon-mo <simon.mo@hey.com>

simon-mo · 2025-01-02T20:28:33Z

Update:

Debugging some accuracy issues today (gsm8k is worse in MLA, TP might be a factor as well)
The next remaining steps after debugging finishes
- Implement matrix absorption
- Final round of benchmarks
- Deepseek V3

Then it will be ready for review

Signed-off-by: simon-mo <simon.mo@hey.com>

cennn · 2025-01-13T03:38:18Z

I think the accuracy issues of the FlashInfer kernel might be related to this:
simon-mo/flashinfer#1 (comment)

Co-authored-by: cennn <2523403608@qq.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: simon-mo <simon.mo@hey.com>

mergify · 2025-01-16T21:21:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simon-mo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

simon-mo · 2025-01-16T21:41:40Z

Thanks to @cennn, the accuracy issue has been partially identified. We are now at a point the kernel generate coherent output. However, the accuracy is still lower than that of MHA implementation.

MLA
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.24|±  |0.0429|
|     |       |strict-match    |     8|exact_match|↑  | 0.23|±  |0.0423|


No MLA
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  | 0.32|±  |0.0469|
|     |       |strict-match    |     8|exact_match|↑  | 0.32|±  |0.0469|

simon-mo · 2025-01-27T17:49:04Z

Triton Implementation as a temporary solution for NVIDIA and a longer term solution for AMD
Support passing in kwargs in Attention and pass in hidden_states into Attention.forward in the attention backend.
Code quality pass over changes
Test quickly for TP is accuracy.
Support chunked prefill/prefix caching by directly gathering from kv cache.

Future work

Supporting CUDA Graph somehow for V0 compatibility
TP4 shard the query heads

cc @LucasWilkinson thank you!

simon-mo · 2025-01-30T02:16:17Z

-> #12528

simon-mo added 2 commits December 4, 2024 03:32

Annotated MHA

95f1991

wip, cached latents, used mla decode, but it generate gibberish

e7a56da

simon-mo added 3 commits December 6, 2024 06:37

impl vannilla prefill w/kv cache, now it generate with matabsob, debu…

5bb7586

…gging fi mla kernel issue

lint

feb6ba3

eager mode works

f585a3f

youkaichao reviewed Dec 6, 2024

View reviewed changes

heheda12345 mentioned this pull request Dec 20, 2024

[RFC]: Hybrid Memory Allocator #11382

Closed

1 task

simon-mo added 2 commits December 25, 2024 03:44

reformat and add FLASHINFER_MLA

611acaa

moved prefill FA into FLASHINFER_MLA backend

2ccc97d

simon-mo mentioned this pull request Dec 27, 2024

[Model] DeepSeek-V3 Enhancements #11539

Closed

10 tasks

mergify bot added the needs-rebase label Dec 31, 2024

simon-mo added 5 commits December 31, 2024 19:08

working TP?

7b92036

Signed-off-by: simon-mo <simon.mo@hey.com>

feature flag working

b8f63d7

Signed-off-by: simon-mo <simon.mo@hey.com>

add env flag

043085d

Signed-off-by: simon-mo <simon.mo@hey.com>

Merge branch 'main' of github.com:vllm-project/vllm into mla

19167ca

lint

6ea2c06

Signed-off-by: simon-mo <simon.mo@hey.com>

more debug lines

f668cb9

Signed-off-by: simon-mo <simon.mo@hey.com>

mergify bot removed the needs-rebase label Jan 6, 2025

simon-mo and others added 2 commits January 16, 2025 21:17

Use yarn scales, fixes some accuracy issue

4094e2d

Co-authored-by: cennn <2523403608@qq.com> Signed-off-by: simon-mo <simon.mo@hey.com>

format

a047772

Signed-off-by: simon-mo <simon.mo@hey.com>

mergify bot added ci/build needs-rebase labels Jan 16, 2025

Merge branch 'main' of github.com:vllm-project/vllm into mla

784a5a4

mergify bot removed the needs-rebase label Jan 16, 2025

LucasWilkinson mentioned this pull request Jan 28, 2025

[Attention] MLA decode optimizations #12528

Merged

simon-mo closed this Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Deepseek V2 MLA#10927

[WIP] Deepseek V2 MLA#10927
simon-mo wants to merge 16 commits intovllm-project:mainfrom
simon-mo:mla

simon-mo commented Dec 5, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 5, 2024

Uh oh!

youkaichao Dec 6, 2024

Uh oh!

liangzelang commented Dec 30, 2024

Uh oh!

simon-mo commented Dec 30, 2024

Uh oh!

mergify bot commented Dec 31, 2024

Uh oh!

simon-mo commented Jan 2, 2025

Uh oh!

cennn commented Jan 13, 2025

Uh oh!

mergify bot commented Jan 16, 2025

Uh oh!

simon-mo commented Jan 16, 2025

Uh oh!

simon-mo commented Jan 27, 2025

Uh oh!

simon-mo commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

simon-mo commented Dec 5, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2024

Uh oh!

youkaichao Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

liangzelang commented Dec 30, 2024

Uh oh!

simon-mo commented Dec 30, 2024

Uh oh!

mergify bot commented Dec 31, 2024

Uh oh!

simon-mo commented Jan 2, 2025

Uh oh!

cennn commented Jan 13, 2025

Uh oh!

mergify bot commented Jan 16, 2025

Uh oh!

simon-mo commented Jan 16, 2025

Uh oh!

simon-mo commented Jan 27, 2025

Uh oh!

simon-mo commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simon-mo commented Dec 5, 2024 •

edited by github-actions bot

Loading