[DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer by gjc0824 · Pull Request #25438 · vllm-project/vllm

gjc0824 · 2025-09-23T02:11:53Z

Purpose

This PR adds Decode Context Parallel (DCP) support for GQA follwing PR #23734 and PR #24864. Current implementation based on FlashInfer Attention.

FlashInfer inserts the current query KV into the cache before computation. Each query then attends to both its own KV and the context KV on the local device, with LSE applied to correct the attention outputs.

In the prefill/partial-prefill stage, custom mask is added to support interleaved KV cache with FlashInfer.

q_lens = 8, total_lens = 25 , group_size = 4, local_rank = 0

stored kv cache
rank0: 0 4 8 12 16 20 24
rank1: 1 5 9 13 17 21
rank2: 2 6 10 14 18 22
rank3: 3 7 11 15 19 23

rank0 custom mask
q\kv    0      4      8     12    16      20      24
17   True,  True,  True,  True,  True,  False, False
18   True,  True,  True,  True,  True,  False, False
19   True,  True,  True,  True,  True,  False, False
20   True,  True,  True,  True,  True,  True,  False
21   True,  True,  True,  True,  True,  True,  False
22   True,  True,  True,  True,  True,  True,  False
23   True,  True,  True,  True,  True,  True,  False
24   True,  True,  True,  True,  True,  True,  True

In the decode stage, this PR follows the DCP decode approach from MLA, i.e., all-gathering Q and lse, then correcting the attn out before performing reduce-scatter.

Test Plan

Qwen/Qwen3-235B-A22B

export VLLM_ATTENTION_BACKEND='FLASHINFER'
vllm serve Qwen/Qwen3-235B-A22B --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --decode-context-parallel-size 2

Test Result

gsm8k eval

dcp=1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8578|±  |0.0068|
|     |       |strict-match    |     5|exact_match|↑  |0.8415|±  |0.0071|

dcp=2
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8613|±  |0.0067|
|     |       |strict-match    |     5|exact_match|↑  |0.8469|±  |0.0070|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces Decode Context Parallel (DCP) support for Grouped-Query Attention (GQA) with the FlashInfer backend, which is a valuable enhancement for distributed inference performance. The changes are comprehensive, covering configuration validation, modifications to the attention backend to support DCP-specific logic like query head gathering and LSE-based output correction, and the implementation of a custom attention mask for prefills. The addition of tests for a GQA model using the new functionality is also a great inclusion. The overall implementation is well-executed. I have a couple of suggestions to enhance code quality by addressing a dynamically assigned attribute and removing duplicated code.

vllm/config/model.py

vllm/v1/attention/backends/flashinfer.py

github-actions · 2025-09-23T02:15:36Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

LucasWilkinson · 2025-10-07T04:23:46Z

vllm/v1/attention/backends/flashinfer.py

+                            continue
+                        K = ((rightmost - r) // p) + 1
+                        j = torch.arange(K)
+                        t = torch.arange(Q)


nit: we generally avoid single character variable names; theyre ok though if there is supporting comment, can you please add comments explaining what the mask looks like and how it is constructed?

Thank you for your review. We have added the comment about mask examples and algorithm explanation after vectorized improvements.

LucasWilkinson · 2025-10-07T04:24:47Z

vllm/v1/attention/backends/flashinfer.py

+                                                  torch.int64).tolist()
+                    r = self.dcp_rank
+                    p = self.dcp_world_size
+                    for i in range(num_prefills):


nit: is there a way we can vectorize this loop or replace it with a triton kernel? ideally we avoid python loops as they can be very slow and create GPU bubbles

Thank you for your valuable review. We have vectorized the "num_prefills" loop to avoid GPU bubbles. Looking forward to your further review.

if self.dcp_world_size > 1: # init custom mask for interleave kv cache # |-------total_lens----------| # |--context_lens--|--q_lens--| # Example: dcp_size=2, dcp_rank=0 # For a SINGLE prefill seq, q_lens=3, total_lens=5 # k_lens on RANK1 is (5 - 1 - 0) // 2 + 1 = 3 # mask.shape = [q_lens, k_lens] = [3,3] # mask [[True, True, False], # [True, True, False], # [True, True, True]] dcp_rank = self.dcp_rank dcp_size = self.dcp_world_size q_lens = (qo_indptr_cpu[1:] - qo_indptr_cpu[:-1]).to( dtype=torch.int64, device=self.device) total_lens = seq_lens_cpu[prefill_start:prefill_start + num_prefills].to(dtype=torch.int64, device=self.device) context_lens = total_lens - q_lens # max indices for global sequences max_indices = total_lens - 1 # if max_indices are smaller than dcp_rank, # current rank has no kv cache, is invalid, # the mask is skipped valid = (max_indices >= dcp_rank) assert torch.any(valid), "There is no valid sequence" # local kv lens on current dcp_rank k_lens = torch.div(max_indices - dcp_rank, dcp_size, rounding_mode="floor") + 1 k_lens = torch.where( valid, k_lens, torch.zeros_like(k_lens)) # vectorize operation # obtain the max length of all prefill reqs max_q = int(q_lens[valid].max().item()) max_k = int(k_lens[valid].max().item()) # generate local q and k indices q_indices = torch.arange(max_q, device=self.device) k_indices = torch.arange(max_k, device=self.device) # valid q and k indices of each reqs valid_q = valid[:, None] & \ (q_indices[None, :] < q_lens[:, None]) valid_k = valid[:, None] & \ (k_indices[None, :] < k_lens[:, None]) # where global q_indices >= global k_indices, # the mask is True # global q_indices = context_lens + local q_indices # global k_indices = local k_indcies * dcp_size + dcp_rank # ====> local k_indcies must be smaller or equal k_upper # k_upper=(context_lens + local q_indices - dcp_rank) // dcp_size k_upper = torch.div( context_lens[:, None] + q_indices - dcp_rank, dcp_size, rounding_mode="floor") k_upper = torch.where( valid_q, torch.clamp(k_upper, min=-1), k_upper.new_full(k_upper.shape, -1)) mask = (k_indices[None, None, :] <= k_upper[:, :, None]) \ & (k_upper[:, :, None] >= 0) valid_positions = valid_q[:, :, None] & valid_k[:, None, :] # flashinfer backend needs flattened format custom_mask = torch.masked_select(mask, valid_positions)

LucasWilkinson · 2025-10-07T04:25:40Z

Apologies for the delayed review! left a couple nits; overall its looking pretty good though

mergify · 2025-10-09T18:14:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gjc0824.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gjc0824 · 2025-10-14T06:41:28Z

Apologies for the delayed review! left a couple nits; overall its looking pretty good though

Hi @LucasWilkinson . Could you re-review this PR and give the final sign off ? Thanks!

mergify · 2025-10-14T06:42:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gjc0824.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

Apologies for the delay! Overall looks pretty good so far but I think we should land #26696 first (seems more important and this can build on that), thoughts?

LucasWilkinson · 2025-10-17T04:39:54Z

vllm/v1/attention/backends/flashinfer.py


-        self.num_qo_heads = self.model_config.get_num_attention_heads(
-            self.vllm_config.parallel_config
+        try:


see: #26696 (comment)

Sorry for delay. We were blocked by #26696 which will effect our local kv lengths. Now #26696 has been merged and we can continue to improve this work.
Compared to the previous commit, we have made significant improvements and employed the similar implementation as #24864 The main reason for making this improvement is that we found that inducing a custom mask greatly slows down the prefill_wrapper.run() operator (2ms -> 10ms when seq_len=32k). For avoid the custom mask, we divide the computation of prefill stage into context and new tokens.

# |---------- context_len ----------|--- query_len---| # |------------ context-- ----------|-- newtokens ---|

For newtokens, query can compute with kv in causal=True mode without other communications in DCP group.

For context, the KV is distributed across different DCP ranks and causal mask is not required. We follows the #24864, i.e., all-gathering Q and lse, then correcting the attn out before performing reduce-scatter.
This implementation obtains the memory space by splitting kvcache and not impair performance much.

LucasWilkinson · 2025-10-17T04:41:01Z

vllm/v1/attention/backends/flashinfer.py

        block_table_tensor = common_attn_metadata.block_table_tensor

+        if self.dcp_world_size > 1:
+            seq_lens_np = seq_lens_np // self.dcp_world_size + (


should we land #26696 first and then update this to use the dcp_local_seq_lens computed in the model runner?

Sure. We can obtain the local seq_lens by get_dcp_local_seq_lens at this.

LucasWilkinson · 2025-10-17T04:45:40Z

vllm/v1/attention/backends/flashinfer.py

+                if self.dcp_world_size > 1:
+                    prefill_query = get_dcp_group().all_gather(
+                        prefill_query.contiguous(), dim=1
+                    )


nit: I guess this is fine but I guess the name "decode context parallel" is falling apart a bit here 😞

In last implementation, this issue does exist. But in current version, additional DCP communication operations during the prefill phase only occur when context tokens are present. So I think this may be suitable.

FENP · 2025-10-20T03:25:49Z

tests/distributed/test_context_parallel.py

    ],
    "bigcode/gpt_bigcode-santacoder": [
-        CPTestSettings.detailed(),
-        CPTestSettings.detailed(tp_base=2),


I think it's better to keep the default backend for CI.

Sorry for delay. We were blocked by #26696 which will effect our local kv lengths. Now #26696 has been merged and we can continue to improve this work.
Sure. We should keep the default backend. Thanks!

FENP · 2025-10-20T03:29:12Z

tests/distributed/test_context_parallel.py

 class CPTestOptions(NamedTuple):
    multi_node_only: bool
    load_format: str | None = None
+    attn_backend: str = "FLASH_ATTN"


MLA can't use "FLASH_ATTN" backend, so the default value should not be set.

Sure. We have improved it. Thanks for your comment.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson · 2025-11-10T21:28:35Z

vllm/v1/attention/backends/flashinfer.py

+                            fixed_split_size=self.prefill_fixed_split_size,
+                            disable_split_kv=self.disable_split_kv,
+                        )
+                        kv_query_indptr_cpu = qo_indptr_cpu.clone()


why is a clone needed? can't we just do kv_indptr=qo_indptr_cpu.to(self.device)

Yes. It is not needed. We verified that removing it does not affect the model precision.

LucasWilkinson · 2025-11-10T21:28:38Z

vllm/v1/attention/backends/flashinfer.py

+                        assert not isinstance(attn_metadata.prefill_wrapper, dict)
+                        attn_metadata.prefill_wrapper.plan(
+                            qo_indptr_cpu.to(self.device),
+                            paged_kv_indptr_cpu.to(self.device),


why add .to(self.device); last i checked FlashInfer prefers CPU tensors otherwise we can get D2H copies in the plan: #21137

This was necessary in our last implementation with the custom mask, otherwise a device error would occur in the BatchPrefillWithPagedKVCacheWrapper. Now we can freely remove it.

LucasWilkinson · 2025-11-10T21:30:54Z

vllm/v1/attention/backends/flashinfer.py

+            BatchPrefillWithPagedKVCacheWrapper | BatchPrefillWithRaggedKVCacheWrapper,
+        ]
+        | None
+    ) = None


this type signature is complicated and repeated alot 😞 ; maybe we could make our wrapper shim? like

class BatchDCPPrefillWrapper: self._new_tokens: BatchPrefillWithRaggedKVCacheWrapper self._context: BatchPrefillWithPagedKVCacheWrapper def plan(....): self._new_tokens.plan(...) self._context.plan(...) def run(...): new = self._new_tokens.run(...) context = self._context.run(...) return merge_attn_states(new, context)

then we can make this prefill_wrapper: BatchPrefillWithPagedKVCacheWrapper | BatchDCPPrefillWrapper

thoughts?

Thanks for you valuable comment. We added the new class BatchDCPPrefillWrapper for concise wrapper. The main improvement is here.

LucasWilkinson · 2025-11-10T21:32:47Z

vllm/v1/attention/backends/flashinfer.py

@@ -679,24 +756,61 @@ def build(
                attn_metadata.max_q_len_prefill = int(query_lens_prefill.max().item())

                if not attn_metadata.prefill_use_trtllm:


do we need to force prefill_use_trtllm to False when DCP is enabled?

trllm cannot return lse so it is not supported in DCP. Now we directly disable it in vllm/utils/flashinfer.py.

mergify · 2025-11-11T12:52:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gjc0824.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson

overall is looking much better thank you! what is the issue with block interleave_size > 1 ?

vllm/v1/attention/backends/flashinfer.py

LucasWilkinson · 2025-11-12T03:57:11Z

vllm/utils/flashinfer.py

+    # Decode context parallel is not supported
+    if dcp_world_size > 1:
+        logger.warning_once(
+            "Trtllm not support lse, please use flash attention or FlashInfer backend."


"Trtllm does not support returning LSE and as a result does not support DCP, reverting to FlashInfer"

Thanks! We update the warning information for a clearer explanation.

pisceskkk · 2025-11-12T06:50:13Z

what is the issue with block interleave_size > 1 ?

When handling contexts for chunked prefill, we split contexts into chunks based on the workspace. However, the recent refactoring of the reorg_kvcache function for adapting to interleave_size > 1overlooked this aspect, resulting in incorrect chunk sizes being extracted. I have submitted a new PR #28526 to address these issues.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson

LGTM; thanks for the cleanups!

(please resolve conflicts)

mergify · 2025-11-13T03:39:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gjc0824.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

gjc0824 requested review from DarkLight1337, ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners September 23, 2025 02:11

mergify bot added the v1 label Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

vllm/config/model.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

gjc0824 force-pushed the dcp-gqa-flashinfer branch 9 times, most recently from 540c862 to b9e9b41 Compare September 24, 2025 03:44

pisceskkk mentioned this pull request Sep 26, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

9 tasks

LucasWilkinson reviewed Oct 7, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 9, 2025

github-project-automation bot moved this to Done in Structured Output Oct 14, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Oct 14, 2025

github-project-automation bot moved this to Done in Tool Calling Oct 14, 2025

gjc0824 reopened this Oct 14, 2025

github-project-automation bot moved this from Done to To Triage in gpt-oss Issues & Enhancements Oct 14, 2025

LookAround0301 mentioned this pull request Oct 15, 2025

[Feature] Support Prefill Context Parallel (PCP) for GQA flashinfer #26864

Closed

3 tasks

LucasWilkinson reviewed Oct 17, 2025

View reviewed changes

FENP suggested changes Oct 20, 2025

View reviewed changes

pisceskkk mentioned this pull request Oct 28, 2025

[DCP] Support dcp kv_cache interleave size > 1 #26696

Merged

5 tasks

gjc0824 and others added 7 commits November 9, 2025 21:47

Merge branch 'main' into dcp-gqa-flashinfer

3564c88

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Merge branch 'main' into dcp-gqa-flashinfer

c05cc84

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Merge branch 'main' into dcp-gqa-flashinfer

de3e18d

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Merge branch 'main' into dcp-gqa-flashinfer

fdf776d

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Merge branch 'vllm-project:main' into dcp-gqa-flashinfer

aa70cea

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

[Refactor] remove the custom mask

6ceb064

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

[lint] clean code

9037102

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson reviewed Nov 10, 2025

View reviewed changes

gjc0824 and others added 3 commits November 11, 2025 21:43

refactor dcp prefill_wrapper

f781d96

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

Merge branch 'main' into dcp-gqa-flashinfer

bd65197

Signed-off-by: Jingchun Gao <63247409+gjc0824@users.noreply.github.com>

[Lint] clean code

aedf9c4

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson reviewed Nov 12, 2025

View reviewed changes

rename and clean code

c172a1a

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

LucasWilkinson approved these changes Nov 13, 2025

View reviewed changes

pisceskkk added 2 commits November 13, 2025 15:56

Merge remote-tracking branch 'upstream/main' into dcp-dev

25ece95

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

[typo]

34e9774

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

		@@ -679,24 +756,61 @@ def build(
		attn_metadata.max_q_len_prefill = int(query_lens_prefill.max().item())

		if not attn_metadata.prefill_use_trtllm:

Uh oh!

Conversation

gjc0824 commented Sep 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

gjc0824 commented Oct 14, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjc0824 Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjc0824 Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjc0824 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gjc0824 commented Sep 23, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading

gjc0824 Nov 9, 2025 •

edited

Loading

gjc0824 Nov 9, 2025 •

edited

Loading

gjc0824 Nov 11, 2025 •

edited

Loading

pisceskkk commented Nov 12, 2025 •

edited

Loading

LucasWilkinson left a comment •

edited

Loading