[Feature] Support Decode Context Parallel (DCP) for MLA by youzhedian · Pull Request #23734 · vllm-project/vllm

youzhedian · 2025-08-27T10:37:01Z

This PR adds Decode Context Parallel (DCP) support for MLA inference, fully compatible with chunked prefill and APC.

You can enable DCP with --decode-context-parallel-size/-dcp xxx (only support flashmla backend now), and tp_size needs to be divisible by dcp_size, because the world size does not change by dcp, it simply reuse the GPUs of TP group, and split one TP group into tp_size//dcp_size DCP groups. e.g.

with -tp 8 -dcp 8 , we use 8 GPUs
with -tp 8 -dcp 4 , we use 8 GPUs
with -tp 4 -dcp 4 -pp 2 , we use 8 GPUs

and kvcache token budget always increased by `dcp` times.

This DCP implement store kvcache with an interleave style, the kvcache for the token whose token_idx is i is always stored on the GPU whose dcp_rank equals to i % dcp_world_size:

e.g. DCP2, req with prompt_len=5, generation_len=4:
kvcache store in dcp_rank0: 0, 2, 4, 6, 8 
kvcache store in dcp_rank1: 1, 3, 5, 7,

deepseek-ai/DeepSeek-V2-Lite-Chat gsm8k eval:

- TP4PP2DCP4
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6687|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.6657|±  | 0.013|

-  TP4PP2DCP1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.6657|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.6611|±  | 0.013|

more info pls ref introduce Doc

Future work (These items will be tackled in follow-up PRs; community contributions are warmly welcomed.):

DCP support MLA fullgraph
Extend cp_gather_cache to handle scaled KV-cache and supersede gather_and_maybe_dequant_cache
DCP support triton_mla/cutlass_mla, DCP only support flashmla backend now
KV-cache deduplication via DCP in GQA (e.g., GQA8-TP16 with DCP2)
DCP support for MTP
Ring-CP style prefill context parallel (PCP) implement to optimize TTFT

mergify · 2025-08-27T10:37:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youzhedian.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces Context Parallelism (CP) support for MLA inference, which is a significant feature enhancement. The changes are extensive, touching configuration, parallel state management, scheduling, KV cache, and attention backends. The implementation seems well-thought-out, with new CUDA kernels for CP-specific operations and corresponding Python wrappers and tests. The end-to-end tests comparing CP with TP are a good validation strategy.

My review found one critical bug fix in the cuda_communicator.py file, where a reduce_scatter operation was using a potentially non-contiguous tensor, which could lead to incorrect results. The provided patch correctly fixes this issue. The rest of the implementation for context parallelism appears solid.

vllm/config/parallel.py

youkaichao

thanks for the great work!

as discussed, there can be two types of cp, cp for prefill (where the world size is enlarged by cp) and cp for decode (where the world size does not change by cp). if possible, let's denote the current pr as decode-context-parallel-size and dcp_size to leave room for prefill cp in the future.

youkaichao · 2025-08-27T12:20:05Z

@youzhedian to accelerate the review and merge (especially ci testing), maybe we can split the kernel side changes to a separate PR and get it merged first. then follow-up PRs can use pre-compiled wheels from that PR, with much faster ci testing.

hmellor · 2025-08-27T14:37:13Z

I've just come across this PR adding cp to non-MLA attention #23703

youzhedian · 2025-08-28T03:20:26Z

@youzhedian to accelerate the review and merge (especially ci testing), maybe we can split the kernel side changes to a separate PR and get it merged first. then follow-up PRs can use pre-compiled wheels from that PR, with much faster ci testing.

#23791 as suggested

LucasWilkinson · 2025-08-28T04:59:20Z

Cool thanks for taking this on! I think this can be done without any GPU model runner changes; I was working on a prototype but got unfortunately it got backburned for few months 😞 anyways just sharing here for an alternative solution that doesn't require as much more of the core code but potentially more susceptible to imbalance (its not fully functional yet)

#22789

…#23734) Signed-off-by: hongchao <hongchao@msh.team> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: hongchao <hongchao@msh.team> Co-authored-by: youkaichao <youkaichao@gmail.com>

gary-wjc · 2025-10-15T05:47:14Z

Is this PR compatible with #22668 ? @youzhedian

Livinfly · 2025-11-10T12:39:37Z

@youzhedian I noticed that in kv_cache_interface.py, it points out "DCP does not support sliding window." However, logically, it seems like sliding window should be technically feasible. Will it be considered for future support?

FirwoodLin · 2026-01-21T18:13:34Z

Thanks for the contribution!

I noticed that the documentation mentions both MLA-CP and MLA-CP-FAST in relation to the V1 Engine. I was wondering: does the current vLLM implementation correspond to MLA-CP or MLA-CP-FAST?

From my reading of the code, it appears to be the non-FAST version, specifically because the q_b_proj, kv_b_proj, and o_proj weights are TP shared rather than Replicated.

Could you confirm if my understanding is correct?

youzhedian requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zhuohan123 as code owners August 27, 2025 10:37

mergify bot added the v1 label Aug 27, 2025

mergify bot added the needs-rebase label Aug 27, 2025

gemini-code-assist bot reviewed Aug 27, 2025

View reviewed changes

youkaichao reviewed Aug 27, 2025

View reviewed changes

vllm/config/parallel.py Outdated Show resolved Hide resolved

youkaichao reviewed Aug 27, 2025

View reviewed changes

youkaichao changed the title ~~[Feature] Support Context Parallel for MLA~~ [Feature] Support Decode Context Parallel for MLA Aug 27, 2025

hmellor mentioned this pull request Aug 27, 2025

[Feat] long_seq_optim #23703

Closed

5 tasks

youzhedian closed this Aug 28, 2025

youzhedian mentioned this pull request Aug 28, 2025

[Kernel] cuda kernels for upcoming decode context parallel feature #23791

Merged

youzhedian reopened this Aug 28, 2025

LucasWilkinson mentioned this pull request Sep 8, 2025

[Attention] add DCP support for FLASH_ATTN_MLA backend #24453

Merged

5 tasks

MatthewBonanni mentioned this pull request Sep 8, 2025

[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration #21078

Merged

frank-wei mentioned this pull request Sep 11, 2025

[RFC]: Decode Context Parallel for GQA #24685

Closed

1 task

FENP mentioned this pull request Sep 15, 2025

[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention #24864

Merged

5 tasks

This was referenced Sep 17, 2025

Support CP with query length larger than 1 vllm-project/flash-attention#93

Merged

Add DCP parameters vllm-project/flash-attention#92

Draft

frank-wei mentioned this pull request Sep 18, 2025

[Kernel] Support DCP for Triton backend #25132

Merged

5 tasks

gjc0824 mentioned this pull request Sep 23, 2025

[DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer #25438

Merged

5 tasks

pisceskkk mentioned this pull request Sep 26, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

9 tasks

josephrocca mentioned this pull request Oct 1, 2025

[Bug]: TypeError: argument 'id': StreamInput must be either an integer or a list of integers #25821

Closed

1 task

LucasWilkinson mentioned this pull request Oct 2, 2025

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Closed

1 task

zhangsicheng5 mentioned this pull request Oct 13, 2025

[DCP] Support dcp kv_cache interleave size > 1 #26696

Merged

5 tasks

wuhuikx mentioned this pull request Oct 14, 2025

[Performance]: Deepseek-V3 Performance Uplift Plan on ROCm Backend #26768

Open

30 tasks

staugust mentioned this pull request Oct 27, 2025

[Feature] Implement Decode Context Parallel in SGLang sgl-project/sglang#12196

Closed

2 tasks

ZSL98 mentioned this pull request Oct 31, 2025

What is the difference between shift parallelism and vllm's decode context parallelism? snowflakedb/ArcticInference#219

Open

hmellor mentioned this pull request Mar 6, 2026

Add token sharding functions for context parallel #26058

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support Decode Context Parallel (DCP) for MLA#23734

[Feature] Support Decode Context Parallel (DCP) for MLA#23734
youkaichao merged 30 commits intovllm-project:mainfrom
youzhedian:hc/upstream_cp_pr_0827

youzhedian commented Aug 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

youkaichao left a comment

Uh oh!

youkaichao commented Aug 27, 2025

Uh oh!

hmellor commented Aug 27, 2025

Uh oh!

youzhedian commented Aug 28, 2025

Uh oh!

LucasWilkinson commented Aug 28, 2025

Uh oh!

gary-wjc commented Oct 15, 2025

Uh oh!

Livinfly commented Nov 10, 2025

Uh oh!

FirwoodLin commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Uh oh!

Conversation

youzhedian commented Aug 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Aug 27, 2025

Uh oh!

hmellor commented Aug 27, 2025

Uh oh!

youzhedian commented Aug 28, 2025

Uh oh!

LucasWilkinson commented Aug 28, 2025

Uh oh!

gary-wjc commented Oct 15, 2025

Uh oh!

Livinfly commented Nov 10, 2025

Uh oh!

FirwoodLin commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

youzhedian commented Aug 27, 2025 •

edited by github-actions bot

Loading