[Deepseek v3.2] Support indexer prefill chunking by heheda12345 · Pull Request #25999 · vllm-project/vllm

heheda12345 · 2025-10-01T01:07:50Z

Purpose

Split the prefill to multiple steps, with each step contains a subset of prefill requests. With this approach, we can avoid the large output caused by gather kv cache.

Test Plan

20 shot gsm 8k

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8 --max_model_len 32768

lm-eval --model local-completions --tasks gsm8k   --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False --num_fewshot 20

 vllm bench serve --backend vllm --model deepseek-ai/DeepSeek-V3.2-Exp --dataset-name random --random-input 100 --random-output 100 --num-prompt 256 --ignore-eos --random-prefix-len 18000

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9545|±  |0.0057|
|     |       |strict-match    |    20|exact_match|↑  |0.9545|±  |0.0057|

Not crash. And I'm sure it did some chunking.

============ Serving Benchmark Result ============
Successful requests:                     256       
Benchmark duration (s):                  16.24     
Total input tokens:                      4633344   
Total generated tokens:                  25600     
Request throughput (req/s):              15.76     
Output token throughput (tok/s):         1576.28   
Peak output token throughput (tok/s):    3212.00   
Peak concurrent requests:                256.00    
Total Token throughput (tok/s):          286867.21 
---------------Time to First Token----------------
Mean TTFT (ms):                          5906.71   
Median TTFT (ms):                        5612.02   
P99 TTFT (ms):                           8094.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          102.66    
Median TPOT (ms):                        105.46    
P99 TPOT (ms):                           152.14    
---------------Inter-token Latency----------------
Mean ITL (ms):                           102.67    
Median ITL (ms):                         82.62     
P99 ITL (ms):                            259.64    
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 · 2025-10-01T01:08:49Z

+                total_seq_lens = 0
+        if total_seq_lens > 0:
+            chunk_seq_ids.append((reqs_start, len(seq_lens_cpu)))
+        print("chunk_seq_ids", chunk_seq_ids,


self reminder: remove this print

gemini-code-assist

Code Review

This pull request introduces prefill chunking for the Deepseek v3.2 indexer to manage memory usage by splitting large prefill batches. The changes involve refactoring metadata structures to support chunks and adding logic to split prefill requests based on sequence length.

I've found a critical logic issue in the split_prefill_chunks function that could lead to creating chunks larger than the specified limit, which would cause an assertion failure. I've provided a suggested fix for this. Additionally, there's a debug print statement that should be removed.

Regarding your note about not observing chunking in your test run, this is likely because the max_prefill_buffer_size (currently max_model_len * 2) is larger than the total sequence length of your test case. You could temporarily lower this value to verify that the chunking logic is triggered correctly with the fix.

gemini-code-assist · 2025-10-01T01:09:50Z

+        chunk_seq_ids = []
+        total_seq_lens = 0
+        for i in range(reqs_start, len(seq_lens_cpu)):
+            total_seq_lens += seq_lens_cpu[i]
+            if total_seq_lens > max_prefill_buffer_size:
+                chunk_seq_ids.append((reqs_start, i))
+                reqs_start = i
+                total_seq_lens = 0
+        if total_seq_lens > 0:
+            chunk_seq_ids.append((reqs_start, len(seq_lens_cpu)))


There's a critical logic error in the chunk splitting implementation. The current logic resets total_seq_lens to 0 after creating a chunk, but fails to account for the sequence length of the request that triggered the split. This causes the next chunk to potentially include too many requests, exceeding max_prefill_buffer_size and triggering the assertion in build_one_prefill_chunk.

For example, with seq_lens_cpu = [1000, 1000, 1000] and max_prefill_buffer_size = 1500, the first chunk correctly contains only the first request. However, the second chunk would be created with the remaining two requests, having a total length of 2000, which is over the limit.

Here is a corrected implementation that properly handles chunk creation and sequence length accounting:

Suggested change

chunk_seq_ids = []

total_seq_lens = 0

for i in range(reqs_start, len(seq_lens_cpu)):

total_seq_lens += seq_lens_cpu[i]

if total_seq_lens > max_prefill_buffer_size:

chunk_seq_ids.append((reqs_start, i))

reqs_start = i

total_seq_lens = 0

if total_seq_lens > 0:

chunk_seq_ids.append((reqs_start, len(seq_lens_cpu)))

chunk_seq_ids = []

total_seq_lens = 0

chunk_start_idx = reqs_start

for i in range(reqs_start, len(seq_lens_cpu)):

seq_len = seq_lens_cpu[i]

if total_seq_lens > 0 and total_seq_lens + seq_len > max_prefill_buffer_size:

chunk_seq_ids.append((chunk_start_idx, i))

chunk_start_idx = i

total_seq_lens = seq_len

else:

total_seq_lens += seq_len

if total_seq_lens > 0:

chunk_seq_ids.append((chunk_start_idx, len(seq_lens_cpu)))

gemini-code-assist · 2025-10-01T01:09:50Z

+        print("chunk_seq_ids", chunk_seq_ids,
+              seq_lens_cpu[chunk_seq_ids[0][0]:])


This print statement appears to be for debugging and should be removed to avoid polluting logs in production environments.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

gemini-code-assist

Code Review

This pull request introduces prefill chunking for the Deepseek v3.2 indexer to manage memory usage by splitting large prefill requests into smaller chunks. The core logic is implemented in vllm/v1/attention/backends/mla/indexer.py with a new split_prefill_chunks function and modifications to the metadata builder.

My review identifies a critical bug in the split_prefill_chunks function that can lead to incorrect behavior with empty chunks, especially when a single request exceeds the buffer size. I've also pointed out a problematic assertion in build_one_prefill_chunk that would cause crashes in such scenarios. I've provided suggestions to fix these issues to ensure the chunking logic is robust.

gemini-code-assist · 2025-10-01T01:57:35Z

+def split_prefill_chunks(seq_lens_cpu: torch.Tensor,
+                         max_prefill_buffer_size: int,
+                         reqs_start: int) -> list[tuple[int, int]]:
+    """
+    Split the prefill chunks into a list of tuples of (reqs_start, reqs_end)
+    such that the total sequence length of each chunk is less than the
+    maximum prefill buffer size.
+
+    Args:
+        seq_lens_cpu: The sequence lengths of the prefill requests.
+        max_prefill_buffer_size: The maximum prefill buffer size.
+        reqs_start: The start index of the prefill requests.
+
+    Returns:
+        A list of tuples of (reqs_start, reqs_end).
+    """
+    chunk_seq_ids = []
+    total_seq_lens = 0
+    for i in range(reqs_start, len(seq_lens_cpu)):
+        cur_seq_len = seq_lens_cpu[i].item()
+        total_seq_lens += cur_seq_len
+        if total_seq_lens > max_prefill_buffer_size:
+            chunk_seq_ids.append((reqs_start, i))
+            reqs_start = i
+            total_seq_lens = cur_seq_len
+    if total_seq_lens > 0:
+        chunk_seq_ids.append((reqs_start, len(seq_lens_cpu)))
+    return chunk_seq_ids


The current implementation of split_prefill_chunks has a bug when handling requests that are larger than max_prefill_buffer_size. It can lead to creating empty chunks and incorrect chunking, which will cause issues downstream. For example, with seq_lens_cpu=torch.tensor([10]), max_prefill_buffer_size=5, and reqs_start=0, the function returns [(0, 0), (0, 1)], where the first chunk is empty.

A corrected implementation should handle this edge case gracefully, typically by placing the large request in its own chunk. It would also be beneficial to add a test case for this scenario.

Here is a suggested implementation that fixes this issue:

def split_prefill_chunks(seq_lens_cpu: torch.Tensor, max_prefill_buffer_size: int, reqs_start: int) -> list[tuple[int, int]]: """ Split the prefill chunks into a list of tuples of (reqs_start, reqs_end) such that the total sequence length of each chunk is less than the maximum prefill buffer size. Args: seq_lens_cpu: The sequence lengths of the prefill requests. max_prefill_buffer_size: The maximum prefill buffer size. reqs_start: The start index of the prefill requests. Returns: A list of tuples of (reqs_start, reqs_end). """ chunk_seq_ids = [] total_seq_lens = 0 for i in range(reqs_start, len(seq_lens_cpu)): cur_seq_len = seq_lens_cpu[i].item() if total_seq_lens > 0 and total_seq_lens + cur_seq_len > max_prefill_buffer_size: chunk_seq_ids.append((reqs_start, i)) reqs_start = i total_seq_lens = 0 total_seq_lens += cur_seq_len if total_seq_lens > 0: chunk_seq_ids.append((reqs_start, len(seq_lens_cpu))) return chunk_seq_ids

@heheda12345 I think the bot has a point; we should probably assert there is no empty chunks

gemini-code-assist · 2025-10-01T01:57:35Z

+        token_start = query_start_loc_cpu[reqs_start].item()
+        token_end = query_start_loc_cpu[reqs_end].item()
+        total_seq_lens = seq_lens_cpu[reqs_start:reqs_end].sum()
+        assert total_seq_lens < self.max_prefill_buffer_size


This assertion will fail if a single prefill request has a sequence length greater than or equal to max_prefill_buffer_size. The chunking logic should be able to handle such cases by putting the large request into its own chunk, but this assertion prevents that and will cause a crash.

As you noted in the PR description ("I don't have a trace that total_seq_lens < self.max_prefill_buffer_size"), this is indeed a problematic check. It should be removed to allow processing of single large requests.

simon-mo · 2025-10-01T02:14:43Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-01T02:18:19Z

+    chunk_seq_ids = []
+    total_seq_lens = 0
+    for i in range(reqs_start, len(seq_lens_cpu)):
+        cur_seq_len = seq_lens_cpu[i].item()
+        total_seq_lens += cur_seq_len
+        if total_seq_lens > max_prefill_buffer_size:
+            chunk_seq_ids.append((reqs_start, i))


Prevent prefill chunks from equaling buffer size

The new split_prefill_chunks only starts a new chunk when the accumulated length is strictly greater than max_prefill_buffer_size. However, build_one_prefill_chunk asserts total_seq_lens < self.max_prefill_buffer_size, so a chunk whose total length is exactly the buffer size slips through the splitter and immediately triggers the assertion. Two prefill requests with length max_model_len each will produce a chunk of exactly 2 * max_model_len (the configured buffer), causing an assertion failure and request crash. The split predicate should use >= or the assertion should accept equality so that valid workloads do not abort.

Useful? React with 👍 / 👎.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 · 2025-10-01T04:11:05Z

@codex review

chatgpt-codex-connector · 2025-10-01T04:11:10Z

To use Codex here, create a Codex account and connect to github.

heheda12345 · 2025-10-01T04:13:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

chatgpt-codex-connector · 2025-10-01T04:16:22Z

+    def build_one_prefill_chunk(self, reqs_start, reqs_end,
+                                query_start_loc_cpu, seq_lens_cpu,
+                                block_table):
+        prefill_query_start_loc = query_start_loc_cpu[
+            reqs_start:reqs_end + 1] - query_start_loc_cpu[reqs_start]
+        cu_seqlen_ks, cu_seqlen_ke = kv_spans_from_batches(
+            prefill_query_start_loc, seq_lens_cpu[reqs_start:reqs_end],
+            self.device)
+        token_start = query_start_loc_cpu[reqs_start].item()
+        token_end = query_start_loc_cpu[reqs_end].item()
+        total_seq_lens = seq_lens_cpu[reqs_start:reqs_end].sum()
+        assert total_seq_lens <= self.max_prefill_buffer_size
+        cu_seq_lens = torch.cat([
+            torch.zeros(1, dtype=torch.int32),
+            seq_lens_cpu[reqs_start:reqs_end].cumsum(dim=0)
+        ]).to(torch.int32).to(self.device)
+        return DeepseekV32IndexerPrefillChunkMetadata(
+            cu_seqlen_ks=cu_seqlen_ks,
+            cu_seqlen_ke=cu_seqlen_ke,
+            cu_seq_lens=cu_seq_lens,
+            total_seq_lens=total_seq_lens,
+            block_table=block_table[reqs_start:reqs_end],


Store prefill chunk length as int, not tensor

build_one_prefill_chunk keeps total_seq_lens as the tensor returned by seq_lens_cpu[reqs_start:reqs_end].sum() and places it in DeepseekV32IndexerPrefillChunkMetadata. The downstream indexer (sparse_attn_indexer in deepseek_v2.py) uses this field to size temporary buffers (torch.empty([chunk.total_seq_lens, …])). Because the metadata now contains a 0‑D tensor rather than a Python int, those allocations will raise TypeError: empty(): argument 'size' must be tuple of ints whenever a prefill chunk is processed. Convert total_seq_lens to an int (e.g. .item()) before storing it.

Useful? React with 👍 / 👎.

LucasWilkinson

Make sense to me! I think this is pretty good solution, nice work!

I think the bot was right about the one comment, an assert could be helpful (even though I know it currently would never be hit; would be nice to have incase someone mucks around with get_max_prefill_buffer_size)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

- vllm-project/vllm#25896 - vllm-project/vllm#27205 - vllm-project/vllm#27204 - vllm-project/vllm#27431 - chat_utils: fix resolve_chat_template_kwargs duplication - vllm-project/vllm#27556 - vllm-project/vllm#25996 - requirements/rocm.txt: pin triton==3.3.0 (from build requirements) - Dockerfile*.ubi: bump base image tag to 9.6-1760340988 - Dockerfile*.ubi: pre-download tiktoken tokenizers (o200k_base) (https://issues.redhat.com/browse/INFERENG-2959) - Dockerfile.ubi: add missing `cuda-cudart-devel` package, required for deepgeemm JITs - vllm-project/vllm#25999 - vllm-project/vllm#26416 Related: neuralmagic/nm-cicd#313

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

indexer chunk

0174774

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 requested a review from LucasWilkinson as a code owner October 1, 2025 01:07

mergify Bot added deepseek Related to DeepSeek models v1 labels Oct 1, 2025

heheda12345 commented Oct 1, 2025

View reviewed changes

gemini-code-assist Bot reviewed Oct 1, 2025

View reviewed changes

heheda12345 added 2 commits September 30, 2025 18:45

fix bug

7e474f6

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

cleanup and add test

8e16f4c

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

gemini-code-assist Bot reviewed Oct 1, 2025

View reviewed changes

simon-mo added this to the v0.11.0 Cherry Picks milestone Oct 1, 2025

chatgpt-codex-connector Bot reviewed Oct 1, 2025

View reviewed changes

small fix

5e54e84

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

chatgpt-codex-connector Bot reviewed Oct 1, 2025

View reviewed changes

LucasWilkinson approved these changes Oct 1, 2025

View reviewed changes

add assert to seq_len

9227c3b

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 1, 2025

heheda12345 added 2 commits October 1, 2025 15:52

Merge branch 'main' of github.com:vllm-project/vllm into indexer_chunk

4a566ff

Merge branch 'main' of github.com:vllm-project/vllm into indexer_chunk

442b387

simon-mo merged commit 1e50f1b into vllm-project:main Oct 2, 2025
47 of 53 checks passed

simon-mo pushed a commit that referenced this pull request Oct 2, 2025

[Deepseek v3.2] Support indexer prefill chunking (#25999)

c75c2e7

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

b7e4aac

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Deepseek v3.2] Support indexer prefill chunking (#25999)

77e9587

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

3389e2a

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

00042f9

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>

shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025

Support indexer prefill chunking (vllm-project#25999)

4470841

Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: simon-mo <simon.mo@hey.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

39f452d

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

8686a92

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

024d52e

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Dec 11, 2025

[Deepseek v3.2] Support indexer prefill chunking (vllm-project#25999)

556ad81

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

		print("chunk_seq_ids", chunk_seq_ids,
		seq_lens_cpu[chunk_seq_ids[0][0]:])

Uh oh!

Conversation

heheda12345 commented Oct 1, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

heheda12345 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo commented Oct 1, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Oct 1, 2025

Uh oh!

chatgpt-codex-connector Bot commented Oct 1, 2025

Uh oh!

heheda12345 commented Oct 1, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heheda12345 commented Oct 1, 2025 •

edited by github-actions Bot

Loading