[NVIDIA] disable chunked prefix cache when dp is used by kaixih · Pull Request #9861 · sgl-project/sglang

kaixih · 2025-09-01T08:18:17Z

This PR is to work around an accuracy issue found in #9806.

It seems the chunked prefix cache doesn't work well with the DP. So, we disable it for now to recover the accuracy.

gemini-code-assist

Summary of Changes

Hello @kaixih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a targeted fix to resolve an accuracy degradation observed when the chunked prefix cache interacts with data parallelism. By conditionally disabling the cache under these specific circumstances, the system's accuracy is restored, ensuring reliable model performance.

Highlights

Chunked Prefix Cache Disablement: The chunked prefix cache is now disabled when data parallelism (DP) is in use (i.e., dp_size > 1). This change is a workaround to address an accuracy issue identified in issue #9806.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request disables the chunked prefix cache when data parallelism is used, which is a workaround for an accuracy issue. The logic is correct. I've suggested a small refactoring to combine the two conditional statements that disable the cache into a single if/elif block. This improves code clarity and conciseness.

gemini-code-assist · 2025-09-01T08:19:26Z

        if not self.use_mla_backend:
            server_args.disable_chunked_prefix_cache = True

+        if self.dp_size > 1:
+            logger.info("Disable chunked prefix cache when dp size > 1.")
+            server_args.disable_chunked_prefix_cache = True


These two if statements both set server_args.disable_chunked_prefix_cache to True. They can be combined into an if/elif structure for better readability and to avoid a redundant assignment.

Suggested change

if not self.use_mla_backend:

server_args.disable_chunked_prefix_cache = True

if self.dp_size > 1:

logger.info("Disable chunked prefix cache when dp size > 1.")

server_args.disable_chunked_prefix_cache = True

if not self.use_mla_backend:

server_args.disable_chunked_prefix_cache = True

elif self.dp_size > 1:

logger.info("Disable chunked prefix cache when dp size > 1.")

server_args.disable_chunked_prefix_cache = True

Fridge003 · 2025-09-01T08:24:04Z

@kaixih The accuracy issue is only reported on flashinfer backend, so maybe we should disable chunked prefix cache for flashinfer backend + dp. On fa3 backend it should work well as before so no need to disable for dp.

kaixih · 2025-09-02T02:15:20Z

@kaixih The accuracy issue is only reported on flashinfer backend, so maybe we should disable chunked prefix cache for flashinfer backend + dp. On fa3 backend it should work well as before so no need to disable for dp.

Sure. I tested it and found that the accuracy issue also occurs with trtllm_mla and cutlass_mla, in addition to the default flashinfer backend. I’ve updated the check accordingly.

Fridge003

LGTM

kaixih · 2025-09-04T04:23:07Z

@zhyncs Can we merge?

…gl-project#9861)

Revert "[NVIDIA] disable chunked prefix cache when dp and blackwell is used (sgl-project#9861)" This reverts commit 90dfe3d.

…9861)

disable prefix cache when dp is used

ea21d46

kaixih requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners September 1, 2025 08:18

gemini-code-assist Bot reviewed Sep 1, 2025

View reviewed changes

kaixih mentioned this pull request Sep 1, 2025

[Bug] Accuracy drop for nvidia/DeepSeek-R1-0528-FP4 with dp attention #9806

Closed

5 tasks

gemini-code-assist Bot reviewed Sep 1, 2025

View reviewed changes

Fridge003 reviewed Sep 1, 2025

View reviewed changes

Comment thread python/sglang/srt/model_executor/model_runner.py Outdated

kaixih added 2 commits September 2, 2025 02:08

Address comments

3d2fab7

Lint

f317f98

Fridge003 approved these changes Sep 3, 2025

View reviewed changes

hnyls2002 merged commit 90dfe3d into sgl-project:main Sep 6, 2025
104 of 113 checks passed

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[NVIDIA] disable chunked prefix cache when dp and blackwell is used (s…

402c2b7

…gl-project#9861)

wenscarl added a commit to wenscarl/sglang that referenced this pull request Sep 9, 2025

temp fix

5fcba1d

Revert "[NVIDIA] disable chunked prefix cache when dp and blackwell is used (sgl-project#9861)" This reverts commit 90dfe3d.

lifuhuang pushed a commit that referenced this pull request Sep 10, 2025

[NVIDIA] disable chunked prefix cache when dp and blackwell is used (#…

7c48535

…9861)

wenscarl mentioned this pull request Sep 15, 2025

enable prefix cache with dp #10459

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] disable chunked prefix cache when dp is used#9861

[NVIDIA] disable chunked prefix cache when dp is used#9861
hnyls2002 merged 3 commits intosgl-project:mainfrom
kaixih:disable_prefix_cache_for_dp

kaixih commented Sep 1, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 1, 2025

Uh oh!

Fridge003 commented Sep 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

kaixih commented Sep 2, 2025

Uh oh!

Fridge003 left a comment

Uh oh!

kaixih commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaixih commented Sep 1, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kaixih commented Sep 2, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

kaixih commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fridge003 commented Sep 1, 2025 •

edited

Loading