Enable IndexCache for DeepSeek V3.2#21405
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the IndexCache mechanism into DeepSeek V3.2 models within the SGLang framework. The primary goal is to enhance inference performance by intelligently skipping redundant top-k index computations in attention layers, leveraging a technique ported from THUDM. This optimization leads to notable improvements in throughput and latency, while maintaining accuracy, and is supported by new unit tests to ensure stability across different distributed processing configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an "IndexCache" mechanism for the DeepSeekV2 model, enabling certain attention layers to reuse topk_indices from previous layers instead of recomputing them. This behavior is controlled by skip_topk and next_skip_topk flags, configurable via index_topk_freq or index_topk_pattern in the model's configuration. The changes involve modifying attention forward methods to accept and return topk_indices, updating decoder layers to propagate these indices, and adding a new test file to validate the functionality and performance of the IndexCache across different configurations. A minor issue was noted regarding a typo in an arXiv reference year within the deepseek_v2.py file.
|
Thanks for supporting! Will review on it soon |
|
HI could you also provide Traffic request rate: at 1 5 10 20 ? thanks! |
|
I adapted this codebase for NPU, and also validated the corresponding accuracy. The CEval accuracy score is 0.9198, and the E2E latency shows about 1.2x speedup. For details, please refer to PR #21502. |
Nice advice. thanks |
|
@jinyouzhi Have you tried how it works with MTP/CP? Also, how this feature affects harder tests like GPQA or AIME25? |
| from sglang.test.test_utils import ModelLaunchSettings | ||
| from sglang.test.tool_call_test_runner import ToolCallTestParams | ||
|
|
||
| register_cuda_ci(est_time=5400, suite="nightly-8-gpu-common", nightly=True) |
There was a problem hiding this comment.
Please don't add such a huge nightly test for index_cache.
Rather, add a single per-commit test similar to test/registered/8-gpu-models/test_deepseek_v32_basic.py
FYI.
|
|
@jinyouzhi Which GPU are you testing on? |
H20x8 |
|
This works on GLM-5-FP8, and I think it can serve as a tutorial in the SGLang CookBook for users attempting their own deployment, though we cannot guarantee zero precision loss. |
|
/tag-and-rerun-ci |
|
/rerun-test test_kimi_linear_models.py |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
/rerun-stage stage-c-test-4-gpu-b200 |
|
✅ Triggered |
|
✅ Triggered |
|
/rerun-stage stage-c-test-deepep-8-gpu-h200 |
|
✅ Triggered |
The index cache feature makes forward_core return a tuple (hidden_states, topk_indices) for NSA models. The TBO overlapped operations path (op_core) didn't unpack this tuple, causing 'tuple' object has no attribute 'shape' errors in DeepEP tests with DeepSeek V3.2. Fix: - Unpack the tuple in op_core, discarding topk_indices (TBO path doesn't propagate it between layers) - Fall back to computing topk when prev_topk_indices is None (i.e., in TBO mode) even if skip_topk is set, to avoid using None as topk indices Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
|
Hello, I understand that the larger the index_topk_freq setting is (i.e., the more S layers there are), the greater the impact on output accuracy. Is that correct? @jinyouzhi |
Yes |
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Motivation
fix #21286
Modifications
Accuracy Tests
gsm8k with DeepSeek-V3.2-Exp with this PR:
main
Benchmarking and Profiling
Throghput improved ~ +6.4%
TTFT improved ~ -5.4%
TPOT improved ~ -5.5%
this PR:
main
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci