Skip to content

[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache#19148

Merged
BBuf merged 2 commits intosgl-project:mainfrom
yuan-luo:support_nsa_fuse_store_k_cache
Feb 26, 2026
Merged

[DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache#19148
BBuf merged 2 commits intosgl-project:mainfrom
yuan-luo:support_nsa_fuse_store_k_cache

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Feb 22, 2026

Motivation

In DeepSeek v3.2, after the Indexer produces key in bf16 (roughly (N, 128)), it needs to populate NSA’s index_k_with_scale_buffer. The previous implementation used two steps:

  1. Quantization: act_quant(key, ...) converts bf16 keys into k_fp8: FP8(E4M3) key bytes (128 dims) and k_scale: per-token FP32 scale (NSA uses one scale for the 128-d block)
  2. Store into cache: token_to_kv_pool.set_index_k_scale_buffer(layer_id, loc, k_fp8, k_scale) writes k_fp8 and k_scale into the paged index_k_with_scale_buffer using out_cache_loc.

This path requires at least two kernel launches (quant + store). Under CUDA Graph / multi-stream execution, launch and sync overhead becomes more noticeable.

This PR is to introduce a JIT-compiled CUDA kernel that fuses quantization and store:
Inside the kernel, it will do:

  1. per-token absmax reduction
  2. compute FP32 scale (max(1e-4, absmax)/FP8_MAX)
  3. FP8(E4M3) quantize + pack
  4. compute page/offset from loc and write K(128B) + scale(4B) into the same paged buffer in one pass

Inspired by @DarkSharpness

Before PR:
image

After PR:
image

Performance improved slightly. Will do more testing.

gsm8k no drops.

Server:
➜  sglang_dev3 python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja

Client:
➜  sglang git:(main) python3 benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 128 --num-shots 8 --port 30000
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:19<00:00, 10.46it/s]
Accuracy: 0.975
Invalid: 0.000
Latency: 19.125 s
Output throughput: 1038.949 token/s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the key cache population mechanism within the DeepSeek V3.2 architecture. By consolidating the quantization of BF16 keys and their subsequent storage into a single, JIT-compiled CUDA kernel, the change aims to minimize computational overhead and improve overall inference performance. The implementation includes robust fallback logic, ensuring system stability even if the optimized kernel cannot be utilized.

Highlights

  • Fused Quantization and Storage Kernel: Introduced a new JIT-compiled CUDA kernel that fuses the quantization of BF16 keys to FP8 and the storage of these keys along with their scales into the NSA index K-cache. This combines what was previously two separate kernel launches into a single, optimized operation.
  • Performance Optimization: The fusion of quantization and storage into one kernel aims to reduce kernel launch and synchronization overhead, particularly beneficial under CUDA Graph or multi-stream execution, leading to slight performance improvements.
  • Conditional Kernel Usage with Fallback: The system now conditionally uses the new fused JIT kernel if it can be successfully loaded and specific conditions (CUDA, page_size=64, non-fnuz) are met. Otherwise, it gracefully falls back to the original two-step quantization and storage process.
  • DeepSeek V3.2 Integration: This change specifically addresses the key population process in DeepSeek V3.2, where BF16 keys are converted and stored in NSA's index_k_with_scale_buffer.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/jit_kernel/csrc/nsa/fused_store_index_cache.cuh
    • Added a new CUDA kernel fused_store_indexer_cache to perform per-token absmax reduction, FP32 scale computation, FP8(E4M3) quantization and packing, and direct storage into a paged buffer.
    • Included helper device functions for FP8 clipping and packing.
    • Defined FusedStoreCacheIndexerKernel to wrap the CUDA kernel for TVM FFI integration, handling tensor verification and kernel launch parameters.
  • python/sglang/jit_kernel/fused_store_index_cache.py
    • Added a new Python module to provide JIT-compiled CUDA kernel wrappers.
    • Implemented _jit_nsa_fused_store_module to load the fused_store_indexer_cache CUDA kernel.
    • Introduced can_use_nsa_fused_store to check for successful kernel loading and cache the result.
    • Provided fused_store_index_k_cache as the main entry point, handling tensor shape normalization, contiguity, and dtype assertions before calling the JIT kernel.
  • python/sglang/srt/layers/attention/nsa/nsa_indexer.py
    • Imported can_use_nsa_fused_store and fused_store_index_k_cache for the new fused kernel functionality.
    • Modified _forward_cuda_k_only to conditionally use the new fused_store_index_k_cache or fall back to the original two-step process for key storage.
    • Added a new private method _store_index_k_cache to encapsulate the logic for storing NSA indexer K cache, prioritizing the fused JIT kernel under specific conditions.
    • Updated forward_cuda to utilize the new _store_index_k_cache method for key storage in both dual-stream and single-stream execution paths, replacing the previous direct act_quant and set_index_k_scale_buffer calls.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo yuan-luo changed the title [WIP][DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache Feb 22, 2026
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a JIT-compiled CUDA kernel to fuse quantization and storage of the NSA indexer K cache, aiming to reduce kernel launch overhead and improve performance. The changes include a new CUDA kernel, a Python wrapper for JIT compilation, and modifications to nsa_indexer.py to use this new fused kernel.

My review has identified a critical race condition in the new CUDA kernel where multiple threads in a warp attempt to write to the same memory location. I've also found some opportunities for code cleanup by removing unused functions and a redundant buffer fetch. Addressing these points will improve the correctness and maintainability of the new implementation.

Comment thread python/sglang/jit_kernel/csrc/nsa/fused_store_index_cache.cuh
Comment thread python/sglang/jit_kernel/csrc/nsa/fused_store_index_cache.cuh Outdated
Comment thread python/sglang/srt/layers/attention/nsa/nsa_indexer.py Outdated
@yuan-luo yuan-luo requested a review from ispobock February 22, 2026 09:27
@yuan-luo yuan-luo changed the title [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache [WIP][DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache Feb 22, 2026
@yuan-luo yuan-luo force-pushed the support_nsa_fuse_store_k_cache branch from 9c6ddb2 to 75bca33 Compare February 22, 2026 10:23
@yuan-luo yuan-luo changed the title [WIP][DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache [DeepSeek-V3.2][JIT-kernel] Support nsa fuse store indexer k cache Feb 22, 2026
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

#11989

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the support_nsa_fuse_store_k_cache branch 2 times, most recently from 055d29f to 781acbd Compare February 22, 2026 13:07
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

6 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@DarkSharpness
Copy link
Copy Markdown
Collaborator

Can we add PDL support for this kernel? I'm not sure if this will bring performance improvement.

Comment thread python/sglang/jit_kernel/csrc/nsa/fused_store_index_cache.cuh
/// NOTE: 132 = 128 + 4
constexpr int64_t kPageBytes = 132 << kPageBits;

// each warp handles 128 elements, 1 warp, each block handles multiple rows
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// each warp handles 128 elements, 1 warp, each block handles multiple rows
// each warp handles 128 elements, each block handles multiple rows

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


# Fast path: JIT fused store (CUDA, page_size=64, non-fnuz)
if can_use_nsa_fused_store() and _is_cuda and (not _is_fp8_fnuz):
if forward_batch.token_to_kv_pool.page_size == 64:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make those two if code to onlt one if?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

layer_id=layer_id
)
fused_store_index_k_cache(key, buf, forward_batch.out_cache_loc)
else:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_use_nsa_fused_store func has a fallback now, why we need another fallback here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_use_nsa_fused_store func itself doesn't have fallback.
There are two branches in forward_cuda, both needs a separate fallback:

  1. fast path(seqlen<2048): it skips topk computation and calls _forward_cuda_k_only --> fused_store_index_k_cache
  2. normal path: it conducts topk computation and calls _store_index_k_cache()

Comment thread python/sglang/jit_kernel/csrc/nsa/fused_store_index_cache.cuh
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Can we add PDL support for this kernel? I'm not sure if this will bring performance improvement.

Addressed and refactored code.

@yuan-luo yuan-luo force-pushed the support_nsa_fuse_store_k_cache branch from f63bef3 to c4d59ad Compare February 23, 2026 14:21
@Fridge003
Copy link
Copy Markdown
Collaborator

@yuan-luo Can you please test the result of gpqa and aime25 as shown here: https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

@Fridge003
Copy link
Copy Markdown
Collaborator

Fridge003 commented Feb 23, 2026

Also can you please test on some extreme workloads (e.g. 128k input), to make sure it doesn't crack due to any IMA -like errors (although with int64 out cache loc this shouldn't happen)

Comment thread python/sglang/srt/layers/attention/nsa/nsa_indexer.py Outdated
@Fridge003
Copy link
Copy Markdown
Collaborator

Can we add a test for this jit kernel

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Feb 24, 2026

@yuan-luo Can you please test the result of gpqa and aime25 as shown here: https://docs.sglang.io/basic_usage/deepseek_v32.html#accuracy-test-with-gpqa-diamond

gpqa result:

➜  sglang_dev3 git:(support_nsa_fuse_store_k_cache) ✗ python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa \
  --num-examples 198 --max-tokens 128000 --repeat 1 \
  --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
ChatCompletionSampler initialized with self.system_message=None self.temperature=1.0 self.max_tokens=128000 self.reasoning_effort=None self.extra_body={'chat_template_kwargs': {'thinking': True}}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198/198 [08:55<00:00,  2.70s/it]
Total latency: 535.045 s
Score: 0.823
[METRIC] gpqa_score=0.8232323232323232 labels={"model": "deepseek-ai/DeepSeek-V3.2-Exp", "eval": "gpqa"}
[METRIC] gpqa_latency=535.0447993390262 labels={"model": "deepseek-ai/DeepSeek-V3.2-Exp", "eval": "gpqa"}
Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.html
{'chars': np.float64(1464.7676767676767), 'chars:std': np.float64(345.69935932254305), 'score:std': np.float64(0.38147197173296354), 'score': np.float64(0.8232323232323232)}
Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.json

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Feb 24, 2026

AIME25 result:

Server:
➜  python git:(support_nsa_fuse_store_k_cache) ✗ python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ../examples/chat_template/tool_chat_template_deepseekv32.jinja

Client:
➜  bench_script bash bench_aime.sh
[09:08:03] WARNING  Cluster config is not specified. Running locally without containers. Only a subset of features is supported and you're responsible for installing any required dependencies. It's recommended to run `ns setup`  cluster.py:354
                    to define appropriate configs!
[W 2026-02-24T09:08:03.794] Cluster config is not specified. Running locally without containers. Only a subset of features is supported and you're responsible for installing any required dependencies. It's recommended to run `ns setup` to define appropriate configs!
[09:08:03] INFO     Optional environment variable GEMINI_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable OPENAI_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable NVIDIA_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable AZURE_OPENAI_API_KEY not found in user environment; skipping.                                                                                                                      cluster.py:255
           INFO     Adding optional environment variable HF_TOKEN from default factory                                                                                                                                               cluster.py:251
           INFO     Optional environment variable NGC_API_KEY not found in user environment; skipping.                                                                                                                               cluster.py:255
           INFO     Optional environment variable WANDB_API_KEY not found in user environment; skipping.                                                                                                                             cluster.py:255
           INFO     Adding a task with commands:                                                                                                                                                                                         exp.py:523
           INFO     Adding optional environment variable NEMO_SKILLS_SANDBOX_PORT from cluster config                                                                                                                                cluster.py:232
           INFO     Not running from a git repo, trying to upload installed package. Make sure there are no extra files in /usr/local/lib/python3.12/dist-packages/nemo_skills/*                                                    packager.py:203
           INFO     Main command(s): export HYDRA_FULL_ERROR=1 && export PYTHONPATH=$PYTHONPATH:/nemo_run/code && cd /nemo_run/code && python -m nemo_skills.dataset.prepare  aime25 --parallelism 20 --retries 3                        exp.py:604
nemo-run/0 Preparing aime25 (attempt 1/4)
Starting AIME25 evaluation with model deepseek-ai/DeepSeek-V3.2-Exp on port 30000 using backend sglang...
[09:08:17] INFO     Starting evaluation job                                                                                                                                                                                             eval.py:475
           INFO     Extra arguments that will be passed to the underlying script: ++chat_template_kwargs.thinking=true ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.tokens_to_generate=64000                          eval.py:476
           WARNING  Cluster config is not specified. Running locally without containers. Only a subset of features is supported and you're responsible for installing any required dependencies. It's recommended to run `ns setup`  cluster.py:354
                    to define appropriate configs!
[W 2026-02-24T09:08:17.329] Cluster config is not specified. Running locally without containers. Only a subset of features is supported and you're responsible for installing any required dependencies. It's recommended to run `ns setup` to define appropriate configs!
           INFO     Optional environment variable GEMINI_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable NGC_API_KEY not found in user environment; skipping.                                                                                                                               cluster.py:255
           INFO     Adding optional environment variable HF_TOKEN from default factory                                                                                                                                               cluster.py:251
           INFO     Optional environment variable NVIDIA_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable OPENAI_API_KEY not found in user environment; skipping.                                                                                                                            cluster.py:255
           INFO     Optional environment variable WANDB_API_KEY not found in user environment; skipping.                                                                                                                             cluster.py:255
           INFO     Optional environment variable AZURE_OPENAI_API_KEY not found in user environment; skipping.                                                                                                                      cluster.py:255
[09:08:17] INFO     Adding a task with commands:                                                                                                                                                                                         exp.py:523
           INFO     Adding optional environment variable NEMO_SKILLS_SANDBOX_PORT from cluster config                                                                                                                                cluster.py:232
           INFO     Not running from a git repo, trying to upload installed package. Make sure there are no extra files in /usr/local/lib/python3.12/dist-packages/nemo_skills/*                                                    packager.py:203
           INFO     Main command(s): export PYTHONPATH=$PYTHONPATH:/nemo_run/code && cd /nemo_run/code && (  export HYDRA_FULL_ERROR=1 && python -m nemo_skills.inference.generate  ++skip_filled=True                                   exp.py:604
                    ++input_file=/nemo_run/code/nemo_skills/dataset/aime25/test.jsonl ++output_file=nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs0.jsonl     ++inference.random_seed=0
                    ++inference.temperature=0.7     ++inference.top_k=-1     ++inference.top_p=0.95   ++prompt_config=generic/math ++eval_type=math ++eval_config.split=test ++server.base_url=http://localhost:30000/v1
                    ++server.model=deepseek-ai/DeepSeek-V3.2-Exp ++server.server_type=sglang ++chat_template_kwargs.thinking=true ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.tokens_to_generate=64000  && touch
                    nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs0.jsonl.done   ) & (  export HYDRA_FULL_ERROR=1 && python -m nemo_skills.inference.generate  ++skip_filled=True
                    ++input_file=/nemo_run/code/nemo_skills/dataset/aime25/test.jsonl ++output_file=nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs1.jsonl     ++inference.random_seed=1
                    ++inference.temperature=0.7     ++inference.top_k=-1     ++inference.top_p=0.95   ++prompt_config=generic/math ++eval_type=math ++eval_config.split=test ++server.base_url=http://localhost:30000/v1
                    ++server.model=deepseek-ai/DeepSeek-V3.2-Exp ++server.server_type=sglang ++chat_template_kwargs.thinking=true ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.tokens_to_generate=64000  && touch
                    nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs1.jsonl.done   ) & (  export HYDRA_FULL_ERROR=1 && python -m nemo_skills.inference.generate  ++skip_filled=True
                    ++input_file=/nemo_run/code/nemo_skills/dataset/aime25/test.jsonl ++output_file=nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs2.jsonl     ++inference.random_seed=2
                    ++inference.temperature=0.7     ++inference.top_k=-1     ++inference.top_p=0.95   ++prompt_config=generic/math ++eval_type=math ++eval_config.split=test ++server.base_url=http://localhost:30000/v1
                    ++server.model=deepseek-ai/DeepSeek-V3.2-Exp ++server.server_type=sglang ++chat_template_kwargs.thinking=true ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.tokens_to_generate=64000  && touch
                    nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs2.jsonl.done   ) & (  export HYDRA_FULL_ERROR=1 && python -m nemo_skills.inference.generate  ++skip_filled=True
                    ++input_file=/nemo_run/code/nemo_skills/dataset/aime25/test.jsonl ++output_file=nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs3.jsonl     ++inference.random_seed=3
                    ++inference.temperature=0.7     ++inference.top_k=-1     ++inference.top_p=0.95   ++prompt_config=generic/math ++eval_type=math ++eval_config.split=test ++server.base_url=http://localhost:30000/v1
                    ++server.model=deepseek-ai/DeepSeek-V3.2-Exp ++server.server_type=sglang ++chat_template_kwargs.thinking=true ++inference.temperature=1.0 ++inference.top_p=0.95 ++inference.tokens_to_generate=64000  && touch
                    nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs3.jsonl.done   ) & wait
           INFO     Adding a task with commands:                                                                                                                                                                                         exp.py:523
           INFO     Not running from a git repo, trying to upload installed package. Make sure there are no extra files in /usr/local/lib/python3.12/dist-packages/nemo_skills/*                                                    packager.py:203
           INFO     Main command(s): python -m nemo_skills.pipeline.summarize_results nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results     --benchmarks aime25     --save_metrics_path                            exp.py:604
                    nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/metrics.json     --metric_type=math  --wandb_project=nemo-skills
nemo-run/0 2026-02-24 09:08:24 INFO  Config used: GenerationTaskConfig(input_file='/usr/local/lib/python3.12/dist-packages/nemo_skills/dataset/aime25/test.jsonl', output_file='nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs2.jsonl', prompt_config='generic/math', use_completions_api=False, tokenizer=None, chat_template_kwargs={'thinking': True}, prompt_format='ns', prompt_suffix='', system_message=None, code_tags=None, examples_type=None, server={'base_url': 'http://localhost:30000/v1', 'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'server_type': 'sglang'}, sandbox={}, wait_for_sandbox=False, start_assistant_response_key=None, inference=InferenceConfig(endpoint_type=<EndpointType.chat: 'chat'>, temperature=1.0, top_k=-1, top_p=0.95, min_p=0.0, random_seed=2, tokens_to_generate=64000, repetition_penalty=1.0, top_logprobs=None, timeout=14400, reasoning_effort=None, extra_body={}), max_samples=-1, skip_filled=True, max_concurrent_requests=512, num_chunks=None, chunk_id=None, add_generation_stats=True, count_prompt_tokens=False, generation_key='generation', async_position_key='_async_position', dry_run=False, code_execution=False, total_code_executions_in_prompt=None, override_max_code_executions=False, stop_phrase=None, parallel_thinking=ParallelThinkingConfig(temperature=0.6, tokens_to_generate=None, parse_reasoning=False, parse_reasoning_solutions=True, end_reasoning_string='</think>', endpoint_type=<EndpointType.chat: 'chat'>, tokenizer=None, chat_template_kwargs={}, start_assistant_response_key=None, count_prompt_tokens=False, mode=None, genselect=GenSelectSpecificConfig(prompt_config='generic/genselect', regex='Judg[e]?ment: (\\d+)'), gensynthesis=GenSynthesisSpecificConfig(prompt_config='generic/gensynthesis', regex='<NEW_SOLUTION>(.*?)</NEW_SOLUTION>'), solution_length_cap=16384, window_size=8, solution_key='generation', filter_incomplete_solutions=True, generation_dir=None, num_initial_solutions=None), tool_modules=None, tool_overrides={}, schema_overrides={}, max_tool_calls=-1, parse_reasoning=False, end_reasoning_string='</think>', enable_litellm_cache=False, drop_content_types=['audio_url', 'input_audio'], enable_audio=False, enable_audio_chunking=True, audio_chunk_task_types=None, chunk_audio_threshold_sec=30, eval_type='math', eval_config={'split': 'test'}, structured_output=None)
nemo-run/0 2026-02-24 09:08:24 INFO  Prompt used: PromptConfig(user='Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{{}}.\n\n{examples}{problem}', system=None, code_tags=None, few_shot_examples=FewShotExamplesConfig(prefix='Here are some examples of problems and solutions you can refer to.\n\n', template='Problem:\n{problem}\n\nSolution:\n{solution}\n\n\n\n\n\n', suffix='Here is the problem you need to solve:\n', examples_type=None, retrieval_field=None, retrieval_file=None, retrieved_entries=10, retrieved_few_shots=5, randomize_retrieved_entries=False, max_retrieved_chars=100000000, max_retrieved_chars_field='reference_solution', retriever=None), image_field=None, image_position='before')
nemo-run/0 Waiting for the server to start at http://localhost:30000/v1
nemo-run/0 2026-02-24 09:08:24 INFO  Evaluator supports per-datapoint evals, will interleave evaluation with generation.
nemo-run/0 2026-02-24 09:08:24 INFO  Async loop is maintaining 512 generations in parallel. Use max_concurrent_requests to control the number of concurrent requests.
nemo-run/0 2026-02-24 09:08:24 WARNING  File `nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs2.jsonl-async` not found, starting from scratch
nemo-run/0 2026-02-24 09:08:24 INFO  Example prompt:
nemo-run/0 Data dictionary: {'id': 'aime25-0', 'problem': 'Find the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$', 'expected_answer': '70', 'reference_solution': 'This means that  $a(b+7)=9b+7$  where  $a$  is a natural number. Rearranging we get  $(a-9)(b+7)=-56$ . Since  $b>9$ ,  $b=49,21$ . Thus the answer is  $49+21=\\boxed{70}$', '_async_position': 0}
nemo-run/0 Prompt: [{'role': 'user', 'content': 'Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nFind the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$'}]
nemo-run/0 Remaining generations:   0%|          | 0/30 [00:00<?, ?it/s]nemo-run/0 2026-02-24 09:08:24 INFO  Config used: GenerationTaskConfig(input_file='/usr/local/lib/python3.12/dist-packages/nemo_skills/dataset/aime25/test.jsonl', output_file='nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs1.jsonl', prompt_config='generic/math', use_completions_api=False, tokenizer=None, chat_template_kwargs={'thinking': True}, prompt_format='ns', prompt_suffix='', system_message=None, code_tags=None, examples_type=None, server={'base_url': 'http://localhost:30000/v1', 'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'server_type': 'sglang'}, sandbox={}, wait_for_sandbox=False, start_assistant_response_key=None, inference=InferenceConfig(endpoint_type=<EndpointType.chat: 'chat'>, temperature=1.0, top_k=-1, top_p=0.95, min_p=0.0, random_seed=1, tokens_to_generate=64000, repetition_penalty=1.0, top_logprobs=None, timeout=14400, reasoning_effort=None, extra_body={}), max_samples=-1, skip_filled=True, max_concurrent_requests=512, num_chunks=None, chunk_id=None, add_generation_stats=True, count_prompt_tokens=False, generation_key='generation', async_position_key='_async_position', dry_run=False, code_execution=False, total_code_executions_in_prompt=None, override_max_code_executions=False, stop_phrase=None, parallel_thinking=ParallelThinkingConfig(temperature=0.6, tokens_to_generate=None, parse_reasoning=False, parse_reasoning_solutions=True, end_reasoning_string='</think>', endpoint_type=<EndpointType.chat: 'chat'>, tokenizer=None, chat_template_kwargs={}, start_assistant_response_key=None, count_prompt_tokens=False, mode=None, genselect=GenSelectSpecificConfig(prompt_config='generic/genselect', regex='Judg[e]?ment: (\\d+)'), gensynthesis=GenSynthesisSpecificConfig(prompt_config='generic/gensynthesis', regex='<NEW_SOLUTION>(.*?)</NEW_SOLUTION>'), solution_length_cap=16384, window_size=8, solution_key='generation', filter_incomplete_solutions=True, generation_dir=None, num_initial_solutions=None), tool_modules=None, tool_overrides={}, schema_overrides={}, max_tool_calls=-1, parse_reasoning=False, end_reasoning_string='</think>', enable_litellm_cache=False, drop_content_types=['audio_url', 'input_audio'], enable_audio=False, enable_audio_chunking=True, audio_chunk_task_types=None, chunk_audio_threshold_sec=30, eval_type='math', eval_config={'split': 'test'}, structured_output=None)
nemo-run/0 2026-02-24 09:08:24 INFO  Prompt used: PromptConfig(user='Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{{}}.\n\n{examples}{problem}', system=None, code_tags=None, few_shot_examples=FewShotExamplesConfig(prefix='Here are some examples of problems and solutions you can refer to.\n\n', template='Problem:\n{problem}\n\nSolution:\n{solution}\n\n\n\n\n\n', suffix='Here is the problem you need to solve:\n', examples_type=None, retrieval_field=None, retrieval_file=None, retrieved_entries=10, retrieved_few_shots=5, randomize_retrieved_entries=False, max_retrieved_chars=100000000, max_retrieved_chars_field='reference_solution', retriever=None), image_field=None, image_position='before')
nemo-run/0 Waiting for the server to start at http://localhost:30000/v1
nemo-run/0 Waiting for the server to start at http://localhost:30000/v1
nemo-run/0 2026-02-24 09:08:24 INFO  Config used: GenerationTaskConfig(input_file='/usr/local/lib/python3.12/dist-packages/nemo_skills/dataset/aime25/test.jsonl', output_file='nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs3.jsonl', prompt_config='generic/math', use_completions_api=False, tokenizer=None, chat_template_kwargs={'thinking': True}, prompt_format='ns', prompt_suffix='', system_message=None, code_tags=None, examples_type=None, server={'base_url': 'http://localhost:30000/v1', 'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'server_type': 'sglang'}, sandbox={}, wait_for_sandbox=False, start_assistant_response_key=None, inference=InferenceConfig(endpoint_type=<EndpointType.chat: 'chat'>, temperature=1.0, top_k=-1, top_p=0.95, min_p=0.0, random_seed=3, tokens_to_generate=64000, repetition_penalty=1.0, top_logprobs=None, timeout=14400, reasoning_effort=None, extra_body={}), max_samples=-1, skip_filled=True, max_concurrent_requests=512, num_chunks=None, chunk_id=None, add_generation_stats=True, count_prompt_tokens=False, generation_key='generation', async_position_key='_async_position', dry_run=False, code_execution=False, total_code_executions_in_prompt=None, override_max_code_executions=False, stop_phrase=None, parallel_thinking=ParallelThinkingConfig(temperature=0.6, tokens_to_generate=None, parse_reasoning=False, parse_reasoning_solutions=True, end_reasoning_string='</think>', endpoint_type=<EndpointType.chat: 'chat'>, tokenizer=None, chat_template_kwargs={}, start_assistant_response_key=None, count_prompt_tokens=False, mode=None, genselect=GenSelectSpecificConfig(prompt_config='generic/genselect', regex='Judg[e]?ment: (\\d+)'), gensynthesis=GenSynthesisSpecificConfig(prompt_config='generic/gensynthesis', regex='<NEW_SOLUTION>(.*?)</NEW_SOLUTION>'), solution_length_cap=16384, window_size=8, solution_key='generation', filter_incomplete_solutions=True, generation_dir=None, num_initial_solutions=None), tool_modules=None, tool_overrides={}, schema_overrides={}, max_tool_calls=-1, parse_reasoning=False, end_reasoning_string='</think>', enable_litellm_cache=False, drop_content_types=['audio_url', 'input_audio'], enable_audio=False, enable_audio_chunking=True, audio_chunk_task_types=None, chunk_audio_threshold_sec=30, eval_type='math', eval_config={'split': 'test'}, structured_output=None)
nemo-run/0 2026-02-24 09:08:24 INFO  Prompt used: PromptConfig(user='Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{{}}.\n\n{examples}{problem}', system=None, code_tags=None, few_shot_examples=FewShotExamplesConfig(prefix='Here are some examples of problems and solutions you can refer to.\n\n', template='Problem:\n{problem}\n\nSolution:\n{solution}\n\n\n\n\n\n', suffix='Here is the problem you need to solve:\n', examples_type=None, retrieval_field=None, retrieval_file=None, retrieved_entries=10, retrieved_few_shots=5, randomize_retrieved_entries=False, max_retrieved_chars=100000000, max_retrieved_chars_field='reference_solution', retriever=None), image_field=None, image_position='before')
nemo-run/0 2026-02-24 09:08:24 INFO  Evaluator supports per-datapoint evals, will interleave evaluation with generation.
nemo-run/0 2026-02-24 09:08:24 INFO  Async loop is maintaining 512 generations in parallel. Use max_concurrent_requests to control the number of concurrent requests.
nemo-run/0 2026-02-24 09:08:24 WARNING  File `nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs1.jsonl-async` not found, starting from scratch
nemo-run/0 2026-02-24 09:08:24 INFO  Example prompt:
nemo-run/0 Data dictionary: {'id': 'aime25-0', 'problem': 'Find the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$', 'expected_answer': '70', 'reference_solution': 'This means that  $a(b+7)=9b+7$  where  $a$  is a natural number. Rearranging we get  $(a-9)(b+7)=-56$ . Since  $b>9$ ,  $b=49,21$ . Thus the answer is  $49+21=\\boxed{70}$', '_async_position': 0}
nemo-run/0 Prompt: [{'role': 'user', 'content': 'Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nFind the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$'}]
nemo-run/0 Remaining generations:   0%|          | 0/30 [00:00<?, ?it/s]2026-02-24 09:08:24 INFO  Evaluator supports per-datapoint evals, will interleave evaluation with generation.
nemo-run/0 2026-02-24 09:08:24 INFO  Async loop is maintaining 512 generations in parallel. Use max_concurrent_requests to control the number of concurrent requests.
nemo-run/0 2026-02-24 09:08:24 WARNING  File `nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs3.jsonl-async` not found, starting from scratch
nemo-run/0 2026-02-24 09:08:24 INFO  Example prompt:
nemo-run/0 Data dictionary: {'id': 'aime25-0', 'problem': 'Find the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$', 'expected_answer': '70', 'reference_solution': 'This means that  $a(b+7)=9b+7$  where  $a$  is a natural number. Rearranging we get  $(a-9)(b+7)=-56$ . Since  $b>9$ ,  $b=49,21$ . Thus the answer is  $49+21=\\boxed{70}$', '_async_position': 0}
nemo-run/0 Prompt: [{'role': 'user', 'content': 'Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nFind the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$'}]
nemo-run/0 Remaining generations:   0%|          | 0/30 [00:00<?, ?it/s]nemo-run/0 2026-02-24 09:08:24 INFO  Config used: GenerationTaskConfig(input_file='/usr/local/lib/python3.12/dist-packages/nemo_skills/dataset/aime25/test.jsonl', output_file='nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs0.jsonl', prompt_config='generic/math', use_completions_api=False, tokenizer=None, chat_template_kwargs={'thinking': True}, prompt_format='ns', prompt_suffix='', system_message=None, code_tags=None, examples_type=None, server={'base_url': 'http://localhost:30000/v1', 'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'server_type': 'sglang'}, sandbox={}, wait_for_sandbox=False, start_assistant_response_key=None, inference=InferenceConfig(endpoint_type=<EndpointType.chat: 'chat'>, temperature=1.0, top_k=-1, top_p=0.95, min_p=0.0, random_seed=0, tokens_to_generate=64000, repetition_penalty=1.0, top_logprobs=None, timeout=14400, reasoning_effort=None, extra_body={}), max_samples=-1, skip_filled=True, max_concurrent_requests=512, num_chunks=None, chunk_id=None, add_generation_stats=True, count_prompt_tokens=False, generation_key='generation', async_position_key='_async_position', dry_run=False, code_execution=False, total_code_executions_in_prompt=None, override_max_code_executions=False, stop_phrase=None, parallel_thinking=ParallelThinkingConfig(temperature=0.6, tokens_to_generate=None, parse_reasoning=False, parse_reasoning_solutions=True, end_reasoning_string='</think>', endpoint_type=<EndpointType.chat: 'chat'>, tokenizer=None, chat_template_kwargs={}, start_assistant_response_key=None, count_prompt_tokens=False, mode=None, genselect=GenSelectSpecificConfig(prompt_config='generic/genselect', regex='Judg[e]?ment: (\\d+)'), gensynthesis=GenSynthesisSpecificConfig(prompt_config='generic/gensynthesis', regex='<NEW_SOLUTION>(.*?)</NEW_SOLUTION>'), solution_length_cap=16384, window_size=8, solution_key='generation', filter_incomplete_solutions=True, generation_dir=None, num_initial_solutions=None), tool_modules=None, tool_overrides={}, schema_overrides={}, max_tool_calls=-1, parse_reasoning=False, end_reasoning_string='</think>', enable_litellm_cache=False, drop_content_types=['audio_url', 'input_audio'], enable_audio=False, enable_audio_chunking=True, audio_chunk_task_types=None, chunk_audio_threshold_sec=30, eval_type='math', eval_config={'split': 'test'}, structured_output=None)
nemo-run/0 2026-02-24 09:08:24 INFO  Prompt used: PromptConfig(user='Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{{}}.\n\n{examples}{problem}', system=None, code_tags=None, few_shot_examples=FewShotExamplesConfig(prefix='Here are some examples of problems and solutions you can refer to.\n\n', template='Problem:\n{problem}\n\nSolution:\n{solution}\n\n\n\n\n\n', suffix='Here is the problem you need to solve:\n', examples_type=None, retrieval_field=None, retrieval_file=None, retrieved_entries=10, retrieved_few_shots=5, randomize_retrieved_entries=False, max_retrieved_chars=100000000, max_retrieved_chars_field='reference_solution', retriever=None), image_field=None, image_position='before')
nemo-run/0 Waiting for the server to start at http://localhost:30000/v1
nemo-run/0 2026-02-24 09:08:25 INFO  Evaluator supports per-datapoint evals, will interleave evaluation with generation.
nemo-run/0 2026-02-24 09:08:25 INFO  Async loop is maintaining 512 generations in parallel. Use max_concurrent_requests to control the number of concurrent requests.
nemo-run/0 2026-02-24 09:08:25 WARNING  File `nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/output-rs0.jsonl-async` not found, starting from scratch
nemo-run/0 2026-02-24 09:08:25 INFO  Example prompt:
nemo-run/0 Data dictionary: {'id': 'aime25-0', 'problem': 'Find the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$', 'expected_answer': '70', 'reference_solution': 'This means that  $a(b+7)=9b+7$  where  $a$  is a natural number. Rearranging we get  $(a-9)(b+7)=-56$ . Since  $b>9$ ,  $b=49,21$ . Thus the answer is  $49+21=\\boxed{70}$', '_async_position': 0}
nemo-run/0 Prompt: [{'role': 'user', 'content': 'Solve the following math problem. Make sure to put the answer (and only answer) inside \\boxed{}.\n\nFind the sum of all integer bases  $b>9$  for which  $17_b$  is a divisor of  $97_b.$'}]
nemo-run/0 Remaining generations:  77%|███████▋  | 23/30 [11:10<07:08, 61.18s/it]

nemo-run/0 Remaining generations: 100%|██████████| 30/30 [22:02<00:00, 44.08s/it]
nemo-run/0 Remaining generations: 100%|██████████| 30/30 [24:10<00:00, 48.34s/it]]
nemo-run/0 Remaining generations: 100%|██████████| 30/30 [24:49<00:00, 49.66s/it]
nemo-run/0 Remaining generations: 100%|██████████| 30/30 [26:36<00:00, 53.20s/it]
nemo-run_1/0 ---------------------------------------- aime25 ----------------------------------------
nemo-run_1/0 evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
nemo-run_1/0 pass@1[avg-of-4] | 30          | 14033      | 1596        | 87.50% ± 3.19%   | 0.00%
nemo-run_1/0 majority@4       | 30          | 14033      | 1596        | 90.00%           | 0.00%
nemo-run_1/0 pass@4           | 30          | 14033      | 1596        | 93.33%           | 0.00%
nemo-run_1/0
nemo-run_1/0
nemo-run_1/0 Metrics are saved to nemo_skills_aime25_dsv32-fp8_output_sglang_20260224_090809/eval-results/aime25/metrics.json

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Can we add a test for this jit kernel

WIP.

@yuan-luo yuan-luo force-pushed the support_nsa_fuse_store_k_cache branch from c4d59ad to 350c76d Compare February 25, 2026 01:23
luoyuan.luo and others added 2 commits February 26, 2026 10:07
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
@yuan-luo yuan-luo force-pushed the support_nsa_fuse_store_k_cache branch from 350c76d to 8f6a1f3 Compare February 26, 2026 02:09
@BBuf BBuf merged commit 4e843f1 into sgl-project:main Feb 26, 2026
31 of 93 checks passed
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Can we add a test for this jit kernel

@Fridge003 #19389

klhhhhh pushed a commit to klhhhhh/sglang that referenced this pull request Feb 26, 2026
…gl-project#19148)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…gl-project#19148)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…gl-project#19148)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…gl-project#19148)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: DarkSharpness <76582120+darksharpness@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants