Skip to content

Enables TRT-LLM backend to be used for target_verify#10281

Merged
zhyncs merged 12 commits intosgl-project:mainfrom
pranavm-nvidia:trtllm-target-verify
Sep 22, 2025
Merged

Enables TRT-LLM backend to be used for target_verify#10281
zhyncs merged 12 commits intosgl-project:mainfrom
pranavm-nvidia:trtllm-target-verify

Conversation

@pranavm-nvidia
Copy link
Copy Markdown
Collaborator

@pranavm-nvidia pranavm-nvidia commented Sep 10, 2025

Motivation

This change allows the TRT-LLM MLA backend to be used for the target_verify step in MTP, which should improve performance.

Modifications

  • Enables forward_extend in the TRT-LLM MLA backend to be used for target_verify in MTP.
  • Also adds code to update the KV cache in forward_extend which was previously missing.

Accuracy Tests

GSM8k:

Details

Server Command:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Test:

$ python3 benchmark/gsm8k/bench_sglang.py --num-shots 5 --num-questions 1319 --parallel 512 --port 30000

Accuracy: 0.946
Invalid: 0.000
Latency: 25.739 s
Output throughput: 5051.383 token/s

GPQA-Diamond:

Details

Note: Please see my comment below regarding the accuracy with MTP=off; it doesn't seem like MTP=on causes an accuracy drop compared to that.

Server Command:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Results:

Category evaluation_mode num_entries avg_tokens gen_seconds symbolic_correct no_answer
gpqa pass@1 198 5414 329 70.20% 0.00%
gpqa-Physics (general) pass@1 19 4118 329 78.95% 0.00%
gpqa-Organic Chemistry pass@1 72 7983 326 44.44% 0.00%
gpqa-Quantum Mechanics pass@1 25 3521 266 100.00% 0.00%
gpqa-Electromagnetism and Photonics pass@1 6 2103 74 83.33% 0.00%
gpqa-High-energy particle physics pass@1 14 3768 299 92.86% 0.00%
gpqa-Genetics pass@1 4 4562 162 75.00% 0.00%
gpqa-Astrophysics pass@1 13 4758 215 92.31% 0.00%
gpqa-Molecular Biology pass@1 15 3711 306 80.00% 0.00%
gpqa-Chemistry (general) pass@1 20 5138 302 65.00% 0.00%
gpqa-Relativistic Mechanics pass@1 7 2545 113 85.71% 0.00%
gpqa-Inorganic Chemistry pass@1 1 6055 165 100.00% 0.00%
gpqa-Optics and Acoustics pass@1 1 2147 60 100.00% 0.00%
gpqa-Condensed Matter Physics pass@1 1 1042 32 100.00% 0.00%

Math-500

Details

Server Command:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Results:

evaluation_mode num_entries avg_tokens gen_seconds symbolic_correct no_answer
pass@1 500 2284 370 94.60% 1.80%

Benchmarking and Profiling

FlashInfer vs. TRT-LLM MLA (MTP=on for both)

Details

Server commands:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend flashinfer --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Client:

python3 -m sglang.bench_serving --backend sglang --model "deepseek-ai/DeepSeek-R1" --num-prompts 256 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=256

Results:

Metric FlashInfer Result TRT-LLM Result % Difference (TRT-LLM vs FlashInfer)
Total input tokens 262144 262144 0.00%
Total generated tokens 262144 262144 0.00%
Total generated tokens (retokenized) 261347 261314 -0.01%
Request throughput (req/s) 5.47 5.79 +5.84%
Input token throughput (tok/s) 5598.17 5934.04 +6.00%
Output token throughput (tok/s) 5598.17 5934.04 +6.00%
Total token throughput (tok/s) 11196.34 11868.09 +6.00%
Concurrency 225.74 222.14 -1.59%
Accept length 2.62 2.79 +6.49%
Mean E2E Latency (ms) 41290.86 38332.63 -7.23%
Median E2E Latency (ms) 42046.48 39026.21 -7.19%
Mean TTFT (ms) 4382.98 4294.74 -2.01%
Median TTFT (ms) 4384.86 4308.44 -1.75%
P99 TTFT (ms) 8171.60 7987.98 -2.25%
Mean ITL (ms) 36.08 33.27 -7.80%
Median ITL (ms) 28.33 24.54 -13.37%
P95 ITL (ms) 45.87 46.65 +1.70%
P99 ITL (ms) 89.37 88.16 -1.35%
Max ITL (ms) 7688.61 7519.31 -2.20%

TRT-LLM MLA MTP on vs. off (concurrency=1)

Details

Server commands:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Client:

python3 -m sglang.bench_serving --backend sglang --model "deepseek-ai/DeepSeek-R1" --num-prompts 64 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=1

Results:

Metric MTP=off MTP=on % Difference
Total input tokens 65536 65536 0%
Total generated tokens 65536 65536 0%
Total generated tokens (retokenized) 65349 65372 +0.04%
Request throughput (req/s) 0.08 0.15 +87.50%
Input token throughput (tok/s) 82.72 151.53 +83.23%
Output token throughput (tok/s) 82.72 151.53 +83.23%
Total token throughput (tok/s) 165.44 303.07 +83.32%
Concurrency 1.00 1.00 0%
Accept length - 2.69 N/A
Mean E2E Latency (ms) 12378.26 6757.24 -45.39%
Median E2E Latency (ms) 12378.47 6738.44 -45.54%
Mean TTFT (ms) 102.32 97.48 -4.73%
Median TTFT (ms) 101.83 97.37 -4.38%
P99 TTFT (ms) 105.88 101.64 -4.01%
Mean ITL (ms) 12.00 6.51 -45.75%
Median ITL (ms) 12.00 5.77 -51.92%
P95 ITL (ms) 12.41 17.08 +37.68%
P99 ITL (ms) 12.55 17.78 +41.78%
Max ITL (ms) 16.64 21.41 +28.70%

TRT-LLM MLA MTP on vs. off (concurrency=8)

Details

Server commands:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Client:

python3 -m sglang.bench_serving --backend sglang --model "deepseek-ai/DeepSeek-R1" --num-prompts 64 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=8

Results:

Metric MTP=off MTP=on % Difference
Total input tokens 65536 65536 0%
Total generated tokens 65536 65536 0%
Total generated tokens (retokenized) 65319 65361 +0.06%
Request throughput (req/s) 0.53 0.71 +34.0%
Input token throughput (tok/s) 547.30 729.13 +33.2%
Output token throughput (tok/s) 547.30 729.13 +33.2%
Total token throughput (tok/s) 1094.61 1458.27 +33.3%
Concurrency 8.00 7.70 -3.75%
Accept length - 2.71 N/A
Mean E2E Latency (ms) 14964.81 10810.43 -27.7%
Median E2E Latency (ms) 14961.21 10750.30 -28.1%
Mean TTFT (ms) 240.68 143.85 -40.2%
Median TTFT (ms) 193.33 117.34 -39.3%
P99 TTFT (ms) 438.72 323.26 -26.3%
Mean ITL (ms) 14.39 10.43 -27.5%
Median ITL (ms) 14.36 8.96 -37.6%
P95 ITL (ms) 14.89 27.12 +82.2%
P99 ITL (ms) 15.05 30.33 +101.6%
Max ITL (ms) 260.28 478.39 +83.8%

TRT-LLM MLA MTP on vs. off (concurrency=16)

Details

Server commands:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Client:

python3 -m sglang.bench_serving --backend sglang --model "deepseek-ai/DeepSeek-R1" --num-prompts 64 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=16

Results:

Metric MTP=off MTP=on % Difference
Total input tokens 65536 65536 0%
Total generated tokens 65536 65536 0%
Total generated tokens (retokenized) 65319 65356 +0.06%
Request throughput (req/s) 0.87 1.13 +29.89%
Input token throughput (tok/s) 894.61 1157.60 +29.41%
Output token throughput (tok/s) 894.61 1157.60 +29.41%
Total token throughput (tok/s) 1789.22 2315.20 +29.37%
Concurrency 15.99 14.71 -8.00%
Accept length 2.71
Mean E2E Latency (ms) 18307.75 13013.49 -28.89%
Median E2E Latency (ms) 18289.69 13071.94 -28.48%
Mean TTFT (ms) 287.83 179.31 -37.70%
Median TTFT (ms) 285.44 125.67 -55.98%
P99 TTFT (ms) 375.73 317.55 -15.52%
Mean ITL (ms) 17.61 12.55 -28.71%
Median ITL (ms) 17.61 10.33 -41.33%
P95 ITL (ms) 18.05 31.29 +73.38%
P99 ITL (ms) 18.21 53.58 +194.29%
Max ITL (ms) 207.73 685.71 +230.10%

TRT-LLM MLA MTP on vs. off (concurrency=256)

Details

Server commands:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800 --speculative-algorithm EAGLE --speculative-eagle-topk=1 --speculative-num-draft-tokens 4 --speculative-num-steps 3

Client:

python3 -m sglang.bench_serving --backend sglang --model "deepseek-ai/DeepSeek-R1" --num-prompts 256 --dataset-name random --random-input-len 1024 --random-output-len 1024 --random-range-ratio 1 --max-concurrency=256

Results:

Metric MTP=off MTP=on % Difference
Total input tokens 262144 262144 0.00%
Total generated tokens 262144 262144 0.00%
Total generated tokens (retokenized) 261253 261314 +0.02%
Request throughput (req/s) 5.33 5.79 +8.62%
Input token throughput (tok/s) 5461.20 5934.04 +8.67%
Output token throughput (tok/s) 5461.20 5934.04 +8.67%
Total token throughput (tok/s) 10922.40 11868.09 +8.65%
Concurrency 255.80 222.14 -13.17%
Accept length 2.79
Mean E2E Latency (ms) 47962.98 38332.63 -20.10%
Median E2E Latency (ms) 47963.28 39026.21 -18.62%
Mean TTFT (ms) 4908.56 4294.74 -12.50%
Median TTFT (ms) 4936.11 4308.44 -12.71%
P99 TTFT (ms) 8532.59 7987.98 -6.38%
Mean ITL (ms) 42.09 33.27 -20.93%
Median ITL (ms) 38.45 24.54 -36.17%
P95 ITL (ms) 39.91 46.65 +16.90%
P99 ITL (ms) 40.35 88.16 +118.52%
Max ITL (ms) 7936.68 7519.31 -5.24%

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @pranavm-nvidia, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the TRT-LLM MLA backend with the target_verify step in the Multi-Target Prediction (MTP) process. This enhancement aims to leverage the performance benefits of TRT-LLM for verification tasks, which is expected to improve overall system efficiency. The changes involve adapting the attention mechanism's forward pass and KV cache management to properly support this new operational mode.

Highlights

  • TRT-LLM MLA Backend Integration: The TRT-LLM MLA backend is now enabled for use in the target_verify step of the Multi-Target Prediction (MTP) process, aiming to improve performance.
  • KV Cache Management: Code has been added to update the Key-Value (KV) cache within the forward_extend method, which was previously missing for certain operational modes, ensuring proper state management during verification.
  • Metadata Initialization Updates: The metadata initialization and replay functions (init_forward_metadata_capture_cuda_graph, init_forward_metadata_replay_cuda_graph, init_forward_metadata) have been updated to correctly handle and support the target_verify mode.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the TRT-LLM backend for the target_verify step in multi-token prediction, which should improve performance. The changes correctly route target_verify to use a decode-like path and also fix a missing KV cache update in forward_extend. The implementation for target_verify is sound, but there is significant code duplication with the forward_decode method. I've suggested refactoring this duplicated logic into a shared helper method to improve maintainability. I also pointed out a misleading comment that should be updated for clarity.

Comment thread python/sglang/srt/layers/attention/trtllm_mla_backend.py
Comment thread python/sglang/srt/layers/attention/trtllm_mla_backend.py Outdated
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, could you please run (1) math500 w/ 64k gen 3 times (3) gpqa diamond w/ 32k gen for at least 16 times (preferably 100 times) and paste the acc. I have seen cases when gsm8k is good but these benchs drop acc.

cc @kaixih - could you please share your modification to my script which helps run these

@pranavm-nvidia
Copy link
Copy Markdown
Collaborator Author

GPQA benchmark with MTP=off:

Details

Server command:

python -m sglang.launch_server --model-path "deepseek-ai/DeepSeek-R1" --trust-remote-code --attention-backend trtllm_mla --page-size 64 --tp-size 8 --max-running-requests 512 --cuda-graph-max-bs 512 --host "0.0.0.0" --port 30000 --mem-fraction-static 0.70 --dist-timeout 1800

Results:

Category evaluation_mode num_entries avg_tokens gen_seconds symbolic_correct no_answer
gpqa pass@1 198 6068 1017 68.69% 1.01%
gpqa-Physics (general) pass@1 19 3868 419 89.47% 0.00%
gpqa-Organic Chemistry pass@1 72 8860 1017 36.11% 1.39%
gpqa-Quantum Mechanics pass@1 25 3715 531 100.00% 0.00%
gpqa-Electromagnetism and Photonics pass@1 6 3333 207 100.00% 0.00%
gpqa-High-energy particle physics pass@1 14 4101 394 92.86% 0.00%
gpqa-Genetics pass@1 4 7689 427 50.00% 0.00%
gpqa-Astrophysics pass@1 13 5102 317 100.00% 0.00%
gpqa-Molecular Biology pass@1 15 5453 1017 73.33% 6.67%
gpqa-Chemistry (general) pass@1 20 5675 442 70.00% 0.00%
gpqa-Relativistic Mechanics pass@1 7 2493 196 85.71% 0.00%
gpqa-Inorganic Chemistry pass@1 1 5775 216 100.00% 0.00%
gpqa-Optics and Acoustics pass@1 1 2990 112 100.00% 0.00%
gpqa-Condensed Matter Physics pass@1 1 1168 44 100.00% 0.00%

FYI @elfiegg

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 11, 2025

would be great to try R1-0528 and get 80% for this number
also need to repeat e.g. >10 times since it has huge randomness

image

@Qiaolin-Yu Qiaolin-Yu self-assigned this Sep 11, 2025
@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 14, 2025

looking forward to this PR, if it is hard to do the testing maybe merge first and I will test it in my case and report bugs if acc drops

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 15, 2025

python3 test/srt/test_deepseek_v3_fp4_4gpu.py
[2025-09-15 03:06:57 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2694, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 344, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 260, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 397, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1899, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once
    logits_output_or_pp_proxy_tensors = forward(
                                        ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2490, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2354, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2083, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1193, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1257, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1475, in forward_absorb_core
    attn_output = self.attn_mqa(
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 108, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 81, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 568, in forward_extend
    return super().forward_extend(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 525, in forward_extend
    prefill_wrapper_paged = self.forward_metadata.prefill_wrapper
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'prefill_wrapper'

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 15, 2025

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-V3-0324-FP4 --tp 4 --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --quantization modelopt_fp4 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --device cuda --host 127.0.0.1 --port 8000
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
W0915 03:30:59.104000 1372100 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:30:59.104000 1372100 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:sglang.srt.configs.model_config:modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING:sglang.srt.server_args:TensorRT-LLM MLA only supports page_size of 32 or 64, changing page_size from None to 64.
WARNING:sglang.srt.server_args:FlashInfer TRTLLM MoE is enabled. --disable-shared-experts-fusion is automatically set.
WARNING:sglang.srt.server_args:Overlap scheduler is disabled because of using eagle speculative decoding.
[2025-09-15 03:31:00] server_args=ServerArgs(model_path='nvidia/DeepSeek-V3-0324-FP4', tokenizer_path='nvidia/DeepSeek-V3-0324-FP4', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization='modelopt_fp4', quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.81, max_running_requests=48, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=202633197, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-customer-labels', tokenizer_metrics_allowed_customer_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, enable_trace=False, oltp_traces_endpoint='localhost:4317', api_key=None, served_model_name='nvidia/DeepSeek-V3-0324-FP4', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', load_watch_interval=0.1, prefill_round_robin_balance=False, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mla', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm='EAGLE', speculative_draft_model_path='nvidia/DeepSeek-V3-0324-FP4', speculative_draft_model_revision=None, speculative_num_steps=3, speculative_eagle_topk=1, speculative_num_draft_tokens=4, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_trtllm', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_lmcache=False, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=True, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, enable_pdmux=False, sm_group_num=3, max_mamba_cache_size=None, mamba_ssm_dtype='float32', enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_cutedsl_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
[2025-09-15 03:31:00] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:00] Using default HuggingFace chat template with detected content format: string
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
W0915 03:31:08.765000 1372776 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:31:08.765000 1372776 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
W0915 03:31:08.931000 1372774 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:31:08.931000 1372774 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 03:31:09.082000 1372777 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:31:09.082000 1372777 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 03:31:09.088000 1372775 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:31:09.088000 1372775 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0915 03:31:09.093000 1372778 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0915 03:31:09.093000 1372778 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 03:31:09 TP2] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 03:31:09 TP0] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-15 03:31:09 TP1] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:09 TP3] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:10 TP2] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:10 TP0] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:10 TP0] MLA optimization is turned on. Use trtllm_mla backend.
[2025-09-15 03:31:10 TP0] Chunked prefix cache is turned on.
[2025-09-15 03:31:10 TP0] Init torch distributed begin.
[2025-09-15 03:31:10 TP3] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-15 03:31:10 TP1] modelopt_fp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-15 03:31:11 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-15 03:31:16 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-15 03:31:16 TP0] CUDA-fused xIELU not available (No module named 'xielu') – falling back to a Python version.
For CUDA xIELU (experimental), `pip install git+https://github.com/nickjbrowning/XIELU`
[2025-09-15 03:31:16 TP1] CUDA-fused xIELU not available (No module named 'xielu') – falling back to a Python version.
For CUDA xIELU (experimental), `pip install git+https://github.com/nickjbrowning/XIELU`
[2025-09-15 03:31:16 TP2] CUDA-fused xIELU not available (No module named 'xielu') – falling back to a Python version.
For CUDA xIELU (experimental), `pip install git+https://github.com/nickjbrowning/XIELU`
[2025-09-15 03:31:16 TP3] CUDA-fused xIELU not available (No module named 'xielu') – falling back to a Python version.
For CUDA xIELU (experimental), `pip install git+https://github.com/nickjbrowning/XIELU`
[2025-09-15 03:31:16 TP1] `cute.compile` CompileOptions: options=
[2025-09-15 03:31:16 TP1] Initializing CUTE_DSL DSL
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] `cute.compile` CompileOptions: options=
[2025-09-15 03:31:16 TP0] Initializing CUTE_DSL DSL
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] `cute.compile` CompileOptions: options=
[2025-09-15 03:31:16 TP3] Initializing CUTE_DSL DSL
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] `cute.compile` CompileOptions: options=
[2025-09-15 03:31:16 TP2] Initializing CUTE_DSL DSL
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP1] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP3] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP2] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:16 TP0] jit_runner
[2025-09-15 03:31:17 TP0] Load weight begin. avail mem=176.25 GB
[2025-09-15 03:31:17 TP1] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-09-15 03:31:17 TP3] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-09-15 03:31:17 TP2] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-09-15 03:31:17 TP0] Detected nvfp4 checkpoint. Please note that the format is experimental and subject to change.
[2025-09-15 03:31:18 TP1] Using model weights format ['*.safetensors']
[2025-09-15 03:31:18 TP3] Using model weights format ['*.safetensors']
[2025-09-15 03:31:18 TP2] Using model weights format ['*.safetensors']
[2025-09-15 03:31:19 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/80 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 2/80 [00:01<00:53,  1.45it/s]
Loading safetensors checkpoint shards:   4% Completed | 3/80 [00:02<01:16,  1.00it/s]
Loading safetensors checkpoint shards:   5% Completed | 4/80 [00:04<01:28,  1.16s/it]
Loading safetensors checkpoint shards:   6% Completed | 5/80 [00:05<01:32,  1.23s/it]
Loading safetensors checkpoint shards:   8% Completed | 6/80 [00:06<01:26,  1.17s/it]
Loading safetensors checkpoint shards:   9% Completed | 7/80 [00:08<01:33,  1.28s/it]
Loading safetensors checkpoint shards:  10% Completed | 8/80 [00:09<01:32,  1.29s/it]
Loading safetensors checkpoint shards:  11% Completed | 9/80 [00:11<01:39,  1.40s/it]
Loading safetensors checkpoint shards:  12% Completed | 10/80 [00:12<01:40,  1.43s/it]
Loading safetensors checkpoint shards:  14% Completed | 11/80 [00:13<01:36,  1.40s/it]
Loading safetensors checkpoint shards:  15% Completed | 12/80 [00:15<01:34,  1.39s/it]
Loading safetensors checkpoint shards:  16% Completed | 13/80 [00:16<01:33,  1.39s/it]
Loading safetensors checkpoint shards:  18% Completed | 14/80 [00:18<01:30,  1.38s/it]
Loading safetensors checkpoint shards:  19% Completed | 15/80 [00:19<01:29,  1.37s/it]
Loading safetensors checkpoint shards:  20% Completed | 16/80 [00:20<01:28,  1.38s/it]
Loading safetensors checkpoint shards:  21% Completed | 17/80 [00:22<01:28,  1.40s/it]
Loading safetensors checkpoint shards:  22% Completed | 18/80 [00:23<01:31,  1.47s/it]
Loading safetensors checkpoint shards:  24% Completed | 19/80 [00:25<01:28,  1.45s/it]
Loading safetensors checkpoint shards:  26% Completed | 21/80 [00:26<01:05,  1.11s/it]
Loading safetensors checkpoint shards:  28% Completed | 22/80 [00:28<01:08,  1.19s/it]
Loading safetensors checkpoint shards:  29% Completed | 23/80 [00:29<01:12,  1.27s/it]
Loading safetensors checkpoint shards:  31% Completed | 25/80 [00:31<00:56,  1.03s/it]
Loading safetensors checkpoint shards:  32% Completed | 26/80 [00:32<01:00,  1.12s/it]
Loading safetensors checkpoint shards:  34% Completed | 27/80 [00:33<01:03,  1.19s/it]
Loading safetensors checkpoint shards:  35% Completed | 28/80 [00:35<01:04,  1.24s/it]
Loading safetensors checkpoint shards:  36% Completed | 29/80 [00:36<01:05,  1.29s/it]
Loading safetensors checkpoint shards:  38% Completed | 30/80 [00:38<01:04,  1.30s/it]
Loading safetensors checkpoint shards:  39% Completed | 31/80 [00:39<01:10,  1.43s/it]
Loading safetensors checkpoint shards:  41% Completed | 33/80 [00:41<00:52,  1.12s/it]
Loading safetensors checkpoint shards:  42% Completed | 34/80 [00:42<00:55,  1.20s/it]
Loading safetensors checkpoint shards:  44% Completed | 35/80 [00:44<00:56,  1.26s/it]
Loading safetensors checkpoint shards:  45% Completed | 36/80 [00:45<00:57,  1.30s/it]
Loading safetensors checkpoint shards:  46% Completed | 37/80 [00:47<00:58,  1.36s/it]
Loading safetensors checkpoint shards:  48% Completed | 38/80 [00:48<00:58,  1.40s/it]
Loading safetensors checkpoint shards:  49% Completed | 39/80 [00:50<00:58,  1.42s/it]
Loading safetensors checkpoint shards:  50% Completed | 40/80 [00:51<00:57,  1.43s/it]
Loading safetensors checkpoint shards:  51% Completed | 41/80 [00:52<00:55,  1.43s/it]
Loading safetensors checkpoint shards:  52% Completed | 42/80 [00:54<00:53,  1.42s/it]
Loading safetensors checkpoint shards:  54% Completed | 43/80 [00:55<00:52,  1.41s/it]
Loading safetensors checkpoint shards:  55% Completed | 44/80 [00:57<00:50,  1.41s/it]
Loading safetensors checkpoint shards:  57% Completed | 46/80 [00:58<00:36,  1.08s/it]
Loading safetensors checkpoint shards:  59% Completed | 47/80 [01:00<00:41,  1.25s/it]
Loading safetensors checkpoint shards:  60% Completed | 48/80 [01:01<00:41,  1.28s/it]
Loading safetensors checkpoint shards:  61% Completed | 49/80 [01:02<00:39,  1.28s/it]
Loading safetensors checkpoint shards:  62% Completed | 50/80 [01:04<00:40,  1.33s/it]
Loading safetensors checkpoint shards:  64% Completed | 51/80 [01:05<00:39,  1.36s/it]
Loading safetensors checkpoint shards:  65% Completed | 52/80 [01:07<00:38,  1.39s/it]
Loading safetensors checkpoint shards:  66% Completed | 53/80 [01:08<00:38,  1.44s/it]
Loading safetensors checkpoint shards:  68% Completed | 54/80 [01:10<00:37,  1.43s/it]
Loading safetensors checkpoint shards:  69% Completed | 55/80 [01:11<00:34,  1.40s/it]
Loading safetensors checkpoint shards:  70% Completed | 56/80 [01:13<00:33,  1.41s/it]
Loading safetensors checkpoint shards:  71% Completed | 57/80 [01:14<00:32,  1.40s/it]
Loading safetensors checkpoint shards:  72% Completed | 58/80 [01:15<00:30,  1.40s/it]
Loading safetensors checkpoint shards:  74% Completed | 59/80 [01:17<00:29,  1.38s/it]
Loading safetensors checkpoint shards:  75% Completed | 60/80 [01:18<00:27,  1.38s/it]
Loading safetensors checkpoint shards:  76% Completed | 61/80 [01:19<00:26,  1.39s/it]
Loading safetensors checkpoint shards:  78% Completed | 62/80 [01:21<00:25,  1.40s/it]
Loading safetensors checkpoint shards:  79% Completed | 63/80 [01:22<00:24,  1.45s/it]
Loading safetensors checkpoint shards:  80% Completed | 64/80 [01:24<00:25,  1.57s/it]
Loading safetensors checkpoint shards:  81% Completed | 65/80 [01:26<00:22,  1.52s/it]
Loading safetensors checkpoint shards:  82% Completed | 66/80 [01:27<00:20,  1.46s/it]
Loading safetensors checkpoint shards:  84% Completed | 67/80 [01:28<00:19,  1.47s/it]
Loading safetensors checkpoint shards:  85% Completed | 68/80 [01:30<00:17,  1.47s/it]
Loading safetensors checkpoint shards:  86% Completed | 69/80 [01:31<00:15,  1.44s/it]
Loading safetensors checkpoint shards:  90% Completed | 72/80 [01:33<00:07,  1.01it/s]
Loading safetensors checkpoint shards:  91% Completed | 73/80 [01:35<00:07,  1.08s/it]
Loading safetensors checkpoint shards:  92% Completed | 74/80 [01:36<00:06,  1.16s/it]
Loading safetensors checkpoint shards:  94% Completed | 75/80 [01:37<00:06,  1.22s/it]
Loading safetensors checkpoint shards:  95% Completed | 76/80 [01:39<00:05,  1.27s/it]
Loading safetensors checkpoint shards:  96% Completed | 77/80 [01:40<00:03,  1.33s/it]
Loading safetensors checkpoint shards:  98% Completed | 78/80 [01:42<00:02,  1.35s/it]
Loading safetensors checkpoint shards:  99% Completed | 79/80 [01:43<00:01,  1.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 80/80 [01:45<00:00,  1.37s/it]
Loading safetensors checkpoint shards: 100% Completed | 80/80 [01:45<00:00,  1.31s/it]

[2025-09-15 03:33:08 TP0] Load weight end. type=DeepseekV3ForCausalLM, dtype=torch.bfloat16, avail mem=72.23 GB, mem usage=104.02 GB.
[2025-09-15 03:33:10 TP0] KV Cache is allocated. #tokens: 591872, KV size: 38.74 GB
[2025-09-15 03:33:10 TP0] Memory pool end. avail mem=33.39 GB
[2025-09-15 03:33:10 TP2] KV Cache is allocated. #tokens: 591872, KV size: 38.74 GB
[2025-09-15 03:33:10 TP3] KV Cache is allocated. #tokens: 591872, KV size: 38.74 GB
[2025-09-15 03:33:10 TP1] KV Cache is allocated. #tokens: 591872, KV size: 38.74 GB
[2025-09-15 03:33:11 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=32.80 GB
[2025-09-15 03:33:11 TP0] Capture cuda graph bs [1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 40, 48]
Capturing batches (bs=48 avail_mem=32.68 GB):   0%|                                                                  | 0/22 [00:00<?, ?it/s][2025-09-15 03:33:12 TP1] Registering 0 cuda graph addresses
Capturing batches (bs=48 avail_mem=32.68 GB):   0%|                                                                  | 0/22 [00:00<?, ?it/s]
[2025-09-15 03:33:12 TP0] Registering 0 cuda graph addresses
[2025-09-15 03:33:12 TP2] Registering 0 cuda graph addresses
[2025-09-15 03:33:12 TP3] Registering 0 cuda graph addresses
[2025-09-15 03:33:12 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2694, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 344, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 260, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 397, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1899, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once
    logits_output_or_pp_proxy_tensors = forward(
                                        ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2490, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2354, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2083, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1193, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1257, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1475, in forward_absorb_core
    attn_output = self.attn_mqa(
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 108, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 81, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 568, in forward_extend
    return super().forward_extend(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 525, in forward_extend
    prefill_wrapper_paged = self.forward_metadata.prefill_wrapper
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'prefill_wrapper'

[2025-09-15 03:33:12 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2694, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 344, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 96, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 260, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 397, in initialize
    self.init_device_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1899, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once
    logits_output_or_pp_proxy_tensors = forward(
                                        ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2490, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2354, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2083, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1193, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1257, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1475, in forward_absorb_core
    attn_output = self.attn_mqa(
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 108, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 81, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 568, in forward_extend
    return super().forward_extend(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_mla_backend.py", line 525, in forward_extend
    prefill_wrapper_paged = self.forward_metadata.prefill_wrapper
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'prefill_wrapper'

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 17, 2025

I still encountered this issue #10281 (comment) @pranavm-nvidia

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 17, 2025

@pranavm-nvidia @kushanam please merge latest main and fix the conflicts

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 17, 2025

python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-V3-0324-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-V3-0324-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--prefill-attention-backend fa4 --speculative-attention-mode decode

We need to ensure that this PR does not break the above two commands. Thanks! @pranavm-nvidia @kushanam

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 18, 2025

The latest one still doesn't work.

python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-V3-0324-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2351, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 2080, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1190, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1254, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1472, in forward_absorb_core
    attn_output = self.attn_mqa(
                  ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/radix_attention.py", line 108, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 81, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 671, in forward_extend
    workspace_buffer=metadata.workspace,
                     ^^^^^^^^^^^^^^^^^^
AttributeError: 'TRTLLMMLADecodeMetadata' object has no attribute 'workspace'

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Sep 18, 2025

gsm8k is lower than expected

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9158|±  |0.0076|
|     |       |strict-match    |     8|exact_match|↑  |0.8052|±  |0.0109|
python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-V3-0324-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4


pip3 install "lm_eval[api]"
lm_eval --model local-chat-completions --model_args model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 128 --apply_chat_template --num_fewshot 8

Pranav Marathe added 5 commits September 19, 2025 17:24
Enables `forward_extend` in the TRT-LLM MLA backend to be used
for `target_verify` in MTP.

Also adds code to update the KV cache in `forward_extend` which was
previously missing.
Comment thread python/sglang/srt/layers/attention/trtllm_mla_backend.py
@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

I want to have a quick test and see the following launch error. may I know what is your env? (I tried both stable flashinfer and latest main flashinfer)

Details
[2025-09-22 01:30:36 TP0] Registering 0 cuda graph addresses
[2025-09-22 01:30:55 TP3] Registering 0 cuda graph addresses
[2025-09-22 01:30:55 TP1] Registering 0 cuda graph addresses
[2025-09-22 01:30:55 TP2] Registering 0 cuda graph addresses
[2025-09-22 01:30:55 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 391, in __init__
    self.capture()
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 509, in capture
    ) = self.capture_one_batch_size(bs, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 680, in capture_one_batch_size
    run_once()
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 669, in run_once
    logits_output_or_pp_proxy_tensors = forward(
                                        ^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 2520, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 2384, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 2113, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 1229, in forward
    return self.forward_core(s)
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 1293, in forward_core
    return self.forward_absorb_core(*inner_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/models/deepseek_v2.py", line 1505, in forward_absorb_core
    attn_output = self.attn_mqa(
                  ^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/layers/radix_attention.py", line 108, in forward
    return forward_batch.attn_backend.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/layers/attention/base_attn_backend.py", line 81, in forward
    return self.forward_extend(
           ^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/layers/attention/trtllm_mla_backend.py", line 617, in forward_extend
    raw_out = flashinfer.decode.trtllm_batch_decode_with_kv_cache_mla(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/flashinfer/flashinfer/decode.py", line 2331, in trtllm_batch_decode_with_kv_cache_mla
    run_func(
  File "/data/numa0/tom/venvs/sgl/lib/python3.12/site-packages/torch/_ops.py", line 1243, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error in function 'trtllm_paged_attention_launcher' at /data/numa0/tom/primary_synced/flashinfer/csrc/trtllm_fmha_kernel_launcher.cu:172: Missing TRTLLM-GEN kernel (decode): qkvLayout=2, maskType=0, kernelType=2, tileScheduler=0, multiCtasKvMode=1, headDimPerCtaV=512, headDimQk=576, headDimV=512, tileSizeKv=128, numTokensPerPage=64, maxNumHeadsQPerKvInCta=16, reuseSmemKForV=0, uses2CtaMma=0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/managers/scheduler.py", line 2819, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/managers/scheduler.py", line 354, in __init__
    self.tp_worker = TpWorkerClass(
                     ^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/managers/tp_worker.py", line 97, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 281, in __init__
    self.initialize(min_per_gpu_memory)
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 424, in initialize
    self.init_device_graphs()
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/model_runner.py", line 1801, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/numa0/tom/primary_synced/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 393, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Error in function 'trtllm_paged_attention_launcher' at /data/numa0/tom/primary_synced/flashinfer/csrc/trtllm_fmha_kernel_launcher.cu:172: Missing TRTLLM-GEN kernel (decode): qkvLayout=2, maskType=0, kernelType=2, tileScheduler=0, multiCtasKvMode=1, headDimPerCtaV=512, headDimQk=576, headDimV=512, tileSizeKv=128, numTokensPerPage=64, maxNumHeadsQPerKvInCta=16, reuseSmemKForV=0, uses2CtaMma=0
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

My full reproduction on B200

docker rm -f tom_ac6284

docker run -it -d --name tom_ac6284 --restart=always --gpus all --ipc=host --network=host --privileged --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN -v /root/.cache:/root/.cache -v /home/yineng/tom:/host_home lmsysorg/sglang:v0.5.3rc1 /bin/bash -c 'echo now sleep forever ; while true; do sleep 2; done'

docker exec -it tom_ac6284 /bin/zsh

pip install --force-reinstall flashinfer_python sgl-kernel

git remote add pranavm-nvidia-sglang https://github.com/pranavm-nvidia/sglang
git fetch pranavm-nvidia-sglang 0fa65cd360c5304e62e3c0d0ea9ee4da2b5ec69e
git checkout FETCH_HEAD
git log

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server \
--model-path /dev/shm/DeepSeek-R1-0528-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Results

RuntimeError: Error in function 'trtllm_paged_attention_launcher' at /usr/local/lib/python3.12/dist-packages/flashinfer/data/csrc/trtllm_fmha_kernel_launcher.cu:172: Missing TRTLLM-GEN kernel (decode): qkvLayout=2, maskType=0, kernelType=2, tileScheduler=0, multiCtasKvMode=1, headDimPerCtaV=512, headDimQk=576, headDimV=512, tileSizeKv=128, numTokensPerPage=64, maxNumHeadsQPerKvInCta=16, reuseSmemKForV=0, uses2CtaMma=0

p.s. version is correct

➜  sglang git:(0fa65cd3) pip list | grep flashinfer
flashinfer-python         0.3.1
➜  sglang git:(0fa65cd3) pip list |grep kernel
sgl-kernel                0.3.11
➜  sglang git:(0fa65cd3) python
Python 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sglang
>>> sglang.__path__
['/sgl-workspace/sglang/python/sglang']
>>> exit()

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

hi @pranavm-nvidia, @zhyncs let me to help fix accuracy issues, my fix is WIP in https://github.com/fzyzcjy/sglang/tree/feat/ac6280

@fzyzcjy fzyzcjy requested a review from ping1jing2 as a code owner September 22, 2025 13:21
@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

Test command

# PR
SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 python3 -m sglang.launch_server \
--model-path /data/numa0/tom/downloaded_models/models--nvidia--DeepSeek-V3-0324-FP4/snapshots/d03662cdc34b56eab1315d4557e395e6b4944782 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

# baseline (no MTP)
SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 python3 -m sglang.launch_server \
--model-path nvidia/DeepSeek-V3-0324-FP4 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4

# test
while true; do lm_eval --model local-chat-completions --model_args model=gpt-oss,base_url=http://127.0.0.1:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 128 --apply_chat_template --num_fewshot 8; done

Results (the "strict-match" part in lm-eval)

  • baseline: 946,937,942,939,942,946,943,943,942,943,947,950,942,948,941,945,938,944,936,939
  • before commit: 89.5,88.8,88.3,88.9,89.5,89.4,89.0
  • after commit: 94.5,94.1,94.9,94.1,94.5,94.5,945,942,948,949,943,942,948,944

for more subtle checks, I will need to run gpqa on my gb200 setup later, but anyway there may not be big issues since gsm8k looks not bad.

Remarks

  • loading R1 fp4 model will lead to error Enables TRT-LLM backend to be used for target_verify #10281 (comment), while V3 fp4 model is good. direct cause is that, R1 fp4 uses fp8 kv dtype. I will check it later when checking fp8 kv.
  • in latest main, modifying metadata.max_seq_len in init_forward_metadata_replay_cuda_graph I think should not take effect and is a bug, not digged or checked yet
  • in latest main, there is a o_sf_scale=-1.0 (the negative sign), not digged why yet
  • shall we unify the code between decode and target_verify

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

cc @zhyncs shall we merge this first (if I find subtle issues in gb200 settings will post separately)

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Sep 22, 2025

UPDATE: Support of fp8

test command

SGLANG_USE_CUTLASS_BACKEND_FOR_FP4_GEMM=1 python3 -m sglang.launch_server \
--model-path /data/numa0/tom/downloaded_models/models--nvidia--DeepSeek-V3-0324-FP4/snapshots/d03662cdc34b56eab1315d4557e395e6b4944782 \
--tp 4 --attention-backend trtllm_mla \
--moe-runner-backend flashinfer_trtllm \
--quantization modelopt_fp4 \
--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 \
--kv-cache-dtype fp8_e4m3

results (strict-match in lm-eval)

  • baseline: not tested
  • before change: error when starting server
  • after change: 94.3, 94.8, 94.5, 93.8, 94.9, 94.7, 94.2, 93.7, 93.1

thus seems not bad

@zhyncs zhyncs merged commit b1bb8e7 into sgl-project:main Sep 22, 2025
63 of 77 checks passed
v = v.view(-1, layer.tp_k_head_num, layer.v_head_dim)

if forward_batch.forward_mode.is_target_verify():
metadata = (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for technical debt: We can refactor the change to this function to be reused between forward_decode and target_verify. We can simply pass in ForwardMode and modify the seq_lens and q tensors.

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Co-authored-by: Pranav Marathe <pranavm@ipp1-3309.ipp1a1.colossus.nvidia.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
BraveY pushed a commit to openanolis/sglang that referenced this pull request Oct 22, 2025
Merge branch sglang_public_tracker of git@code.alipay.com:Theta/SGLang.git into main
https://code.alipay.com/Theta/SGLang/pull_requests/342?tab=diff

Reviewed-by: 苏墨 <xuyongfei.xyf@antgroup.com>


* [router] minor code clean up in server startup (sgl-project#10470)
* [bugfix] fix typo (sgl-project#10471)
* [PD metrics] Add latency Histogram metrics of each stage for generate requests (sgl-project#8710)
* [CI] Fix runner for sgl-kernel (sgl-project#9887)
* fix(internvl): fix accuracy issue of normalization (sgl-project#10375)
* fix: gpt-oss streaming dropping normal content when tools are provided but not used (sgl-project#9657)
* model: support solar (sgl-project#8189)
* fix: resolve sgl-kernel ut (sgl-project#10476)
* [1/2] Speed up trtllm_mla attention backend (>10% e2e) (sgl-project#10473)
* Fix `--dataset-path` in `bench_one_batch_server` (sgl-project#10475)
* [Env] minimal version for organizing envs (sgl-project#10479)
* chore: bump v0.3.10 sgl-kernel (sgl-project#10478)
* [router] multi model registration fix (sgl-project#10481)
* [2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (sgl-project#10286)
* [Auto Sync] Update registry.py (20250915) (sgl-project#10484)
* [router] fix worker registration in multi model mode (sgl-project#10486)
* fix crash of DeepSeek-V3 update_weights_from_disk (sgl-project#8863)
* Temporay work-around for rocm 7.0.0 alpha with enabling data-parallel issue (sgl-project#10434)
* [Hicache] Evaluate Per-Round Metrics in Multiturn Bench (sgl-project#10203)
* [ModelOpt] Respect `kv_cache_quant_algo` in ModelOpt checkpoints (sgl-project#10336)
* Add Logprobs unit test with a loose threshold (sgl-project#10230)
* [router] add router db connector for responses api (sgl-project#10487)
* Remove wrong imports `from sglang.python` (sgl-project#10493)
* [router] fix router manager and router init in server (sgl-project#10499)
* Cache the result of `is_blackwell` platform check (sgl-project#10498)
* feat: update support for qwen3next model (sgl-project#10466)
* Minor fix lint introduced by sgl-project#10466 (sgl-project#10507)
* chore: upgrade sgl-kernel 0.3.10 (sgl-project#10500)
* Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (sgl-project#10491)
* Fix CI when sgl-kernel is changed but srt is not changed (sgl-project#10515)
* Support sgl-router parallel_batch in bench_one_batch_server (sgl-project#10506)
* [CPU] fix CPU backend sel. issue for Llama4 (sgl-project#10511)
* adjust import setuptools_rust (sgl-project#10524)
* Fix formatting in long code blocks (sgl-project#10528)
* skip vision_model for lora (sgl-project#10530)
* [2/2] Speed up trtllm_mla attention backend (sgl-project#10474)
* support using fa4 on deepseek on blackwell (sgl-project#9928)
* [Auto Sync] Update scheduler_profiler_mixin.py, rpd_utils.p... (20250916) (sgl-project#10494)
* [Auto Sync] Update activation.py, chunk_cache.py, utils.py (20250917) (sgl-project#10538)
* feat: add priority based scheduling with priority based request acceptance and preemption (sgl-project#8746)
* Fix decord dependency for aarch64 docker build (sgl-project#10529)
* enable prefix cache with dp (sgl-project#10459)
* [bugfix]hicache bench_long_context.py run failed (sgl-project#10523)
* Remove duplicated code (sgl-project#10545)
* CUDA Arch Independent (sgl-project#8813)
* [bench] Fix random seed in `bench_one_batch_server` (sgl-project#10548)
* [HiCache] Add tests for hicache storage mooncake backend (sgl-project#10171)
* [BugFix] Fix incorrect hidden_states_tensor in pd disaggregation + eagle (sgl-project#9976)
* fix: update dsv3 fp4 ut (sgl-project#10584)
* vlm: remove redundant d2h movement of mm feature tensors (sgl-project#9987)
* Enable trtllm mla prefix extend (sgl-project#10526)
* [ROCm] Fix fp8 quantization accuracy issue. (sgl-project#10558)
* [HICache] introduce evict policy (sgl-project#10190)
* PullRequest: 303 Revert "PullRequest: 291 for fa3 kvcache: revert github "convert mla kvcache to bfloat16""
* aiter v0.1.5.post2 (sgl-project#10563)
* [PD] Improve disaggregation common backend and refactor mooncake backend (sgl-project#10273)
* chore: upgrade mooncake 0.3.6 (sgl-project#10596)
* [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
* Scale kkt after reduction (sgl-project#10604)
* fix deepep assert when PD disaggregation == null (sgl-project#8274)
* [RL] Add destroy process group api (sgl-project#9979)
* Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
* update deepep version for qwen3-next deepep moe (sgl-project#10624)
* support qwen3-next-fp8 deepep (sgl-project#10622)
* Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
* [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
* Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
* feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
* Garbage collector regression in the online server (sgl-project#10621)
* [router] refactor worker to builder pattern 1/n (sgl-project#10628)
* refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
* [Feature] Speculative decoding support lookahead (sgl-project#9873)
* [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
* [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
* model support: Sarashina2VisionForCausalLM (sgl-project#10632)
* feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
* chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
* Hicache L3 backend mooncake optimization configuration reading method (sgl-project#10319)
* [router] refactor worker to builder pattern 2/n (sgl-project#10633)
* [Feature]feat(get_ip): unify get_ip_xxx (sgl-project#10081)
* [router] refactor worker to builder pattern 3/n (sgl-project#10647)
* [sgl-kernel] Support moe_sum_reduce cuda kernel (sgl-project#10321)
* [router] refactor worker to builder pattern 4/n (sgl-project#10650)
* Fix fast decode plan for flashinfer v0.4.0rc1 and upgrade sgl-kernel 0.3.11 (sgl-project#10634)
* [router] refactor worker to builder pattern 5/n (sgl-project#10653)
* [HiCacheStorage]support page_first_direct layout for generic set&get (sgl-project#10522)
* [router] preserve order of json params using preserve_order feature (sgl-project#10661)
* [router] refactor router and worker management 1/n (sgl-project#10664)
* fix: resolve sync issue (sgl-project#10668)
* [Auto Sync] Update .clang-format (20250919) (sgl-project#10670)
* [router] refactor router and worker management 2/n (sgl-project#10666)
* router-spec: Reorder `ChatCompletionRequest` and fix validation logic (sgl-project#10675)
* chore: cleanup docker image (sgl-project#10671)
* limit sgl-kernel causal conv1d to cuda only (sgl-project#10648)
* [Auto Sync] Update model_runner.py (20250920) (sgl-project#10679)
* [router] refactor router and worker management 2.5/n (sgl-project#10677)
* [1/2] Support deterministic inference with flashinfer attention backend (sgl-project#10645)
* [Auto Sync] Update deepseek_v2.py (20250920) (sgl-project#10683)
* chore: upgrade mooncake 0.3.6.post1 to fix gb200 dockerfile (sgl-project#10681)
* [Performance] Qwen3-Next: optimize causal_conv1d_fn triton kernel - up to 9% faster (sgl-project#10680)
* Replace os.environ in layernorm.py (sgl-project#10684)
* fix(disagg): fix sending KV cache in case of MLA for NIXL backend (sgl-project#10673)
* fix: update run_suite (sgl-project#10685)
* fix: remove awq_dequantize deps (sgl-project#10686)
* [Auto Sync] Update modelopt_quant.py (20250920) (sgl-project#10688)
* [Feature] Support deterministic inference with FA3 backend (sgl-project#10651)
* feat: update server args  (sgl-project#10696)
* Super tiny fix extra logs (sgl-project#10697)
* [3/4] Speed up CSGMV backend perf by 10% through dynamic chunking + kernel optimization  (sgl-project#10592)
* Update release-docs.yml (sgl-project#10706)
* Refactors radix cache for extra key support (sgl-project#10317)
* [Router]fix: fix get_load missing api_key (sgl-project#10385)
* fix: disable gpt-oss b200 ut (sgl-project#10716)
* Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (sgl-project#10714)
* [Auto Sync] Update deepseek_v2.py (20250922) (sgl-project#10717)
* Support deterministic inference with triton backend (sgl-project#10694)
* [deterministic inference] Move batch invariant pkg to sglang (sgl-project#10695)
* [2/2] Support deterministic inference for temperature > 0 (sgl-project#10678)
* [Ascend] codeowner updates for ascend related files (sgl-project#10699)
* [theta] 支持qwen-vl的多模自定义采样
* revert e61d08c [theta] 支持qwen-vl的多模...
* PullRequest: 306 [theta] 支持qwen-vl的多模自定义采样
* [4/4] Introduce CachedKernel to reduce CSGMV kernel launch overheads by 60% (sgl-project#10709)
* Convert FLASHINFER_WORKSPACE_SIZE to integer (sgl-project#10731)
* EPLB: prefer to use physical experts in the same node (sgl-project#9849)
* fix capture_bs when speculative decoding enabled (sgl-project#10730)
* Fix flaky logprobs test (sgl-project#10728)
* Fix CI TestChunkedSGMV (sgl-project#10737)
* [Docs, minor] Fix LLM doc matrix (sgl-project#10753)
* Add warnings and remove dependency for deterministic inference (sgl-project#10724)
* bugfix: Fix `get_worker_urls_for_model` in http/router.rs (sgl-project#10754)
* [router] refactor router and worker management 3/n (sgl-project#10727)
* [router] update ci so only execute benchmarks when labels are added (sgl-project#10757)
* Fix MTP MoE weight loading with NVFP4 target model. (sgl-project#10758)
* chore: bump sgl-kernel v0.3.12 (sgl-project#10732)
* [Generative Score API] Added test_scores_api.py to github CICD to run per commit (sgl-project#10755)
* refactor zero copy (sgl-project#10300)
* Fix multimodal registry and code sync scripts (sgl-project#10759)
* Enables TRT-LLM backend to be used for target_verify (sgl-project#10281)
* fix: kv events with tp > 1 (sgl-project#10541)
* [Auto Sync] Update flashattention_backend.py (20250922) (sgl-project#10762)
* [Feature] Add MLAProcess for DeepSeek MLA on NPU (sgl-project#10130)
* [Ascend] optimize Qwen-vl on Ascend (sgl-project#10556)
* [Ascend]optimize Qwen3 on Ascend (sgl-project#10574)
* [Auto Sync] Update configurer.py (20250923) (sgl-project#10765)
* [router] refactor router and worker management 4/n (sgl-project#10756)
* PullRequest: 310 新增 BailingMoEV3 模型及其 MLA 支持
* [router] remove pd router draining channel (sgl-project#10767)
* [router] fix logger type mismatch (sgl-project#10774)
* Use simulate acc len from `sglang.environ` (sgl-project#10771)
* Fix trtllm_mla slow concat kernel in MTP (sgl-project#10777)
* Move cached kernel to srt.utils (sgl-project#10776)
* feat: unify dockerfiles (sgl-project#10705)
* Introduce `FutureMap` (sgl-project#10715)
* chore: upgrade sgl-kernel 0.3.12 (sgl-project#10782)
* followup: clean up dockerfiles and release yamls  (sgl-project#10783)
* Clean up server args (sgl-project#10770)
* move `environ` into `sglang.srt` to avoid break SRT auto sync. (sgl-project#10791)
* Fix hicache mooncake backend CI (sgl-project#10792)
* [router] fix cache aware routing strategy and lock contention (sgl-project#10773)
* [router] responses api POST and GET with local storage (sgl-project#10581)
* model: support qwen3-vl series (sgl-project#10323)
* [fix][pd-disag]no need set next batch sampling info done in prefill (sgl-project#10259)
* [ROCm] Update aiter to v0.1.5.post3 (sgl-project#10812)
* [router] use dashmap for radix tree instead of hash for multi model (sgl-project#10814)
* router(grpc): Implement route for chat_cmpl endpoint (sgl-project#10761)
* fix ceval (sgl-project#10504)
* Remove duplicate code in qwen2 model (sgl-project#10540)
* [router] fix axum default body limit (sgl-project#10818)
* Fix latest main ci (sgl-project#10799)
* add tunning files for QWEN-3-NEXT (sgl-project#10794)
* [Auto Sync] Update protocol.py (20250923) (sgl-project#10820)
* fix: draft model IMA by overide max_positional_embeddings (sgl-project#10787)
* [Auto Sync] Update elementwise.py (20250923) (sgl-project#10823)
* [Auto Sync] Update simple_eval_common.py (20250923) (sgl-project#10824)
* [router] Support streaming for Openai Router Response api  (sgl-project#10822)
* [router] add auth middleware for api key auth (sgl-project#10826)
* [Auto Sync] Update load_config.py, model_config.py, configu... (20250923) (sgl-project#10825)
* Revert "[fix][pd-disag]no need set next batch sampling info done in prefill" (sgl-project#10828)
* Add CI timeout guidelines (sgl-project#10829)
* [theta] fix serving_tokenization.py
* feat: add cache_salt support to request (sgl-project#10718)
* fix bailing_moe with enable_dp_attention (sgl-project#10860)
* ci: free space on workers for build (sgl-project#10786)
* router-grpc: Support jinja chat template content format detection (sgl-project#10832)
* [router] select first healthy worker on proxied get requests (sgl-project#10827)
* chore: Initial support for input config files (sgl-project#10534)
* router-grpc: Add tools processing and other paramters for apply_chat_template (sgl-project#10877)
* [router] consolidate health endpoints and flush cache (sgl-project#10876)
* Restruct sgl-kernel benchmark (sgl-project#10861)
* [Bug] Fix Issue#10215 (sgl-project#10572)
* [router] consolidate worker get loads (sgl-project#10880)
* [router] Support Oracle DB(ATP) Data Connector (sgl-project#10845)
* [router] simplify tokenizer dev doc (sgl-project#10895)
* [Auto Sync] Update model_config.py (20250925) (sgl-project#10885)
* [ci feature] add ci monitor (sgl-project#10872)
* [HiCache] Cleaning the deprecated host memory state (sgl-project#10778)
* integrate AIBrix KVcache (sgl-project#10376)
* Add fuse_moe per-channel tune (sgl-project#10915)
* [router] consolidate worker load monitoring (sgl-project#10894)
* router: Fix constraint proto and `build_constraint` in grpc router (sgl-project#10881)
* Refactor kv_cache_scheme handling for quantization (sgl-project#10132)
* refactor: Move `grpc/client.rs` to `grpc_client/sglang_scheduler.rs` (sgl-project#10924)
* fix env flashinfer (sgl-project#10910)
* [minor] Remove deprecated function `get_ip` (sgl-project#10883)
* Rename customer label -> custom label (sgl-project#10899)
* [router] change log level to warning (sgl-project#10926)
* [router][refactor] Clean up protobuf fields (sgl-project#10923)
* Replace the Kimi-K2 generated tool call idx with history tool call count (sgl-project#10612)
* [ci] add ci-monitor workflow (sgl-project#10898)
* Remove pull_request trigger from CI monitor workflow (sgl-project#10932)
* router: Support parallel sampling num > 1 in grpc_server and non-stream handling (sgl-project#10929)
* Revert "Refactor kv_cache_scheme handling for quantization (sgl-project#10132)" (sgl-project#10935)
* Update CODEOWNERS to include JustinTong0323 in FC (sgl-project#10939)
* [PD-HiCache]: Support Async Offloading KVCache In Decode Side (sgl-project#10192)
* CI: Fix docker manifest build (sgl-project#10936)
* [router] update owners for router components (sgl-project#10927)
* Fuse write kv buffer into rope for qwen3 moe & bailing moe (sgl-project#10749)
* [router] add grpc client get and set (sgl-project#10955)
* [router]fix code owner syntax error (sgl-project#10956)
* [router] move grpc client from router to worker and builder (sgl-project#10958)
* [router] add move grpc worker management from router to worker manager (sgl-project#10960)
* [router] grpc router regular mode import cleanup (sgl-project#10963)
* [router] remove old/oudated/useless comments (sgl-project#10967)
* [router] remove old/oudated/useless comments across code base (sgl-project#10968)
* ci: fix rate-limit of huggingface with hf auth login (sgl-project#10947)
* Update label field comment to indicate deprecation (sgl-project#10970)
* Restruct gpu_memory_settings in a unify function and relax max_cuda_graph_bs (sgl-project#10372)
* ci: refactor nightly test (sgl-project#10495)
* refactor loading weights from remote instance coding format (sgl-project#10941)
* [router][grpc] Add helpfer functions for decoder in router.rs and fix specs (sgl-project#10971)
* Add simple docker file for B300 (sgl-project#10944)
* Ci monitor support performance (sgl-project#10965)
* [HiCache]: Support dynamic loading backends for hicache (sgl-project#10551)
* [Bugfix][Minor][Benchmark] Fix some bugs due to PR sgl-project#10495 (sgl-project#10982)
* [router][grpc] Support E2E non-stream chat completions (sgl-project#10980)
* fix: fp8 quantization failure of qwen 2.5 VL 7B model (sgl-project#10112)
* [Fix] RuntimeError: get_cfg Unsupported input_type:Float4_e2m1fn_x2 in using aiter-mxfp4-moe (sgl-project#10981)
* fix: make inference deterministic for large TP (sgl-project#10930)
* Add auth to get server info (sgl-project#10751)
* PullRequest: 315 bailingMoE: Fix deepep_mode keyerror
* Add support for topk metadata transferring for PD (sgl-project#10616)
* [PD] Extract the PP transfer layer calculate logic from Mooncake to Common backend (sgl-project#10565)
* Use jsonschema to constrain required or specific tool choice (sgl-project#10550)
* Fix profiler (sgl-project#10997)
* [router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (sgl-project#10995)
* [router] basic mcp support for openai router response api (sgl-project#10978)
* [router] fix chat template loading and tokenizer path (sgl-project#10999)
* Fix CI failure of TypeError: RotaryEmbedding.forward_cpu() got an unexpected keyword argument 'fused_set_kv_buffer_arg' (sgl-project#11009)
* [bugfix]Add empty_context import to two_batch_overlap.py (sgl-project#10964)
* prepare for sglang+verl (sgl-project#10555)
* [sgl-kernel] Optimize concat_mla_k kernel (sgl-project#10543)
* [HiCache] bug: fix mooncake store batch set v1 (sgl-project#11013)
* Fix FusedSetKVBufferArg  in RotaryEmbedding (sgl-project#11003)
* Update GLM-4.5 Model Doc (sgl-project#11017)
* [router] migrate to rust python module for pythonic parser (sgl-project#11033)
* fix: show failed models in nightly ci (sgl-project#10986)
* [router][tool call] Support normal content extraction before tool call (streaming) (sgl-project#11038)
* [router] add harmony tool parser base structure and interface (sgl-project#11036)
* Unify SGL Kernel Releases (sgl-project#10701)
* [1/2] Support FA4 for MHA Prefill in sgl-kernel (sgl-project#10940)
* fix: check if weights are already local before downloading (sgl-project#11015)
* [HiCacheStorage] mooncake store support page_first_direct layout (sgl-project#10591)
* [speculative decoding] rename lookahead to ngram (sgl-project#11010)
* Fix gemma 3 launch with `transformers:` the error: `AttributeError: 'TransformersForCausalLM' object has no attribute 'tp_size'` (sgl-project#9614)
* Fix sgl-kernel benchmark dead code  (sgl-project#11022)
* [router][tool call] Improve normal content extraction and error handling (non-stream) (sgl-project#11050)
* chore: upgrade cutedsl 4.2.1 (sgl-project#11054)
* [Ci Monitor] Auto uploaded performance data to sglang_ci_data repo (sgl-project#10976)
* chore: upgrade sgl-kernel 0.3.13 (sgl-project#11056)
* [router] add n to generate sampling params (sgl-project#11069)
* Use more general heuristics to set the default value of --mem-fraction-static (sgl-project#10975)
* [router][tool call] Separate `JsonParser` and `LlamaParser` (sgl-project#11073)
* Fix mem fraction static for nightly tests (sgl-project#11076)
* fix: fp8 mllama4 without vision modules being quantized (sgl-project#10611)
* [router] Use `get_pooled` in `process_single_choice` (sgl-project#11079)
* [router][grpc] Add logprobs support to router (sgl-project#11082)
* feat(reasoning): improve enable thinking from request (sgl-project#10875)
* [Profile] dump memory trace when cuda graph profile is enabled (sgl-project#11083)
* Remove hybrid_linear_attn attention backend and refactor attention registry (sgl-project#10816)
* [model] added support for w8a8int8 used by neuralmagic/Qwen2-0.5B-Ins… (sgl-project#9642)
* Enable optional FP32 compute for LM Head (sgl-project#10729)
* Update CODEOWNERS for attention/ascend_backend.py (sgl-project#11092)
* [router] grpc router generate endpoint support (sgl-project#11070)
* [router][tool call] Full support for ToolChoice (sgl-project#11085)
* Fix spec filter batch when target extend  (sgl-project#10991)
* [Fix] Resolve performance drop in speculative decoding aiter backend (sgl-project#11087)
* [Auto Sync] Update fused_moe_triton_config.py (20250930) (sgl-project#11099)
* chore: bump sgl-kernel v0.3.14 (sgl-project#11067)
* [router][grpc-server] Fix gRPC server shutdown (sgl-project#11094)
* Fix eagle radix cache (sgl-project#10846)
* [Eval] Add `--repeat` in `run_eval`  (sgl-project#11101)
* [CPU] Adding Memory Capacity Acquisition Functionality (sgl-project#11102)
* Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization (sgl-project#11081)
* Support Dots.ocr model (sgl-project#11071)
* [router][bugfix] Fix input_logprobs handling with None value and `logprob_start_len = -1` (sgl-project#11113)
* Feature/make PEFT adapter module format compatibile (sgl-project#11080)
* fix: KimiK2Detector Improve tool call ID parsing with regex (sgl-project#10972)
* [router] add mcp list and mcp call in output array (sgl-project#11112)
* Organize spec-related data structures (sgl-project#10735)
* [AMD] Add Tilelang and Fast Hadamard Transform builds to Dockerfile.rocm (sgl-project#11114)
* [Auto Sync] Update base_grammar_backend.py, xgrammar_backen... (20250930) (sgl-project#11115)
* [Doc] Update multimodal language models documentation (sgl-project#11111)
* Quick Fix: fix Qwen3-VL launch failure caused by MRotaryEmbedding arg (sgl-project#10985)
* docker: x86 dev builds for hopper and blackwell (sgl-project#11075)
* Refactor AMD CI. (sgl-project#11128)
* feat: add fast_decode_plan from flashinfer, flashinfer to 0.4.0rc3 (sgl-project#10760)
* [HiCache]bug fix: fixed blank item in host_mem_release_queue (sgl-project#11005)
* [Feature] Add EIC as sglang HiCache Storage backend (sgl-project#10271)
* [HiCache] Configurable and Dynamic Prefetch Timeout (sgl-project#10512)
* [router] add pd service in grpc router for pd (sgl-project#11120)
* [router] Add multi-turn tool calling loop support for MCP integration (sgl-project#11143)
* Fix metrics and request tracing (TimeStats) (sgl-project#11123)
* Remove debug print statement from scheduler output (sgl-project#11145)
* Intoduce cpu tensor as metadata to avoid blocking gpu kernel launch (sgl-project#10720)
* Fix ngram spec with page size > 1 (sgl-project#11135)
* [ROCm] To reduce the compiling time when using torch compile. (sgl-project#10559)
* Fix DeepSeek chunked prefill memory issue (sgl-project#11149)
* Clean up parallel_state.py (sgl-project#11148)
* Tiny improve dumper (sgl-project#11132)
* Tiny fix missing alt stream in nextn layer (sgl-project#10768)
* Fuse quantize and rope in trtllm_mla MTP (sgl-project#10779)
* Tiny detect slow ranks (sgl-project#10508)
* Remove unused pack `.item()` in paged allocator. (sgl-project#11156)
* Support dispatch low latency (sgl-project#10263)
* Support single batch overlap (sgl-project#10422)
* [router][grpc] Support tool call parser in streaming (sgl-project#11160)
* [model] Add mamba2 and Falcon-H1 support. (sgl-project#10988)
* Clean up ascend allocator (sgl-project#11152)
* fix cpp JIT compilation issue of ngram speculative decoding (sgl-project#10837)
* Tiny cleanup deepseek_v2.py (sgl-project#11163)
* Tiny fix ep_gather behavior different in CI (sgl-project#11130)
* Tiny remove duplicated code (sgl-project#11164)
* [proto] Add script to compile python protos (sgl-project#11171)
* Unify forward output datastructure (sgl-project#11124)
* [grpc] style fix for grpc compilation. (sgl-project#11175)
* Remove dp balance metadata and minimul token balance. (sgl-project#11170)
* Minor fixes for server_args, parallel_state, and test_deterministic.py (sgl-project#11159)
* fix: shoudn't include CUDA_ARCH 100 and 120 for cuda12.6.1 (sgl-project#11176)
* [router][grpc] Support streaming for v1/chat/completions (sgl-project#11179)
* Allow use of TRTLLM_MHA backend for hybrid attention on Blackwell (sgl-project#11138)
* Introduce naming convention in `io_struct` and base sglang io classes. (sgl-project#10133)
* [Generative Scores API] add performance tests to CICD  (sgl-project#10830)
* [1/n] Enable DCA CUDA graph capture (sgl-project#9537)
* [Fix] Update to v0.1.5.post4 and refine HIP attention backend selection (sgl-project#11161)
* [CI]] Tee server logs to both file and stdout/stderr using PIPE (sgl-project#11185)
* fix: radix cache memory accounting (sgl-project#10637)
* Tiny add PD disaggregation + DP attention test (sgl-project#11167)
* [router] Steaming support for MCP Tool Calls in OpenAI Router (sgl-project#11173)
* [Feature] Option to save model weights to CPU when memory saver mode is enabled (sgl-project#10873)
* Add --thinking-mode to run_eval (sgl-project#11189)
* [hot-fix] Fix CI break which caused by adding `thinking_mode` in eval (sgl-project#11192)
* Tiny move files to utils folder (sgl-project#11166)
* Fix CUDA illegal memory access issues in speculative decoding (sgl-project#10892)
* Fix [test]: Env:SGLANG_TORCH_PROFILER_DIR for pytest. (sgl-project#10780)
* Optimize debug log position of PD abort request (sgl-project#11090)
* fix 3fs indices (sgl-project#10855)
* model: support starcoder2 (sgl-project#10609)
* [Test] Initialize mem_fraction_static in setUpClass to fix pytest VLM test crashes. (sgl-project#10859)
* fix xeon ci check (sgl-project#10838)
* fix qwen2 eagle3 runtime error (sgl-project#10517)
* [minor] fix the lint (sgl-project#11198)
* [Fix] Fix the bug of the calculation of base_gpu_id (dp offset) in data_parallel_controller.py (sgl-project#10741)
* [fix]missing prefix_lens_cpu init when p/d disaggregation (sgl-project#11196)
* fix self.enable_kv_cache_events (sgl-project#11178)
* [HICache]: Refactor HiCache CI (sgl-project#11011)
* fix sampling_seed handling when deterministic is enabled (sgl-project#11096)
* [fix]enable flashmla when using draft model P/D attention select (sgl-project#11012)
* [router] fix get load response parsing (sgl-project#11213)
* [router] add grpc router pd mode for chat and generate (sgl-project#11140)
* EAGLE cache fix for HiCache (sgl-project#11215)
* Add --max-new-tokens CLI flag for MMMU evaluation (sgl-project#11217)
* Add DeepSeek-V3.2 Tool Call Template (sgl-project#11063)
* Tiny `skip_sample` adjust (sgl-project#11225)
* [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (sgl-project#11194)
* Update `v1/responses` to be more OpenAI-compatible. (sgl-project#9624)
* chore: bump sgl-kernel v0.3.14.post1 (sgl-project#11137)
* Update DeepGEMM repository tag to specific commit (sgl-project#11229)
* [Feat] Support Torch Symm Mem AllReduce (sgl-project#10571)
* Refactor and optimize mooncake CI (sgl-project#11162)
* [Fix AMD CI] VRAM cleanup  (sgl-project#11174)
* Update transformers package version to 4.57.0 (sgl-project#11222)
* Remove gdrcopy check in ci_install_deepep.sh (sgl-project#11237)
* Rename runner labels (sgl-project#11228)
* [Auto Sync] Update io_struct.py (20251004) (sgl-project#11206)
* Create two new GH workflows to automatically bump SGLang and Kernel version (sgl-project#10996)
* Fix spec_utils.py (sgl-project#11247)
* ci: make find_local_hf_snapshot_dir more robust (sgl-project#11248)
* [quantization] Fix scale remapping for mllama4 (sgl-project#10042)
* [quantization] Enable aiter mxfp4 fused_moe for Quark (sgl-project#10048)
* Use cu128 for torch audio to fix some CI tests (sgl-project#11251)
* Bump torch_memory_saver 0.0.9rc2 (sgl-project#11252)
* update sgl kernel version to 0.3.14.post1 (sgl-project#11242)
* Update condition for sgl-kernel-benchmark-test (sgl-project#11254)
* feat: add shortcut detection for multimodal templates in Jinja format (sgl-project#11209)
* Improve bot release workflow (sgl-project#11240)
* Add flashmla and fast hadamard transform to Dockerfile (sgl-project#11235)
* Support DeepSeek V3.2 Exp (sgl-project#11061)
* chore: bump SGLang version to 0.5.3rc2 (sgl-project#11259)
* chore: bump SGLang version to 0.5.3 (sgl-project#11263)
* [theta] fix bailing v3
* [router] add ipv6 support across all components (sgl-project#11219)
* Remove env var warnings for release (sgl-project#11262)
* Enable native ModelOpt quantization support (1/3)  (sgl-project#7149)
* [router][tool call] Clean up redundant `detect_format` and `has_tool_markers` (sgl-project#11270)
* disable sm100 for FlashMLA and fast-hadamard-transform in cuda12.6.1 (sgl-project#11274)
* docker: add manifest to versioned docker releases (sgl-project#11268)
* [Bug] Fix incorrect assertion in FA4 and add UT. (sgl-project#11182)
* [router][grpc] Refine streaming processes (sgl-project#11277)
* Fix code sync scripts (sgl-project#11276)
* [Auto Sync] Update test_utils.py (20251006) (sgl-project#11280)
* Rename max_micro_batch_size -> pp_max_micro_batch_size (sgl-project#11279)
* reverse the amd ci test back to 1200s and split the 8-gpu deepseek job into two. (sgl-project#11238)
* Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components (sgl-project#11261)
* fix: correct scale parameter remapping logic in Llama4ForConditionalGeneration (sgl-project#11282)
* docs: update sgl-kernel README (sgl-project#11286)
* chore: bump sgl-kernel version to 0.3.15 (sgl-project#11281)
* [router][grpc] Fix proto3 default value mismatches and cleanup unused fields (sgl-project#11283)
* convert test_deterministic into unit tests (sgl-project#11095)
* Feature/longbench v2 evaluation utils (sgl-project#10949)
* [ci] fix pp test (sgl-project#11294)
* EAGLE cache fix for SWARadixCache (sgl-project#11231)
* Remove overlap thread (sgl-project#11210)
* [router] add reasoning and tool parser argument in router (sgl-project#11290)
* Remove sampling info events and overlap thread file (sgl-project#11300)
* Introduce future indices (sgl-project#11301)
* [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (sgl-project#11068)
* [Docs] [Router] Update Observability and Common Issues Section (sgl-project#11302)
* [router] add get server info and get model info in grpc server (sgl-project#11303)
* [router][grpc] Refactor chat template content format detection (sgl-project#11288)
* [Doc] HiCache Design Documents (sgl-project#11027)
* [Doc]: Best Practice for HICache (sgl-project#11001)
* [router] fix grpc connection conversion and add optimization (sgl-project#11305)
* [router][grpc] Fix sampling_params.stop_strs is None (sgl-project#11306)
* Update tool parser and related documentation (sgl-project#11223)
* [router][grpc] Fix error message format in grpc chat handler (sgl-project#11307)
* [quantization] Properly ignore quantization for layers excluded in quant_config (sgl-project#11205)
* [router] support Openai router conversation API CRUD (sgl-project#11297)
* [router][grpc] Fix request_id extraction when n > 1 (sgl-project#11311)
* [router] cleanup worker health check to return early (sgl-project#11310)
* [oai serving chat] Add argument `--sampling-defaults` and fix `ChatCompletionRequest` defaults (sgl-project#11304)
* Clean match_prefix and prepare_for_extend for mem cache V2 (sgl-project#11200)
* ci: unify the model launch method of nightly ci (sgl-project#11230)
* [Chore] Update xgrammar 0.1.24 -> 0.1.25 (sgl-project#10710)
* update sampling_params documentation with defaults (sgl-project#11315)
* Optimize copy_kv_cache for spec decoding (sgl-project#11126)
* Rename `ngram_utils` -> `ngram_info` (sgl-project#11316)
* [router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (sgl-project#11314)
* [Feature] Add /tokenize and /detokenize OpenAI compatible endpoints (sgl-project#9545)
* [8/N] MoE Refactor: deprecate `EPMoE` (sgl-project#11211)
* Skip weight loading in deepgemm compilation (sgl-project#11312)
* [2/2] Support MHA prefill with FlashAttention 4. (sgl-project#10937)
* [Doc] Update mooncake nvlink transport doc for PD disaggregation (sgl-project#11321)
* fix(decode): adjust ServerArgs import to explicit module path (sgl-project#11007)
* Support LoRA in bench_serving oai interface (sgl-project#11318)
* benchmark: enhance configurable multimodal benchmarking in bench_serving (sgl-project#9812)
* [CI] improve disaggregation CI. (sgl-project#11264)
* [theta] fix tokenization
* model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) (sgl-project#10909)
* [router] refactor generate to use new pipeline arch (sgl-project#11323)
* [router] improve reasoning parser lock and reduce req cloning (sgl-project#11336)
* [router][grpc] Cleanup debug logs in grpc_server and grpc_router (sgl-project#11340)
* [router] Fix all unused_qualifications (sgl-project#11341)
* [router] Support history management using conversation (sgl-project#11339)
* [router][grpc] Add dependencies in Cargo.toml to support chat template rendering (sgl-project#11342)
* fix: fix revision for sgl-flash-attn in sgl-kernel (sgl-project#11327)
* [Auto Sync] Update scheduler.py (20251009) (sgl-project#11350)
* [Generative Score API] Multi-Item scoring with custom attention mask. (sgl-project#10979)
* [router][grpc] disable health check generation and increase timeout (sgl-project#11353)
* [router] Refactor OpenAI router: split monolithic file and move location (sgl-project#11359)
* [router][lint] Add unused_qualifications to cargo lint warnings (sgl-project#11366)
* [DeepSeek-V3.2] Include indexer kv cache when estimating kv cache size (sgl-project#11309)
* PullRequest: 323 [theta] 错误码规范化:1)chat和completions请求的前处理统一为400;2)多模态load data请求返回为标准的http错误码
* [router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (sgl-project#11373)
* add code pp support for nixl (sgl-project#11375)
* fix bench_serving mishandling of internal states (sgl-project#11376)
* PullRequest: 322 支持MTP并使用DeepseekV2AttentionMLA子类化BailingMoEV3AttentionMLA
* [router][grpc] Replace fake health check with correct ones (sgl-project#11387)
* [router] change grpc client from mutable to clone (sgl-project#11394)
* chore: upgrade flashinfer 0.4.0 (sgl-project#11364)
* [router] conversation item API: create, retrieve and delete (sgl-project#11369)
* chore: bump SGLang version to 0.5.3.post1 (sgl-project#11324)
* move more files under srt/utils (sgl-project#11285)
* [grammar] Avoid server crash when grammar backend is None (sgl-project#11401)
* fix: fix gpu-proc affinity set incorrectly when pp_size > 1 (sgl-project#11389)
* [Bug Fix] prevent lora adapter from being loaded into LoRAManager if it is already loaded (sgl-project#11365)
* [CI] Refactor PD disaggregation test suite (sgl-project#11363)
* Replace pad with cat for better performance (sgl-project#11388)
* fix: reinstall torch in deps install (sgl-project#11414)
* feat(hicache): Support passing prefix keys for l3 store. (sgl-project#9045)
* fix file and object naming scheme in HiCacheNixl to avoid data corruption (sgl-project#10969)
* Dedicated toml files for CPU/XPU (sgl-project#10734)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (sgl-project#11144)
* chore: update pyproject (sgl-project#11420)
* PullRequest: 330 [theta] qwen-vl支持视频base64传入图像帧,如:data:video/jpeg;base64,frame1_base64,frame2_base64,...,frameN_base64
* fix: fix video input for qwen3-vl (sgl-project#11361)
* perf: optimize qwen-vl with symm mem allreduce (sgl-project#11381)
* [HiCache] feat: add multi tenant with prefix tag (sgl-project#9256)
* [CI] Merge build-dev into workflow matrix (sgl-project#11345)
* Revert "perf: optimize qwen-vl with symm mem allreduce" (sgl-project#11436)
* Revert "fix: fix video input for qwen3-vl" (sgl-project#11437)
* Revert "Add metrics for speculative decoding (acceptance rate, average acceptance length)" (sgl-project#11433)
* [router] Fix ci nvcc not found error (sgl-project#11411)
* feat(mooncake): support GB suffix for global_segment_size  (sgl-project#10745)
* Separate allocation logic from scheduler (sgl-project#11313)
* [router] disable rate limiter by default (sgl-project#11435)
* [router] leverage RAII to actively cancel request during client disconnect (sgl-project#11399)
* [router][grpc] Consolidate parser checks for chat completions (sgl-project#11439)
* Reorder PD disagg CI tests (#11438)
* fix: Change dsv32 hack temporary path to use system temp directory (#11445)
* Fix batch invariant ops (#11368)
* [BugFix] test_mla_fp8.py fails on Cublas 12.9 (#11360)
* [DPSKv3.2] Rewrite nsa tilelang act_quant kernel to triton (#11450)
* Remove tilelang dependency in Dockerfile (#11455)
* Enable native ModelOpt quantization support (2/3) (#9991)
* Reland [1/2] Optimizations and refactors about quant kernel (#10312)
* Super tiny delete unused openai router in sgl-router (#11448)
* Adjust logits metada init for target verify (#11467)
* [Documentation][Configuration] Server args and documentation of PD-Multiplexing. (#11427)
* Fix enable_v2 in int8 quant (#11470)
* [Fix] Fix split prefill with fa3. (#11428)
* fix stop when stream  (#11462)
* Add option to disable `any_whitespace` for `xgrammar` and `llguidance` backends. (#8919)
* PullRequest: 334 [theta] 修复qwen3-vl的各种bug
* [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019)
* fix Xeon CI (#11454)
* [CI] Add nightly builds to dockerhub (#9804)
* [Feature] support regex strings as a stopping condition (#10635)
* Beta spec-overlap for EAGLE (#11398)
* Piecewise CUDA Graph Support & Torch Compile Backend (#10062)
* [Router]: Small Typo in a comment within tree.rs (#11489)
* chore: bump sgl-kernel version to 0.3.16 (#11476)
* [smol] [perf] Qwen3-VL in place op. (#11481)
* [chore][1/N] Avoid using default mutable parameters (#11478)
* [bugfix]: use correct causality condition for flashattention, flashinfer, and triton backends (#10172)
* [ perf ] Replace json-> orjson in hot path (#11221)
* [chore][2/N] Avoid using default mutable parameters (#11479)
* Fix the GPT function calling regex to allow dash in the name (#10577)
* bailingMoE: Fix Key error of deepep_mode (#11465)
* Fix CI break by express-laned PRs. (#11499)
* Move args from `global_config` to `environ` (#11332)
* move fla env check position (#11500)
* Temporarily remove b200 tests (#11501)
* Fix port conflicts in CI (#11497)
* temporarily remove b200 tests (#11502)
* Fix unit tests (#11503)
* Bugfix: Fix Type consistency for KV indices in SWARadixCache (#11452)
* doc: add doc for adding new models into nightly-ci (#11443)
* [CI] fix lint (#11509)
* Deprecate `global_server_args_dict` (#11331)
* chore: remove flashinfer cleanup cache (#11514)
* fix: revert temporarily remove b200 tests (#11515)
* [Fix] Improve longbench prompt and other logics (#11474)
* Sync changes on io_struct.py and deterministic ops (#11498)
* [lint] Fix the lint issue (#11516)
* Revert "Deprecate `global_server_args_dict`" (#11520)
* Improve dp attention port assignment scheme (#5889)
* [theta] rebase public/main 1013-2
* [router] openai router: support grok model (#11511)
* docs(router): add token-bucket rate limiting to the docs (#11485)
* [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
* Update DeepSeek-R1-FP4 default config on blackwell (#11512)
* [Fix]: add missing device attribute to ChunkCache (#11493)
* [Feature] Support mamba radix cache v0 (#11214)
* ci: improve nightly-ci (#11385)
* [CI monitor] Improve CI analyzer: fix job failure tracking and add CUDA-focused filtering (#11505)
* [HICache]: Support 3FS-Store with page_first_direct layout (#11460)
* Tiny fix test run estimated time (#11544)
* [Reland] perf: optimize qwen-vl with symm mem allreduce (#11457)
* [theta] rebase public/main 1013-5
* Depreate `global_server_args_dict` (#11528)
* [theta] rebase public/main 1013-6
* [Fix] Add per_channel_quant parameter to MoE config functions (#11201)
* [router][ci] Add Nightly Release Workflow for SGLang Router (#11527)
* [router] allow tokenizer path to be dir (#11530)
* Remove `tp_worker.worker` (#11548)
* fix: fix video input for qwen3-vl (#11442)
* [NVIDIA] BUMP FA3 (#11444)
* [router][Fix] Include grpc reflection runtime dependency (#11419)
* Adjust overlap event loop (#11507)
* Move deep gemm related arguments to `sglang.srt.environ` (#11547)
* [router][grpc] Further delegate non-stream processing to `processing.rs`  (#11553)
* [router] allow user to specify chat template path (#11549)
* Minor: improve sampler & remove unused fields from model_config.py (#11531)
* [router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483)
* Add metrics for speculative decoding (acceptance rate, average acceptance length) (#11441)
* Fix DeepSeek-v3.2 default config (ValueError: not enough values to unpack (expected 4, got 3)) (#11557)
* [CI] Add Basic Test for DeepSeek V3.2 (#11308)
* [router][grpc] Add error handling to `generate_tool_constraints` (#11562)
* [NVIDIA] update pyproject.toml to support cu130 option (#11521)
* [CI Monitor] Ci monitor only deal with main branch in default (#11538)
* Tiny cleanup fp4 gemm calls (#11537)
* [router][grpc] Add `serve_grpc` to `launch_server` and log id for HealthCheck (#11564)
* [router] Add BRANCH_TYPE=local support to Dockerfile.router for local builds (#11571)
* [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534)
* chore: bump sgl-kernel version to 0.3.16.post1 (#11573)
* Fix accept rate in speculative decoding metrics (#11572)
* Compilation Folder Reset (#11539)
* [FEATURE] Add Profile Trace Merger for Distributed Traces (#11413)
* [DSv32] Use torch.compile for _get_logits_head_gate (#11565)
* Make DeepEP combine recv do not overlap (#11535)
* bench_serving support PD Disaggregation (#11542)
* Implement LRU eviction policy for LoRA adapters (#11041)
* PullRequest: 337 支持completions协议传入多模态请求
* Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582)
* chore: bump sgl-kernel version to 0.3.16.post2 (#11583)
* [Auto Sync] Update model_config.py (20251014) (#11580)
* Add fused_moe_triton config: triton_3_4_0/E=256,N=256,device_name=NVIDIA_B200.json (#11587)
* [router][protocols] Add Axum validate extractor and use it for `/v1/chat/completions` endpoint (#11588)
* [router] update generate spec to align with sgl io struct (#11591)
* [router] change worker api to async instead of sync (#11566)
* Update news section in README.md (#11598)
* [router] delete useless table content comment in spec (#11597)
* [router] allow router launch server to use grpc mode (#11600)
* [Docs] [Router]: Update sg-router doc on circuit breaker (#11449)
* [router] when given both local tokenizer and chat template, log all (#11601)
* [AMD CI] Add image and weights caching. (#11593)
* Update release-docker-dev.yml (#11603)
* Optimize Triton Draft Backend (#11556)
* Refactor spec decoding metrics calculation into separate `TokenizerManager` utility function (#11586)
* make radix cache deterministic (#10721)
* move eagle draft post process to cuda graph (#11434)
* Reduce one step decode for draft model. (#11561)
* [router] add py binding and readme for openai router and history backend (#11453)
* [theta] print load mm cost
* [theta] 百灵4头支持tp8
* [router] cleanup app context and move to startup (#11617)
* [router] add chang and keyang to sgl router author (#11620)
* use non_blocking h2d in ForwardBatch.prepare_mlp_sync_batch. (#11605)
* [router] update router readme to latest features (#11619)
* Fix log for chunked prefix cache (#11624)
* [Auto Sync] Update scheduler.py, server_args.py (20251014) (#11623)
* [Auto Sync] Update collector.py (20251014) (#11625)
* [Minor] Update xgrammar dependency (#11622)
* Update install.md (#11631)
* fix: Update SGL_KERNEL_VERSION to 0.3.15 (#11633)
* [router][grpc] add warm up to grpc server (#11627)
* Refactor kv cache free (#11351)
* [router] update router doc to latest features (#11639)
* fix: upgrade transformers to 4.57.1 (#11628)
* [router] add worker self discovery for metadata (#11638)
* [router] upgrade to 0.2.0 (#11642)
* [theta] qwen vl耗时打印
* [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP (#10423)
* [theta] qwen vl耗时打印
* [1/N]Support  DeepSeek-R1 w4a8 normal deepep (#8247)
* [Fix] Fix accuracy bug in CSGMV kernel caching key. (#11579)
* feat: add add_chunked_prefix_cache_attention_backend (#11636)
* Super tiny improve FA3 import error message (#11590)
* [BugFix][Qwen3-VL]: fix cu_seqlens in qwen3-vl  (#11458)
* [Doc] Update support matrix for attn and hybrid attn (#11293)
* Clean up some Qwen3-Next and deterministic code (#11585)
* docs: update sglang installation guide (#11659)
* [theta] 更新aci镜像和依赖
* Tiny cleanup some eagle unused codes (#11660)
* Fix 1-step draft model forward (#11653)
* [tool call] Fix prev_tool_call_arr management in base_format_detector.py (#11367)
* [router] Fix response api related spec (#11621)
* Fix missing json imports in serving_responses.py (#11681)
* [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674)
* [sgl-kernel] Optimize gguf test (#11667)
* [router][grpc] Simplify model_id determination (#11684)
* [router] Refactor StopSequenceDecoder to Use Sequence for Incremental Decoding (#11676)
* chore: bump SGLang version to 0.5.3.post2 (#11680)
* [CI][XPU]enable sglang CI on Intel XPU (#9493)
* enable rmsnorm on XPU (#10248)
* Sync code and test CI; rename some env vars (#11686)
* docs: Add Contributor Covenant Code of Conduct (#11689)
* [theta] dockerfile增加deepgemm编译缓存(需要定期更新😂)
* [Mamba] Increase default mamba_full_memory_ratio to 0.9 (#11679)
* [PD] Add PD support for hybrid model (Qwen3-Next, DeepSeek V3.2 Exp) (#10912)
* [sgl-kernel] support hadamard (#11663)
* Fix missing a2a backend init of GLM4.5 MoE Block (#11692)
* Split test_intel_amx_attention_backend.py to pass CI of timeout (#11370)
@Fridge003
Copy link
Copy Markdown
Collaborator

This PR causes some break when --disable-chunked-prefix-cache is added.
Now trying to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants