[Generative Score API] Optimization to Remove Decode. by sundar24295s · Pull Request #8840 · sgl-project/sglang

sundar24295s · 2025-08-06T02:35:13Z

🚀 Motivation

This PR is a follow-up to the Decoder-only Scoring API introduced in PR #6460, which was initially proposed and discussed in Issue #5973.
Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.

🔧 Modifications

Remote Decode Optimization:
Scoring workloads require only log probabilities at last token position and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.
This optimization applies to all prefill-only workloads.

📈 Future Optimizations

This refactor lays the foundation for further improvements, including:

Memory Transfer Optimization:
Move all CPU-GPU synchronization to the post-processing loop to better overlap run_batch with post-processing.
Multi-Item Scoring:
Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
→ See flashinfer-ai/flashinfer#1015

Accuracy

Scores before this change

jobuser [ /shared/user/repos3/sglang ]$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF"
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF","usage":null,"object":"scoring"}

Scores after this change

(sglang) jobuser [ /shared/user/repos3/sglang ]$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF"
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF","usage":null,"object":"scoring"}

Profiling

Updated Benchmark Scripts: Enhanced bench_score.py to properly test the new scoring pipeline
Before this change, for a single request we see forward_extend and forward_decode.

After this change, we only have forward_extend.

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

Setup:

Model: Pruned Qwen3-0.6B
Prompt length: 300 tokens
Hardware: H100 GPU
Duration: 120s
Target RPS: 70
Item Count: 10
Distribution: Poisson
Transport: HTTP

[Todo]: Add numbers for open source Qwen3-0.6B.

Server Start:

(sglang) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 100000 --mem-fraction-static 0.3 --enable-tokenizer-batch-encode  --disable-cuda-graph

Results

Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.
For batch size 10 and the same 300-token input, the table below compares the latency percentiles before and after the change.

🔍 Summary of Improvement

Metric	Baseline	With Change	Improvement
Avg Response Time (ms)	83.92	51.16	↓ 39%
P50 Latency (ms)	51.77	39.10	↓ 24%
P90 Latency (ms)	114.13	73.37	↓ 36%
P99 Latency (ms)	621.66	391.96	↓ 37%
Single Request Latency	33 ms	20 ms	↓ 39%

Checklist

Format your code according to the Code Formatting with Pre-Commit.
[] Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist · 2025-08-06T02:35:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ndar24295s/sglang into suramach/removeDecodeForScore

sundar24295s · 2025-08-09T16:51:47Z

Unit test failures are unrelated

…ndar24295s/sglang into suramach/removeDecodeForScore

hnyls2002 · 2025-08-13T01:46:21Z

@sundar24295s LGTM. Just curious about that, do you reuse the logic from reward/embedding models?

sglang/python/sglang/srt/managers/scheduler.py

Line 1782 in 86a0be6

model_worker_batch = batch.get_model_worker_batch()

sundar24295s · 2025-08-13T17:02:42Z

@hnyls2002 There are some unrelated AMD related CICD tests failure, can you help take a look and merge the PR?

narutolhy · 2025-08-18T22:21:36Z

Hi @sundar24295s, Thank you very much for your work, this interface helps me a lot.
I plan to optimize this interface. What are your future plans? We may be able to cooperate. I have observed that although the interface is a whole request to sglang, it will be split into pairs internally and sent to the scheduler through zmq. When I made the request, I found that zmq may not have received all the pairs before starting to forward. As a result, a batch was split into many small batches. I plan to give the batch req to the scheduler together, so that it will be easier to use Multi-Item Scoring. I am also very interested in the work of integrating Multi-Item Scoring into sglang. When do you plan to start development? If there is no plan to develop in the short term, I can try to develop it. Thank you.

sundar24295s · 2025-08-20T21:01:58Z

Hey @narutolhy,

We noticed the same behavior: when sending a batch of 10 prompts, the tokenizer manager currently sends them one by one. Depending on how they arrive at the Scheduler process, the batch may get split (e.g., 2 + 8) during prefill.

I have a draft PR that changes this by sending the entire batch through a single ZMQ socket. I can put up a PR for review. The trade-off is that the entire batch would then be scheduled on the same DP (or GPU), so this approach is best applied on a case-by-case basis. Happy to hear your thoughts on this — we can collaborate further if you have other ideas.

As for MIS, I’m actively working on it. In MIS though, we will construct a single prompt from the TokenizerManager, so the batch send optimization will not improve it though.

narutolhy · 2025-08-20T21:52:52Z

Hi @sundar24295s , awesome! For sending entire batch to a single ZMQ socket. Our current thinking is the same as yours. We will split the batches upstream and there is no need to use dp in sglang.
I also found that using cuda graph has good results in this scene. I think we can collaborate to optimize the MIS operator. My colleagues and I want to get involved. We can split it into several tasks and complete them together, what do you think?

Rockdu · 2025-08-21T00:46:00Z

Hi @sundar24295s , I'm @narutolhy's colleague. Your work is really amazing and we realized this feature will be very important for our long-term plan. We are very interested in collaboration. Maybe we can discuss about future plan and split of work together, what is your idea?

sundar24295s · 2025-08-29T21:03:23Z

Hey @Rockdu / @narutolhy,
Let me setup up a meeting on week of Sept 8th regarding this. Lets continue in the internal slack channel.

Remove Decode for Scoring Requests.

36920e3

Merge branch 'main' into suramach/removeDecodeForScore

c27c354

zhyncs assigned hnyls2002 Aug 6, 2025

Merge branch 'main' into suramach/removeDecodeForScore

1a34f7a

sundar24295s marked this pull request as ready for review August 6, 2025 06:57

sundar24295s requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners August 6, 2025 06:57

sundar24295s added 3 commits August 6, 2025 00:22

Merge branch 'main' into suramach/removeDecodeForScore

588440a

Add unit tests

d457443

Merge branch 'suramach/removeDecodeForScore' of https://github.com/su…

e65be31

…ndar24295s/sglang into suramach/removeDecodeForScore

chanh reviewed Aug 6, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

sundar24295s added 3 commits August 6, 2025 18:41

Remove is_scoring_request flag.

8a478b7

Merge branch 'main' into suramach/removeDecodeForScore

bcf8d60

updates

b13c083

chanh reviewed Aug 6, 2025

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler.py Outdated

sundar24295s added 2 commits August 6, 2025 22:44

updates

6f993a4

Merge branch 'main' into suramach/removeDecodeForScore

1ab18c9

chanh approved these changes Aug 6, 2025

View reviewed changes

sundar24295s added 4 commits August 6, 2025 19:24

Merge branch 'main' into suramach/removeDecodeForScore

168128e

Merge branch 'main' into suramach/removeDecodeForScore

54c3a0c

Merge branch 'main' into suramach/removeDecodeForScore

760d196

Merge branch 'main' into suramach/removeDecodeForScore

4ea0267

hebiao064 assigned hebiao064 and qingquansong Aug 8, 2025

hebiao064 approved these changes Aug 8, 2025

View reviewed changes

Merge branch 'main' into suramach/removeDecodeForScore

24896e5

sundar24295s added 5 commits August 10, 2025 08:13

update

a7cf1a7

Merge branch 'suramach/removeDecodeForScore' of https://github.com/su…

fd11eba

…ndar24295s/sglang into suramach/removeDecodeForScore

Merge branch 'main' into suramach/removeDecodeForScore

c6f7ffd

Merge branch 'main' into suramach/removeDecodeForScore

e67aeb9

Merge branch 'main' into suramach/removeDecodeForScore

4cfc1c8

hnyls2002 approved these changes Aug 13, 2025

View reviewed changes

Merge branch 'main' into suramach/removeDecodeForScore

b0c066c

hnyls2002 merged commit a027a9b into sgl-project:main Aug 13, 2025
97 of 100 checks passed

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[Generative Score API] Optimization to Remove Decode. (sgl-project#8840)

6e258f1

sundar24295s mentioned this pull request Aug 28, 2025

[Generative Score API] Scoring(Prefill-only) optimizations. #9748

Merged

4 tasks

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[Generative Score API] Optimization to Remove Decode. (sgl-project#8840)

53e4f3d

sundar24295s mentioned this pull request Dec 17, 2025

[Roadmap] SGLang Prefill-Only 2026 CY26H1 Roadmap #15344

Open

23 tasks

This was referenced Apr 13, 2026

google/multi item scoring v1 [conflicts-resolved] alexshires/sglang-jax#2

Open

Google/multi item scoring v1 alexshires/sglang-jax#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Generative Score API] Optimization to Remove Decode.#8840

[Generative Score API] Optimization to Remove Decode.#8840
hnyls2002 merged 22 commits intosgl-project:mainfrom
sundar24295s:suramach/removeDecodeForScore

sundar24295s commented Aug 6, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Aug 9, 2025

Uh oh!

hnyls2002 commented Aug 13, 2025

Uh oh!

sundar24295s commented Aug 13, 2025

Uh oh!

Uh oh!

narutolhy commented Aug 18, 2025 •

edited

Loading

Uh oh!

sundar24295s commented Aug 20, 2025

Uh oh!

narutolhy commented Aug 20, 2025

Uh oh!

Rockdu commented Aug 21, 2025

Uh oh!

sundar24295s commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

sundar24295s commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Motivation

🔧 Modifications

📈 Future Optimizations

Accuracy

Profiling

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

Results

🔍 Summary of Improvement

Checklist

Uh oh!

gemini-code-assist Bot commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Aug 9, 2025

Uh oh!

hnyls2002 commented Aug 13, 2025

Uh oh!

sundar24295s commented Aug 13, 2025

Uh oh!

Uh oh!

narutolhy commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sundar24295s commented Aug 20, 2025

Uh oh!

narutolhy commented Aug 20, 2025

Uh oh!

Rockdu commented Aug 21, 2025

Uh oh!

sundar24295s commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sundar24295s commented Aug 6, 2025 •

edited

Loading

narutolhy commented Aug 18, 2025 •

edited

Loading