Skip to content

[Generative Score API] Optimization to Remove Decode.#8840

Merged
hnyls2002 merged 22 commits intosgl-project:mainfrom
sundar24295s:suramach/removeDecodeForScore
Aug 13, 2025
Merged

[Generative Score API] Optimization to Remove Decode.#8840
hnyls2002 merged 22 commits intosgl-project:mainfrom
sundar24295s:suramach/removeDecodeForScore

Conversation

@sundar24295s
Copy link
Copy Markdown
Collaborator

@sundar24295s sundar24295s commented Aug 6, 2025

🚀 Motivation

  • This PR is a follow-up to the Decoder-only Scoring API introduced in PR #6460, which was initially proposed and discussed in Issue #5973.
  • Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.

🔧 Modifications

  • Remote Decode Optimization:
    Scoring workloads require only log probabilities at last token position and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.
  • This optimization applies to all prefill-only workloads.

📈 Future Optimizations

This refactor lays the foundation for further improvements, including:

  • Memory Transfer Optimization:
    Move all CPU-GPU synchronization to the post-processing loop to better overlap run_batch with post-processing.

  • Multi-Item Scoring:
    Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
    → See flashinfer-ai/flashinfer#1015

Accuracy

  • Scores before this change
jobuser [ /shared/user/repos3/sglang ]$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF"
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF","usage":null,"object":"scoring"}
  • Scores after this change
(sglang) jobuser [ /shared/user/repos3/sglang ]$ curl -X POST "http://localhost:30000/v1/score"   -H "Content-Type: application/json"   -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Scaramento", "San Jose", "San Francisco"],
    "label_token_ids": [9454, 2753],
    "model": "/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF"
  }'
{"scores":[[4.234364670752685e-06,1.2348638303110892e-05],[7.162677269222586e-05,0.0003160321422383422],[0.0001203001937164321,0.00030480667807191645]],"model":"/shared/public/sharing/job-rank/kbehdin/f389cde308efd4dbb8d9-2025-06-06-18-31-30/best_model/epoch=0-step=498-HF","usage":null,"object":"scoring"}

Profiling

  • Updated Benchmark Scripts: Enhanced bench_score.py to properly test the new scoring pipeline
  • Before this change, for a single request we see forward_extend and forward_decode.
image
  • After this change, we only have forward_extend.
image

🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)

Setup:

  • Model: Pruned Qwen3-0.6B
  • Prompt length: 300 tokens
  • Hardware: H100 GPU
  • Duration: 120s
  • Target RPS: 70
  • Item Count: 10
  • Distribution: Poisson
  • Transport: HTTP

[Todo]: Add numbers for open source Qwen3-0.6B.

Server Start:

(sglang) jobuser [ /shared/user/repos3/sglang ]$ python -m sglang.launch_server --model-path /shared/public/sharing/suramach/Qwen3-0.6B --port 30000 --host 0.0.0.0 --chunked-prefill-size -1 --enable-torch-compile --dtype float16 --max-prefill-tokens 100000 --mem-fraction-static 0.3 --enable-tokenizer-batch-encode  --disable-cuda-graph

Results

  • Achieved a 40% reduction in single-request latency for a 300-token input (120 query tokens + 180 item tokens), decreasing from 33 ms to 20 ms.
  • For batch size 10 and the same 300-token input, the table below compares the latency percentiles before and after the change.

🔍 Summary of Improvement

Metric Baseline With Change Improvement
Avg Response Time (ms) 83.92 51.16 ↓ 39%
P50 Latency (ms) 51.77 39.10 ↓ 24%
P90 Latency (ms) 114.13 73.37 ↓ 36%
P99 Latency (ms) 621.66 391.96 ↓ 37%
Single Request Latency 33 ms 20 ms ↓ 39%

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@sundar24295s sundar24295s marked this pull request as ready for review August 6, 2025 06:57
Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py Outdated
@sundar24295s
Copy link
Copy Markdown
Collaborator Author

Unit test failures are unrelated

@hnyls2002
Copy link
Copy Markdown
Collaborator

@sundar24295s LGTM. Just curious about that, do you reuse the logic from reward/embedding models?

model_worker_batch = batch.get_model_worker_batch()

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

@hnyls2002 There are some unrelated AMD related CICD tests failure, can you help take a look and merge the PR?

@hnyls2002 hnyls2002 merged commit a027a9b into sgl-project:main Aug 13, 2025
97 of 100 checks passed
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
@narutolhy
Copy link
Copy Markdown
Contributor

narutolhy commented Aug 18, 2025

Hi @sundar24295s, Thank you very much for your work, this interface helps me a lot.
I plan to optimize this interface. What are your future plans? We may be able to cooperate. I have observed that although the interface is a whole request to sglang, it will be split into pairs internally and sent to the scheduler through zmq. When I made the request, I found that zmq may not have received all the pairs before starting to forward. As a result, a batch was split into many small batches. I plan to give the batch req to the scheduler together, so that it will be easier to use Multi-Item Scoring. I am also very interested in the work of integrating Multi-Item Scoring into sglang. When do you plan to start development? If there is no plan to develop in the short term, I can try to develop it. Thank you.

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

Hey @narutolhy,

We noticed the same behavior: when sending a batch of 10 prompts, the tokenizer manager currently sends them one by one. Depending on how they arrive at the Scheduler process, the batch may get split (e.g., 2 + 8) during prefill.

I have a draft PR that changes this by sending the entire batch through a single ZMQ socket. I can put up a PR for review. The trade-off is that the entire batch would then be scheduled on the same DP (or GPU), so this approach is best applied on a case-by-case basis. Happy to hear your thoughts on this — we can collaborate further if you have other ideas.

As for MIS, I’m actively working on it. In MIS though, we will construct a single prompt from the TokenizerManager, so the batch send optimization will not improve it though.

@narutolhy
Copy link
Copy Markdown
Contributor

Hi @sundar24295s , awesome! For sending entire batch to a single ZMQ socket. Our current thinking is the same as yours. We will split the batches upstream and there is no need to use dp in sglang.
I also found that using cuda graph has good results in this scene. I think we can collaborate to optimize the MIS operator. My colleagues and I want to get involved. We can split it into several tasks and complete them together, what do you think?

@Rockdu
Copy link
Copy Markdown
Contributor

Rockdu commented Aug 21, 2025

Hi @sundar24295s , I'm @narutolhy's colleague. Your work is really amazing and we realized this feature will be very important for our long-term plan. We are very interested in collaboration. Maybe we can discuss about future plan and split of work together, what is your idea?

@sundar24295s
Copy link
Copy Markdown
Collaborator Author

Hey @Rockdu / @narutolhy,
Let me setup up a meeting on week of Sept 8th regarding this. Lets continue in the internal slack channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants