[Generative Score API] Optimization to Remove Decode.#8840
[Generative Score API] Optimization to Remove Decode.#8840hnyls2002 merged 22 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Unit test failures are unrelated |
…ndar24295s/sglang into suramach/removeDecodeForScore
|
@sundar24295s LGTM. Just curious about that, do you reuse the logic from reward/embedding models? sglang/python/sglang/srt/managers/scheduler.py Line 1782 in 86a0be6 |
|
@hnyls2002 There are some unrelated AMD related CICD tests failure, can you help take a look and merge the PR? |
|
Hi @sundar24295s, Thank you very much for your work, this interface helps me a lot. |
|
Hey @narutolhy, We noticed the same behavior: when sending a batch of 10 prompts, the tokenizer manager currently sends them one by one. Depending on how they arrive at the Scheduler process, the batch may get split (e.g., 2 + 8) during prefill. I have a draft PR that changes this by sending the entire batch through a single ZMQ socket. I can put up a PR for review. The trade-off is that the entire batch would then be scheduled on the same DP (or GPU), so this approach is best applied on a case-by-case basis. Happy to hear your thoughts on this — we can collaborate further if you have other ideas. As for MIS, I’m actively working on it. In MIS though, we will construct a single prompt from the TokenizerManager, so the |
|
Hi @sundar24295s , awesome! For sending entire batch to a single ZMQ socket. Our current thinking is the same as yours. We will split the batches upstream and there is no need to use dp in sglang. |
|
Hi @sundar24295s , I'm @narutolhy's colleague. Your work is really amazing and we realized this feature will be very important for our long-term plan. We are very interested in collaboration. Maybe we can discuss about future plan and split of work together, what is your idea? |
|
Hey @Rockdu / @narutolhy, |
🚀 Motivation
🔧 Modifications
Scoring workloads require only log probabilities at last token position and do not involve token generation. This PR removes the decode phase entirely for such workloads, resulting in significantly lower latency and improved throughput.
📈 Future Optimizations
This refactor lays the foundation for further improvements, including:
Memory Transfer Optimization:
Move all CPU-GPU synchronization to the post-processing loop to better overlap
run_batchwith post-processing.Multi-Item Scoring:
Support scoring multiple items within a single prompt using custom attention masks via FlashInfer.
→ See flashinfer-ai/flashinfer#1015
Accuracy
Profiling
bench_score.pyto properly test the new scoring pipelineforward_extendandforward_decode.forward_extend.🧪 Benchmark Comparison: Qwen3-0.6B on H100 (CUDA 12.8)
Setup:
[Todo]: Add numbers for open source Qwen3-0.6B.
Server Start:
Results
🔍 Summary of Improvement
Checklist