[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel#22427
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the ability to return pooled hidden states from the model's intermediate layers during scoring. This feature is integrated across various components, including the scoring API, protocol definitions, model pooling logic, and request/batch management. New tests validate this functionality for single-item and multi-item scoring, HTTP integration, and confirm that CausalLM models correctly reject requests for pooled hidden states. A high-severity bug was identified in the scheduler_output_processor_mixin.py file, where the logic for stacking pooled_hidden_states for pickling optimization can cause an IndexError if some requests in a batch do not return these states. The suggested fix is to pass the list directly to preserve the one-to-one mapping with requests.
kpham-sgl
left a comment
There was a problem hiding this comment.
@sundar24295s is this new feature intended to be cudagraph-able in the future?
If not, can you disable it in can_run?
If yes, I think you need to let capture_one_batch_size() capturewithreturn_pooled_hidden_states=True`?
|
kpham-sgl
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments!
|
/rerun-failed-ci |
…ceClassification / RewardModel (sgl-project#22427)
…ceClassification / RewardModel (sgl-project#22427)
…ceClassification / RewardModel (sgl-project#22427)
…lay_prepare The shared replay_prepare bound from PiecewiseCudaGraphRunner reads self.capture_return_pooled_hidden_states (added upstream by the Score API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR with main hit AttributeError on first replay. Mirror PCG's initialization: not model_runner.is_generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
…ceClassification / RewardModel (sgl-project#22427)
Add
return_pooled_hidden_statesto Scoring API for SequenceClassification / RewardModelSummary
Adds the ability to extract raw transformer hidden states (before the task-specific classification/reward head) from the
/v1/scoreendpoint. This is useful for downstream consumers that need the model's internal representation alongside the final scores — e.g., for distillation, interpretability, or secondary scoring pipelines.The feature is gated behind a new
return_pooled_hidden_states: boolparameter on the scoring request. Whenfalse(default), behavior is identical to before. Whentrue, an additionalpooled_hidden_statesfield is returned containing per-item hidden state vectors.Supported on SequenceClassification and RewardModel architectures. Raises
ValueErrorfor CausalLM models.Changes
Core pipeline (bottom → top)
layers/pooler.pypooled_hidden_statesfield toEmbeddingPoolerOutput. Extractedpool_hidden_states()standalone function for reuse. Updatedscore_and_pool()to conditionally capture pre-head hidden states for both standard and MIS paths.model_executor/forward_batch_info.pyreturn_pooled_hidden_states: booltoForwardBatch, propagated fromModelWorkerBatch.model_executor/piecewise_cuda_graph_runner.pyreturn_pooled_hidden_states=True; replay ORs that flag for graph matching; passes through onstatic_forward_batch.configs/model_config.pyis_cross_encoding_pooler_model()checker and_cross_encoding_pooler_archslist (BERT, XLM-R) for models that don't expose pre-head hidden states.server_args.py_handle_multi_item_scoring()— auto-disables CUDA graph and piecewise CUDA graph when--multi-item-scoring-delimiteris set (padded staticinput_idscause spurious delimiter matches inscore_and_pool).managers/schedule_batch.pyreturn_pooled_hidden_statestoReq,ScheduleBatch, andModelWorkerBatch.ScheduleBatch.init_newaggregates the flag viaany().managers/scheduler.pypooled_hidden_statestoEmbeddingBatchResultwithcopy_to_cpu()support.managers/scheduler_output_processor_mixin.pyEmbeddingBatchResult, assigns toreq.pooled_hidden_state, and stacks/sends viaBatchEmbeddingOutput.managers/tp_worker.pyforward_batch_embeddingnow returns the fullEmbeddingPoolerOutput(not just embeddings).managers/io_struct.pyreturn_pooled_hidden_statestoEmbeddingReqInput,TokenizedEmbeddingReqInput. Addedpooled_hidden_statestoBatchEmbeddingOutput. Fixed__getitem__propagation in both standard and cross-encoder paths.managers/tokenizer_manager.pyreturn_pooled_hidden_statesduring tokenization; includespooled_hidden_statein output dict.managers/tokenizer_manager_score_mixin.pypooled_hidden_statestoScoreResult. Updatedscore_request(),_process_single_item_scoring_results(),_process_multi_item_scoring_results()to thread the flag and collect PHS. Added earlyValueErrorfor CausalLM + PHS and for BERT/XLM-RCrossEncodingPoolermodels that don't expose pre-head hidden states.entrypoints/engine_score_mixin.pyreturn_pooled_hidden_statesparameter toscore()andasync_score().entrypoints/openai/protocol.pyreturn_pooled_hidden_statestoScoringRequest,pooled_hidden_statestoScoringResponse.entrypoints/openai/serving_score.pyscore_request(), converts PHS tensors toList[List[float]]for JSON serialization viaORJSONResponse.Model files
models/qwen3_classification.pyscore_and_pool()which now handles PHS natively — no changes needed.models/llama_reward.pyLlamaForSequenceClassification: pools hidden states then scores; conditionally returns PHS.LlamaForSequenceClassificationWithNormal_Weights: callspool_hidden_states()separately since it has custom scoring logic.models/qwen2_rm.pypool_hidden_states().models/gemma2_reward.pymodels/internlm2_reward.pyTests
test/registered/prefill_only/test_pooled_hidden_states.pyServer Start Commands
Without Multi-Item Scoring (single-item mode)
source /workspace/venvs/sglang-repos/bin/activate python -m sglang.launch_server \ --model-path /shared/public/elr-models/sjsmodels/downloaded_models/l2-slmv6/base_model/ \ --port 30000 --host 0.0.0.0 \ --chunked-prefill-size -1 \ --dtype float16 \ --max-prefill-tokens 100000 \ --mem-fraction-static 0.5 \ --disable-radix-cache \ --disable-cuda-graph \ --is-embeddingWith Multi-Item Scoring (MIS delimiter mode)
CUDA graph behaviour
Without
--multi-item-scoring-delimiter: CUDA graph and piecewise CUDA graph work normally for single-item scoring. Piecewise graphs for embedding/classification models are captured withreturn_pooled_hidden_states=Trueso the traced forward matches scoring requests that need PHS. When the batch does not request PHS, the extra output is stripped inmodel_runner.forward_extendto avoid redundant CPU copies.With
--chunked-prefill-size -1,piecewise_cuda_graph_max_tokensdefaults to-1and the capture-size list is empty; pass--piecewise-cuda-graph-max-tokens 8192(or similar) to enable piecewise graphs.With
--multi-item-scoring-delimiter:server_args.pynow auto-disables both CUDA graph and piecewise CUDA graph at startup (with a warning log). The padded staticinput_idsbuffer used by CUDA graph replay causes spurious delimiter matches inscore_and_pool, so MIS always runs without graphs.CUDA / piecewise graph validation (manual, single-item mode)
Validated 2026-04-13 on NVIDIA H100 80GB, model
Qwen3ForSequenceClassification, single-item scoring (no--multi-item-scoring-delimiter), port30001, with--piecewise-cuda-graph-max-tokens 8192.disable_cuda_graph=Falsein loggedServerArgs)Capture piecewise CUDA graph begin→end. Time elapsed: 7.90 s. mem usage=0.68 GB./v1/scorewithout PHSobject=scoring,pooled_hidden_states=null, 12 label logits/v1/scorewithreturn_pooled_hidden_states: truepooled_hidden_stateslength 1024 per item; head[0.182..., 0.200..., -0.908...]Curl Tests & Output
Single-Item Scoring Mode
Test 1: Basic scoring (no pooled hidden states)
Output:
{ "scores": [ [ -2.6796875, -2.78125, 4.7890625, -2.166015625, -0.8828125, -3.216796875, -2.767578125, -2.935546875, 1.1982421875, -0.703125, 3.623046875, -2.7421875 ] ], "pooled_hidden_states": null, "model": "test", "usage": { "prompt_tokens": 20, "total_tokens": 20, "completion_tokens": 0, "prompt_tokens_details": null, "reasoning_tokens": 0 }, "object": "scoring" }Test 2: Single item with
return_pooled_hidden_states=trueOutput (summarized):
Test 3: Multiple items with
return_pooled_hidden_states=trueOutput:
Multi-Item Scoring (MIS) Mode
Test 4: MIS — multiple items with
return_pooled_hidden_states=trueOutput:
Test 5: MIS — without
return_pooled_hidden_statesOutput:
Test 6: MIS — single item with
return_pooled_hidden_states=trueOutput:
Test 7: MIS — many items (6) with
return_pooled_hidden_states=trueOutput:
Unit Test Results
Pooler unit tests (CPU only)
Pooled hidden states E2E tests (GPU)
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci