Skip to content

[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel#22427

Merged
Qiaolin-Yu merged 11 commits intosgl-project:mainfrom
sundar24295s:suramach/pooledhiddenstates
Apr 15, 2026
Merged

[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel#22427
Qiaolin-Yu merged 11 commits intosgl-project:mainfrom
sundar24295s:suramach/pooledhiddenstates

Conversation

@sundar24295s
Copy link
Copy Markdown
Collaborator

@sundar24295s sundar24295s commented Apr 9, 2026

Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel

Summary

Adds the ability to extract raw transformer hidden states (before the task-specific classification/reward head) from the /v1/score endpoint. This is useful for downstream consumers that need the model's internal representation alongside the final scores — e.g., for distillation, interpretability, or secondary scoring pipelines.

The feature is gated behind a new return_pooled_hidden_states: bool parameter on the scoring request. When false (default), behavior is identical to before. When true, an additional pooled_hidden_states field is returned containing per-item hidden state vectors.

Supported on SequenceClassification and RewardModel architectures. Raises ValueError for CausalLM models.

Changes

Core pipeline (bottom → top)

File Change
layers/pooler.py Added pooled_hidden_states field to EmbeddingPoolerOutput. Extracted pool_hidden_states() standalone function for reuse. Updated score_and_pool() to conditionally capture pre-head hidden states for both standard and MIS paths.
model_executor/forward_batch_info.py Added return_pooled_hidden_states: bool to ForwardBatch, propagated from ModelWorkerBatch.
model_executor/piecewise_cuda_graph_runner.py Captures embedding forwards with return_pooled_hidden_states=True; replay ORs that flag for graph matching; passes through on static_forward_batch.
configs/model_config.py Added is_cross_encoding_pooler_model() checker and _cross_encoding_pooler_archs list (BERT, XLM-R) for models that don't expose pre-head hidden states.
server_args.py Added _handle_multi_item_scoring() — auto-disables CUDA graph and piecewise CUDA graph when --multi-item-scoring-delimiter is set (padded static input_ids cause spurious delimiter matches in score_and_pool).
managers/schedule_batch.py Added return_pooled_hidden_states to Req, ScheduleBatch, and ModelWorkerBatch. ScheduleBatch.init_new aggregates the flag via any().
managers/scheduler.py Added pooled_hidden_states to EmbeddingBatchResult with copy_to_cpu() support.
managers/scheduler_output_processor_mixin.py Extracts PHS from EmbeddingBatchResult, assigns to req.pooled_hidden_state, and stacks/sends via BatchEmbeddingOutput.
managers/tp_worker.py forward_batch_embedding now returns the full EmbeddingPoolerOutput (not just embeddings).
managers/io_struct.py Added return_pooled_hidden_states to EmbeddingReqInput, TokenizedEmbeddingReqInput. Added pooled_hidden_states to BatchEmbeddingOutput. Fixed __getitem__ propagation in both standard and cross-encoder paths.
managers/tokenizer_manager.py Propagates return_pooled_hidden_states during tokenization; includes pooled_hidden_state in output dict.
managers/tokenizer_manager_score_mixin.py Added pooled_hidden_states to ScoreResult. Updated score_request(), _process_single_item_scoring_results(), _process_multi_item_scoring_results() to thread the flag and collect PHS. Added early ValueError for CausalLM + PHS and for BERT/XLM-R CrossEncodingPooler models that don't expose pre-head hidden states.
entrypoints/engine_score_mixin.py Added return_pooled_hidden_states parameter to score() and async_score().
entrypoints/openai/protocol.py Added return_pooled_hidden_states to ScoringRequest, pooled_hidden_states to ScoringResponse.
entrypoints/openai/serving_score.py Passes flag to score_request(), converts PHS tensors to List[List[float]] for JSON serialization via ORJSONResponse.

Model files

File Change
models/qwen3_classification.py Uses score_and_pool() which now handles PHS natively — no changes needed.
models/llama_reward.py LlamaForSequenceClassification: pools hidden states then scores; conditionally returns PHS. LlamaForSequenceClassificationWithNormal_Weights: calls pool_hidden_states() separately since it has custom scoring logic.
models/qwen2_rm.py Scores all tokens then pools; conditionally returns PHS via pool_hidden_states().
models/gemma2_reward.py Pools then scores; conditionally returns PHS.
models/internlm2_reward.py Pools then scores; conditionally returns PHS.

Tests

File Tests
test/registered/prefill_only/test_pooled_hidden_states.py 20 tests across 4 classes: Engine single-item (8), Engine MIS (7), CausalLM rejection (2), HTTP integration (3).

Server Start Commands

Without Multi-Item Scoring (single-item mode)

source /workspace/venvs/sglang-repos/bin/activate

python -m sglang.launch_server \
  --model-path /shared/public/elr-models/sjsmodels/downloaded_models/l2-slmv6/base_model/ \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --max-prefill-tokens 100000 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph \
  --is-embedding

With Multi-Item Scoring (MIS delimiter mode)

python -m sglang.launch_server \
  --model-path /shared/public/elr-models/sjsmodels/downloaded_models/l2-slmv6/base_model/ \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --max-prefill-tokens 100000 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph \
  --is-embedding \
  --multi-item-scoring-delimiter 151643

CUDA graph behaviour

Without --multi-item-scoring-delimiter: CUDA graph and piecewise CUDA graph work normally for single-item scoring. Piecewise graphs for embedding/classification models are captured with return_pooled_hidden_states=True so the traced forward matches scoring requests that need PHS. When the batch does not request PHS, the extra output is stripped in model_runner.forward_extend to avoid redundant CPU copies.

With --chunked-prefill-size -1, piecewise_cuda_graph_max_tokens defaults to -1 and the capture-size list is empty; pass --piecewise-cuda-graph-max-tokens 8192 (or similar) to enable piecewise graphs.

With --multi-item-scoring-delimiter: server_args.py now auto-disables both CUDA graph and piecewise CUDA graph at startup (with a warning log). The padded static input_ids buffer used by CUDA graph replay causes spurious delimiter matches in score_and_pool, so MIS always runs without graphs.


CUDA / piecewise graph validation (manual, single-item mode)

Validated 2026-04-13 on NVIDIA H100 80GB, model Qwen3ForSequenceClassification, single-item scoring (no --multi-item-scoring-delimiter), port 30001, with --piecewise-cuda-graph-max-tokens 8192.

Check Result
Decode CUDA graph Enabled (disable_cuda_graph=False in logged ServerArgs)
Piecewise CUDA graph Enabled — startup: Capture piecewise CUDA graph beginend. Time elapsed: 7.90 s. mem usage=0.68 GB.
/v1/score without PHS OKobject=scoring, pooled_hidden_states=null, 12 label logits
/v1/score with return_pooled_hidden_states: true OKpooled_hidden_states length 1024 per item; head [0.182..., 0.200..., -0.908...]

Curl Tests & Output

Single-Item Scoring Mode

Server started without --multi-item-scoring-delimiter

Test 1: Basic scoring (no pooled hidden states)

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "model": "test"
  }' | python3 -m json.tool

Output:

{
    "scores": [
        [
            -2.6796875, -2.78125, 4.7890625, -2.166015625,
            -0.8828125, -3.216796875, -2.767578125, -2.935546875,
            1.1982421875, -0.703125, 3.623046875, -2.7421875
        ]
    ],
    "pooled_hidden_states": null,
    "model": "test",
    "usage": {
        "prompt_tokens": 20,
        "total_tokens": 20,
        "completion_tokens": 0,
        "prompt_tokens_details": null,
        "reasoning_tokens": 0
    },
    "object": "scoring"
}

Test 2: Single item with return_pooled_hidden_states=true

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output (summarized):

scores: 1 item, 12 dims
pooled_hidden_states: 1 item, dim=1024
  first 5 values: [0.182007, 0.200928, -0.908691, 0.897949, 0.177002]

Test 3: Multiple items with return_pooled_hidden_states=true

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: 3 items, dim=1024
  item 0 first 5: [-0.031921, 0.95166, -0.897949, 1.463867, 0.424805]
  item 1 first 5: [0.387939, -1.735352, -0.908203, 2.292969, -0.062744]
  item 2 first 5: [0.132446, 0.162964, -0.894531, 0.879395, 0.12854]

Multi-Item Scoring (MIS) Mode

Server started with --multi-item-scoring-delimiter 151643

Test 4: MIS — multiple items with return_pooled_hidden_states=true

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: 3 items, dim=1024
  item 0 first 5: [-0.031921, 0.95166, -0.897949, 1.463867, 0.424805]
  item 1 first 5: [0.387939, -1.735352, -0.908203, 2.292969, -0.062744]
  item 2 first 5: [0.132446, 0.162964, -0.894531, 0.879395, 0.12854]

Test 5: MIS — without return_pooled_hidden_states

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]  (same as Test 4)
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: None

Test 6: MIS — single item with return_pooled_hidden_states=true

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California?",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

scores: 1 item, 12 dims
pooled_hidden_states: 1 item, dim=1024
  first 5 values: [0.212158, -0.395508, -0.897949, 1.944336, 0.267334]

Test 7: MIS — many items (6) with return_pooled_hidden_states=true

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this a US state capital?",
    "items": ["Sacramento", "New York", "Austin", "Miami", "Denver", "Olympia"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 6
  item 0: score_dims=12, phs_dim=1024
  item 1: score_dims=12, phs_dim=1024
  item 2: score_dims=12, phs_dim=1024
  item 3: score_dims=12, phs_dim=1024
  item 4: score_dims=12, phs_dim=1024
  item 5: score_dims=12, phs_dim=1024

Unit Test Results

Pooler unit tests (CPU only)

$ python test/registered/unit/layers/test_pooler_score_and_pool.py -v

TestScoreAndPool
  test_mis_batched_splits_per_request .................. ok
  test_mis_extracts_positions_before_delimiter ......... ok
  test_mis_falls_back_when_no_delimiters_in_input ...... ok
  test_mis_falls_back_when_not_prefill_only ............ ok
  test_mis_ignores_delimiter_at_position_zero .......... ok
  test_mis_returns_per_request_list .................... ok
  test_single_item_returns_scores ...................... ok
  test_single_item_scores_match_manual_computation ..... ok

----------------------------------------------------------------------
Ran 8 tests in 0.009s

OK

Pooled hidden states E2E tests (GPU)

$ python test/registered/prefill_only/test_pooled_hidden_states.py -v

TestPooledHiddenStatesEngine
  test_phs_count_matches_items .............. ok
  test_phs_deterministic .................... ok
  test_phs_none_when_not_requested .......... ok
  test_phs_on_cpu ........................... ok
  test_phs_returned_when_requested .......... ok
  test_phs_shape_is_consistent .............. ok
  test_phs_with_tokenized_inputs ............ ok
  test_scores_unaffected_by_phs_flag ........ ok

TestPooledHiddenStatesMISEngine
  test_mis_many_items ....................... ok
  test_mis_phs_are_tensors_on_cpu ........... ok
  test_mis_phs_count_matches_items .......... ok
  test_mis_phs_different_items_different_hs . ok
  test_mis_phs_none_when_not_requested ...... ok
  test_mis_scores_unaffected_by_phs_flag .... ok
  test_mis_single_item ...................... ok

TestPooledHiddenStatesCausalLMRejection
  test_causal_lm_rejects_phs ............... ok
  test_causal_lm_without_phs_still_works .... ok

TestPooledHiddenStatesHTTP
  test_phs_absent_when_not_requested ........ ok
  test_phs_in_response_json ................. ok
  test_phs_matches_item_count ............... ok

----------------------------------------------------------------------
Ran 20 tests in 78.881s

OK

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the ability to return pooled hidden states from the model's intermediate layers during scoring. This feature is integrated across various components, including the scoring API, protocol definitions, model pooling logic, and request/batch management. New tests validate this functionality for single-item and multi-item scoring, HTTP integration, and confirm that CausalLM models correctly reject requests for pooled hidden states. A high-severity bug was identified in the scheduler_output_processor_mixin.py file, where the logic for stacking pooled_hidden_states for pickling optimization can cause an IndexError if some requests in a batch do not return these states. The suggested fix is to pass the list directly to preserve the one-to-one mapping with requests.

Comment thread python/sglang/srt/managers/scheduler_output_processor_mixin.py Outdated
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundar24295s is this new feature intended to be cudagraph-able in the future?
If not, can you disable it in can_run?
If yes, I think you need to let capture_one_batch_size() capturewithreturn_pooled_hidden_states=True`?

Comment thread python/sglang/srt/entrypoints/openai/protocol.py Outdated
Comment thread python/sglang/srt/entrypoints/openai/protocol.py Outdated
Comment thread python/sglang/srt/managers/tokenizer_manager_score_mixin.py Outdated
Comment thread python/sglang/srt/managers/tokenizer_manager_score_mixin.py
Comment thread python/sglang/srt/managers/schedule_batch.py
@sundar24295s
Copy link
Copy Markdown
Collaborator Author

@sundar24295s is this new feature intended to be cudagraph-able in the future? If not, can you disable it in can_run? If yes, I think you need to let capture_one_batch_size() capturewithreturn_pooled_hidden_states=True`?

  • For Single Item scoring, updated the piecewise_cuda_graph_runner.
  • For Multi-Item scoring, piecewise_cuda_graph I have disabled it in ServerArgs. This will need more work and will do in a separate PR. (MIS was added first, followed by PCG, we didn't actually make both work before)

Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments!

@sundar24295s sundar24295s enabled auto-merge (squash) April 14, 2026 03:41
@sundar24295s
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@Qiaolin-Yu Qiaolin-Yu disabled auto-merge April 15, 2026 21:58
@Qiaolin-Yu Qiaolin-Yu merged commit 4927975 into sgl-project:main Apr 15, 2026
789 of 1040 checks passed
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
Oasis-Git added a commit to Oasis-Git/sglang that referenced this pull request Apr 23, 2026
…lay_prepare

The shared replay_prepare bound from PiecewiseCudaGraphRunner reads
self.capture_return_pooled_hidden_states (added upstream by the Score
API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR
with main hit AttributeError on first replay.

Mirror PCG's initialization: not model_runner.is_generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants