[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel by sundar24295s · Pull Request #22427 · sgl-project/sglang

sundar24295s · 2026-04-09T07:24:20Z

Add `return_pooled_hidden_states` to Scoring API for SequenceClassification / RewardModel

Summary

Adds the ability to extract raw transformer hidden states (before the task-specific classification/reward head) from the /v1/score endpoint. This is useful for downstream consumers that need the model's internal representation alongside the final scores — e.g., for distillation, interpretability, or secondary scoring pipelines.

The feature is gated behind a new return_pooled_hidden_states: bool parameter on the scoring request. When false (default), behavior is identical to before. When true, an additional pooled_hidden_states field is returned containing per-item hidden state vectors.

Supported on SequenceClassification and RewardModel architectures. Raises ValueError for CausalLM models.

Changes

Core pipeline (bottom → top)

File	Change
`layers/pooler.py`	Added `pooled_hidden_states` field to `EmbeddingPoolerOutput`. Extracted `pool_hidden_states()` standalone function for reuse. Updated `score_and_pool()` to conditionally capture pre-head hidden states for both standard and MIS paths.
`model_executor/forward_batch_info.py`	Added `return_pooled_hidden_states: bool` to `ForwardBatch`, propagated from `ModelWorkerBatch`.
`model_executor/piecewise_cuda_graph_runner.py`	Captures embedding forwards with `return_pooled_hidden_states=True`; replay ORs that flag for graph matching; passes through on `static_forward_batch`.
`configs/model_config.py`	Added `is_cross_encoding_pooler_model()` checker and `_cross_encoding_pooler_archs` list (BERT, XLM-R) for models that don't expose pre-head hidden states.
`server_args.py`	Added `_handle_multi_item_scoring()` — auto-disables CUDA graph and piecewise CUDA graph when `--multi-item-scoring-delimiter` is set (padded static `input_ids` cause spurious delimiter matches in `score_and_pool`).
`managers/schedule_batch.py`	Added `return_pooled_hidden_states` to `Req`, `ScheduleBatch`, and `ModelWorkerBatch`. `ScheduleBatch.init_new` aggregates the flag via `any()`.
`managers/scheduler.py`	Added `pooled_hidden_states` to `EmbeddingBatchResult` with `copy_to_cpu()` support.
`managers/scheduler_output_processor_mixin.py`	Extracts PHS from `EmbeddingBatchResult`, assigns to `req.pooled_hidden_state`, and stacks/sends via `BatchEmbeddingOutput`.
`managers/tp_worker.py`	`forward_batch_embedding` now returns the full `EmbeddingPoolerOutput` (not just embeddings).
`managers/io_struct.py`	Added `return_pooled_hidden_states` to `EmbeddingReqInput`, `TokenizedEmbeddingReqInput`. Added `pooled_hidden_states` to `BatchEmbeddingOutput`. Fixed `__getitem__` propagation in both standard and cross-encoder paths.
`managers/tokenizer_manager.py`	Propagates `return_pooled_hidden_states` during tokenization; includes `pooled_hidden_state` in output dict.
`managers/tokenizer_manager_score_mixin.py`	Added `pooled_hidden_states` to `ScoreResult`. Updated `score_request()`, `_process_single_item_scoring_results()`, `_process_multi_item_scoring_results()` to thread the flag and collect PHS. Added early `ValueError` for CausalLM + PHS and for BERT/XLM-R `CrossEncodingPooler` models that don't expose pre-head hidden states.
`entrypoints/engine_score_mixin.py`	Added `return_pooled_hidden_states` parameter to `score()` and `async_score()`.
`entrypoints/openai/protocol.py`	Added `return_pooled_hidden_states` to `ScoringRequest`, `pooled_hidden_states` to `ScoringResponse`.
`entrypoints/openai/serving_score.py`	Passes flag to `score_request()`, converts PHS tensors to `List[List[float]]` for JSON serialization via `ORJSONResponse`.

Model files

File	Change
`models/qwen3_classification.py`	Uses `score_and_pool()` which now handles PHS natively — no changes needed.
`models/llama_reward.py`	`LlamaForSequenceClassification`: pools hidden states then scores; conditionally returns PHS. `LlamaForSequenceClassificationWithNormal_Weights`: calls `pool_hidden_states()` separately since it has custom scoring logic.
`models/qwen2_rm.py`	Scores all tokens then pools; conditionally returns PHS via `pool_hidden_states()`.
`models/gemma2_reward.py`	Pools then scores; conditionally returns PHS.
`models/internlm2_reward.py`	Pools then scores; conditionally returns PHS.

Tests

File	Tests
`test/registered/prefill_only/test_pooled_hidden_states.py`	20 tests across 4 classes: Engine single-item (8), Engine MIS (7), CausalLM rejection (2), HTTP integration (3).

Server Start Commands

Without Multi-Item Scoring (single-item mode)

source /workspace/venvs/sglang-repos/bin/activate

python -m sglang.launch_server \
  --model-path /shared/public/elr-models/sjsmodels/downloaded_models/l2-slmv6/base_model/ \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --max-prefill-tokens 100000 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph \
  --is-embedding

With Multi-Item Scoring (MIS delimiter mode)

python -m sglang.launch_server \
  --model-path /shared/public/elr-models/sjsmodels/downloaded_models/l2-slmv6/base_model/ \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --max-prefill-tokens 100000 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph \
  --is-embedding \
  --multi-item-scoring-delimiter 151643

CUDA graph behaviour

Without --multi-item-scoring-delimiter: CUDA graph and piecewise CUDA graph work normally for single-item scoring. Piecewise graphs for embedding/classification models are captured with return_pooled_hidden_states=True so the traced forward matches scoring requests that need PHS. When the batch does not request PHS, the extra output is stripped in model_runner.forward_extend to avoid redundant CPU copies.

With --chunked-prefill-size -1, piecewise_cuda_graph_max_tokens defaults to -1 and the capture-size list is empty; pass --piecewise-cuda-graph-max-tokens 8192 (or similar) to enable piecewise graphs.

With --multi-item-scoring-delimiter: server_args.py now auto-disables both CUDA graph and piecewise CUDA graph at startup (with a warning log). The padded static input_ids buffer used by CUDA graph replay causes spurious delimiter matches in score_and_pool, so MIS always runs without graphs.

CUDA / piecewise graph validation (manual, single-item mode)

Validated 2026-04-13 on NVIDIA H100 80GB, model Qwen3ForSequenceClassification, single-item scoring (no --multi-item-scoring-delimiter), port 30001, with --piecewise-cuda-graph-max-tokens 8192.

Check	Result
Decode CUDA graph	Enabled (`disable_cuda_graph=False` in logged `ServerArgs`)
Piecewise CUDA graph	Enabled — startup: `Capture piecewise CUDA graph begin` → `end. Time elapsed: 7.90 s. mem usage=0.68 GB.`
`/v1/score` without PHS	OK — `object=scoring`, `pooled_hidden_states=null`, 12 label logits
`/v1/score` with `return_pooled_hidden_states: true`	OK — `pooled_hidden_states` length 1024 per item; head `[0.182..., 0.200..., -0.908...]`

Curl Tests & Output

Single-Item Scoring Mode

Server started without --multi-item-scoring-delimiter

Test 1: Basic scoring (no pooled hidden states)

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "model": "test"
  }' | python3 -m json.tool

Output:

{
    "scores": [
        [
            -2.6796875, -2.78125, 4.7890625, -2.166015625,
            -0.8828125, -3.216796875, -2.767578125, -2.935546875,
            1.1982421875, -0.703125, 3.623046875, -2.7421875
        ]
    ],
    "pooled_hidden_states": null,
    "model": "test",
    "usage": {
        "prompt_tokens": 20,
        "total_tokens": 20,
        "completion_tokens": 0,
        "prompt_tokens_details": null,
        "reasoning_tokens": 0
    },
    "object": "scoring"
}

Test 2: Single item with `return_pooled_hidden_states=true`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California? Answer Yes or No for each of the following options:",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output (summarized):

scores: 1 item, 12 dims
pooled_hidden_states: 1 item, dim=1024
  first 5 values: [0.182007, 0.200928, -0.908691, 0.897949, 0.177002]

Test 3: Multiple items with `return_pooled_hidden_states=true`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: 3 items, dim=1024
  item 0 first 5: [-0.031921, 0.95166, -0.897949, 1.463867, 0.424805]
  item 1 first 5: [0.387939, -1.735352, -0.908203, 2.292969, -0.062744]
  item 2 first 5: [0.132446, 0.162964, -0.894531, 0.879395, 0.12854]

Multi-Item Scoring (MIS) Mode

Server started with --multi-item-scoring-delimiter 151643

Test 4: MIS — multiple items with `return_pooled_hidden_states=true`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: 3 items, dim=1024
  item 0 first 5: [-0.031921, 0.95166, -0.897949, 1.463867, 0.424805]
  item 1 first 5: [0.387939, -1.735352, -0.908203, 2.292969, -0.062744]
  item 2 first 5: [0.132446, 0.162964, -0.894531, 0.879395, 0.12854]

Test 5: MIS — without `return_pooled_hidden_states`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this city in California?",
    "items": ["Sacramento", "New York", "Los Angeles"],
    "label_token_ids": [9454, 2753],
    "model": "test"
  }'

Output:

num_items: 3
  item 0: scores=[-3.029, -3.297, 4.938, ...]  (same as Test 4)
  item 1: scores=[-2.832, -2.83, 6.535, ...]
  item 2: scores=[-2.627, -2.943, 5.727, ...]
pooled_hidden_states: None

Test 6: MIS — single item with `return_pooled_hidden_states=true`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California?",
    "items": ["Sacramento"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

scores: 1 item, 12 dims
pooled_hidden_states: 1 item, dim=1024
  first 5 values: [0.212158, -0.395508, -0.897949, 1.944336, 0.267334]

Test 7: MIS — many items (6) with `return_pooled_hidden_states=true`

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Is this a US state capital?",
    "items": ["Sacramento", "New York", "Austin", "Miami", "Denver", "Olympia"],
    "label_token_ids": [9454, 2753],
    "return_pooled_hidden_states": true,
    "model": "test"
  }'

Output:

num_items: 6
  item 0: score_dims=12, phs_dim=1024
  item 1: score_dims=12, phs_dim=1024
  item 2: score_dims=12, phs_dim=1024
  item 3: score_dims=12, phs_dim=1024
  item 4: score_dims=12, phs_dim=1024
  item 5: score_dims=12, phs_dim=1024

Unit Test Results

Pooler unit tests (CPU only)

$ python test/registered/unit/layers/test_pooler_score_and_pool.py -v

TestScoreAndPool
  test_mis_batched_splits_per_request .................. ok
  test_mis_extracts_positions_before_delimiter ......... ok
  test_mis_falls_back_when_no_delimiters_in_input ...... ok
  test_mis_falls_back_when_not_prefill_only ............ ok
  test_mis_ignores_delimiter_at_position_zero .......... ok
  test_mis_returns_per_request_list .................... ok
  test_single_item_returns_scores ...................... ok
  test_single_item_scores_match_manual_computation ..... ok

----------------------------------------------------------------------
Ran 8 tests in 0.009s

OK

Pooled hidden states E2E tests (GPU)

$ python test/registered/prefill_only/test_pooled_hidden_states.py -v

TestPooledHiddenStatesEngine
  test_phs_count_matches_items .............. ok
  test_phs_deterministic .................... ok
  test_phs_none_when_not_requested .......... ok
  test_phs_on_cpu ........................... ok
  test_phs_returned_when_requested .......... ok
  test_phs_shape_is_consistent .............. ok
  test_phs_with_tokenized_inputs ............ ok
  test_scores_unaffected_by_phs_flag ........ ok

TestPooledHiddenStatesMISEngine
  test_mis_many_items ....................... ok
  test_mis_phs_are_tensors_on_cpu ........... ok
  test_mis_phs_count_matches_items .......... ok
  test_mis_phs_different_items_different_hs . ok
  test_mis_phs_none_when_not_requested ...... ok
  test_mis_scores_unaffected_by_phs_flag .... ok
  test_mis_single_item ...................... ok

TestPooledHiddenStatesCausalLMRejection
  test_causal_lm_rejects_phs ............... ok
  test_causal_lm_without_phs_still_works .... ok

TestPooledHiddenStatesHTTP
  test_phs_absent_when_not_requested ........ ok
  test_phs_in_response_json ................. ok
  test_phs_matches_item_count ............... ok

----------------------------------------------------------------------
Ran 20 tests in 78.881s

OK

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces the ability to return pooled hidden states from the model's intermediate layers during scoring. This feature is integrated across various components, including the scoring API, protocol definitions, model pooling logic, and request/batch management. New tests validate this functionality for single-item and multi-item scoring, HTTP integration, and confirm that CausalLM models correctly reject requests for pooled hidden states. A high-severity bug was identified in the scheduler_output_processor_mixin.py file, where the logic for stacking pooled_hidden_states for pickling optimization can cause an IndexError if some requests in a batch do not return these states. The suggested fix is to pass the list directly to preserve the one-to-one mapping with requests.

kpham-sgl

@sundar24295s is this new feature intended to be cudagraph-able in the future?
If not, can you disable it in can_run?
If yes, I think you need to let capture_one_batch_size() capturewithreturn_pooled_hidden_states=True`?

sundar24295s · 2026-04-14T02:35:43Z

@sundar24295s is this new feature intended to be cudagraph-able in the future? If not, can you disable it in can_run? If yes, I think you need to let capture_one_batch_size() capturewithreturn_pooled_hidden_states=True`?

For Single Item scoring, updated the piecewise_cuda_graph_runner.
For Multi-Item scoring, piecewise_cuda_graph I have disabled it in ServerArgs. This will need more work and will do in a separate PR. (MIS was added first, followed by PCG, we didn't actually make both work before)

kpham-sgl

Thanks for addressing all the comments!

sundar24295s · 2026-04-15T04:21:05Z

/rerun-failed-ci

…ceClassification / RewardModel (sgl-project#22427)

…lay_prepare The shared replay_prepare bound from PiecewiseCudaGraphRunner reads self.capture_return_pooled_hidden_states (added upstream by the Score API PR sgl-project#22427). BCG's __init__ never set it, so CI merges of this PR with main hit AttributeError on first replay. Mirror PCG's initialization: not model_runner.is_generation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

…ceClassification / RewardModel (sgl-project#22427)

Add return pooled hidden states feature.

8eaa1d3

sundar24295s requested review from BBuf, CatherineSue, Edwardf0t1, Fridge003, HaiShaw, JustinTong0323, Ying1123, ch-wan, hebiao064, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners April 9, 2026 07:24

sundar24295s added the run-ci label Apr 9, 2026

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler_output_processor_mixin.py Outdated

sundar24295s added 4 commits April 9, 2026 06:38

Merge branch 'main' into suramach/pooledhiddenstates

48f0260

updates

44d09dd

Merge branch 'main' into suramach/pooledhiddenstates

8ff3809

updates

173bcaf

sundar24295s mentioned this pull request Apr 9, 2026

[Roadmap] SGLang Prefill-Only 2026 CY26H1 Roadmap #15344

Open

23 tasks

kpham-sgl self-assigned this Apr 10, 2026

kpham-sgl reviewed Apr 13, 2026

View reviewed changes

sundar24295s added 3 commits April 13, 2026 15:04

Merge branch 'main' into suramach/pooledhiddenstates

33b67b4

Updates

0cd8532

Merge branch 'main' into suramach/pooledhiddenstates

7beeac8

kpham-sgl approved these changes Apr 14, 2026

View reviewed changes

sundar24295s enabled auto-merge (squash) April 14, 2026 03:41

sundar24295s and others added 3 commits April 13, 2026 22:49

Merge branch 'main' into suramach/pooledhiddenstates

65e82c3

Merge branch 'main' into suramach/pooledhiddenstates

5f08ecd

Merge branch 'main' into suramach/pooledhiddenstates

889f822

kpham-sgl mentioned this pull request Apr 15, 2026

[Score API] Add Multi-Item Scoring with pre-computed delimiter indices #22544

Merged

Qiaolin-Yu disabled auto-merge April 15, 2026 21:58

Qiaolin-Yu merged commit 4927975 into sgl-project:main Apr 15, 2026
789 of 1040 checks passed

jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026

[Score API] Add return_pooled_hidden_states to Scoring API for Sequen…

be22b99

…ceClassification / RewardModel (sgl-project#22427)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Score API] Add return_pooled_hidden_states to Scoring API for Sequen…

96693e5

…ceClassification / RewardModel (sgl-project#22427)

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

[Score API] Add return_pooled_hidden_states to Scoring API for Sequen…

009da6d

…ceClassification / RewardModel (sgl-project#22427)

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

[Score API] Add return_pooled_hidden_states to Scoring API for Sequen…

b0e2a4e

…ceClassification / RewardModel (sgl-project#22427)

This was referenced May 8, 2026

Defer KV cache release until after streaming output #22746

Open

[Perf] Bypass detokenizer for prefill-only requests #22748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel#22427

[Score API] Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel#22427
Qiaolin-Yu merged 11 commits intosgl-project:mainfrom
sundar24295s:suramach/pooledhiddenstates

sundar24295s commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

kpham-sgl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Apr 14, 2026

Uh oh!

kpham-sgl left a comment

Uh oh!

sundar24295s commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sundar24295s commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add return_pooled_hidden_states to Scoring API for SequenceClassification / RewardModel

Summary

Changes

Core pipeline (bottom → top)

Model files

Tests

Server Start Commands

Without Multi-Item Scoring (single-item mode)

With Multi-Item Scoring (MIS delimiter mode)

CUDA graph behaviour

CUDA / piecewise graph validation (manual, single-item mode)

Curl Tests & Output

Single-Item Scoring Mode

Test 1: Basic scoring (no pooled hidden states)

Test 2: Single item with return_pooled_hidden_states=true

Test 3: Multiple items with return_pooled_hidden_states=true

Multi-Item Scoring (MIS) Mode

Test 4: MIS — multiple items with return_pooled_hidden_states=true

Test 5: MIS — without return_pooled_hidden_states

Test 6: MIS — single item with return_pooled_hidden_states=true

Test 7: MIS — many items (6) with return_pooled_hidden_states=true

Unit Test Results

Pooler unit tests (CPU only)

Pooled hidden states E2E tests (GPU)

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kpham-sgl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sundar24295s commented Apr 14, 2026

Uh oh!

kpham-sgl left a comment

Choose a reason for hiding this comment

Uh oh!

sundar24295s commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sundar24295s commented Apr 9, 2026 •

edited

Loading

Add `return_pooled_hidden_states` to Scoring API for SequenceClassification / RewardModel

Test 2: Single item with `return_pooled_hidden_states=true`

Test 3: Multiple items with `return_pooled_hidden_states=true`

Test 4: MIS — multiple items with `return_pooled_hidden_states=true`

Test 5: MIS — without `return_pooled_hidden_states`

Test 6: MIS — single item with `return_pooled_hidden_states=true`

Test 7: MIS — many items (6) with `return_pooled_hidden_states=true`