Skip to content

[Score API] Add SequenceClassification Model support#22118

Merged
hnyls2002 merged 4 commits intosgl-project:mainfrom
sundar24295s:suramach/addseqcls
Apr 8, 2026
Merged

[Score API] Add SequenceClassification Model support#22118
hnyls2002 merged 4 commits intosgl-project:mainfrom
sundar24295s:suramach/addseqcls

Conversation

@sundar24295s
Copy link
Copy Markdown
Collaborator

Add Scoring API support for SequenceClassification models

Summary

  • Extends the /v1/score endpoint to support SequenceClassification models (e.g., Qwen3ForSequenceClassification, Qwen2ForSequenceClassification, LlamaForClassification) in addition to the existing CausalLM path. For classification models, the scoring API returns pooled class logits from the model's classification head — no label_token_ids required.
  • Adds multi-item scoring (MIS) support for classification models via --multi-item-scoring-delimiter. All items are packed into a single sequence separated by a delimiter token and scored in one forward pass, with score_and_pool() extracting per-item scores at delimiter positions.
  • Adds comprehensive unit and E2E tests covering both single-item and multi-item classification scoring paths.

Changes

File Description
python/sglang/srt/layers/pooler.py New score_and_pool() function: dynamically finds delimiter positions in input_ids via get_global_server_args(), pools logits at those positions for MIS, or falls back to normal pool-then-score for single-item
python/sglang/srt/managers/tokenizer_manager_score_mixin.py Dual model-type dispatch: creates EmbeddingReqInput for classification models (vs GenerateReqInput for CausalLM). Handles MIS for both model types. Text inputs are tokenized separately then combined at token level to avoid boundary artifacts
python/sglang/srt/models/qwen3_classification.py Forward method uses score_and_pool()
python/sglang/srt/models/qwen2_classification.py Forward method uses score_and_pool()
python/sglang/srt/models/llama_classification.py Forward method uses score_and_pool()
python/sglang/srt/entrypoints/engine_score_mixin.py Docstrings updated for both CausalLM and SequenceClassification
python/sglang/srt/entrypoints/http_server.py /v1/score docstring updated
.codespellrc Added "MIS" (Multi-Item Scoring) to codespell ignore list
test/registered/unit/test_pooler_score_and_pool.py 6 CPU unit tests for score_and_pool (single-item, MIS, fallback paths)
test/registered/core/test_score_classification.py 14 E2E tests across 3 classes: single-item classification, MIS classification, and MIS with 12 labels
test/srt/test_multi_item_scoring.py Additional MIS integration tests

Design

  • Single-item scoring: Each query + item pair is sent as a separate EmbeddingReqInput. The model's pooler extracts the last-token hidden state, the classification head produces logits, and those logits are returned as scores.
  • Multi-item scoring: When --multi-item-scoring-delimiter <token_id> is configured, all items are packed into one sequence: query<delim>item1<delim>item2<delim>.... The score_and_pool() function applies the classification head to ALL hidden states, then extracts logits at positions just before each delimiter — the same pattern CausalLM uses in LogitsProcessor.
  • Delimiter positions are found dynamically via (input_ids == delimiter_token).nonzero() at forward time, avoiding the need to thread delimiter indices through the scheduling pipeline.

Test plan

Unit tests (CPU, no model needed)

source /workspace/venvs/sglang-repos/bin/activate
python test/registered/unit/test_pooler_score_and_pool.py -v

All 6 tests pass:

test_mis_extracts_positions_before_delimiter ... ok
test_mis_falls_back_when_no_delimiters_in_input ... ok
test_mis_falls_back_when_not_prefill_only ... ok
test_mis_returns_list_of_scores ... ok
test_single_item_returns_scores ... ok
test_single_item_scores_match_manual_computation ... ok
----------------------------------------------------------------------
Ran 6 tests in 0.007s
OK

E2E tests (GPU required)

Uses json_model_override_args to override Qwen/Qwen3-0.6B to Qwen3ForSequenceClassification with a random classification head.

python test/registered/core/test_score_classification.py -v

All 14 tests pass:

test_basic_single_item ... ok
test_deterministic ... ok
test_label_token_ids_ignored ... ok
test_raw_logits_without_softmax ... ok
test_single_item_edge_case ... ok
test_tokenized_inputs ... ok
test_deterministic (MIS) ... ok
test_items_produce_distinct_scores ... ok
test_mis_basic ... ok
test_mis_many_items ... ok
test_mis_single_item ... ok
test_softmax_valid ... ok
test_many_items_distinct (12 labels) ... ok
test_many_labels_shape (12 labels) ... ok
----------------------------------------------------------------------
Ran 14 tests in 45.538s
OK

Manual server test — single-item scoring

Launch server with a SequenceClassification model:

python -m sglang.launch_server \
  --model-path <path-to-seq-cls-model> \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph

Single-item curl:

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California?",
    "items": ["Sacramento"],
    "model": "default"
  }' | python3 -m json.tool

Returns class logits directly (12 labels for this model):

{
    "scores": [
        [-0.44, 0.27, -0.85, -0.41, 0.69, 0.32, 0.59, 0.38, 0.17, 0.70, -0.07, -0.07]
    ],
    "prompt_tokens": 14
}

Manual server test — multi-item scoring

Launch server with MIS delimiter:

python -m sglang.launch_server \
  --model-path <path-to-seq-cls-model> \
  --port 30000 --host 0.0.0.0 \
  --chunked-prefill-size -1 \
  --dtype float16 \
  --mem-fraction-static 0.5 \
  --disable-radix-cache \
  --disable-cuda-graph \
  --multi-item-scoring-delimiter 151643

Multi-item curl:

curl -s -X POST "http://localhost:30000/v1/score" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is the capital of California?",
    "items": ["Sacramento", "Los Angeles", "San Francisco"],
    "model": "default",
    "apply_softmax": true
  }' | python3 -m json.tool

Returns one probability distribution per item (all items scored in a single forward pass):

{
    "scores": [
        [0.05, 0.10, 0.03, 0.05, 0.15, 0.11, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07],
        [0.05, 0.10, 0.03, 0.05, 0.16, 0.10, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07],
        [0.05, 0.10, 0.03, 0.05, 0.15, 0.10, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07]
    ],
    "prompt_tokens": 26
}

Pre-commit

SKIP=clang-format pre-commit run  # all hooks pass

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for SequenceClassification models within the scoring API, enabling both single-item and multi-item scoring (MIS) for these architectures. Key implementation details include a new score_and_pool utility in the pooler layer to extract hidden states at delimiter positions and updates to the tokenizer manager to process pooled classification logits. The changes also integrate this logic into Llama and Qwen classification models and add comprehensive test suites. Review feedback identifies a performance bottleneck in the pooler layer caused by GPU-CPU synchronizations when iterating over sequence lengths and suggests using dim=-1 for softmax operations in the tokenizer manager to ensure robustness against varying tensor shapes.

Comment thread python/sglang/srt/layers/pooler.py Outdated
Comment thread python/sglang/srt/managers/tokenizer_manager_score_mixin.py Outdated
@hnyls2002 hnyls2002 merged commit 712c8c5 into sgl-project:main Apr 8, 2026
284 of 316 checks passed
@hnyls2002
Copy link
Copy Markdown
Collaborator

Three test placement issues to fix before merge:

  1. Create test/registered/score/ and move score tests there. test_score_classification.py should not be in core/. Also move the existing test_score_api.py from core/ into score/ so all scoring tests live together.

  2. Move test/registered/unit/test_pooler_score_and_pool.py to test/registered/unit/layers/. Unit tests follow the source module hierarchy — score_and_pool is in sglang.srt.layers.pooler.

  3. Delete test/srt/test_multi_item_scoring.py. It has no CI registration, uses hand-rolled server lifecycle instead of sglang test fixtures, and its coverage overlaps with test_score_api.py and test_score_classification.py.

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 8, 2026
JustinTong0323 added a commit to JustinTong0323/sglang that referenced this pull request Apr 8, 2026
The file was added via merge from main (sgl-project#22118) but not registered
in the old test/srt/run_suite.py suite, causing a sanity check failure.
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants