[Score API] Add SequenceClassification Model support#22118
[Score API] Add SequenceClassification Model support#22118hnyls2002 merged 4 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for SequenceClassification models within the scoring API, enabling both single-item and multi-item scoring (MIS) for these architectures. Key implementation details include a new score_and_pool utility in the pooler layer to extract hidden states at delimiter positions and updates to the tokenizer manager to process pooled classification logits. The changes also integrate this logic into Llama and Qwen classification models and add comprehensive test suites. Review feedback identifies a performance bottleneck in the pooler layer caused by GPU-CPU synchronizations when iterating over sequence lengths and suggests using dim=-1 for softmax operations in the tokenizer manager to ensure robustness against varying tensor shapes.
|
Three test placement issues to fix before merge:
|
The file was added via merge from main (sgl-project#22118) but not registered in the old test/srt/run_suite.py suite, causing a sanity check failure.
Add Scoring API support for SequenceClassification models
Summary
/v1/scoreendpoint to supportSequenceClassificationmodels (e.g.,Qwen3ForSequenceClassification,Qwen2ForSequenceClassification,LlamaForClassification) in addition to the existing CausalLM path. For classification models, the scoring API returns pooled class logits from the model's classification head — nolabel_token_idsrequired.--multi-item-scoring-delimiter. All items are packed into a single sequence separated by a delimiter token and scored in one forward pass, withscore_and_pool()extracting per-item scores at delimiter positions.Changes
python/sglang/srt/layers/pooler.pyscore_and_pool()function: dynamically finds delimiter positions ininput_idsviaget_global_server_args(), pools logits at those positions for MIS, or falls back to normal pool-then-score for single-itempython/sglang/srt/managers/tokenizer_manager_score_mixin.pyEmbeddingReqInputfor classification models (vsGenerateReqInputfor CausalLM). Handles MIS for both model types. Text inputs are tokenized separately then combined at token level to avoid boundary artifactspython/sglang/srt/models/qwen3_classification.pyscore_and_pool()python/sglang/srt/models/qwen2_classification.pyscore_and_pool()python/sglang/srt/models/llama_classification.pyscore_and_pool()python/sglang/srt/entrypoints/engine_score_mixin.pypython/sglang/srt/entrypoints/http_server.py/v1/scoredocstring updated.codespellrctest/registered/unit/test_pooler_score_and_pool.pyscore_and_pool(single-item, MIS, fallback paths)test/registered/core/test_score_classification.pytest/srt/test_multi_item_scoring.pyDesign
query + itempair is sent as a separateEmbeddingReqInput. The model's pooler extracts the last-token hidden state, the classification head produces logits, and those logits are returned as scores.--multi-item-scoring-delimiter <token_id>is configured, all items are packed into one sequence:query<delim>item1<delim>item2<delim>.... Thescore_and_pool()function applies the classification head to ALL hidden states, then extracts logits at positions just before each delimiter — the same pattern CausalLM uses inLogitsProcessor.(input_ids == delimiter_token).nonzero()at forward time, avoiding the need to thread delimiter indices through the scheduling pipeline.Test plan
Unit tests (CPU, no model needed)
source /workspace/venvs/sglang-repos/bin/activate python test/registered/unit/test_pooler_score_and_pool.py -vAll 6 tests pass:
E2E tests (GPU required)
Uses
json_model_override_argsto overrideQwen/Qwen3-0.6BtoQwen3ForSequenceClassificationwith a random classification head.All 14 tests pass:
Manual server test — single-item scoring
Launch server with a SequenceClassification model:
Single-item curl:
Returns class logits directly (12 labels for this model):
{ "scores": [ [-0.44, 0.27, -0.85, -0.41, 0.69, 0.32, 0.59, 0.38, 0.17, 0.70, -0.07, -0.07] ], "prompt_tokens": 14 }Manual server test — multi-item scoring
Launch server with MIS delimiter:
Multi-item curl:
Returns one probability distribution per item (all items scored in a single forward pass):
{ "scores": [ [0.05, 0.10, 0.03, 0.05, 0.15, 0.11, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07], [0.05, 0.10, 0.03, 0.05, 0.16, 0.10, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07], [0.05, 0.10, 0.03, 0.05, 0.15, 0.10, 0.14, 0.11, 0.09, 0.15, 0.07, 0.07] ], "prompt_tokens": 26 }Pre-commit
SKIP=clang-format pre-commit run # all hooks passChecklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci