Feat/support rerank by woodx9 · Pull Request #6058 · sgl-project/sglang

woodx9 · 2025-05-06T14:34:28Z

Motivation

support v1_rerank endpoint and cross_encoder model like cross-encoder/ms-marco-MiniLM-L6-v2 and BAAI/bge-reranker-v2-m3, clarify on #5577, but the endpoint is changed a little bit.

Co-authored-by: DavidBao03 davidbao0304@gmail.com
Co-authored-by: Tushar-ml

Modifications

add v1_rerank endpoint in openai server endpoint
add rerank endpoint in engine endpoint
support BertForSequenceClassification and XLMRobertaForSequenceClassification architecture.
add unit test for v1_rerank endpoint and cross encoder model

ps:
doc will be on a separate pr.

demo for this work

# another option for attention backend: triton
python -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3  --disable-radix-cache  --host 0.0.0.0 --port 7879 --is-embedding --chunked-prefill-size -1   --attention-backend torch_native  --mem-fraction-static 0.5 --dtype float32

curl --location 'http://127.0.0.1:7879/v1/rerank' \
--header 'Content-Type: application/json' \
--data '
{
        "query": "what is panda?",
        "documents": ["hi"]
}'
[{"score":-8.179647445678711,"document":"hi","index":0,"meta_info":{"id":"4278d85101184013b22863d26c4ff5d0","finish_reason":{"type":"length","length":0},"prompt_tokens":10,"e2e_latency":0.013120651245117188}}]%

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

DavidBao03 · 2025-05-06T14:59:10Z

cool!!

DavidBao03 · 2025-05-06T15:46:24Z

@woodx9 BTW, have you ever try to use this declare the co-author instead of declare it in the PR statement?

woodx9 · 2025-05-07T05:10:09Z

@woodx9 BTW, have you ever try to use this declare the co-author instead of declare it in the PR statement?

commiter will do this when squash this pr's commit !

Titan-p · 2025-05-20T03:14:31Z

LGTM. I've tested it locally.

…ect#6137)" (sgl-project#6440)

lwabish · 2025-06-23T10:55:09Z

Appreciate you guys' great work. I am trying serving https://huggingface.co/Qwen/Qwen3-Reranker-8B but got response of status code 400 says

{
"error": {
"message": "1 validation error for RerankResponse\nscore\n Input should be a valid number [type=float_type, input_value=[0.01141357421875, 0.0029...0625, 0.004486083984375], input_type=list]\n For further information visit https://errors.pydantic.dev/2.11/v/float_type"
}
}

sglang is run with following docker compose file:

services:
  sglang:
    image: lmsysorg/sglang:v0.4.7.post1-cu124
    container_name: sglang
    volumes:
      - /root/xxxx/Qwen3-Reranker-8B/:/model
    network_mode: host
    entrypoint: python3 -m sglang.launch_server
    command:
      --model-path /model
      --is-embedding
      --disable-radix-cache
      --chunked-prefill-size -1
      --attention-backend torch_native
      --trust-remote-code
      --host 0.0.0.0
      --port 30000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]

my request :

curl --location 'http://127.0.0.1:30000/v1/rerank' \
--header 'Content-Type: application/json' \
--data '{
        "query": "what is panda?",
        "documents": ["hi"]
}'

Is this model specific problem(like qwen3-reranker-8b is not supported)?

Jimmy-L99 · 2025-06-24T02:51:31Z

  sglang-rerank:
    image: lmsysorg/sglang:v0.4.7-cu124
    container_name: sglang-rerank
    volumes:
      - /root/xxx/model/Rerank:/models
      - /etc/localtime:/etc/localtime:ro
      - /usr/share/zoneinfo/Asia/Shanghai:/usr/share/zoneinfo/Asia/Shanghai:ro
    restart: always
    network_mode: host
    privileged: true
    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --disable-radix-cache
      --chunked-prefill-size -1      --port xxx
      --is-embedding
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://xxx:xxx/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

I tried to pull up the bge-reranker-v2-m3 model, but an error occurred:
ValueError: XLMRobertaForSequenceClassification has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

sglang version: 0.4.7,.

Does this mean that this feature is not available in the current version and will be available in the next version?

FeliceSchena · 2025-06-24T06:42:30Z

i think that the pr has been merged, but since the docker image is up to 0.4.7 the feature is not available into that image.

lwabish · 2025-06-24T12:01:05Z

This Pr is included in git tag v0.4.7.post1, docker image lmsysorg/sglang:v0.4.7.post1-cu124 should have introduced this feature？

lwabish · 2025-06-24T12:02:47Z

  sglang-rerank:
    image: lmsysorg/sglang:v0.4.7-cu124
    container_name: sglang-rerank
    volumes:
      - /root/xxx/model/Rerank:/models
      - /etc/localtime:/etc/localtime:ro
      - /usr/share/zoneinfo/Asia/Shanghai:/usr/share/zoneinfo/Asia/Shanghai:ro
    restart: always
    network_mode: host
    privileged: true
    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --disable-radix-cache
      --chunked-prefill-size -1      --port xxx
      --is-embedding
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://xxx:xxx/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

I tried to pull up the bge-reranker-v2-m3 model, but an error occurred: ValueError: XLMRobertaForSequenceClassification has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

sglang version: 0.4.7,.

Does this mean that this feature is not available in the current version and will be available in the next version?

@Jimmy-L99 Maybe try using lmsysorg/sglang:v0.4.7.post1-cu124 image and add an arg --attention-backend torch_native?

woodx9 · 2025-06-24T12:43:22Z

hi @lwabish, really thanks for your feedback. this rerank api is for encoder model only. qwen rerank model is a decoder model. you should use socre api from #5973 cc @Jimmy-L99

Jimmy-L99 · 2025-06-25T07:26:48Z

@woodx9 Thanks for your answer. I then upgrate to sglang:0.4.8, and the command as:

    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --port xxx
      --is-embedding
      --disable-radix-cache
      --chunked-prefill-size -1
      --attention-backend torch_native

logs as below:

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.02it/s]
[2025-06-25 15:10:37] Load weight end. type=XLMRobertaForSequenceClassification, dtype=torch.float16, avail mem=13.18 GB, mem usage=1.10 GB.
[2025-06-25 15:10:37] KV Cache is allocated. #tokens: 124370, K size: 5.69 GB, V size: 5.69 GB
[2025-06-25 15:10:37] Memory pool end. avail mem=1.60 GB
[2025-06-25 15:10:38] max_total_num_tokens=124370, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=8194, available_gpu_mem=1.58 GB
[2025-06-25 15:10:39] INFO:     Started server process [1]
[2025-06-25 15:10:39] INFO:     Waiting for application startup.
[2025-06-25 15:10:39] INFO:     Application startup complete.
[2025-06-25 15:10:39] INFO:     Uvicorn running on http://xxx:xxx (Press CTRL+C to quit)
[2025-06-25 15:10:40] INFO:     xxx:xxx - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-25 15:10:40] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-25 15:10:40] INFO:     xxx:xxx - "POST /encode HTTP/1.1" 200 OK
[2025-06-25 15:10:40] The server is fired up and ready to roll!
[2025-06-25 15:10:51] INFO:    xxx:xxx - "GET /health HTTP/1.1" 200 OK

It seems the KV Cache use the vast majority of the GPU memory, and nvtop show memory used a total of 13,500 MB.

When I use vllm:

    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: vllm serve /embedding_model/bge-m3
    command:
      --host xxx
      --port xxx
      --task embed
      --disable-log-requests

model only use 1620MB.
And the same situation also occurs in the embedding model(bge-m3). I think with the parameter of these two models, it seems that they shouldn't occupy such a large amount of video memory

woodx9 · 2025-06-25T07:36:25Z

Thank you for your feedback, @Jimmy-L99 With the --disable-radix-cache parameter, we should not store the KV cache, and we cannot reuse it in encoder models like BGE Rerank. I will look into this and see why it is happening.

woodx9 requested review from ByronHsu, CatherineSue, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners May 6, 2025 14:34

woodx9 mentioned this pull request May 6, 2025

[Feature] support bert rerank model and openai "score" api #5577

Closed

2 tasks

woodx9 mentioned this pull request May 7, 2025

[Feature] Support for rerank models #2109

Closed

2 tasks

woodx9 force-pushed the feat/support_rerank branch 2 times, most recently from 826289f to c1aed6a Compare May 11, 2025 06:26

woodx9 mentioned this pull request May 11, 2025

[Feature] Generative Score API #5973

Closed

2 tasks

woodx9 requested review from BBuf and zhaochenyang20 as code owners May 20, 2025 03:21

chanh mentioned this pull request May 20, 2025

Decoder-only Scoring API #6460

Merged

6 tasks

woodx9 force-pushed the feat/support_rerank branch from 26ba04c to 92dbf8d Compare May 31, 2025 09:33

woodx9 requested review from FlamingoPg, Fridge003, HandH1998, kssteven418, rkooo567, slin1237 and yizhang2077 as code owners May 31, 2025 09:33

woodx9 and others added 8 commits June 9, 2025 01:31

fix typo

f10ce38

Implement return_hidden_states for the OpenAI API (sgl-project#6137)

0518001

Revert "Implement return_hidden_states for the OpenAI API (sgl-proj…

80da7b3

…ect#6137)" (sgl-project#6440)

fix lint

85481c5

revert

03934bc

more

28dfa4e

fix type error

a058622

fix lint

3f62484

woodx9 force-pushed the feat/support_rerank branch from 951f9ec to 3f62484 Compare June 8, 2025 17:33

Merge branch 'main' into feat/support_rerank

27c4cb5

woodx9 temporarily deployed to prod June 9, 2025 06:37 — with GitHub Actions Inactive

zhyncs and others added 3 commits June 9, 2025 21:44

Merge branch 'main' into feat/support_rerank

0858629

Merge branch 'main' into feat/support_rerank

70a7187

Merge branch 'main' into feat/support_rerank

208e1db

woodx9 mentioned this pull request Jun 14, 2025

[OAI Server Refactor] Implement Embeddings, Scoring, and Rerank #7107

Closed

b8zhong added the new-model label Jun 15, 2025

Merge branch 'main' into feat/support_rerank

e25ca99

zhyncs assigned CatherineSue and zhyncs Jun 15, 2025

Merge branch 'main' into feat/support_rerank

d262815

zhyncs merged commit e30ef36 into sgl-project:main Jun 16, 2025

FeliceSchena mentioned this pull request Jun 26, 2025

[Bug] Model compilation does not start when serving BAAI/bge-reranker-v2-m3 #7560

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/support rerank#6058

Feat/support rerank#6058
zhyncs merged 23 commits intosgl-project:mainfrom
woodx9:feat/support_rerank

woodx9 commented May 6, 2025 •

edited

Loading

Uh oh!

DavidBao03 commented May 6, 2025

Uh oh!

DavidBao03 commented May 6, 2025

Uh oh!

woodx9 commented May 7, 2025

Uh oh!

Titan-p commented May 20, 2025

Uh oh!

lwabish commented Jun 23, 2025

Uh oh!

Jimmy-L99 commented Jun 24, 2025 •

edited

Loading

Uh oh!

FeliceSchena commented Jun 24, 2025

Uh oh!

lwabish commented Jun 24, 2025

Uh oh!

lwabish commented Jun 24, 2025

Uh oh!

woodx9 commented Jun 24, 2025 •

edited

Loading

Uh oh!

Jimmy-L99 commented Jun 25, 2025

Uh oh!

woodx9 commented Jun 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

woodx9 commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

DavidBao03 commented May 6, 2025

Uh oh!

DavidBao03 commented May 6, 2025

Uh oh!

woodx9 commented May 7, 2025

Uh oh!

Titan-p commented May 20, 2025

Uh oh!

lwabish commented Jun 23, 2025

Uh oh!

Jimmy-L99 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FeliceSchena commented Jun 24, 2025

Uh oh!

lwabish commented Jun 24, 2025

Uh oh!

lwabish commented Jun 24, 2025

Uh oh!

woodx9 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jimmy-L99 commented Jun 25, 2025

Uh oh!

woodx9 commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

woodx9 commented May 6, 2025 •

edited

Loading

Jimmy-L99 commented Jun 24, 2025 •

edited

Loading

woodx9 commented Jun 24, 2025 •

edited

Loading

woodx9 commented Jun 25, 2025 •

edited

Loading