Skip to content

Feat/support rerank#6058

Merged
zhyncs merged 23 commits intosgl-project:mainfrom
woodx9:feat/support_rerank
Jun 16, 2025
Merged

Feat/support rerank#6058
zhyncs merged 23 commits intosgl-project:mainfrom
woodx9:feat/support_rerank

Conversation

@woodx9
Copy link
Copy Markdown
Contributor

@woodx9 woodx9 commented May 6, 2025

Motivation

support v1_rerank endpoint and cross_encoder model like cross-encoder/ms-marco-MiniLM-L6-v2 and BAAI/bge-reranker-v2-m3, clarify on #5577, but the endpoint is changed a little bit.

Co-authored-by: DavidBao03 davidbao0304@gmail.com
Co-authored-by: Tushar-ml

Modifications

  1. add v1_rerank endpoint in openai server endpoint
  2. add rerank endpoint in engine endpoint
  3. support BertForSequenceClassification and XLMRobertaForSequenceClassification architecture.
  4. add unit test for v1_rerank endpoint and cross encoder model

ps:
doc will be on a separate pr.

demo for this work

# another option for attention backend: triton
python -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3  --disable-radix-cache  --host 0.0.0.0 --port 7879 --is-embedding --chunked-prefill-size -1   --attention-backend torch_native  --mem-fraction-static 0.5 --dtype float32 
curl --location 'http://127.0.0.1:7879/v1/rerank' \
--header 'Content-Type: application/json' \
--data '
{
        "query": "what is panda?",
        "documents": ["hi"]
}'
[{"score":-8.179647445678711,"document":"hi","index":0,"meta_info":{"id":"4278d85101184013b22863d26c4ff5d0","finish_reason":{"type":"length","length":0},"prompt_tokens":10,"e2e_latency":0.013120651245117188}}]%  

Checklist

@DavidBao03
Copy link
Copy Markdown
Contributor

cool!!

@DavidBao03
Copy link
Copy Markdown
Contributor

@woodx9 BTW, have you ever try to use this declare the co-author instead of declare it in the PR statement?

@woodx9
Copy link
Copy Markdown
Contributor Author

woodx9 commented May 7, 2025

@woodx9 BTW, have you ever try to use this declare the co-author instead of declare it in the PR statement?

commiter will do this when squash this pr's commit !

@woodx9 woodx9 mentioned this pull request May 7, 2025
2 tasks
@woodx9 woodx9 force-pushed the feat/support_rerank branch 2 times, most recently from 826289f to c1aed6a Compare May 11, 2025 06:26
@woodx9 woodx9 mentioned this pull request May 11, 2025
2 tasks
@Titan-p
Copy link
Copy Markdown
Contributor

Titan-p commented May 20, 2025

LGTM. I've tested it locally.

@woodx9 woodx9 force-pushed the feat/support_rerank branch from 951f9ec to 3f62484 Compare June 8, 2025 17:33
@zhyncs zhyncs merged commit e30ef36 into sgl-project:main Jun 16, 2025
@lwabish
Copy link
Copy Markdown

lwabish commented Jun 23, 2025

Appreciate you guys' great work. I am trying serving https://huggingface.co/Qwen/Qwen3-Reranker-8B but got response of status code 400 says

{
"error": {
"message": "1 validation error for RerankResponse\nscore\n Input should be a valid number [type=float_type, input_value=[0.01141357421875, 0.0029...0625, 0.004486083984375], input_type=list]\n For further information visit https://errors.pydantic.dev/2.11/v/float_type"
}
}

sglang is run with following docker compose file:

services:
  sglang:
    image: lmsysorg/sglang:v0.4.7.post1-cu124
    container_name: sglang
    volumes:
      - /root/xxxx/Qwen3-Reranker-8B/:/model
    network_mode: host
    entrypoint: python3 -m sglang.launch_server
    command:
      --model-path /model
      --is-embedding
      --disable-radix-cache
      --chunked-prefill-size -1
      --attention-backend torch_native
      --trust-remote-code
      --host 0.0.0.0
      --port 30000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]

my request :

curl --location 'http://127.0.0.1:30000/v1/rerank' \
--header 'Content-Type: application/json' \
--data '{
        "query": "what is panda?",
        "documents": ["hi"]
}'

Is this model specific problem(like qwen3-reranker-8b is not supported)?

@Jimmy-L99
Copy link
Copy Markdown
Contributor

Jimmy-L99 commented Jun 24, 2025

  sglang-rerank:
    image: lmsysorg/sglang:v0.4.7-cu124
    container_name: sglang-rerank
    volumes:
      - /root/xxx/model/Rerank:/models
      - /etc/localtime:/etc/localtime:ro
      - /usr/share/zoneinfo/Asia/Shanghai:/usr/share/zoneinfo/Asia/Shanghai:ro
    restart: always
    network_mode: host
    privileged: true
    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --disable-radix-cache
      --chunked-prefill-size -1      --port xxx
      --is-embedding
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://xxx:xxx/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

I tried to pull up the bge-reranker-v2-m3 model, but an error occurred:
ValueError: XLMRobertaForSequenceClassification has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

sglang version: 0.4.7,.

Does this mean that this feature is not available in the current version and will be available in the next version?

@FeliceSchena
Copy link
Copy Markdown

i think that the pr has been merged, but since the docker image is up to 0.4.7 the feature is not available into that image.

@lwabish
Copy link
Copy Markdown

lwabish commented Jun 24, 2025

This Pr is included in git tag v0.4.7.post1, docker image lmsysorg/sglang:v0.4.7.post1-cu124 should have introduced this feature?

@lwabish
Copy link
Copy Markdown

lwabish commented Jun 24, 2025

  sglang-rerank:
    image: lmsysorg/sglang:v0.4.7-cu124
    container_name: sglang-rerank
    volumes:
      - /root/xxx/model/Rerank:/models
      - /etc/localtime:/etc/localtime:ro
      - /usr/share/zoneinfo/Asia/Shanghai:/usr/share/zoneinfo/Asia/Shanghai:ro
    restart: always
    network_mode: host
    privileged: true
    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --disable-radix-cache
      --chunked-prefill-size -1      --port xxx
      --is-embedding
    ulimits:
      memlock: -1
      stack: 67108864
    ipc: host
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://xxx:xxx/health || exit 1"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

I tried to pull up the bge-reranker-v2-m3 model, but an error occurred: ValueError: XLMRobertaForSequenceClassification has no SGlang implementation and the Transformers implementation is not compatible with SGLang.

sglang version: 0.4.7,.

Does this mean that this feature is not available in the current version and will be available in the next version?

@Jimmy-L99 Maybe try using lmsysorg/sglang:v0.4.7.post1-cu124 image and add an arg --attention-backend torch_native?

@woodx9
Copy link
Copy Markdown
Contributor Author

woodx9 commented Jun 24, 2025

hi @lwabish, really thanks for your feedback. this rerank api is for encoder model only. qwen rerank model is a decoder model. you should use socre api from #5973 cc @Jimmy-L99

@Jimmy-L99
Copy link
Copy Markdown
Contributor

@woodx9 Thanks for your answer. I then upgrate to sglang:0.4.8, and the command as:

    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: python3 -m sglang.launch_server
    command: |
      --model-path /models/bge-reranker-v2-m3
      --host xxx
      --port xxx
      --is-embedding
      --disable-radix-cache
      --chunked-prefill-size -1
      --attention-backend torch_native

logs as below:

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.02it/s]
[2025-06-25 15:10:37] Load weight end. type=XLMRobertaForSequenceClassification, dtype=torch.float16, avail mem=13.18 GB, mem usage=1.10 GB.
[2025-06-25 15:10:37] KV Cache is allocated. #tokens: 124370, K size: 5.69 GB, V size: 5.69 GB
[2025-06-25 15:10:37] Memory pool end. avail mem=1.60 GB
[2025-06-25 15:10:38] max_total_num_tokens=124370, chunked_prefill_size=-1, max_prefill_tokens=16384, max_running_requests=4096, context_len=8194, available_gpu_mem=1.58 GB
[2025-06-25 15:10:39] INFO:     Started server process [1]
[2025-06-25 15:10:39] INFO:     Waiting for application startup.
[2025-06-25 15:10:39] INFO:     Application startup complete.
[2025-06-25 15:10:39] INFO:     Uvicorn running on http://xxx:xxx (Press CTRL+C to quit)
[2025-06-25 15:10:40] INFO:     xxx:xxx - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-25 15:10:40] Prefill batch. #new-seq: 1, #new-token: 8, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-06-25 15:10:40] INFO:     xxx:xxx - "POST /encode HTTP/1.1" 200 OK
[2025-06-25 15:10:40] The server is fired up and ready to roll!
[2025-06-25 15:10:51] INFO:    xxx:xxx - "GET /health HTTP/1.1" 200 OK

It seems the KV Cache use the vast majority of the GPU memory, and nvtop show memory used a total of 13,500 MB.

When I use vllm:

    environment:
      - CUDA_VISIBLE_DEVICES=1
    entrypoint: vllm serve /embedding_model/bge-m3
    command:
      --host xxx
      --port xxx
      --task embed
      --disable-log-requests

model only use 1620MB.
And the same situation also occurs in the embedding model(bge-m3). I think with the parameter of these two models, it seems that they shouldn't occupy such a large amount of video memory

@woodx9
Copy link
Copy Markdown
Contributor Author

woodx9 commented Jun 25, 2025

Thank you for your feedback, @Jimmy-L99 With the --disable-radix-cache parameter, we should not store the KV cache, and we cannot reuse it in encoder models like BGE Rerank. I will look into this and see why it is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.