Feat/support rerank#6058
Conversation
|
cool!! |
826289f to
c1aed6a
Compare
|
LGTM. I've tested it locally. |
26ba04c to
92dbf8d
Compare
951f9ec to
3f62484
Compare
|
Appreciate you guys' great work. I am trying serving https://huggingface.co/Qwen/Qwen3-Reranker-8B but got response of status code 400 says
sglang is run with following docker compose file: services:
sglang:
image: lmsysorg/sglang:v0.4.7.post1-cu124
container_name: sglang
volumes:
- /root/xxxx/Qwen3-Reranker-8B/:/model
network_mode: host
entrypoint: python3 -m sglang.launch_server
command:
--model-path /model
--is-embedding
--disable-radix-cache
--chunked-prefill-size -1
--attention-backend torch_native
--trust-remote-code
--host 0.0.0.0
--port 30000
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]my request : curl --location 'http://127.0.0.1:30000/v1/rerank' \
--header 'Content-Type: application/json' \
--data '{
"query": "what is panda?",
"documents": ["hi"]
}'Is this model specific problem(like qwen3-reranker-8b is not supported)? |
I tried to pull up the bge-reranker-v2-m3 model, but an error occurred: sglang version: 0.4.7,. Does this mean that this feature is not available in the current version and will be available in the next version? |
|
i think that the pr has been merged, but since the docker image is up to 0.4.7 the feature is not available into that image. |
|
This Pr is included in git tag v0.4.7.post1, docker image lmsysorg/sglang:v0.4.7.post1-cu124 should have introduced this feature? |
@Jimmy-L99 Maybe try using lmsysorg/sglang:v0.4.7.post1-cu124 image and add an arg |
|
hi @lwabish, really thanks for your feedback. this rerank api is for encoder model only. qwen rerank model is a decoder model. you should use socre api from #5973 cc @Jimmy-L99 |
|
@woodx9 Thanks for your answer. I then upgrate to logs as below: It seems the When I use vllm: model only use 1620MB. |
|
Thank you for your feedback, @Jimmy-L99 With the |
Motivation
support v1_rerank endpoint and cross_encoder model like cross-encoder/ms-marco-MiniLM-L6-v2 and BAAI/bge-reranker-v2-m3, clarify on #5577, but the endpoint is changed a little bit.
Co-authored-by: DavidBao03 davidbao0304@gmail.com
Co-authored-by: Tushar-ml
Modifications
ps:
doc will be on a separate pr.
demo for this work
# another option for attention backend: triton python -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 --disable-radix-cache --host 0.0.0.0 --port 7879 --is-embedding --chunked-prefill-size -1 --attention-backend torch_native --mem-fraction-static 0.5 --dtype float32Checklist