Add GLM-4v Multimodal Model support for SGLang#1641
Add GLM-4v Multimodal Model support for SGLang#1641sixsixcoder wants to merge 10 commits intosgl-project:mainfrom
Conversation
|
Wow, that's cool. Thank you and Zhipu AI for your contribution! |
|
Thanks for the contribution.
|
When executing this test file, an error will occur.
Do you have any solution? |
|
Can you share your command and more traceback? I can run it successfully on an H100. |
It may be a problem with model registration, which leads to infinite recursion and then an error after exceeding the video memory. Where should I modify the model registration? Is my |
|
Your usage seems good.
|
The previous problem has been solved, but when I execute test_vision_openai_server.py, an error occurs |
|
It seems the model did not see the image and started to hallucinate. Did you pass in the images correctly? |
Where does sglang receive and process multimodal input? |
|
You can see the llava for example sglang/python/sglang/srt/models/llava.py Lines 156 to 167 in d19cc0b and run the https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py to understand the code path. You can also see some related PRs: #1551 #1546 |
What is the minimum example of running a multimodal model, receiving a prompt and an image, and then performing inference? |
|
Hi @sixsixcoder The code for Qwen2 VL has already been merged into the main branch, where the triton-related kernel can be reused in GLM 4V, which is more efficient than the torch implementation and was completed by @ispobock . You may consider replacing and using it in this PR. Thanks! |
|
@sixsixcoder please rebase and add the test for GLM-4v. Thanks! |
Motivation
Add GLM-4v support for SGLang, GLM-4v is a widely used multimodal model developed by THUDM, we hope to adapt to the excellent fast serving framework SGLang
Modifications
chatglm.pyfile from vllmglm4 vision encoderinpython/sglang/srt/models/glm4_vision_encoder.py.Checklist