Update GemLite to support vLLM V1#2199
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2199
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 30679a1 with merge base 66eb801 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
there are some errors: |
|
@jerryzh168 yes sorry I missed that one, should be fixed now. |
jerryzh168
left a comment
There was a problem hiding this comment.
looks good, could you test in vllm as well to get a sense of speedup? you can test some 8B model, and compare to baseline with benchmark_latency I think, like this: https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#benchmark_latency
|
I think we need to re-export the models, I changed the The thing is, I am working on a different branch right now which would also change |
* update to forward_functional() * add 8-bit symmetric case * ruff * fix test
Updated GemLite changes to make it compatible with vLLM V1.
I also corrected the unpacking which should use the output feature size and added symmetric A16W8 support since the arguments support it but it was not implemented.