Support using Int4PreshuffledTensor after loading#26066
Conversation
f02db41 to
34d63df
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
34d63df to
a6302bd
Compare
a6302bd to
ed36abc
Compare
ed36abc to
bab418c
Compare
💡 Codex Reviewhttps://github.com/vllm-project/vllm/blob/ed36abce2e9fc2610de62da6ed187f4727a10302/vllm/model_executor/layers/quantization/torchao.py#L308-L317 Replacing ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
27db8c4 to
cfc8ea0
Compare
|
CI is broken, can we take a look? |
|
oh OK, will fix |
7eb3fa3 to
2749270
Compare
Summary: Int4PreshuffledTensor has fasted int4 kernel for int4 weight only and fp8 act + int4 weight in fbgemm, but we can't slice the Tensor due to the preshuffling (and slice has to preserve alias) so we have to use Int4Tensor (plain format) so it can be sliced during loading, and convert the Tensor to preshuffled format after loading using `torchao.prototype.tensor_conversion.api.convert_to_packed_tensor_based_on_current_hardware` function. Test Plan: pytest tests/quantization/test_torchao.py -k test_opt_125m_int4wo_model_running_preshuffled_kernel For test we uploaded a plain int4 tensor checkpoint https://huggingface.co/torchao-testing/opt-125m-Int4WeightOnlyConfig-v2-0.14.0.dev and load it in vllm, then check the model is transformed to use Int4PreshuffledTensor before inference Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
2749270 to
43ffb17
Compare
|
@houseroad all checks have passed now, please merge |
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Summary:
Int4PreshuffledTensor has fasted int4 kernel for int4 weight only and fp8 act + int4 weight in fbgemm, but we can't slice the Tensor due to the preshuffling (and slice has to preserve alias) so we have to use Int4Tensor (plain format) so it can be sliced during loading, and convert the Tensor to preshuffled format after loading using
torchao.prototype.tensor_conversion.api.convert_to_packed_tensor_based_on_current_hardwarefunction.Test Plan:
pytest tests/quantization/test_torchao.py -k test_opt_125m_int4wo_model_running_preshuffled_kernel For test we uploaded a plain int4 tensor checkpoint https://huggingface.co/torchao-testing/opt-125m-Int4WeightOnlyConfig-v2-0.14.0.dev and load it in vllm, then check the model is transformed to use Int4PreshuffledTensor before inference
Reviewers:
Subscribers:
Tasks:
Tags: