Skip to content

Support using Int4PreshuffledTensor after loading#26066

Merged
mgoin merged 1 commit into
vllm-project:mainfrom
jerryzh168:support-int4-preshuffle
Nov 4, 2025
Merged

Support using Int4PreshuffledTensor after loading#26066
mgoin merged 1 commit into
vllm-project:mainfrom
jerryzh168:support-int4-preshuffle

Conversation

@jerryzh168

@jerryzh168 jerryzh168 commented Oct 2, 2025

Copy link
Copy Markdown
Contributor

Summary:
Int4PreshuffledTensor has fasted int4 kernel for int4 weight only and fp8 act + int4 weight in fbgemm, but we can't slice the Tensor due to the preshuffling (and slice has to preserve alias) so we have to use Int4Tensor (plain format) so it can be sliced during loading, and convert the Tensor to preshuffled format after loading using torchao.prototype.tensor_conversion.api.convert_to_packed_tensor_based_on_current_hardware function.

Test Plan:
pytest tests/quantization/test_torchao.py -k test_opt_125m_int4wo_model_running_preshuffled_kernel For test we uploaded a plain int4 tensor checkpoint https://huggingface.co/torchao-testing/opt-125m-Int4WeightOnlyConfig-v2-0.14.0.dev and load it in vllm, then check the model is transformed to use Int4PreshuffledTensor before inference

Reviewers:

Subscribers:

Tasks:

Tags:

@jerryzh168 jerryzh168 force-pushed the support-int4-preshuffle branch from f02db41 to 34d63df Compare October 2, 2025 01:13
@jerryzh168 jerryzh168 marked this pull request as ready for review October 3, 2025 23:39
@jerryzh168 jerryzh168 marked this pull request as draft October 3, 2025 23:39
@mergify

mergify Bot commented Oct 8, 2025

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jerryzh168.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@chatgpt-codex-connector

Copy link
Copy Markdown

💡 Codex Review

https://github.com/vllm-project/vllm/blob/ed36abce2e9fc2610de62da6ed187f4727a10302/vllm/model_executor/layers/quantization/torchao.py#L308-L317
P1 Badge Preserve weight metadata when converting to packed tensor

Replacing layer.weight with a fresh Parameter after calling convert_to_packed_tensor_based_on_current_hardware drops all of the attributes that were attached during create_weights (input_dim, output_dim, weight_loader, etc.) and also flips requires_grad back to True. Those attributes are used by the loader/reload path (for example gpu_model_runner.reload_weights expects weight_loader and input_dim to exist), so after the conversion any attempt to reload weights or reshard the tensor will raise an AttributeError or operate on the wrong dimensions. The new parameter should preserve the original metadata and requires_grad=False instead of creating a bare Parameter.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@jerryzh168 jerryzh168 force-pushed the support-int4-preshuffle branch 4 times, most recently from 27db8c4 to cfc8ea0 Compare October 27, 2025 17:49
@houseroad houseroad added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 31, 2025
@houseroad

Copy link
Copy Markdown
Collaborator

CI is broken, can we take a look?

@jerryzh168

Copy link
Copy Markdown
Contributor Author

oh OK, will fix

@jerryzh168 jerryzh168 force-pushed the support-int4-preshuffle branch from 7eb3fa3 to 2749270 Compare October 31, 2025 20:41
Summary:
Int4PreshuffledTensor has fasted int4 kernel for int4 weight only and fp8 act + int4 weight
in fbgemm, but we can't slice the Tensor due to the preshuffling (and slice has to preserve alias)
so we have to use Int4Tensor (plain format) so it can be sliced during loading, and convert
the Tensor to preshuffled format after loading using
`torchao.prototype.tensor_conversion.api.convert_to_packed_tensor_based_on_current_hardware`
function.

Test Plan:
pytest tests/quantization/test_torchao.py -k test_opt_125m_int4wo_model_running_preshuffled_kernel
For test we uploaded a plain int4 tensor checkpoint https://huggingface.co/torchao-testing/opt-125m-Int4WeightOnlyConfig-v2-0.14.0.dev
and load it in vllm, then check the model is transformed to use Int4PreshuffledTensor before inference

Reviewers:

Subscribers:

Tasks:

Tags:

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
@jerryzh168 jerryzh168 force-pushed the support-int4-preshuffle branch from 2749270 to 43ffb17 Compare November 1, 2025 00:11
@jerryzh168

Copy link
Copy Markdown
Contributor Author

@houseroad all checks have passed now, please merge

@mgoin mgoin merged commit 03c4c4a into vllm-project:main Nov 4, 2025
52 checks passed
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants