Fix slicing and get_plain() in GemLite by mobicham · Pull Request #2288 · pytorch/ao

mobicham · 2025-06-02T14:49:04Z

Contributions

GemLite 4.7.0 uses FMA mode by default to improve dequantization performance (Wq * s + z instead of (W_q - z ) * s), so we need to update get_plain() to make it compatible with both formats.
Updated slicing to work directly on the packed data, since the older version that was using get_plain() was causing vLLM issues.

Notes

gemlite.set_kernel_caching(True) gives wrong output with torchao but not when using gemlite as a module, not sure why, but that would impact perf for batch-size=1 by up to 10 tokens/sec.

Tests

End-2-End test

https://gist.github.com/mobicham/54fed6f18bee590f615f18391b45b71e

Slicing Test

import torch, gemlite
from torchao.quantization import GemliteUIntXWeightOnlyConfig, quantize_
device = 'cuda:0'
dtype = torch.float16
gemlite.set_autotune("default")

torch.manual_seed(0)
in_features, out_features, group_size = 256, 512, 64

orig_shape = [out_features, in_features]
layer = torch.nn.Linear(in_features, out_features, bias=False, dtype=dtype, device=device)
layer.weight.data /= 10.
weight = layer.weight.data.clone()

quantize_(layer, GemliteUIntXWeightOnlyConfig(bit_width=4, group_size=group_size))

meta_args =  layer.weight.tensor_impl.gemlite_kwargs['meta_args']
W_group_mode = meta_args[10]

#Test matmul
####################################################################################
torch.manual_seed(0)
x = torch.randn((1, layer.in_features), device=device, dtype=dtype) / 10.
y_ref = x @ weight.T
y_gem  = layer(x)
err = (y_ref - y_gem).abs().mean()
assert err < 5e-3, "Dot product mismatch. " + str(err)

#Test slicing 
####################################################################################
def dequant(input_layer, in_features, orig_shape):
    int_data = input_layer.tensor_impl.packed_weight
    scale = input_layer.tensor_impl.scale
    zero_point = input_layer.tensor_impl.zero_point

    W_q = (
        gemlite.bitpack.unpack_over_rows(
            int_data, W_nbits=4, num_output_rows=in_features, dtype=torch.uint8
        )
        .T.contiguous()
        .view([-1, group_size])
    )

    s = scale.t().contiguous().view(-1, 1)
    z = zero_point.t().contiguous().view(-1, 1)

    if W_group_mode == 4:  # FMA
        W_deq = (W_q * s + z).view(orig_shape)
    else:
        W_deq = ((W_q - z) * s).view(orig_shape)

    return W_deq

W_r = dequant(layer.weight, layer.in_features, orig_shape) #~weight


#Slicing in half
for slice_axis, start, end in [(0, 0, 256), (0, 256, 256), (1, 0, 128), (1, 128, 128)]:
    layer_sliced = layer.weight.narrow(slice_axis, start, end)

    if slice_axis == 0:
        num_rows, out_shape = layer.in_features, (orig_shape[0]//2, orig_shape[1]) 
    else:
        num_rows, out_shape = layer.in_features // 2, (orig_shape[0], orig_shape[1]//2)

    W_slice = dequant(layer_sliced, num_rows, out_shape)

    W_slice_ref = W_r[start:start+end, :] if slice_axis == 0 else W_r[:, start:start+end]
    assert (W_slice_ref - W_slice).abs().mean() == 0, f"Slicing {start}:{end} along axis={slice_axis} is incorrect"
    ```

pytorch-bot · 2025-06-02T14:49:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2288

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit b2892ce with merge base 35ffb26 ():

NEW FAILURE - The following job has failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-06-02T17:32:25Z

could you incorporate test into

ao/test/dtypes/test_affine_quantized.py

Line 363 in 8366465

def test_slice_gemlite(self, device, dtype):

as well

mobicham · 2025-06-03T12:27:49Z

Updated the test and successfully tested on vLLM.

mobicham · 2025-06-03T16:04:47Z

In vLLM, I get _same_metadata() with models that have lm_head quantized, it seems to me that ao';s vLLM implementation doesn't support that?

* fix get_plain() with FMA mode * update * fix in_features/out_feature meta-data mismatch * update gemlite slice test * add packing_bitwidth support * add packing_bitwidth support and cleanup * update default gemlite layout * cleanup * fix symmetric use-case and relax _same_meta_data * _copy() meta data * fix (4,) in autoquant

mobicham added 2 commits June 2, 2025 12:26

fix get_plain() with FMA mode

36c0c25

update

5cc70e1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 2, 2025

mobicham marked this pull request as draft June 2, 2025 15:07

jerryzh168 approved these changes Jun 2, 2025

View reviewed changes

mobicham marked this pull request as ready for review June 2, 2025 17:44

mobicham and others added 3 commits June 3, 2025 10:42

Merge branch 'pytorch:main' into main

4c9dad8

fix in_features/out_feature meta-data mismatch

9ac689e

update gemlite slice test

bece806

mobicham added 2 commits June 3, 2025 17:55

add packing_bitwidth support

ba7b4f1

add packing_bitwidth support and cleanup

33e2bf6

mobicham requested a review from jerryzh168 June 3, 2025 18:22

mobicham and others added 6 commits June 4, 2025 09:32

update default gemlite layout

587ab10

cleanup

1cb7794

Merge branch 'pytorch:main' into main

2a31e9d

fix symmetric use-case and relax _same_meta_data

fc7ff50

fix symmetric use-case and relax _same_meta_data

75c13a5

_copy() meta data

2d66fb4

jerryzh168 reviewed Jun 4, 2025

View reviewed changes

Comment thread torchao/quantization/autoquant.py Outdated

mobicham and others added 2 commits June 4, 2025 15:39

fix (4,) in autoquant

eba10ad

Merge branch 'pytorch:main' into main

b2892ce

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jun 5, 2025

jerryzh168 merged commit 0640474 into pytorch:main Jun 5, 2025
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slicing and get_plain() in GemLite#2288

Fix slicing and get_plain() in GemLite#2288
jerryzh168 merged 15 commits into
pytorch:mainfrom
mobicham:main

mobicham commented Jun 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 2, 2025 •

edited

Loading

Uh oh!

jerryzh168 commented Jun 2, 2025

Uh oh!

mobicham commented Jun 3, 2025

Uh oh!

mobicham commented Jun 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mobicham commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contributions

Notes

Tests

End-2-End test

Slicing Test

Uh oh!

pytorch-bot Bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2288

❌ 1 New Failure

Uh oh!

jerryzh168 commented Jun 2, 2025

Uh oh!

mobicham commented Jun 3, 2025

Uh oh!

mobicham commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mobicham commented Jun 2, 2025 •

edited

Loading

pytorch-bot Bot commented Jun 2, 2025 •

edited

Loading

mobicham commented Jun 3, 2025 •

edited

Loading