Skip to content

Support dynamic activation quant for per-channel quantized matmul#7867

Merged
lsy323 merged 1 commit intomasterfrom
lsiyuan/act-quant
Aug 20, 2024
Merged

Support dynamic activation quant for per-channel quantized matmul#7867
lsy323 merged 1 commit intomasterfrom
lsiyuan/act-quant

Conversation

@lsy323
Copy link
Copy Markdown
Collaborator

@lsy323 lsy323 commented Aug 16, 2024

Need #7863 to land first

For dynamic activation quant. the quantized mamtul will be:

The weight is quantized from w: bf16[out_dim, in_dim] to w_int: int8[out_dim, in_dim] and w_scale: bf16[out_dim]

  1. Quantize matmul input x with shape bf16[bs, seq, in_dim] to x_int: int8[bs, seq, in_dim] and x_scale: bf16[bs,seq]
  2. Matmul between x_int and w_int with int32 output dtype to avoid overflow `matmul(x_int, w_int) -> matmul_out: int32[bs, seq, out_dim]
  3. Scale the matmul output with w_scaler and x_scaler: `final_out = matmul_out * w_scale * x_scale

Test
Added unit tests

@lsy323 lsy323 force-pushed the lsiyuan/act-quant branch from 1783e1f to b05276a Compare August 16, 2024 03:44
@lsy323 lsy323 force-pushed the lsiyuan/act-quant branch from b05276a to ec98e8e Compare August 16, 2024 22:22
@lsy323 lsy323 requested a review from JackCaoG August 16, 2024 22:22
@lsy323 lsy323 assigned miladm and lsy323 and unassigned miladm Aug 16, 2024
@lsy323 lsy323 requested a review from miladm August 16, 2024 22:35
@JackCaoG
Copy link
Copy Markdown
Collaborator

sorry I might not have time for this one today, will try to look into it tmr

x, w, (([-1], [-1]), ()), preferred_element_type=torch.int32)
else:
out = F.linear(x, w)
out = out * scaler
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the output dtype will be int32?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Matmul between x_int and w_int with int32 output dtype to avoid overflow matmul(x_int, w_int) -> matmul_out: int32[bs, seq, out_dim]`

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final output will be in bf16, since there will be scaler multiplying the int32 result

Copy link
Copy Markdown
Collaborator

@JackCaoG JackCaoG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, do you need to run TPU CI on this pr?

@lsy323
Copy link
Copy Markdown
Collaborator Author

lsy323 commented Aug 20, 2024

lgtm, do you need to run TPU CI on this pr?

Right now it's not in TPU CI, the err threshold need to be adjusted to pass on TPU. I can do that in following PR.

@lsy323 lsy323 merged commit 4bd2df1 into master Aug 20, 2024
@lsy323 lsy323 deleted the lsiyuan/act-quant branch August 20, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants