Skip to content

int8 dynamic prefill weight only decode#1436

Merged
jcaip merged 63 commits into
mainfrom
jcaip/prefill-24-sparse-benchmarking
Dec 30, 2024
Merged

int8 dynamic prefill weight only decode#1436
jcaip merged 63 commits into
mainfrom
jcaip/prefill-24-sparse-benchmarking

Conversation

@jcaip

@jcaip jcaip commented Dec 18, 2024

Copy link
Copy Markdown
Contributor

This PR adds in weight_only_decode option to int8_dynamic_activation_int8_weight, which when set will use dynamic quantization for matmuls of shape (> 1, x) * (x, n) and weight only quantization for the batch_size=1 case.

It also updates generate.py to take in a text file for the prompt, we use this to demonstrate these prefill speedups with sh demo_summarize.sh.

@pytorch-bot

pytorch-bot Bot commented Dec 18, 2024

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1436

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b144a53 with merge base 567cb46 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024
@jcaip jcaip changed the title Jcaip/prefill 24 sparse benchmarking int8 dynamic prefill weight only decode Dec 30, 2024
@jcaip jcaip added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) topic: performance Use this tag if this PR improves the performance of a feature labels Dec 30, 2024
@jcaip jcaip merged commit 52b6f4d into main Dec 30, 2024
amdfaa pushed a commit that referenced this pull request Jan 10, 2025
This PR adds in weight_only_decode option to int8_dynamic_activation_int8_weight, which when set will use dynamic quantization for matmuls of shape (> 1, x) * (x, n) and weight only quantization for the batch_size=1 case.

It also updates generate.py to take in a text file for the prompt, we use this to demonstrate these prefill speedups with sh demo_summarize.sh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) topic: performance Use this tag if this PR improves the performance of a feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants