support eval of float8_a1x128_w128x128 by vkuzo · Pull Request #3269 · pytorch/ao

vkuzo · 2025-10-31T18:35:24Z

Summary:

Adds support for the new float8 scaling recipe in the official eval
scripts used to generate accuracy numbers in the README.

For now, I am using this as a smoke test that the scaling is working on
a real model - it is. We can add official benchmark results after we
hook up the cuBLAS binding on H100, which should make the UEX of
running evals a lot better.

Test Plan:

Smoke test on LLama-3.1-8B, accuracy looks good

// download checkpoint
with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B

// prepare checkpoint
python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B

// run bf16 eval on a single task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande'
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697}

// run float8 eval on the same task
with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile
...
winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477}

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-10-31T18:35:25Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-10-31T18:35:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3269

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 01b8d77 ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e87609a ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

Summary: Adds support for the new float8 scaling recipe in the official eval scripts used to generate accuracy numbers in the README. For now, I am using this as a smoke test that the scaling is working on a real model - it is. We can add official benchmark results after we hook up slayton's cuBLAS binding on H100, which should make the UEX of running evals a lot better. Test Plan: Smoke test on LLama-3.1-8B, accuracy looks good ``` // download checkpoint with-proxy python scripts/download.py --hf_token {token} --repo_id meta-llama/Meta-Llama-3.1-8B // prepare checkpoint python scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/meta-llama/Meta-Llama-3.1-8B // run bf16 eval on a single task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7426992896606156, 'acc_stderr,none': 0.012285989618865697} // run float8 eval on the same task with-proxy time python torchao/_models/llama/eval.py --checkpoint_path checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --tasks 'winogrande' --quantization float8_a1x128_w128x128 --compile ... winogrande: {'alias': 'winogrande', 'acc,none': 0.7419100236779794, 'acc_stderr,none': 0.012298278833972477} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e87609a ghstack-comment-id: 3474380821 Pull-Request: #3269

[ghstack-poisoned]

jainapurva · 2025-11-04T20:50:16Z

                model,
                Float8DynamicActivationFloat8WeightConfig(granularity=granularity),
            )
+        if quantization == "float8_a1x128_w128x128":


The evaluation framework for torchao has multiple scripts:
torchao/_models/llama/eval.py
benchmarks/_models/eval_hf_models.py, which will need to be cleaned up as part of BE #3289. For now I feel the quantization technique should also be added to the benchmarking framework here:

ao/benchmarks/microbenchmarks/utils.py

Lines 153 to 155 in 01374eb

def string_to_config(

quantization: Optional[str], sparsity: Optional[str], **kwargs

) -> AOBaseConfig:

This will enable float8_a1x128_w128x128 in the torchao benchmarking module, and running it on hf models

Rest, LGTM!

[ghstack-poisoned]

* Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned] * Update [ghstack-poisoned]

vkuzo added 27 commits October 29, 2025 04:05

Update

990ef89

[ghstack-poisoned]

Update

cce08f0

[ghstack-poisoned]

Update

681277a

[ghstack-poisoned]

Update

26ade98

[ghstack-poisoned]

Update

f76e10b

[ghstack-poisoned]

Update

6994e20

[ghstack-poisoned]

Update

1aff468

[ghstack-poisoned]

Update

f6fa134

[ghstack-poisoned]

Update

1911212

[ghstack-poisoned]

Update

9ec8ce1

[ghstack-poisoned]

Update

57b8876

[ghstack-poisoned]

Update

1161f7f

[ghstack-poisoned]

Update

c5be7c0

[ghstack-poisoned]

Update

00c6bbb

[ghstack-poisoned]

Update

d40ec7c

[ghstack-poisoned]

Update

ce5a8eb

[ghstack-poisoned]

Update

be5a9bb

[ghstack-poisoned]

Update

6a3684b

[ghstack-poisoned]

Update

1d4a2f7

[ghstack-poisoned]

Update

d28b0ae

[ghstack-poisoned]

Update

6c087b4

[ghstack-poisoned]

Update

4de79c9

[ghstack-poisoned]

Update

1938209

[ghstack-poisoned]

Update

c4769a6

[ghstack-poisoned]

Update

eb95772

[ghstack-poisoned]

Update

526b741

[ghstack-poisoned]

Update

22d1a14

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 31, 2025

This was referenced Oct 31, 2025

add a_1_128_w_128_128 (DeepSeek) float8 scaling for inference #3257

Merged

add bias handling for a_1_128_w_128_128 float8 scaling #3259

Merged

Makes fallback float8 1x128 by 128x128 gemm output bfloat16 #3265

Merged

vkuzo requested review from andrewor14, jainapurva and jerryzh168 October 31, 2025 18:36

vkuzo added the topic: for developers Use this tag if this PR is mainly developer facing label Oct 31, 2025

vkuzo added 3 commits October 31, 2025 12:43

Update

76671f9

[ghstack-poisoned]

Update

4a29159

[ghstack-poisoned]

Update

9a995b5

[ghstack-poisoned]

vkuzo added 2 commits October 31, 2025 12:44

Update

c877d67

[ghstack-poisoned]

Update

485ee80

[ghstack-poisoned]

vkuzo mentioned this pull request Nov 3, 2025

make float8 a1x128_w128x128 granularity serializeable #3279

Merged

Update

cafe668

[ghstack-poisoned]

jainapurva reviewed Nov 4, 2025

View reviewed changes

jainapurva approved these changes Nov 4, 2025

View reviewed changes

Update

2dacafc

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/160/head to main November 6, 2025 11:53

vkuzo merged commit a9f2dc1 into main Nov 6, 2025
45 of 55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support eval of float8_a1x128_w128x128#3269

support eval of float8_a1x128_w128x128#3269
vkuzo merged 34 commits into
mainfrom
gh/vkuzo/161/head

vkuzo commented Oct 31, 2025 •

edited

Loading

Uh oh!

vkuzo commented Oct 31, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

jainapurva Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	def string_to_config(
	quantization: Optional[str], sparsity: Optional[str], **kwargs
	) -> AOBaseConfig:

Conversation

vkuzo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3269

Uh oh!

jainapurva Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Oct 31, 2025 •

edited

Loading

vkuzo commented Oct 31, 2025 •

edited

Loading

pytorch-bot Bot commented Oct 31, 2025 •

edited

Loading

jainapurva Nov 4, 2025 •

edited

Loading