[float8] add _auto_filter_for_recipe to float8 by danielvegamyhre · Pull Request #2410 · pytorch/ao

danielvegamyhre · 2025-06-18T21:50:58Z

Problem

float8 rowwise + vanilla TP in torchtitan had flat perf with respect to bfloat.
RCA In float8 rowwise vanilla TP low throughput torchtitan#1207 found attention.wk and attention.wv layers were so small that float8 rowwise conversion resulted in significant slowdown (approx 40%) for those linears, thus the perf benefits from fp8 rowwise conversion on larger linears were nullified.
This is because the default filter_fqns for float8 model conversion are fine for the fp8 tensorwise recipe, but bad for the float8 rowwise recipe.

Solution

This has been a footgun for various users as well (including Poolside), so I created an "auto filter" (#2410) which automatically filters Linears for a given float8 recipe, by checking for the following criteria:

dims not divisible by 16 (hardware requirement for float8)
dim sizes below thresholds that will result in worse perf for that given recipe, using simple heuristics based on the linked recipe perf tables above.
fqn matches one of the user defined filter_fqns

I integrated a PoC into torchtitan and the auto filter improved fp8 rowwise perf both local Llama3 8b run and Llama3 70b MAST run, compared to the default filter_fn we have now.

It prevents users from hitting this common footgun, while also preserving the flexibility to define their model-specific fqns.

Results

See pytorch/torchtitan#1207 for Llama3 70b results, TL;DR is filtering wk and wv improves TPS ~10% for vanilla TP and ~15% for async TP.

pytorch-bot · 2025-06-18T21:51:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2410

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ded0931 with merge base 101c039 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t 7a6582d45905d5961c8f1a9707510eacbc2e51d2fd1a01fa18dc0be83fab2b0c /exec failed with exit code 139

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-06-24T14:36:00Z

cc @vkuzo for review

vkuzo

looks good, can we make sure the name has an underscore and also add a test before landing?

Fixes #1207 ## Problem - float8 rowwise + vanilla TP in torchtitan had flat perf with respect to bfloat16 (see #1207). - RCA In #1207 found attention.wk and attention.wv layers were so small that float8 rowwise conversion resulted in approx ~40% slowdown for those layers, which nullified the perf benefits from fp8 rowwise conversion on larger linears. - This is because the default `filter_fqns` for float8 model conversion are fine for the fp8 tensorwise recipe, but bad for the float8 rowwise recipe. ### Solution This has been a footgun for various users as well (including Poolside), so I created an "auto filter" (pytorch/ao#2410) which automatically filters Linears for a given float8 recipe, by checking for the following criteria: 1. dims not divisible by 16 (hardware requirement for float8) 2. dim sizes below thresholds that may result in worse perf **for that given recipe**, using simple heuristics based on the linked recipe perf tables above. 3. fqn matches one of the user defined `filter_fqns` It prevents users from hitting this common footgun, while also preserving the flexibility to define their model-specific fqns. ## Results Benchmarks show a ~10% TPS improvement for TP and ~15% TPS improvement for async TP (over bf16 TP baseline). Llama3 70b on 256 H100s with FSDP=32, TP=8, torch.compile, full AC, local batch size 16: - [bfloat16 baseline](https://fburl.com/mlhub/ji9smr5u) = ~597TPS - [fp8 rowwise WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/cu4o6w5m) = ~600 TPS - [fp8 rowwise WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/mgzz309o) = ~660 TPS - [fp8 rowwise + async TP WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/76q4mel9 ) = ~625 TPS - [fp8 rowwise + async TP WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/6b07aa4d) = ~695 TPS

Fixes pytorch#1207 ## Problem - float8 rowwise + vanilla TP in torchtitan had flat perf with respect to bfloat16 (see pytorch#1207). - RCA In pytorch#1207 found attention.wk and attention.wv layers were so small that float8 rowwise conversion resulted in approx ~40% slowdown for those layers, which nullified the perf benefits from fp8 rowwise conversion on larger linears. - This is because the default `filter_fqns` for float8 model conversion are fine for the fp8 tensorwise recipe, but bad for the float8 rowwise recipe. ### Solution This has been a footgun for various users as well (including Poolside), so I created an "auto filter" (pytorch/ao#2410) which automatically filters Linears for a given float8 recipe, by checking for the following criteria: 1. dims not divisible by 16 (hardware requirement for float8) 2. dim sizes below thresholds that may result in worse perf **for that given recipe**, using simple heuristics based on the linked recipe perf tables above. 3. fqn matches one of the user defined `filter_fqns` It prevents users from hitting this common footgun, while also preserving the flexibility to define their model-specific fqns. ## Results Benchmarks show a ~10% TPS improvement for TP and ~15% TPS improvement for async TP (over bf16 TP baseline). Llama3 70b on 256 H100s with FSDP=32, TP=8, torch.compile, full AC, local batch size 16: - [bfloat16 baseline](https://fburl.com/mlhub/ji9smr5u) = ~597TPS - [fp8 rowwise WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/cu4o6w5m) = ~600 TPS - [fp8 rowwise WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/mgzz309o) = ~660 TPS - [fp8 rowwise + async TP WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/76q4mel9 ) = ~625 TPS - [fp8 rowwise + async TP WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/6b07aa4d) = ~695 TPS

* add auto_filter_for_recipe to float8 * lint * address comments * add tests

Fixes pytorch#1207 ## Problem - float8 rowwise + vanilla TP in torchtitan had flat perf with respect to bfloat16 (see pytorch#1207). - RCA In pytorch#1207 found attention.wk and attention.wv layers were so small that float8 rowwise conversion resulted in approx ~40% slowdown for those layers, which nullified the perf benefits from fp8 rowwise conversion on larger linears. - This is because the default `filter_fqns` for float8 model conversion are fine for the fp8 tensorwise recipe, but bad for the float8 rowwise recipe. ### Solution This has been a footgun for various users as well (including Poolside), so I created an "auto filter" (pytorch/ao#2410) which automatically filters Linears for a given float8 recipe, by checking for the following criteria: 1. dims not divisible by 16 (hardware requirement for float8) 2. dim sizes below thresholds that may result in worse perf **for that given recipe**, using simple heuristics based on the linked recipe perf tables above. 3. fqn matches one of the user defined `filter_fqns` It prevents users from hitting this common footgun, while also preserving the flexibility to define their model-specific fqns. ## Results Benchmarks show a ~10% TPS improvement for TP and ~15% TPS improvement for async TP (over bf16 TP baseline). Llama3 70b on 256 H100s with FSDP=32, TP=8, torch.compile, full AC, local batch size 16: - [bfloat16 baseline](https://fburl.com/mlhub/ji9smr5u) = ~597TPS - [fp8 rowwise WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/cu4o6w5m) = ~600 TPS - [fp8 rowwise WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/mgzz309o) = ~660 TPS - [fp8 rowwise + async TP WITH attention.wk, attention.wv converted](https://fburl.com/mlhub/76q4mel9 ) = ~625 TPS - [fp8 rowwise + async TP WITHOUT attention.wk, attention.wv converted](https://fburl.com/mlhub/6b07aa4d) = ~695 TPS

add auto_filter_for_recipe to float8

8561c64

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2025

danielvegamyhre mentioned this pull request Jun 18, 2025

float8 rowwise vanilla TP low throughput pytorch/torchtitan#1207

Closed

lint

d448443

danielvegamyhre added float8 topic: new feature Use this tag if this PR adds a new feature labels Jun 18, 2025

danielvegamyhre mentioned this pull request Jun 18, 2025

[float8] add _auto_filter_for_recipe for float8 training pytorch/torchtitan#1319

Merged

danielvegamyhre requested a review from vkuzo June 24, 2025 14:34

vkuzo approved these changes Jun 24, 2025

View reviewed changes

Comment thread torchao/float8/float8_linear_utils.py Outdated

Comment thread torchao/float8/float8_linear_utils.py

Comment thread torchao/float8/float8_linear_utils.py Outdated

Comment thread torchao/float8/float8_linear_utils.py Outdated

danielvegamyhre added 2 commits June 24, 2025 08:01

address comments

b133535

add tests

ded0931

danielvegamyhre force-pushed the auto_filter branch from ae04451 to ded0931 Compare June 24, 2025 15:48

danielvegamyhre changed the title ~~[float8] add auto_filter_for_recipe to float8~~ [float8] add _auto_filter_for_recipe to float8 Jun 24, 2025

danielvegamyhre merged commit 9eeb101 into main Jun 24, 2025
18 of 19 checks passed

danielvegamyhre mentioned this pull request Jun 26, 2025

[float8] add tests for float8 _auto_filter_for_recipe #2450

Merged

liangel-02 pushed a commit that referenced this pull request Aug 25, 2025

[float8] add _auto_filter_for_recipe to float8 (#2410)

04f7da7

* add auto_filter_for_recipe to float8 * lint * address comments * add tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] add _auto_filter_for_recipe to float8#2410

[float8] add _auto_filter_for_recipe to float8#2410
danielvegamyhre merged 4 commits into
mainfrom
auto_filter

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jun 24, 2025

Uh oh!

vkuzo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

danielvegamyhre commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Results

Uh oh!

pytorch-bot Bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2410

❌ 1 New Failure

Uh oh!

danielvegamyhre commented Jun 24, 2025

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danielvegamyhre commented Jun 18, 2025 •

edited

Loading

pytorch-bot Bot commented Jun 18, 2025 •

edited

Loading