Skip to content

add performance and accuracy eval of flux-1.schnell#3502

Merged
vkuzo merged 6 commits intomainfrom
gh/vkuzo/186/head
Jan 6, 2026
Merged

add performance and accuracy eval of flux-1.schnell#3502
vkuzo merged 6 commits intomainfrom
gh/vkuzo/186/head

Conversation

@vkuzo
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo commented Dec 17, 2025

Summary:

Adds performance and accuracy eval for the flux-1.schnell model. This is useful as diffusion models are a major use case for torchao, and before this PR we didn't have reproducible benchmarks for them.

Results, measured on a B200 machine:

experiment lpips_avg time_s speedup
bfloat16 (baseline) - 1.77 -
float8_rowwise 0.1714 1.54 1.15
mxfp8 0.1747 1.47 1.20
nvfp4 0.3081 1.32 1.34

Details:

  • For performance, we measure e2e time for single image generation, with torch.compile on and num_inference_steps=4. In future PRs we can tighten this up to align with https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/. For now I did not do any performance debugging.
  • For accuracy, we measure the LPIPS (https://github.com/richzhang/PerceptualSimilarity) score between the image generated by the baseline (bf16) and quantized model, averaged over the DrawBench (https://huggingface.co/datasets/sayakpaul/drawbench) dataset of 200 prompts.
  • we start with three supported quantization recipes: float8_rowwise, mxfp8, nvfp4 (because I wrote this on a B200). We can expand to other recipes in future PRs as needed.
  • for selecting layers for applying quantization to a model, I wrote a basic heuristic (don't quantize embeddings, etc) - this was not validated with any accuracy study or sensitivity analysis.

How to run the e2e script:

// takes ~16 mins using 8 GPUs on a B200
benchmarks/quantization/eval_accuracy_and_perf_of_flux.sh
// full log: https://www.internalfb.com/phabricator/paste/view/P2093514733

Note: the script quality is not ideal, we can improve in future PRs if it proves to be worth our time. The current code is good enough to check in and start reporting metrics.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Dec 17, 2025

Stack from ghstack (oldest at bottom):

vkuzo added a commit that referenced this pull request Dec 17, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: 25daf59
ghstack-comment-id: 3667066648
Pull-Request: #3502
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Dec 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3502

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 27b34e9 with merge base dd41e98 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2025
[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Dec 19, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: b1bb3d2
ghstack-comment-id: 3667066648
Pull-Request: #3502
@vkuzo vkuzo added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 19, 2025
[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Dec 19, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: 70a7b71
ghstack-comment-id: 3667066648
Pull-Request: #3502
[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Dec 22, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: 551cd15
ghstack-comment-id: 3667066648
Pull-Request: #3502
[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Dec 22, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: 58f5c33
ghstack-comment-id: 3667066648
Pull-Request: #3502
@vkuzo vkuzo changed the title [wip] flux eval add performance and accuracy eval of flux-1.schnell Dec 22, 2025
@sayakpaul
Copy link
Copy Markdown
Contributor

This is good enough, actually.

For performance, we measure e2e time for single image generation, with torch.compile on and num_inference_steps=4. In future PRs we can tighten this up to align with https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/. For now I did not do any performance debugging.

We shouldn't need basic performance debugging as the mentioned blog post already did that (such as ensuring no graph-breaks, recompilations, CPU<->GPU syncs, etc.). I think we could add the following context before the inference runs to ensure no graph breaks (as it's simple):
https://github.com/huggingface/diffusers/blob/1cdb8723b85f1b427031e390e0bd0bebfe92454e/tests/models/test_modeling_common.py#L2143C9-L2149C37

We can squeeze out more, but that would probably be intrusive. Also, note that we log the performance benchmarks, too: https://huggingface.co/datasets/diffusers/benchmarks. In the future, it could be great for us to pair up and consolidate this like we have done many times in the past :-)

@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Jan 5, 2026

@sayakpaul ,

We shouldn't need basic performance debugging as the mentioned blog post already did that (such as ensuring no graph-breaks, recompilations, CPU<->GPU syncs, etc.).

if I use the diffusers pipeline as-is I get all of that for free except for qkv fusion, right? that's great

I think we could add the following context before the inference runs to ensure no graph breaks (as it's simple):

from the code link, seems like no-graph-breaks is enforced in diffusers CI? Lmk if I got that right. If so, I'd rather trust diffusers and not check for it again here, to keep things simple.

Also, note that we log the performance benchmarks, too

That's great! I do want something in torchao to help guide local development, and the goal for the benchmark in this PR is more "here is how different torchao quantization recipes compare to each other" and not "push perf + accuracy to SOTA / catch regressions / etc". I'd definitely be happy to collaborate more on this, where can we find you on slack?

@sayakpaul
Copy link
Copy Markdown
Contributor

if I use the diffusers pipeline as-is I get all of that for free except for qkv fusion, right? that's great

QKV fusion is also supported:
https://github.com/huggingface/flux-fast/blob/0a1dcc91658f0df14cd7fce862a5c8842784c6da/utils/pipeline_utils.py#L389C9-L390C44

from the code link, seems like no-graph-breaks is enforced in diffusers CI? Lmk if I got that right. If so, I'd rather trust diffusers and not check for it again here, to keep things simple.

Yup, that's correct.

I'd definitely be happy to collaborate more on this, where can we find you on slack?

I think you can sync with @jerryzh168 / @supriyar on this. We have a fairly active collaboration channel on Slack :-)

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jan 6, 2026
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
ghstack-source-id: dd31816
ghstack-comment-id: 3667066648
Pull-Request: #3502
@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented Jan 6, 2026

QKV fusion is also supported:

@sayakpaul , does that flux-fast code path get hit if I use diffusers pipeline to load the flux model family, or does it require the user to use flux-fast directly?

@vkuzo vkuzo merged commit 3955b6c into main Jan 6, 2026
59 checks passed
@sayakpaul
Copy link
Copy Markdown
Contributor

does that flux-fast code path get hit if I use diffusers pipeline to load the flux model family, or does it require the user to use flux-fast directly?

It's not flux-fast specific. It's implemented at the diffusers-level. You can use it on the Flux model family.

howardzhang-cv added a commit that referenced this pull request Mar 9, 2026
## Summary
- Added new benchmark for new low precision attention API
- Can set baseline and test models between different backends: (fa2, fa3, fa3_fp8, fa4, fa4_fp8)
- uses flux.1-schnell model, 4 inference steps, DrawBench prompts
- has options to control number of prompts, torch.compile usage, warmup_iters, using debug prompts, number of inference steps, rope fusion
- Following the guidelines of #3502

## Example Run
python benchmarks/prototype/attention/eval_flux_model.py --baseline fa3 --test fa3_fp8 --compile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants