add performance and accuracy eval of flux-1.schnell#3502
Conversation
|
Stack from ghstack (oldest at bottom): |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3502
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 27b34e9 with merge base dd41e98 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This is good enough, actually.
We shouldn't need basic performance debugging as the mentioned blog post already did that (such as ensuring no graph-breaks, recompilations, CPU<->GPU syncs, etc.). I think we could add the following context before the inference runs to ensure no graph breaks (as it's simple): We can squeeze out more, but that would probably be intrusive. Also, note that we log the performance benchmarks, too: https://huggingface.co/datasets/diffusers/benchmarks. In the future, it could be great for us to pair up and consolidate this like we have done many times in the past :-) |
if I use the diffusers pipeline as-is I get all of that for free except for qkv fusion, right? that's great
from the code link, seems like no-graph-breaks is enforced in diffusers CI? Lmk if I got that right. If so, I'd rather trust diffusers and not check for it again here, to keep things simple.
That's great! I do want something in torchao to help guide local development, and the goal for the benchmark in this PR is more "here is how different torchao quantization recipes compare to each other" and not "push perf + accuracy to SOTA / catch regressions / etc". I'd definitely be happy to collaborate more on this, where can we find you on slack? |
QKV fusion is also supported:
Yup, that's correct.
I think you can sync with @jerryzh168 / @supriyar on this. We have a fairly active collaboration channel on Slack :-) |
@sayakpaul , does that flux-fast code path get hit if I use diffusers pipeline to load the flux model family, or does it require the user to use flux-fast directly? |
It's not flux-fast specific. It's implemented at the |
## Summary - Added new benchmark for new low precision attention API - Can set baseline and test models between different backends: (fa2, fa3, fa3_fp8, fa4, fa4_fp8) - uses flux.1-schnell model, 4 inference steps, DrawBench prompts - has options to control number of prompts, torch.compile usage, warmup_iters, using debug prompts, number of inference steps, rope fusion - Following the guidelines of #3502 ## Example Run python benchmarks/prototype/attention/eval_flux_model.py --baseline fa3 --test fa3_fp8 --compile
Summary:
Adds performance and accuracy eval for the
flux-1.schnellmodel. This is useful as diffusion models are a major use case for torchao, and before this PR we didn't have reproducible benchmarks for them.Results, measured on a B200 machine:
Details:
torch.compileon andnum_inference_steps=4. In future PRs we can tighten this up to align with https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/. For now I did not do any performance debugging.float8_rowwise,mxfp8,nvfp4(because I wrote this on a B200). We can expand to other recipes in future PRs as needed.How to run the e2e script:
Note: the script quality is not ideal, we can improve in future PRs if it proves to be worth our time. The current code is good enough to check in and start reporting metrics.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: