-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[js/web] JSEP Attention & MultiHeadAttention #17742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/azp run ONNX Runtime Web CI Pipeline |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Need to run |
yeah. its been a while since i've submitted a new op and forgot about that. also forgot to run format after moving tests from my working branch |
please update comments in https://github.com/microsoft/onnxruntime/blob/main/js/web/script/generate-webgpu-operator-md.ts#L10 since it's a partial implementation |
done |
|
/azp run ONNX Runtime Web CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
fixed ONNX Runtime Web CI Pipeline error with unused variable |
|
/azp run ONNX Runtime Web CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I've fixed errors here and in LayerNorm PR Also, I've managed to load SDXL in the browser. So once I update the pipeline code, will come back with updates to Attention (if it requires something not yet implemented) or will start to annoy you with 64bit PR :) |
|
Awesome if you can SDXL. wasm64 might be a bit of pain but long term not avoidable and a bunch of people badly want it. I think fp16 together with the onnx external data format will go a long way at one point we need 64bit. |
|
/azp run ONNX Runtime Web CI Pipeline |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
I've already loaded it (both fp32 and fp16) but having some issues with either onnx export or pipeline code. Getting NaNs after unet run. Most likely will have time to resolve closer to the next week |
|
Anyway, i can maintain my own package with 64bit build. Since my goal is to have diffusers.js library, not specific implementation details. But right now my dev branch and upstream are very diverged. If you agree to merge things to support 64bit flags then it would make everything so much easier. I can fill some separate PRs, just let me know what you would like to have in upstream and how we can make it compatible |
|
/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline |
|
/azp run Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline |
|
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-python-checks-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Android CI Pipeline |
|
/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
Azure Pipelines successfully started running 10 pipeline(s). |
### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description This is a narrow implementation of Attention/MultiHeadAttention as it does not support: a. inputs 5-7 for MHA b. packed QKV/KV c. past/present d. attention mask But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro VRAM usage is about 8gb if you don't use img2img Going to focus on SDXL now --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Description
This is a narrow implementation of Attention/MultiHeadAttention as it does not support:
a. inputs 5-7 for MHA
b. packed QKV/KV
c. past/present
d. attention mask
But it works well for StableDiffusion and can be extended later. It reduces VRAM usage as it combines many ops into few
I've updated demo here https://islamov.ai/stable-diffusion-webgpu/ it takes ~13sec for 1 image with 20 steps on RTX3090Ti and about 25s on M1 Pro
VRAM usage is about 8gb if you don't use img2img
Going to focus on SDXL now