LayerNormalization broadcast (limited support for axis=2) #23297

tianleiwu · 2025-01-09T01:26:16Z

Description

Spec of LayerNormalization supports broadcasting (tensors Scale and B should be unidirectional broadcastable to tensor X).
https://onnx.ai/onnx/operators/onnx__LayerNormalization.html
However, current implementation only allow scale and bias size to be X.shape()[axis:].

Example of input tensors that normalized with axis=2:

X shape	Scale shape	B shape	Before	After
(B, S, D)	(D)	(D)	Supported	Supported
(B, S, D)	(1, 1, D)	(1, 1, D)	Supported	Supported
(B, S, D)	(B, 1, D)	(B, 1, D)	Not Supported	Supported
(B, S, D)	(1, S, D)	(1, S, D)	Not Supported	Supported
(B, S, D)	(B, S, D)	(B, S, D)	Not Supported	Supported

Here we add limited support: axis=2; scale/bias has same shape; scale/bias/X have same number of dimensions. It could support common use case in LLM and vision models.

Motivation and Context

Support Stable Diffusion 3.x and Flux model.

### Description Spec of LayerNormalization supports broadcasting (tensors Scale and B should be unidirectional broadcastable to tensor X). https://onnx.ai/onnx/operators/onnx__LayerNormalization.html However, current implementation only allow scale and bias size to be X.shape()[axis:]. Example of input tensors that normalized with axis=2: | X shape | Scale shape | B shape | Before | After | | - | - | - | - | - | | (B, S, D) | (D) | (D) | Supported | Supported | | (B, S, D) | (1, 1, D) | (1, 1, D) | Supported | Supported | | (B, S, D) | (B, 1, D) | (B, 1, D) | Not Supported | Supported | | (B, S, D) | (1, S, D) | (1, S, D) | Not Supported | Supported | | (B, S, D) | (B, S, D) | (B, S, D) | Not Supported | Supported | Here we add limited support: axis=2; scale/bias has same shape; scale/bias/X have same number of dimensions. It could support common use case in LLM and vision models. ### Motivation and Context Support Stable Diffusion 3.x and Flux model.

### Description It has dependency on the following PRs: - #23297 Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models (fp32 or fp16). - [x] Update optimize_pipeline script - [x] Update benchmkark script - [x] Update document about Stable Diffusion 3.x and Flux 1.0 models - [x] Add graph optimizations for MMDit model - [x] FastGelu fusion - [x] RMSNorm fusion - [x] MultiHeadAttention fusion - [x] Add graph optimizations for Flux transformer models - [x] MultiHeadAttention fusion - [x] Update graph optimizations for t5 - [x] Add tests Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models: ``` python optimize_pipeline.py -i ./flux1_schnell_onnx/fp32 -o ./flux1_schnell_onnx/fp16 --float16 Optimize flux1_schnell_onnx/fp32/transformer/model.onnx ... Fused LayerNormalization: 115 Fused SimplifiedLayerNormalization: 152 Fused FastGelu: 76 Fused MultiHeadAttention: 57 ``` ### H100 Benchmark Results * GPU: NVIDIA H100 80GB HBM3 * Image Size: 1024x1024 * Batch Size: 1 Model | Steps | Precision | Engine | Latency (Seconds) | GPU Memory (MB) -- | -- | -- | -- | -- | -- Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (compile) | 8.198 | 37,603 Flux 1.0 Dev | 50 | FP16+BF16 | Optimum (ORT) | 10.762 | 41,469 Flux 1.0 Dev | 50 | FP16+FP32 | Optimum (ORT) | 10.891 | 43,545 Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (eager) | 12.339 | 36,651 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (compile) | 0.775 | 37,857 Flux 1.0 Schnell | 4 | FP16+BF16 | Optimum (ORT) | 0.931 | 41,433 Flux 1.0 Schnell | 4 | FP16+FP32 | Optimum (ORT) | 0.939 | 43,809 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (eager) | 1.120 | 36,629 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (compile) | 7.466 | 32,217 SD 3.5 Large | 50 | FP16+BF16 | Optimum (ORT) | 10.275 | 36,609 SD 3.5 Large | 50 | FP16+FP32 | Optimum (ORT) | 10.283 | 36,729 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (eager) | 11.615 | 31,517 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (compile) | 3.240 | 21,143 SD 3.5 Medium | 50 | FP16+BF16 | Optimum (ORT) | 4.799 | 25,097 SD 3.5 Medium | 50 | FP16+FP32 | Optimum (ORT) | 4.838 | 25,109 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (eager) | 5.582 | 20,489 ### A100 Benchmark Results * GPU: A100-SXM4-80GB * Image Size: 1024x1024 * Batch Size: 1 Model | Steps | Precision | Engine | Latency (Seconds) | GPU Memory (MB) -- | -- | -- | -- | -- | -- Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (compile) | 17.593 | 37,723 Flux 1.0 Dev | 50 | FP16+BF16 | Optimum (ORT) | 21.918 | 41,348 Flux 1.0 Dev | 50 | FP16+FP32 | Optimum (ORT) | 22.060 | 44,860 Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (eager) | 24.267 | 36,847 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (compile) | 1.627 | 37,881 Flux 1.0 Schnell | 4 | FP16+BF16 | Optimum (ORT) | 1.884 | 41,537 Flux 1.0 Schnell | 4 | FP16+FP32 | Optimum (ORT) | 1.902 | 44,858 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (eager) | 2.162 | 36,831 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (compile) | 15.881 | 32,307 SD 3.5 Large | 50 | FP16+FP32 | Optimum (ORT) | 19.837 | 36,451 SD 3.5 Large | 50 | FP16+BF16 | Optimum (ORT) | 19.964 | 36,461 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (eager) | 22.477 | 31,513 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (compile) | 6.476 | 21,341 SD 3.5 Medium | 50 | FP16+FP32 | Optimum (ORT) | 8.775 | 25,183 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (eager) | 10.057 | 20,433 ### Future Works * Triton kernel for matrix multiplication and auto tuning. * FP8/Int8 quantization ### Motivation and Context SD 3.5 Architecture: https://huggingface.co/stabilityai/stable-diffusion-3.5-medium/resolve/main/mmdit-x.png

### Description Spec of LayerNormalization supports broadcasting (tensors Scale and B should be unidirectional broadcastable to tensor X). https://onnx.ai/onnx/operators/onnx__LayerNormalization.html However, current implementation only allow scale and bias size to be X.shape()[axis:]. Example of input tensors that normalized with axis=2: | X shape | Scale shape | B shape | Before | After | | - | - | - | - | - | | (B, S, D) | (D) | (D) | Supported | Supported | | (B, S, D) | (1, 1, D) | (1, 1, D) | Supported | Supported | | (B, S, D) | (B, 1, D) | (B, 1, D) | Not Supported | Supported | | (B, S, D) | (1, S, D) | (1, S, D) | Not Supported | Supported | | (B, S, D) | (B, S, D) | (B, S, D) | Not Supported | Supported | Here we add limited support: axis=2; scale/bias has same shape; scale/bias/X have same number of dimensions. It could support common use case in LLM and vision models. ### Motivation and Context Support Stable Diffusion 3.x and Flux model.

### Description It has dependency on the following PRs: - #23297 Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models (fp32 or fp16). - [x] Update optimize_pipeline script - [x] Update benchmkark script - [x] Update document about Stable Diffusion 3.x and Flux 1.0 models - [x] Add graph optimizations for MMDit model - [x] FastGelu fusion - [x] RMSNorm fusion - [x] MultiHeadAttention fusion - [x] Add graph optimizations for Flux transformer models - [x] MultiHeadAttention fusion - [x] Update graph optimizations for t5 - [x] Add tests Optimize the ONNX pipeline for Stable Diffusion 3.x and Flux 1.0 models: ``` python optimize_pipeline.py -i ./flux1_schnell_onnx/fp32 -o ./flux1_schnell_onnx/fp16 --float16 Optimize flux1_schnell_onnx/fp32/transformer/model.onnx ... Fused LayerNormalization: 115 Fused SimplifiedLayerNormalization: 152 Fused FastGelu: 76 Fused MultiHeadAttention: 57 ``` ### H100 Benchmark Results * GPU: NVIDIA H100 80GB HBM3 * Image Size: 1024x1024 * Batch Size: 1 Model | Steps | Precision | Engine | Latency (Seconds) | GPU Memory (MB) -- | -- | -- | -- | -- | -- Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (compile) | 8.198 | 37,603 Flux 1.0 Dev | 50 | FP16+BF16 | Optimum (ORT) | 10.762 | 41,469 Flux 1.0 Dev | 50 | FP16+FP32 | Optimum (ORT) | 10.891 | 43,545 Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (eager) | 12.339 | 36,651 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (compile) | 0.775 | 37,857 Flux 1.0 Schnell | 4 | FP16+BF16 | Optimum (ORT) | 0.931 | 41,433 Flux 1.0 Schnell | 4 | FP16+FP32 | Optimum (ORT) | 0.939 | 43,809 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (eager) | 1.120 | 36,629 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (compile) | 7.466 | 32,217 SD 3.5 Large | 50 | FP16+BF16 | Optimum (ORT) | 10.275 | 36,609 SD 3.5 Large | 50 | FP16+FP32 | Optimum (ORT) | 10.283 | 36,729 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (eager) | 11.615 | 31,517 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (compile) | 3.240 | 21,143 SD 3.5 Medium | 50 | FP16+BF16 | Optimum (ORT) | 4.799 | 25,097 SD 3.5 Medium | 50 | FP16+FP32 | Optimum (ORT) | 4.838 | 25,109 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (eager) | 5.582 | 20,489 ### A100 Benchmark Results * GPU: A100-SXM4-80GB * Image Size: 1024x1024 * Batch Size: 1 Model | Steps | Precision | Engine | Latency (Seconds) | GPU Memory (MB) -- | -- | -- | -- | -- | -- Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (compile) | 17.593 | 37,723 Flux 1.0 Dev | 50 | FP16+BF16 | Optimum (ORT) | 21.918 | 41,348 Flux 1.0 Dev | 50 | FP16+FP32 | Optimum (ORT) | 22.060 | 44,860 Flux 1.0 Dev | 50 | BF16 | Torch 2.5.1 (eager) | 24.267 | 36,847 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (compile) | 1.627 | 37,881 Flux 1.0 Schnell | 4 | FP16+BF16 | Optimum (ORT) | 1.884 | 41,537 Flux 1.0 Schnell | 4 | FP16+FP32 | Optimum (ORT) | 1.902 | 44,858 Flux 1.0 Schnell | 4 | BF16 | Torch 2.5.1 (eager) | 2.162 | 36,831 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (compile) | 15.881 | 32,307 SD 3.5 Large | 50 | FP16+FP32 | Optimum (ORT) | 19.837 | 36,451 SD 3.5 Large | 50 | FP16+BF16 | Optimum (ORT) | 19.964 | 36,461 SD 3.5 Large | 50 | BF16 | Torch 2.5.1 (eager) | 22.477 | 31,513 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (compile) | 6.476 | 21,341 SD 3.5 Medium | 50 | FP16+FP32 | Optimum (ORT) | 8.775 | 25,183 SD 3.5 Medium | 50 | BF16 | Torch 2.5.1 (eager) | 10.057 | 20,433 ### Future Works * Triton kernel for matrix multiplication and auto tuning. * FP8/Int8 quantization ### Motivation and Context SD 3.5 Architecture: https://huggingface.co/stabilityai/stable-diffusion-3.5-medium/resolve/main/mmdit-x.png

…ion (#26613) ### Description  This PR adds full and spec-compliant broadcasting support to both LayerNormalization and RMSNormalization. Previously, onnxruntime supported only a partial set of broadcasting cases (based on the logic introduced in this PR #23297 ). That implementation handled several cases but did not cover all valid broadcasting scenarios. This PR introduces a complete generic broadcasting path, following the [ONNX specification rules](https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md). The previous implementation is preserved as a fast-path and is still used whenever the Scale/Bias shapes match directly. Main changes: - Extended broadcasting logic in: layer_norm_helper.h layer_norm_impl.cc - Added full support for all valid broadcasting configurations of Scale and Bias. - Preserved previous partial logic as a fast-path for exact-match cases. - Added comprehensive tests to: layer_norm_op_test.cc rms_norm_op_test.cc ### Motivation and Context  Before this fix, some valid ONNX broadcasting shapes were rejected in LayerNormalization and RMSNormalization. This PR brings the operators into full alignment with the ONNX specification and fixes models that previously failed due to incomplete broadcasting support. Fixes #26432 Fixes #18184

broadcast 3D scale and bias in layer norm for axis=2

0d89e5d

tianleiwu marked this pull request as draft January 9, 2025 07:26

tianleiwu added 4 commits January 9, 2025 19:34

refactoring for test

35d2713

format

233a60c

fp16

aac9f02

clean up

590643a

tianleiwu marked this pull request as ready for review January 9, 2025 22:28

tianleiwu requested a review from jiafatom January 9, 2025 22:29

tianleiwu added 3 commits January 9, 2025 22:37

update comment

102cbca

refine error message

46d7c38

update comment

5b2290f

tianleiwu requested a review from kunal-vaishnavi January 10, 2025 21:18

kunal-vaishnavi previously approved these changes Jan 10, 2025

View reviewed changes

tianleiwu added 2 commits January 10, 2025 22:06

Merge branch 'main' into tlwu/layer_norm_broadcast

64381ae

refactoring

ee01f92

tianleiwu dismissed kunal-vaishnavi’s stale review via ee01f92 January 10, 2025 23:14

tianleiwu requested a review from kunal-vaishnavi January 10, 2025 23:14

update comments

e765ce6

kunal-vaishnavi approved these changes Jan 10, 2025

View reviewed changes

tianleiwu mentioned this pull request Jan 11, 2025

Stable Diffusion 3.x and Flux Optimization #22986

Merged

11 tasks

tianleiwu merged commit 73f5b0c into main Jan 11, 2025
98 checks passed

tianleiwu deleted the tlwu/layer_norm_broadcast branch January 11, 2025 05:57

titaiwangms mentioned this pull request Oct 31, 2025

Issue with RMS Normalization when scale is a scalar #26432

Closed

naomiOvad mentioned this pull request Nov 19, 2025

Add full broadcasting support to LayerNormalization and RMSNormalization #26613

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LayerNormalization broadcast (limited support for axis=2) #23297

LayerNormalization broadcast (limited support for axis=2) #23297

Uh oh!

tianleiwu commented Jan 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LayerNormalization broadcast (limited support for axis=2) #23297

LayerNormalization broadcast (limited support for axis=2) #23297

Uh oh!

Conversation

tianleiwu commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jan 9, 2025 •

edited

Loading