Skip to content

Add MLFloat16 support for LayerNormalization, SkipLayerNormalization#22063

Merged
amarin16 merged 20 commits intomicrosoft:mainfrom
amarin16:dev/amarin16/layer_norm
Sep 24, 2024
Merged

Add MLFloat16 support for LayerNormalization, SkipLayerNormalization#22063
amarin16 merged 20 commits intomicrosoft:mainfrom
amarin16:dev/amarin16/layer_norm

Conversation

@amarin16
Copy link
Copy Markdown
Contributor

@amarin16 amarin16 commented Sep 11, 2024

Add MLFloat16 support for:

  • LayerNormalization
  • SimplifiedLayerNormalization
  • SkipLayerNormalization
  • SkipSimplifiedLayerNormalization

There are existing LayerNormTest unit tests that cover the MLFloat16 functionality for LayerNormalization once MLFloat16 is registered (for example LayerNormTest.LayerNorm_Scale_Float16Input).

Similarly, there are unit tests such as SkipLayerNormTest.SkipLayerNormBatch1_Float16 that cover MLFloat16 inputs for SkipLayerNormalization.

@tianleiwu
Copy link
Copy Markdown
Contributor

Could it be faster to use MlasConvertHalfToFloatBuffer to convert inputs?

Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
Comment thread onnxruntime/core/providers/cpu/nn/layer_norm_impl.cc Fixed
@amarin16
Copy link
Copy Markdown
Contributor Author

Could it be faster to use MlasConvertHalfToFloatBuffer to convert inputs?

@tianleiwu From what I can see, the logic inside MLFloat16.ToFloat() is very similar to the one inside MlasConvertHalfToFloatBuffer

@tianleiwu
Copy link
Copy Markdown
Contributor

tianleiwu commented Sep 11, 2024

@tianleiwu From what I can see, the logic inside MLFloat16.ToFloat() is very similar to the one inside MlasConvertHalfToFloatBuffer

That's slow path. The fast path uses assembly kernel of AVX_NE_CONVERT instructions, which might be faster (It is not guarantee since there is extra I/O if we use temp buffers to hold the casted inputs. May need run benchmark to see whether it could help).

@tianleiwu
Copy link
Copy Markdown
Contributor

Please fix format like the following:

pip install requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

@tianleiwu
Copy link
Copy Markdown
Contributor

/azp run Linux CPU CI Pipeline, Windows CPU CI Pipeline

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 2 pipeline(s).

@tianleiwu
Copy link
Copy Markdown
Contributor

Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
@amarin16
Copy link
Copy Markdown
Contributor Author

@xadupre
Copy link
Copy Markdown
Member

xadupre commented Sep 12, 2024

It seems to work. You still need to update the markdown pages for the documentation. You can generate them again or manually fix the differences by looking at the job output which makes a diff between the current version and the automatically generated one: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1491838&view=logs&j=7f366e99-16b2-52cc-e1ff-653af284e397&t=5c9cf234-f957-5fdc-9c37-89899bf73c0c&l=58.

@amarin16 amarin16 marked this pull request as ready for review September 12, 2024 20:41
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Outdated
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
Comment thread onnxruntime/contrib_ops/cpu/skip_layer_norm.cc Fixed
@yufenglee
Copy link
Copy Markdown
Member

How is the perf comparing with fp32 version?

@amarin16 amarin16 merged commit eb2506d into microsoft:main Sep 24, 2024
amarin16 added a commit that referenced this pull request Oct 15, 2024
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in #22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.

```
----------------------------------------------------------------------------------------------
Original MLFloat16 support                                   Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time      15599 us        15625 us           47
BM_LayerNormalization<MLFloat16, float>/1/real_time      14714 us        14824 us           39
BM_LayerNormalization<MLFloat16, float>/1/real_time      14634 us        14688 us           50


----------------------------------------------------------------------------------------------
Updated MLFloat16 support                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time       7276 us         7254 us           84
BM_LayerNormalization<MLFloat16, float>/1/real_time       6820 us         6720 us           93
BM_LayerNormalization<MLFloat16, float>/1/real_time       6840 us         6882 us           84
```
guschmue pushed a commit that referenced this pull request Oct 18, 2024
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in #22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.

```
----------------------------------------------------------------------------------------------
Original MLFloat16 support                                   Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time      15599 us        15625 us           47
BM_LayerNormalization<MLFloat16, float>/1/real_time      14714 us        14824 us           39
BM_LayerNormalization<MLFloat16, float>/1/real_time      14634 us        14688 us           50


----------------------------------------------------------------------------------------------
Updated MLFloat16 support                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time       7276 us         7254 us           84
BM_LayerNormalization<MLFloat16, float>/1/real_time       6820 us         6720 us           93
BM_LayerNormalization<MLFloat16, float>/1/real_time       6840 us         6882 us           84
```
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Aug 19, 2025
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in microsoft/onnxruntime#22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.

```
----------------------------------------------------------------------------------------------
Original MLFloat16 support                                   Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time      15599 us        15625 us           47
BM_LayerNormalization<MLFloat16, float>/1/real_time      14714 us        14824 us           39
BM_LayerNormalization<MLFloat16, float>/1/real_time      14634 us        14688 us           50


----------------------------------------------------------------------------------------------
Updated MLFloat16 support                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time       7276 us         7254 us           84
BM_LayerNormalization<MLFloat16, float>/1/real_time       6820 us         6720 us           93
BM_LayerNormalization<MLFloat16, float>/1/real_time       6840 us         6882 us           84
```
rohan11235813 pushed a commit to quadric-io/onnxruntime that referenced this pull request Sep 15, 2025
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support
added in microsoft/onnxruntime#22063.
- Updated the `LayerNormalization` MLFloat16 implementation to improve
the latency.

```
----------------------------------------------------------------------------------------------
Original MLFloat16 support                                   Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time      15599 us        15625 us           47
BM_LayerNormalization<MLFloat16, float>/1/real_time      14714 us        14824 us           39
BM_LayerNormalization<MLFloat16, float>/1/real_time      14634 us        14688 us           50


----------------------------------------------------------------------------------------------
Updated MLFloat16 support                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time       7276 us         7254 us           84
BM_LayerNormalization<MLFloat16, float>/1/real_time       6820 us         6720 us           93
BM_LayerNormalization<MLFloat16, float>/1/real_time       6840 us         6882 us           84
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants