Add MLFloat16 support for LayerNormalization, SkipLayerNormalization#22063
Add MLFloat16 support for LayerNormalization, SkipLayerNormalization#22063amarin16 merged 20 commits intomicrosoft:mainfrom
Conversation
|
Could it be faster to use MlasConvertHalfToFloatBuffer to convert inputs? |
@tianleiwu From what I can see, the logic inside |
That's slow path. The fast path uses assembly kernel of AVX_NE_CONVERT instructions, which might be faster (It is not guarantee since there is extra I/O if we use temp buffers to hold the casted inputs. May need run benchmark to see whether it could help). |
|
Please fix format like the following: |
|
/azp run Linux CPU CI Pipeline, Windows CPU CI Pipeline |
|
Azure Pipelines successfully started running 2 pipeline(s). |
|
SkipLayerNormTest.SkipLayerNormBatch1 unit test failed: |
Seems to be passing in the latest build, as well as locally |
|
It seems to work. You still need to update the markdown pages for the documentation. You can generate them again or manually fix the differences by looking at the job output which makes a diff between the current version and the automatically generated one: https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1491838&view=logs&j=7f366e99-16b2-52cc-e1ff-653af284e397&t=5c9cf234-f957-5fdc-9c37-89899bf73c0c&l=58. |
|
How is the perf comparing with fp32 version? |
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in #22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in #22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in microsoft/onnxruntime#22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```
- Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in microsoft/onnxruntime#22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```
Add
MLFloat16support for:LayerNormalizationSimplifiedLayerNormalizationSkipLayerNormalizationSkipSimplifiedLayerNormalizationThere are existing
LayerNormTestunit tests that cover theMLFloat16functionality forLayerNormalizationonceMLFloat16is registered (for exampleLayerNormTest.LayerNorm_Scale_Float16Input).Similarly, there are unit tests such as
SkipLayerNormTest.SkipLayerNormBatch1_Float16that cover MLFloat16 inputs forSkipLayerNormalization.