Add microbenchmark for layer normalization and improve latency by amarin16 · Pull Request #22223 · microsoft/onnxruntime

amarin16 · 2024-09-25T20:05:50Z

Added a microbenchmark for the LayerNormalization MLFloat16 support added in Add MLFloat16 support for LayerNormalization, SkipLayerNormalization #22063.
Updated the LayerNormalization MLFloat16 implementation to improve the latency.

----------------------------------------------------------------------------------------------
Original MLFloat16 support                                   Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time      15599 us        15625 us           47
BM_LayerNormalization<MLFloat16, float>/1/real_time      14714 us        14824 us           39
BM_LayerNormalization<MLFloat16, float>/1/real_time      14634 us        14688 us           50


----------------------------------------------------------------------------------------------
Updated MLFloat16 support                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------
BM_LayerNormalization<MLFloat16, float>/1/real_time       7276 us         7254 us           84
BM_LayerNormalization<MLFloat16, float>/1/real_time       6820 us         6720 us           93
BM_LayerNormalization<MLFloat16, float>/1/real_time       6840 us         6882 us           84

This reverts commit 6aece95.

fajin-corp · 2024-10-11T22:01:32Z

please add a summary of benchmark results. before vs. after

fajin-corp · 2024-10-11T22:31:52Z

instead of converting fp16 to fp32, do you plan to implement fp16 kernels?

amarin16 · 2024-10-14T15:26:22Z

instead of converting fp16 to fp32, do you plan to implement fp16 kernels?

that can be done in a separate PR

### Description The recent PR #22223 introduced 2 bugs in implementation of CPU LayerNorm f16: - possible access to nullptr for bias `const TensorShape& bias_shape = bias->Shape();` will crash when `bias` does not exist. (amazingly seems this one is not coverred by any test case) - fix: guard with pointer check - a racing condition inside ComputeJob `ComputeJob()` is dispatched to threadpool and it internally tries to modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`, which are `std::unique_ptr`s and are not thread-safe. - fix: move the modification of `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into `LayerNormImpl::ComputeWithoutContext()`. It may still have racing condition because `ConcurrentRunSupported` is set to `true` for CPU EP. Added an OrtMutex. This should fixes the recent flaky tests as well.

- Added a microbenchmark for the `LayerNormalization` MLFloat16 support added in #22063. - Updated the `LayerNormalization` MLFloat16 implementation to improve the latency. ``` ---------------------------------------------------------------------------------------------- Original MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 15599 us 15625 us 47 BM_LayerNormalization<MLFloat16, float>/1/real_time 14714 us 14824 us 39 BM_LayerNormalization<MLFloat16, float>/1/real_time 14634 us 14688 us 50 ---------------------------------------------------------------------------------------------- Updated MLFloat16 support Time CPU Iterations ---------------------------------------------------------------------------------------------- BM_LayerNormalization<MLFloat16, float>/1/real_time 7276 us 7254 us 84 BM_LayerNormalization<MLFloat16, float>/1/real_time 6820 us 6720 us 93 BM_LayerNormalization<MLFloat16, float>/1/real_time 6840 us 6882 us 84 ```

### Description The recent PR #22223 introduced 2 bugs in implementation of CPU LayerNorm f16: - possible access to nullptr for bias `const TensorShape& bias_shape = bias->Shape();` will crash when `bias` does not exist. (amazingly seems this one is not coverred by any test case) - fix: guard with pointer check - a racing condition inside ComputeJob `ComputeJob()` is dispatched to threadpool and it internally tries to modify `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_`, which are `std::unique_ptr`s and are not thread-safe. - fix: move the modification of `LayerNormImpl::scale_fp32_` and `LayerNormImpl::bias_fp32_` out of `ComputeJob()` and put into `LayerNormImpl::ComputeWithoutContext()`. It may still have racing condition because `ConcurrentRunSupported` is set to `true` for CPU EP. Added an OrtMutex. This should fixes the recent flaky tests as well.

Add microbenchmark for layer normalization

2b8cd17

github-advanced-security AI found potential problems Sep 25, 2024

View reviewed changes

Comment thread onnxruntime/test/onnx/microbenchmark/layer_normalization.cc Fixed

amarin16 added 7 commits September 25, 2024 14:25

fix warnings

0c89631

initialize test input data at compile time

bca13ca

remove unused specialization that fails on pipeline

680cf4f

fix build on linux

f0df526

convert all inputs to float efficiently if needed

87725c3

convert output buffer efficiently in layer_norm_impl

8aa80da

convert output buffer efficiently in skip_layer_norm

295d652

amarin16 changed the title ~~Add microbenchmark for layer normalization~~ Add microbenchmark for layer normalization and improve latency Sep 30, 2024

github-advanced-security AI found potential problems Sep 30, 2024

View reviewed changes

Comment thread onnxruntime/test/onnx/microbenchmark/layer_normalization.cc Fixed

amarin16 added 3 commits September 30, 2024 11:55

add inline and fix some lint issues

405a0a0

fix some lint errors

245f298

fix warning

f398b64

github-advanced-security AI found potential problems Sep 30, 2024

View reviewed changes

amarin16 added 14 commits September 30, 2024 22:50

maybe_unused

a483ca4

Fix bug

19d225a

separate MLFloat16 implementation in skip_layer_norm

05b5037

fix linter issues

ab2e5f2

fix precision warning

63e9644

cast

11eb7fb

separate implementation for MLFloat16 inside layer_norm_impl

46775a7

don't use vectors

fd904f6

reuse allocated arrays when possible

a41b802

make_unique instead of new

6aece95

Revert "make_unique instead of new" for latency

766c4b2

This reverts commit 6aece95.

lint

cb55d4b

fix bug

2895f37

fix bug

f93ccb7

amarin16 marked this pull request as ready for review October 2, 2024 17:09

amarin16 added 7 commits October 3, 2024 10:36

handle errors

4be0255

remove checks on tensor data

48ce979

remove try/catch due to -fno-exceptions

3d6b990

Prepack scale and bias in layer_norm_impl

f04aac0

Prepack skip, gamma, beta, bias in skip_layer_norm

1eaa63f

return void from ComputeJob

26ddc6c

lint

3231cff