Conversation
|
Maybe this should be a PR to pyodide-recipes? cc @ryanking13 |
for more information, see https://pre-commit.ci
I think it should be both in here and in pyodide-recipes. We have OpenBLAS recipe in both repositories (because of scipy and probably for numpy later). |
This looks a little bit weird to me. Why is size 10000 slower than 50000 in the baseline? |
|
Thanks for working on this @teddygood! The benchmark looks pretty impresive (except for the comment that I wrote about the huge performance drop in size 10000 case). I tried to compare the size of the three versions, and the difference between the o3 and non-o3 version is not that big. Since non-o3 version showed better performance in your benchmark, I think we can use non-o3 one.
We don't need to run benchmarks every time, neither do we need to keep the three different versions. I am happy with the benchmark and I think we can keep the simd-enabled non-o3 one and remove others. According to caniuse, wasm-simd has been supported in most of the major browsers for a long time. So I think we can use it as a default instead of keeping two different versions. WDTY @agriyakhetarpal? |
I think the first few runs took longer due to initial setup and cold start overhead, such as library loading and internal buffer preparation. So I modified the benchmark to discard the results of the first 5 runs and compute the average of the remaining iterations. ==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size Baseline (ms) SIMD (ms) SIMD-O3 (ms) Speedup SIMD Speedup O3
------------------------------------------------------------------------------------------
10000 0.051281 0.032122 0.043608 1.60 1.18
50000 0.063757 0.055535 0.070220 1.15 0.91
307200 0.361577 0.337165 0.314840 1.07 1.15
921600 0.992375 0.924917 0.917122 1.07 1.08
2073600 2.021727 2.009305 2.036405 1.01 0.99
========================================================================================== |
Great! The test result now looks more reliable to me. Now what you need to do is,
|
|
Thanks for the feedback! I was wondering if I should keep the benchmark package. |
I don't think we need to keep it. We can reference this PR and copy the code when we need to run benchmark again. |
|
Thanks for the feedback! Then I’ll commit the refactored benchmark and remove it afterward. I’ll proceed with the next tasks.
|
|
Here are the results from the latest benchmark. Each benchmark performs 5 warmup runs, followed by averaging 30 runs for sdot and 20 runs for sgemm. This design helps eliminate noise from initial executions (e.g., library loading or other startup overhead), so the first 5 runs are excluded from the average. ==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size Baseline (ms) SIMD (ms) SIMD-O3 (ms) Speedup SIMD Speedup O3
------------------------------------------------------------------------------------------
10000 0.044990 0.030754 0.031492 1.46 1.43
50000 0.057981 0.054054 0.082596 1.07 0.70
307200 0.332531 0.301561 0.303728 1.10 1.09
921600 0.922771 0.899521 0.919775 1.03 1.00
2073600 2.103796 2.137795 2.126294 0.98 0.99
====================================================================================================================================================================================
cblas_sgemm Benchmark Results (Matrix-Matrix Multiplication)
==========================================================================================
Shape Baseline (ms) SIMD (ms) SIMD-O3 (ms) Speedup SIMD Speedup O3
------------------------------------------------------------------------------------------
64x64 @ 64x64 0.195546 0.135287 0.131614 1.45 1.49
128x128 @ 128x128 0.977794 0.911027 0.946723 1.07 1.03
256x256 @ 256x256 7.840875 7.549144 7.513479 1.04 1.04
384x384 @ 384x384 27.185423 27.026663 28.363954 1.01 0.96
512x512 @ 512x512 62.704544 62.936317 63.067033 1.00 0.99
========================================================================================== |
Yes, given the size difference/increase is pretty reasonable, I am happy to use it as the default. Thanks @ryanking13 and @teddygood for the discussion and for working on this! |
ryanking13
left a comment
There was a problem hiding this comment.
Thanks! Could you please open the same PR in pyodide/pyodide-recipes as well? Unfortunately, we have a copy of two same recipes for now.
|
Oh, the PR got merged! Thanks! I'll open the same PR in |
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Description
This PR builds on #5948 and adds a SIMD-enabled OpenBLAS along with a manual benchmark script to compare it with the original build.
-msimd128to enable WebAssembly SIMD.cblas_sdot(vector dot product) andcblas_sgemm(matrix multiplication) across:libopenblas_og.so(baseline, no SIMD)libopenblas.so(SIMD enabled)libopenblas_simd_o3.so(SIMD enabled, -O3 build)All SIMD builds were verified for numerical correctness across
cblas_sdotandcblas_sgemmoperations.Benchmark result
Next steps / Discussion
While the SIMD-enabled OpenBLAS passes all existing tests and shows some improvements, I'd like to further discuss how we should handle the benchmark itself.
To avoid consuming CI resources, the benchmark is disabled by default and only runs manually when explicitly enabled.
You can manually execute the benchmark with the following command:
I feel this approach may not be the most efficient, since it relies on multiple separate packages. If there is any other way other than this, please feel free to give me feedback.