Feat/openblas wasm simd by teddygood · Pull Request #5960 · pyodide/pyodide

teddygood · 2025-10-20T16:44:05Z

Description

This PR builds on #5948 and adds a SIMD-enabled OpenBLAS along with a manual benchmark script to compare it with the original build.

libopenblas: rebuilt using -msimd128 to enable WebAssembly SIMD.
test-openblas-simd: benchmark comparing cblas_sdot (vector dot product) and cblas_sgemm (matrix multiplication) across:
- libopenblas_og.so (baseline, no SIMD)
- libopenblas.so (SIMD enabled)
- libopenblas_simd_o3.so (SIMD enabled, -O3 build)

All SIMD builds were verified for numerical correctness across cblas_sdot and cblas_sgemm operations.

Benchmark result

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07        
307200       0.319130        0.307377        0.319088        1.04         1.00        
921600       0.967350        0.939312        0.903440        1.03         1.07        
2073600      2.103415        2.184920        2.193283        0.96         0.96        
==========================================================================================

==========================================================================================
cblas_sgemm Benchmark Results (Matrix-Matrix Multiplication)
==========================================================================================
Shape                Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
128x128 @ 128x128    3.151062        1.394323        1.722989        2.26         1.83        
256x256 @ 256x256    7.911437        7.780833        7.797250        1.02         1.01        
384x384 @ 384x384    27.864609       27.847767       29.926025       1.00         0.93        
512x512 @ 512x512    62.550344       62.101500       65.827094       1.01         0.95        
==========================================================================================

Next steps / Discussion

While the SIMD-enabled OpenBLAS passes all existing tests and shows some improvements, I'd like to further discuss how we should handle the benchmark itself.

To avoid consuming CI resources, the benchmark is disabled by default and only runs manually when explicitly enabled.

RUN_BENCHMARKS = os.environ.get("PYODIDE_RUN_OPENBLAS_BENCH") == "1"

pytestmark = pytest.mark.skipif(
    not RUN_BENCHMARKS,
    reason="OpenBLAS benchmarks run only when PYODIDE_RUN_OPENBLAS_BENCH=1",
)

You can manually execute the benchmark with the following command:

PYODIDE_RUN_OPENBLAS_BENCH=1 python3 -m pytest -k "test_benchmark" -s packages/test-openblas-simd/test_benchmark.py

I feel this approach may not be the most efficient, since it relies on multiple separate packages. If there is any other way other than this, please feel free to give me feedback.

hoodmane · 2025-10-20T17:02:47Z

Maybe this should be a PR to pyodide-recipes? cc @ryanking13

for more information, see https://pre-commit.ci

ryanking13 · 2025-10-21T12:51:54Z

Maybe this should be a PR to pyodide-recipes? cc @ryanking13

I think it should be both in here and in pyodide-recipes. We have OpenBLAS recipe in both repositories (because of scipy and probably for numpy later).

ryanking13 · 2025-10-21T12:54:50Z

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07

This looks a little bit weird to me. Why is size 10000 slower than 50000 in the baseline?

ryanking13 · 2025-10-21T13:06:21Z

Thanks for working on this @teddygood! The benchmark looks pretty impresive (except for the comment that I wrote about the huge performance drop in size 10000 case).

I tried to compare the size of the three versions, and the difference between the o3 and non-o3 version is not that big. Since non-o3 version showed better performance in your benchmark, I think we can use non-o3 one.

Name                           Size
----                           ----
libopenblas_og.so              {5.76 MB, 5,901.64 KB}
libopenblas_simd_o3.so         {6.22 MB, 6,370.63 KB}
libopenblas-0.3.26.zip         {6.25 MB, 6,396.36 KB}
libopenblas-og-0.3.26.zip      {5.76 MB, 5,901.77 KB}
libopenblas-simd-o3-0.3.26.zip {6.22 MB, 6,370.77 KB}
libopenblas.so                 {6.25 MB, 6,396.24 KB}

I feel this approach may not be the most efficient, since it relies on multiple separate packages. If there is any other way other than this, please feel free to give me feedback.

We don't need to run benchmarks every time, neither do we need to keep the three different versions. I am happy with the benchmark and I think we can keep the simd-enabled non-o3 one and remove others. According to caniuse, wasm-simd has been supported in most of the major browsers for a long time. So I think we can use it as a default instead of keeping two different versions. WDTY @agriyakhetarpal?

teddygood · 2025-10-22T09:14:16Z

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07

This looks a little bit weird to me. Why is size 10000 slower than 50000 in the baseline?

I think the first few runs took longer due to initial setup and cold start overhead, such as library loading and internal buffer preparation.

So I modified the benchmark to discard the results of the first 5 runs and compute the average of the remaining iterations.

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.051281        0.032122        0.043608        1.60         1.18        
50000        0.063757        0.055535        0.070220        1.15         0.91        
307200       0.361577        0.337165        0.314840        1.07         1.15        
921600       0.992375        0.924917        0.917122        1.07         1.08        
2073600      2.021727        2.009305        2.036405        1.01         0.99        
==========================================================================================

ryanking13 · 2025-10-22T09:59:20Z

So I modified the benchmark to discard the results of the first 5 runs and compute the average of the remaining iterations.

Great! The test result now looks more reliable to me.

Now what you need to do is,

Remove all different versions of libopenblas, keep the simd-enabled-non-o3 version (libopenblas).
Push a commit with [scipy] included in the commit message. It will trigger the scipy test suite.
See if the tests pass. If it passes I think we are good to go.

teddygood · 2025-10-23T13:43:15Z

Thanks for the feedback! I was wondering if I should keep the benchmark package.

ryanking13 · 2025-10-24T14:10:39Z

Thanks for the feedback! I was wondering if I should keep the benchmark package.

I don't think we need to keep it. We can reference this PR and copy the code when we need to run benchmark again.

teddygood · 2025-10-24T14:40:47Z

Thanks for the feedback! Then I’ll commit the refactored benchmark and remove it afterward. I’ll proceed with the next tasks.

Now what you need to do is,

Remove all different versions of libopenblas, keep the simd-enabled-non-o3 version (libopenblas).
Push a commit with [scipy] included in the commit message. It will trigger the scipy test suite.
See if the tests pass. If it passes I think we are good to go.

for more information, see https://pre-commit.ci

teddygood · 2025-10-24T15:15:13Z

Here are the results from the latest benchmark.

Each benchmark performs 5 warmup runs, followed by averaging 30 runs for sdot and 20 runs for sgemm. This design helps eliminate noise from initial executions (e.g., library loading or other startup overhead), so the first 5 runs are excluded from the average.

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.044990        0.030754        0.031492        1.46         1.43        
50000        0.057981        0.054054        0.082596        1.07         0.70        
307200       0.332531        0.301561        0.303728        1.10         1.09        
921600       0.922771        0.899521        0.919775        1.03         1.00        
2073600      2.103796        2.137795        2.126294        0.98         0.99        
==========================================================================================

==========================================================================================
cblas_sgemm Benchmark Results (Matrix-Matrix Multiplication)
==========================================================================================
Shape                Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
64x64 @ 64x64        0.195546        0.135287        0.131614        1.45         1.49        
128x128 @ 128x128    0.977794        0.911027        0.946723        1.07         1.03        
256x256 @ 256x256    7.840875        7.549144        7.513479        1.04         1.04        
384x384 @ 384x384    27.185423       27.026663       28.363954       1.01         0.96        
512x512 @ 512x512    62.704544       62.936317       63.067033       1.00         0.99        
==========================================================================================

agriyakhetarpal · 2025-10-26T05:41:31Z

I am happy with the benchmark and I think we can keep the simd-enabled non-o3 one and remove others. According to caniuse, wasm-simd has been supported in most of the major browsers for a long time. So I think we can use it as a default instead of keeping two different versions. WDTY @agriyakhetarpal?

Yes, given the size difference/increase is pretty reasonable, I am happy to use it as the default. Thanks @ryanking13 and @teddygood for the discussion and for working on this!

ryanking13

Thanks! Could you please open the same PR in pyodide/pyodide-recipes as well? Unfortunately, we have a copy of two same recipes for now.

teddygood · 2025-10-26T08:35:21Z

Oh, the PR got merged! Thanks! I'll open the same PR in pyodide/pyodide-recipes as well.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

teddygood added 4 commits October 20, 2025 23:10

Enable SIMD build in libopenblas recipe

9e65056

Add legacy libopenblas-og side module recipe

8e65fc6

Add libopenblas-simd-o3 recipe with SIMD/O3 flags

13464fa

Add OpenBLAS SIMD benchmark package

6d4413c

[pre-commit.ci] auto fixes from pre-commit.com hooks

7fa8f53

for more information, see https://pre-commit.ci

teddygood and others added 2 commits October 25, 2025 00:12

Refactor SIMD benchmark to use warmup

239d7b4

[pre-commit.ci] auto fixes from pre-commit.com hooks

14d4b97

for more information, see https://pre-commit.ci

teddygood added 3 commits October 25, 2025 01:08

Delete the original OpenBLAS

066f698

Delete SIMD OpenBLAS

f82ec57

[scipy] Delete OpenBLAS benchmark

ef9f430

ryanking13 approved these changes Oct 26, 2025

View reviewed changes

ryanking13 merged commit df94f6e into pyodide:main Oct 26, 2025
38 of 40 checks passed

teddygood mentioned this pull request Oct 27, 2025

Sync OpenBLAS SIMD recipe from main pyodide repo pyodide/pyodide-recipes#404

Merged

agriyakhetarpal mentioned this pull request Nov 7, 2025

Using WASM OpenBLAS and SIMD-enabled builds for NumPy in Pyodide scientific-python/summit-2025-nov#14

Open

Drranny pushed a commit to Drranny/pyodide that referenced this pull request Feb 15, 2026

Feat/openblas wasm simd (pyodide#5960)

a6038d4

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

teddygood mentioned this pull request Mar 12, 2026

Improving performance with WebAssembly OpenMathLib/OpenBLAS#4023

Closed

Uh oh!

Conversation

teddygood commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark result

Next steps / Discussion

Uh oh!

hoodmane commented Oct 20, 2025

Uh oh!

ryanking13 commented Oct 21, 2025

Uh oh!

ryanking13 commented Oct 21, 2025

Uh oh!

ryanking13 commented Oct 21, 2025

Uh oh!

teddygood commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanking13 commented Oct 22, 2025

Uh oh!

teddygood commented Oct 23, 2025

Uh oh!

ryanking13 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teddygood commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teddygood commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agriyakhetarpal commented Oct 26, 2025

Uh oh!

ryanking13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teddygood commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teddygood commented Oct 20, 2025 •

edited

Loading

teddygood commented Oct 22, 2025 •

edited

Loading

ryanking13 commented Oct 24, 2025 •

edited

Loading

teddygood commented Oct 24, 2025 •

edited

Loading

teddygood commented Oct 24, 2025 •

edited

Loading