Skip to content

Feat/openblas wasm simd#5960

Merged
ryanking13 merged 10 commits intopyodide:mainfrom
teddygood:feat/openblas-wasm-simd
Oct 26, 2025
Merged

Feat/openblas wasm simd#5960
ryanking13 merged 10 commits intopyodide:mainfrom
teddygood:feat/openblas-wasm-simd

Conversation

@teddygood
Copy link
Copy Markdown
Contributor

@teddygood teddygood commented Oct 20, 2025

Description

This PR builds on #5948 and adds a SIMD-enabled OpenBLAS along with a manual benchmark script to compare it with the original build.

  • libopenblas: rebuilt using -msimd128 to enable WebAssembly SIMD.
  • test-openblas-simd: benchmark comparing cblas_sdot (vector dot product) and cblas_sgemm (matrix multiplication) across:
    • libopenblas_og.so (baseline, no SIMD)
    • libopenblas.so (SIMD enabled)
    • libopenblas_simd_o3.so (SIMD enabled, -O3 build)

All SIMD builds were verified for numerical correctness across cblas_sdot and cblas_sgemm operations.

Benchmark result

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07        
307200       0.319130        0.307377        0.319088        1.04         1.00        
921600       0.967350        0.939312        0.903440        1.03         1.07        
2073600      2.103415        2.184920        2.193283        0.96         0.96        
==========================================================================================
==========================================================================================
cblas_sgemm Benchmark Results (Matrix-Matrix Multiplication)
==========================================================================================
Shape                Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
128x128 @ 128x128    3.151062        1.394323        1.722989        2.26         1.83        
256x256 @ 256x256    7.911437        7.780833        7.797250        1.02         1.01        
384x384 @ 384x384    27.864609       27.847767       29.926025       1.00         0.93        
512x512 @ 512x512    62.550344       62.101500       65.827094       1.01         0.95        
==========================================================================================

Next steps / Discussion

While the SIMD-enabled OpenBLAS passes all existing tests and shows some improvements, I'd like to further discuss how we should handle the benchmark itself.

To avoid consuming CI resources, the benchmark is disabled by default and only runs manually when explicitly enabled.

RUN_BENCHMARKS = os.environ.get("PYODIDE_RUN_OPENBLAS_BENCH") == "1"

pytestmark = pytest.mark.skipif(
    not RUN_BENCHMARKS,
    reason="OpenBLAS benchmarks run only when PYODIDE_RUN_OPENBLAS_BENCH=1",
)

You can manually execute the benchmark with the following command:

PYODIDE_RUN_OPENBLAS_BENCH=1 python3 -m pytest -k "test_benchmark" -s packages/test-openblas-simd/test_benchmark.py

I feel this approach may not be the most efficient, since it relies on multiple separate packages. If there is any other way other than this, please feel free to give me feedback.

@hoodmane
Copy link
Copy Markdown
Member

Maybe this should be a PR to pyodide-recipes? cc @ryanking13

@ryanking13
Copy link
Copy Markdown
Member

Maybe this should be a PR to pyodide-recipes? cc @ryanking13

I think it should be both in here and in pyodide-recipes. We have OpenBLAS recipe in both repositories (because of scipy and probably for numpy later).

@ryanking13
Copy link
Copy Markdown
Member

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07  

This looks a little bit weird to me. Why is size 10000 slower than 50000 in the baseline?

@ryanking13
Copy link
Copy Markdown
Member

Thanks for working on this @teddygood! The benchmark looks pretty impresive (except for the comment that I wrote about the huge performance drop in size 10000 case).

I tried to compare the size of the three versions, and the difference between the o3 and non-o3 version is not that big. Since non-o3 version showed better performance in your benchmark, I think we can use non-o3 one.

Name                           Size
----                           ----
libopenblas_og.so              {5.76 MB, 5,901.64 KB}
libopenblas_simd_o3.so         {6.22 MB, 6,370.63 KB}
libopenblas-0.3.26.zip         {6.25 MB, 6,396.36 KB}
libopenblas-og-0.3.26.zip      {5.76 MB, 5,901.77 KB}
libopenblas-simd-o3-0.3.26.zip {6.22 MB, 6,370.77 KB}
libopenblas.so                 {6.25 MB, 6,396.24 KB}

I feel this approach may not be the most efficient, since it relies on multiple separate packages. If there is any other way other than this, please feel free to give me feedback.

We don't need to run benchmarks every time, neither do we need to keep the three different versions. I am happy with the benchmark and I think we can keep the simd-enabled non-o3 one and remove others. According to caniuse, wasm-simd has been supported in most of the major browsers for a long time. So I think we can use it as a default instead of keeping two different versions. WDTY @agriyakhetarpal?

@teddygood
Copy link
Copy Markdown
Contributor Author

teddygood commented Oct 22, 2025

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.239973        0.041328        0.047072        5.81         5.10        
50000        0.074845        0.063175        0.070183        1.18         1.07  

This looks a little bit weird to me. Why is size 10000 slower than 50000 in the baseline?

I think the first few runs took longer due to initial setup and cold start overhead, such as library loading and internal buffer preparation.

So I modified the benchmark to discard the results of the first 5 runs and compute the average of the remaining iterations.

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.051281        0.032122        0.043608        1.60         1.18        
50000        0.063757        0.055535        0.070220        1.15         0.91        
307200       0.361577        0.337165        0.314840        1.07         1.15        
921600       0.992375        0.924917        0.917122        1.07         1.08        
2073600      2.021727        2.009305        2.036405        1.01         0.99        
==========================================================================================

@ryanking13
Copy link
Copy Markdown
Member

So I modified the benchmark to discard the results of the first 5 runs and compute the average of the remaining iterations.

Great! The test result now looks more reliable to me.

Now what you need to do is,

  1. Remove all different versions of libopenblas, keep the simd-enabled-non-o3 version (libopenblas).
  2. Push a commit with [scipy] included in the commit message. It will trigger the scipy test suite.
  3. See if the tests pass. If it passes I think we are good to go.

@teddygood
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback! I was wondering if I should keep the benchmark package.

@ryanking13
Copy link
Copy Markdown
Member

ryanking13 commented Oct 24, 2025

Thanks for the feedback! I was wondering if I should keep the benchmark package.

I don't think we need to keep it. We can reference this PR and copy the code when we need to run benchmark again.

@teddygood
Copy link
Copy Markdown
Contributor Author

teddygood commented Oct 24, 2025

Thanks for the feedback! Then I’ll commit the refactored benchmark and remove it afterward. I’ll proceed with the next tasks.

Now what you need to do is,

Remove all different versions of libopenblas, keep the simd-enabled-non-o3 version (libopenblas).
Push a commit with [scipy] included in the commit message. It will trigger the scipy test suite.
See if the tests pass. If it passes I think we are good to go.

@teddygood
Copy link
Copy Markdown
Contributor Author

teddygood commented Oct 24, 2025

Here are the results from the latest benchmark.

Each benchmark performs 5 warmup runs, followed by averaging 30 runs for sdot and 20 runs for sgemm. This design helps eliminate noise from initial executions (e.g., library loading or other startup overhead), so the first 5 runs are excluded from the average.

==========================================================================================
cblas_sdot Benchmark Results (Vector Dot Product)
==========================================================================================
Size         Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
10000        0.044990        0.030754        0.031492        1.46         1.43        
50000        0.057981        0.054054        0.082596        1.07         0.70        
307200       0.332531        0.301561        0.303728        1.10         1.09        
921600       0.922771        0.899521        0.919775        1.03         1.00        
2073600      2.103796        2.137795        2.126294        0.98         0.99        
==========================================================================================
==========================================================================================
cblas_sgemm Benchmark Results (Matrix-Matrix Multiplication)
==========================================================================================
Shape                Baseline (ms)   SIMD (ms)       SIMD-O3 (ms)    Speedup SIMD Speedup O3  
------------------------------------------------------------------------------------------
64x64 @ 64x64        0.195546        0.135287        0.131614        1.45         1.49        
128x128 @ 128x128    0.977794        0.911027        0.946723        1.07         1.03        
256x256 @ 256x256    7.840875        7.549144        7.513479        1.04         1.04        
384x384 @ 384x384    27.185423       27.026663       28.363954       1.01         0.96        
512x512 @ 512x512    62.704544       62.936317       63.067033       1.00         0.99        
==========================================================================================

@agriyakhetarpal
Copy link
Copy Markdown
Member

I am happy with the benchmark and I think we can keep the simd-enabled non-o3 one and remove others. According to caniuse, wasm-simd has been supported in most of the major browsers for a long time. So I think we can use it as a default instead of keeping two different versions. WDTY @agriyakhetarpal?

Yes, given the size difference/increase is pretty reasonable, I am happy to use it as the default. Thanks @ryanking13 and @teddygood for the discussion and for working on this!

Copy link
Copy Markdown
Member

@ryanking13 ryanking13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please open the same PR in pyodide/pyodide-recipes as well? Unfortunately, we have a copy of two same recipes for now.

@ryanking13 ryanking13 merged commit df94f6e into pyodide:main Oct 26, 2025
38 of 40 checks passed
@teddygood
Copy link
Copy Markdown
Contributor Author

Oh, the PR got merged! Thanks! I'll open the same PR in pyodide/pyodide-recipes as well.

Drranny pushed a commit to Drranny/pyodide that referenced this pull request Feb 15, 2026
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants