Skip to content

BLD: Add Accelerate support for macOS 13.3+#19816

Merged
rgommers merged 7 commits intoscipy:mainfrom
thalassemia:accelerate
Apr 15, 2024
Merged

BLD: Add Accelerate support for macOS 13.3+#19816
rgommers merged 7 commits intoscipy:mainfrom
thalassemia:accelerate

Conversation

@thalassemia
Copy link
Copy Markdown
Contributor

@thalassemia thalassemia commented Jan 6, 2024

I added a script (scipy/_build_utils/_generate_blas_wrapper.py) to generate C wrappers for Accelerate BLAS/LAPACK. These wrappers are compiled into a static library that other modules can link against with no code modifications (e.g. a call to dgemm in a Fortran module will be routed through a wrapper and turned into a call to dgemm$NEWLAPACK). I also added Accelerate support to the script (scipy/linalg/_generate_pyx.py) that generates the Cython for _fblas/_flapack using similar C wrappers. In the process of making these two changes, I included special logic to handle the G77 ABI wrappers, eliminating the need for additional Fortran wrappers.

Other changes:

Downsides

The wrapper functions are all in separate source files, allowing the linker to exclude unused symbols and keep binary sizes about the same as they currently are. Since there are ~1600 BLAS/LAPACK functions, this increases build time by a noticeable amount. I tried writing everything to a single file and compiling with -ffunction-sections -fdata-sections -Wl,-dead_strip, but this still resulted in unused symbols and larger binaries.

There is no way to mix and match BLAS/LAPACK libraries at the moment (also a challenge to support both LP64 and ILP64).

Future

This wrapper generation framework makes it pretty easy to add ILP64 support. If this PR seems OK in principle, I can make a separate PR detailing some quirks I encountered trying to implement ILP64 support with this as a base.

@github-actions github-actions bot added scipy.special scipy.io scipy.sparse.linalg scipy.linalg scipy.sparse scipy.optimize scipy.interpolate scipy.integrate scipy.odr scipy.spatial C/C++ Items related to the internal C/C++ code base Fortran Items related to the internal Fortran code base Meson Items related to the introduction of Meson as the new build system for SciPy labels Jan 6, 2024
@thalassemia thalassemia changed the title Accelerate ENH: Add Accelerate support for macOS 13.3+ Jan 6, 2024
@thalassemia thalassemia changed the title ENH: Add Accelerate support for macOS 13.3+ BLD: Add Accelerate support for macOS 13.3+ Jan 6, 2024
@rgommers
Copy link
Copy Markdown
Member

Thanks @thalassemia. Working on going through this once more and exercising it a bit.

the build times seem to be under control as evidenced by my M1 Accelerate GHA taking in 10 minutes while the M1 OpenBLAS test took 11 minutes.

Agreed. I see a 10%-15% increase in build time on my M1 Macbook Pro. Which I agree is acceptable. Tracing shows that the added targets are adding a ~20 sec load of small file compiles at the start of the build, which makes sense:

image

The build time increase will only occur with Accelerate, so that's all good.

@rgommers
Copy link
Copy Markdown
Member

I checked the binary size change: this approach is very lean, 54 kb extra due to the extra static libraries. Size grows from 21.78 MB to 21.84 MB for a local build. The macosx_arm64 wheels on PyPI are 30.3 MB, so this is (as expected) a major win - we can start shipping macosx_14_0_arm64 wheels that are almost a third smaller than the macosx_12_0_arm64 ones.

I could not figure out how to get the macOS linker to exclude unused symbols when consolidating all the wrappers into a single compilation target. I wonder if this might be possible on other platforms with other linkers (will explore in a future ILP64 PR).

I think it's possible with linker flags like -Wl,--gc-sections perhaps, but not in a portable way. Enabling LTO may help - see #16098 (comment).

@thalassemia
Copy link
Copy Markdown
Contributor Author

Good news! I think I found a way to put every wrapper in a single file and let the linker exclude unused symbols. Previously, the linker was forced to include otherwise unused symbols simply because all exported symbols are global by default. Now, by explicitly marking my wrapper symbols as hidden, the linker can remove any symbols that are not directly used within each individual module.

Importantly, this PR only adds the flags needed to make this work in macOS, since that is the only platform with Accelerate. In a future ILP64 PR using this framework, I'll need to add additional compiler-specific Meson flags to achieve the same result (-ffunction-sections -fdata-sections -Wl,--gc-sections for GCC and /Gy for MSVC)

Copy link
Copy Markdown
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, the linker was forced to include otherwise unused symbols simply because all exported symbols are global by default.

Regarding hidden symbols, it shouldn't be needed to define those manually or use __attribute__('hidden') directly I suspect. Instead, the gnu_symbol_visibility keyword to static_library can be used to do this automatically - and we wouldn't have to worry about things like Clang defining __GNU__. Did you consider this?

'SuperLU/SRC/zutil.c'
],
c_args: _superlu_lib_c_args,
dependencies: blas_dep,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That this worked at all before looks like a bug. It should have failed without this and USE_VENDOR_BLAS=1 set. The reason it doesn't is probably all the aliasing done in slu_Cnames.h.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more closely, I think this line is unnecessary. Even if the BLAS dependency is missing when compiling the static library, it is included when compiling the extension module that links with this library. The same applies to the BLAS/LAPACK dependencies I added to the ARPACK static library. I'll clean those up for clarity.

@rgommers
Copy link
Copy Markdown
Member

Re gnu_symbol_visibility, I just realized that we should be using that anyway on all static libraries - see gh-20477.

@thalassemia
Copy link
Copy Markdown
Contributor Author

Did not know about the gnu_symbol_visibility option. That is a lot cleaner and works for me locally using either OpenBLAS or Accelerate. I updated my PR to use that.

Copy link
Copy Markdown
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now. Time to give this a go. Thanks a lot @thalassemia! And thanks @h-vetinari for all the review effort!

@h-vetinari
Copy link
Copy Markdown
Member

Congratulations @thalassemia and thanks for splitting various pieces off from this big effort! This definitely deserves a release note - if you like, please edit one into the current draft.

@andyfaff
Copy link
Copy Markdown
Contributor

Yes, congratulations @thalassemia, a big effort! Do we know how much the wheel size will reduce with not having to bundle openblas?

@thalassemia
Copy link
Copy Markdown
Contributor Author

thalassemia commented Apr 16, 2024

@andyfaff On my machine (M2, macOS 14.4.1, Python 3.11), 34edaf2 on main yields a wheel 23.1MB in size. On the same machine with OpenBLAS 0.3.26, the current 1.13.0 release tag (7dcd8c5) yields a wheel 31.8MB in size. This 27% reduction is close to what was reported earlier in this PR before I found a way to efficiently combine all the wrappers into a single source file.

Do you think it makes sense to have our macOS wheels use Accelerate by default? At the moment, Accelerate is only used if users build from source with -Dblas=accelerate.

@andyfaff
Copy link
Copy Markdown
Contributor

Do you think it makes sense to have our macOS wheels use Accelerate by default?

It's only useful for a certain mac version and above. Below a certain version I think there are issues with the library. I think the version is 14, but I can't remember off the top of my head.

This means two sets of wheels have to be made. One set for version > 10.X that bundles openblas and one set for version >14.x that doesn't. numpy is already doing this, but start from a lower macOS version.

@rgommers
Copy link
Copy Markdown
Member

Yes indeed, we should start shipping extra wheels built against Accelerate for macOS >=14.0 for the next release, just like is done for numpy 2.0.0. They will be smaller and faster.

@rgommers
Copy link
Copy Markdown
Member

This may be a good place to record that macOS 15 will include another update to Accelerate - it upgrade to LAPACK 3.11.0 with no change in symbol names. The extra LAPACK APIs will be available if the macOS deployment target is >= 15.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3rd party binaries Build issues Issues with building from source, including different choices of architecture, compilers and OS C/C++ Items related to the internal C/C++ code base CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure Cython Issues with the internal Cython code base Meson Items related to the introduction of Meson as the new build system for SciPy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants