Skip to content

BUG: Prevent deadlock due to downstream importing NumPy in dlopen#31303

Merged
seberg merged 1 commit into
numpy:mainfrom
seberg:issue-31284
Apr 23, 2026
Merged

BUG: Prevent deadlock due to downstream importing NumPy in dlopen#31303
seberg merged 1 commit into
numpy:mainfrom
seberg:issue-31284

Conversation

@seberg

@seberg seberg commented Apr 22, 2026

Copy link
Copy Markdown
Member

Downstream should probably never import numpy (i.e. activate the C-API) during dlopen (rather than the Python module init function).

But if it does so, this check introduced a deadlock regression because dlopen has a recursive lock, but this breaks if the matmul is called and causes threads to spawn that also need to dlopen.

So, as a hot-fix, just only run the check on MacOS, since we are not currently aware of reports on other system (possible or not), although this did happen with certain OpenBLAS versions (and not just accelerate).

(I cannot fully rule out that there won't be regressions on some linux setups, but they were not yet reported. -- The "limit to 1 thread" solution is an alternative as well.)

EDIT: addresses gh-31284. Should be backported for 2.4.5

Downstream should probably never import numpy (i.e. activate the C-API)
during dlopen (rather than the Python module init function).

But if it does so, this check introduced a deadlock regression because
dlopen has a recursive lock, but this breaks if the matmul is called
and causes threads to spawn that also need to dlopen.

So, as a hot-fix, just only run the check on MacOS, since we are not
currently aware of reports on other system (possible or not), although
this did happen with certain OpenBLAS versions (and not just accelerate).

(I cannot fully rule out that there won't be regressions on some linux
setups, but they were not yet reported.)

Closes numpygh-31284
@mattip

mattip commented Apr 22, 2026

Copy link
Copy Markdown
Member

How do we know people won’t dlopen on macos?

@seberg

seberg commented Apr 22, 2026

Copy link
Copy Markdown
Member Author

I don't (unless Mac doesn't have such a mutex, but I'll doubt that). OTOH, we never had issues on Mac, where similar code always ran...

So basically, we can be confident that this cannot be a regression w.r.t. the deadlock (and at that point, I may lean towards considering this a downstream issue, as I think the init is technically at the wrong spot).

@rgommers might have a thought...

@rgommers rgommers added 00 - Bug 09 - Backport-Candidate PRs tagged should be backported labels Apr 22, 2026
@rgommers rgommers added this to the 2.5.0 Release milestone Apr 22, 2026

@rgommers rgommers left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @seberg. I agree this is safe. The only other consideration is whether this can/should be further reduced to macOS arm64, because there's no problem on x86-64. It's conceivable that some users are still running MKL 2023 on macOS x86-64, although the chance of that intersection with a package like csp seems remote. Optional before merge I think, I'm mappy either way.

How do we know people won’t dlopen on macos?

Given that we only have to worry about BLAS libraries here, I think we can rely on our knowledge that the SME problem is Accelerate-specific (🤞🏼 it stays that way) and dlopen usage is MKL-specific (that's very unlikely to change).

@seberg

seberg commented Apr 23, 2026

Copy link
Copy Markdown
Member Author

It's conceivable that some users are still running MKL 2023 on macOS x86-64, although the chance of that intersection with a package like csp seems remote.

We already have another MacOS bug check (although I think we could remove it). So I don't think there is any chance of regression.
(And beyond the fact that this was a regression here, I am still tempted to think that csp needs to clean up it's NumPy import pattern -- I am not sure we can 100% guarantee this in practice.)

I guess we have to try this and put this in for now. It seems unlikely a non Mac Arm systems run into this right now. And if it happens to start, we might have to say that the csp regression is the less bad one (or consider the thread limiting solution)...

@seberg

seberg commented Apr 23, 2026

Copy link
Copy Markdown
Member Author

(Ignoring those test failures, but it seems some compiler warnings might have changed? We may have to fix that up soon.)

@seberg seberg merged commit a253537 into numpy:main Apr 23, 2026
85 of 87 checks passed
@seberg seberg deleted the issue-31284 branch April 23, 2026 08:52
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 26, 2026
charris pushed a commit that referenced this pull request Apr 26, 2026
…1303)

Downstream should probably never import numpy (i.e. activate the C-API) during dlopen (rather than the Python module init function).

But if it does so, this check introduced a deadlock regression because dlopen has a recursive lock, but this breaks if the matmul is called and causes threads to spawn that also need to dlopen.

So, as a hot-fix, just only run the check on MacOS, since we are not currently aware of reports on other system (possible or not), although this did happen with certain OpenBLAS versions (and not just accelerate).

(I cannot fully rule out that there won't be regressions on some linux setups, but they were not yet reported. -- The "limit to 1 thread" solution is an alternative as well.)
charris added a commit that referenced this pull request Apr 26, 2026
BUG: Prevent deadlock due to downstream importing NumPy in dlopen (#31303)
MaanasArora pushed a commit to MaanasArora/numpy that referenced this pull request May 7, 2026
…mpy#31303)

Downstream should probably never import numpy (i.e. activate the C-API) during dlopen (rather than the Python module init function).

But if it does so, this check introduced a deadlock regression because dlopen has a recursive lock, but this breaks if the matmul is called and causes threads to spawn that also need to dlopen.

So, as a hot-fix, just only run the check on MacOS, since we are not currently aware of reports on other system (possible or not), although this did happen with certain OpenBLAS versions (and not just accelerate).

(I cannot fully rule out that there won't be regressions on some linux setups, but they were not yet reported. -- The "limit to 1 thread" solution is an alternative as well.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants