ENH: use more fine-grained critical sections in array coercion internals #30514

ngoldbaum · 2025-12-24T21:11:13Z

Towards addressing #30494. Right now the critical section I remove here in PyArray_FromAny_int shows up in the profile for the script in that issue.

I added the critical section in 5a031d9 and this partially reverts that change. On reflection, it's not a good idea to introduce a scaling bottleneck here in service of a sort of wonky thing to do: mutating the operand of np.array() while the array is being created.

Instead, we should error in those cases. Like we already do without the critical section!

I also needed to add new, more fine-grained critical sections, in PyArray_DiscoverDTypeAndShape_Recursive and PyArray_AssignFromCache_Recursive to avoid data races due to use of the PySequence_Fast API.

I also updated the tests for this to allow the cases affected by this to raise errors instead of succeeding, since that success relies on introducing scaling bottlenecks for valid read-only uses.

Here's a Samply profile output run using this PR on the script from #30494 - I no longer see a scaling bottleneck inside the array coercion routines: https://share.firefox.dev/44KLZxs

jorenham

Makes sense 👌

I picked some nits

numpy/_core/tests/test_multithreading.py

ngoldbaum · 2025-12-24T23:08:59Z

Ping @kumaraditya303 - I'd appreciate your thoughts on this.

numpy/_core/src/common/npy_pycompat.h

ngoldbaum · 2025-12-25T18:40:06Z

I added the backport candidate label and milestoned this for 2.4.1 because it fixes a scaling issue that people will run into. All these critical sections are no-ops on the GIL-enabled build too.

ngoldbaum · 2025-12-25T18:41:43Z

numpy/_core/src/multiarray/ctors.c

    }
    else {
        assert(depth != ndim);
+        // this macro thakes *the argument* of PySequence_Fast, which is orig_obj;


Suggested change

// this macro thakes *the argument* of PySequence_Fast, which is orig_obj;

// this macro takes *the argument* of PySequence_Fast, which is orig_obj;

mhvk

Yes, generally would seem wise to do the critical sections as far down as possible.

Just one nit about a name, since there is a spelling error to correct anyway. But feel free to ignore.

numpy/_core/src/multiarray/ctors.c

seberg

Makes sense to me, thanks. I think you can just merge, but maybe give me a ping if you want to backport it, since then it makes sense to make double sure there is no silly mistake.

numpy/_core/src/common/npy_pycompat.h

numpy/_core/src/multiarray/ctors.c

kumaraditya303 · 2025-12-26T17:16:54Z

numpy/_core/src/multiarray/ctors.c

        assert(depth != ndim);
+        // this macro thakes *the argument* of PySequence_Fast, which is orig_obj;
+        // not the object returned by PySequence_Fast, which is obj
+        NPY_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(orig_obj);


This can still crash and produce data races if the sequence gets mutated but the size remains same. Inside the loop, the items of sequences are accessed using borrowed references (PySequence_Fast_GET_ITEM) so if the critical section gets suspended and other thread sets a new object at current index then it can crash with use-after-free.

You should be able to reproduce this case with a test which tries to create an ndarray from a list whose size is fixed but it getting concurrently modified by other thread with list.__setitem__.

To fix that would I need to use an API that returns a strong reference or could I just INCREF after getting the borrowed reference?

You can incref the item after getting the borrowed reference but it would cause a lot of reference counting contention. I don't think that would scale on free-threading builds.

I guess I'm trying to ask: how would you suggest fixing the issue?

You can try to further reduce the critical section and check the size after each iteration, so something like:

for (int i = 0; i = original_length; i++) { NPY_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(orig_obj); if (check length changed) { raise size changed exception } PyObject *value = PySequence_Fast_GET_ITEM(obj, i); INCREF(value); END_CRITICAL_SECTION(); // rest of loop }

The strong reference is still needed though but I think with the reduced critical section, it would not cause a lot of contention.

I implemented your suggestion and added a new test but I was never able to trigger a race as you described. Do you have any idea why the test I modify in this PR doesn't trigger races using the old code that you said was buggy?

Either way, IMO it's way clearer with the critical section is as short as possible like I have it here. I also think doing the length check every iteration makes sense - I think it may be possible even on the GIL-enabled build for an unlucky thread switch to trigger issues that would be caught by the new repeated check.

ngoldbaum · 2025-12-29T17:55:03Z

@seberg would you mind giving this another look to check the new changes to PyArray_AssignFromCache_Recursive? Do you still think this is OK to backport or foresee any risks or problems with doing the length checks multiple times?

numpy/_core/src/multiarray/ctors.c

numpy/_core/tests/test_multithreading.py

kumaraditya303 · 2026-01-03T13:05:44Z

numpy/_core/src/multiarray/ctors.c

-        }
+        npy_intp orig_length = PyArray_DIMS(self)[0];
+        for (npy_intp i = 0; i < orig_length; i++) {
+            int critical_section_error = 0;


critical sections are nop on default builds and error can happen there as well, I suggest to name it generic like err

kumaraditya303 · 2026-01-03T13:09:40Z

numpy/_core/tests/test_multithreading.py

+        try:
+            with concurrent.futures.ThreadPoolExecutor(max_workers=5) as tpe:
+                tasks = [tpe.submit(read_arrs, b) for _ in range(4)]
+                tasks.append(tpe.submit(replace_list_items, b))


You should be passing mutation_func here, no?

kumaraditya303

LGTM

ngoldbaum · 2026-01-06T17:25:35Z

Rebased to fix the failing tests.

@seberg are you OK with this getting backported? I think it'd still be nice to do backport this even though we probably won't backport all the scaling fixes we're doing. This one fixes a new scaling issue in 2.4 so it's a regression fix in that sense.

seberg

I am slightly curious how this affects the uncontested case of:

l = [0] * 100_000
%timeit np.array(l)

but I am happy to trust you that this seems the better trade-off. If the sequence doesn't contain trivial scalars, it certainly makes sense to unlock so other threads can access it (since I guess we have no RW-lock).

(I am still half wondering if we shouldn't just copy the sequences, but not worth trying here.)

seberg · 2026-01-06T18:32:52Z

numpy/_core/src/multiarray/array_coercion.c

        return -1;
    }

+    int ret;


Suggested change

int ret;

int ret = -1;

or do you trust the compiler warnings on this one?

seberg · 2026-01-06T18:37:42Z

numpy/_core/src/multiarray/ctors.c

+        for (npy_intp i = 0; i < orig_length; i++) {
+            int err = 0;
+            // this macro takes *the argument* of PySequence_Fast, which is orig_seq;
+            // not the object returned by PySequence_Fast, which is obj


You likely did, but can you explain why not obj? If PySequence_Fast did convert to tuple, we'll have our own object anyway?
(Ideally in the comment, I guess.)

Also, can't we assume that a fast-sequence will not change its size if we hold a critical section on it? (But then, I guess critical sections are not fool-proof, so maybe that is a silly thought.)

When handed a list, PySequence_Fast returns a proxy object with its own per-object lock. If we passed obj in, then we'd lock the PySequence_Fast proxy object and not the original list.

We could check for this in the CRITICAL_SECTION_SEQUENCE_FAST macros but then we'd be doing a check every time we enter the critical section. This is an optimization that I copied from CPython where these macros originally came from.

Also, can't we assume that a fast-sequence will not change its size if we hold a critical section on it? (But then, I guess critical sections are not fool-proof, so maybe that is a silly thought.)

It can't change while we're executing code in the body of the critical section. It also can't change at all if we never call back into the CPython C API.

Here we are calling back into the CPython C API, but we're using the low-level accessor macros which do no error checking and do not cause the critical section to be released. In principle a future version of Python could decide to change that but I doubt that would happen because the property that critical sections aren't released by low-level accessor macros is being used in lots of places.

The rule of thumb I have in my head is that a critical section can be released in exactly the same places where GIL switches are possible (e.g. if the interpreter enters a possibly blocking call). As long as we ensure there are no possibly blocking calls while we hold the critical section, then it behaves like a "real" lock and does not get released by code called inside the critical section body.

When handed a list, PySequence_Fast returns a proxy object with its own per-object lock

Aha, I really thought that it just returns the original list in that case, not a proxy that can be mutated by mutating the original list...
Well, I'll let you decide whether you want to explain that.

seberg · 2026-01-06T18:47:58Z

numpy/_core/src/multiarray/ctors.c

+            else {
+                Py_XSETREF(item_pyvalue, Py_NewRef(PySequence_Fast_GET_ITEM(obj, i)));
+            }
+            NPY_END_CRITICAL_SECTION_SEQUENCE_FAST();


I am a bit confused is it really important for the relevant use-case to unlock the criticial section on every iteration?

Yes, it's possible that the size gets changed by another thread while the critical section is released. It could happen any time outside of the critical section if the list is shared with other threads.

Sorry, my question was that I am surprised that contention of the list objects with multiple threads is actually a very important use-case (i.e. we can't just lock the whole iteration).

That's the way it works right now, but in a very silly way where we unconditionally acquire a critical section for lists and tuples. That's the critical section I'm deleting in PyArray_FromAny_int down on line 1571 below here.

I observed contention in the benchmark from #30494 - you can see it in the screenshot from #30494 (comment).

This PR fixes that contention because the critical sections are held for a much shorter period of time in this PR.

It also gives people an escape hatch if it does ever come up again: pass a tuple instead of a list.

OK, sorry for the noise, it wasn't (isn't) quite obvious why that script needs this use-case, but fine with it. The question would really be the opposite I guess: Would this make:

l = [0] * 100_000 %timeit np.array(l)

much slower in the uncontested case. And if it doesn't all of this is moot anyway. If it is much slower, I suppose this may still be the more interesting thing (i.e. ensure we scale well on free-threading, at least for now that is hopefully how things are used)

ngoldbaum · 2026-01-06T19:45:41Z

I am slightly curious how this affects the uncontested case

On current main, with 3.14.2t:

In [1]: l = [0] * 100_000

In [2]: %timeit np.array(l)
2.31 ms ± 5.96 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On this PR with the same interpreter:

In [1]: l = [0] * 100_000

In [2]: %timeit np.array(l)
2.7 ms ± 3.99 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

On the GIL-enabled build, with current main:

In [1]: l = [0] * 100_000

In [2]: %timeit np.array(l)
2.18 ms ± 6.44 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And this PR:

In [2]: %timeit np.array(l)
2.24 ms ± 9.87 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So not nothing, but the impact is disproportionately worse on the free-threaded build.

I think I can improve things here - let me try to experiment a little.

seberg · 2026-01-06T19:57:08Z

Well, to be fair, while very noticeable also not a super bad slowdown if the alternative is terrible congestion on some other cases. (The GIL build doesn't matter here of course)

ngoldbaum · 2026-01-06T22:07:43Z

I posted an incorrect comment and then deleted it - sorry if you all got notifications.

ngoldbaum · 2026-01-06T22:12:23Z

I played around with refactoring to amortize costs and couldn't find a solution that runs faster than this.

One thing we can do is document that using a tuple avoids the critical sections:

Preimported NumPy 2.5.0.dev0+git20260105.204d9a5 as np

In [1]: l = [0] * 100_000

In [2]: t = tuple(l)

In [3]: %timeit np.array(l)
2.4 ms ± 7.21 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit np.array(t)
2.03 ms ± 3.43 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

ngoldbaum · 2026-01-06T22:34:30Z

I opened #30597 as a longer-term followup task.

seberg · 2026-01-07T08:52:11Z

So one more data point is this:

In [4]: l = [1] * 100_1000

In [5]: %timeit np.asarray(tuple(l))
30.6 ms ± 514 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit np.asarray(l)
29.3 ms ± 171 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

on another computer (this was M1), I see more slowdown, but also <10%. I think it really begs the question whether the optimization of not always converting to a tuple is even worthwhile (at least on free-threading).

Admittedly, this probably disregards what happens when we have lists that are so large that they do not fit into the cache, but I am not sure we need to optimize for those.

EDIT: OTOH, I guess the picture may be different for very small lists (although, it seems likely that array creation itself to be the lion share then).

ngoldbaum · 2026-01-07T16:56:33Z

Good point. It'd be nice if there were a zero-copy way for a tuple to consume a list's entries.

ngoldbaum · 2026-01-08T19:59:16Z

@colesbury and I chatted and arrived at moving the critical section outside the for loop. We're still only locking the innermost list in a recursive list of lists because only the innermost critical section in the call stack is locked. The difference is that now we keep the lock held while we're processing the whole list. That avoids all the critical section churn and converting a list to an ndarray has very little single-threaded overhead.

I think this is OK to backport now. @seberg are you ok with that?

seberg

This seems the likely better trade-off to me, thanks. (Not that I care too much which way we go, happy to trust your guts.)

Silly nit/preference that I'll just apply .

And yes, this seems all simple enough to backport to me.

numpy/_core/src/multiarray/ctors.c

ngoldbaum · 2026-01-09T14:18:48Z

Thanks all for your help with this. @charris please feel free to ping me about the backport if there are issues.

…als (numpy#30514)

ENH: use more fine-grained critical sections in array coercion internals (#30514)

ngoldbaum added 34 - Reversion Reversion of previous commit or merge 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) labels Dec 24, 2025

jorenham reviewed Dec 24, 2025

View reviewed changes

numpy/_core/tests/test_multithreading.py Outdated Show resolved Hide resolved

numpy/_core/tests/test_multithreading.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

ngoldbaum changed the title ~~REV: do not use critical section in PyArray_FromAny_int~~ ENH: use more fine-grained critical sections in array coercion internals Dec 24, 2025

ngoldbaum added 01 - Enhancement and removed 34 - Reversion Reversion of previous commit or merge labels Dec 24, 2025

ngoldbaum commented Dec 24, 2025

View reviewed changes

numpy/_core/src/common/npy_pycompat.h Show resolved Hide resolved

ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label Dec 25, 2025

ngoldbaum added this to the 2.4.1 Release milestone Dec 25, 2025

ngoldbaum commented Dec 25, 2025

View reviewed changes

mhvk reviewed Dec 26, 2025

View reviewed changes

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

seberg approved these changes Dec 26, 2025

View reviewed changes

numpy/_core/src/common/npy_pycompat.h Show resolved Hide resolved

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

kumaraditya303 reviewed Dec 26, 2025

View reviewed changes

kumaraditya303 reviewed Dec 30, 2025

View reviewed changes

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

kumaraditya303 reviewed Dec 30, 2025

View reviewed changes

numpy/_core/tests/test_multithreading.py Outdated Show resolved Hide resolved

kumaraditya303 reviewed Jan 3, 2026

View reviewed changes

ngoldbaum force-pushed the rm-critical-section branch from c7b1882 to 6c0c657 Compare January 5, 2026 21:35

kumaraditya303 approved these changes Jan 6, 2026

View reviewed changes

ngoldbaum added 7 commits January 6, 2026 10:24

REV: do not use critical section in PyArray_FromAny_int

e8f475f

ENH: add more fine-grained critical sections to prevent data races

7fa0aa6

MAINT: apply joren's suggestions

4f28cbd

MAINT: fix MSVC build errors

bd19faf

MAINT: suppress unused variable warning on GIL-enabled build

ff83da9

MAINT: respond to code review, add new test

c08d698

MAINT: appease linter

283d8cf

ngoldbaum added 3 commits January 6, 2026 10:24

MAINT: fix indentation error

595988f

MAINT: apply Kumar's suggestions

44a7cb7

MAINT: add suggestions from kumar

204d9a5

ngoldbaum force-pushed the rm-critical-section branch from 6c0c657 to 204d9a5 Compare January 6, 2026 17:24

seberg approved these changes Jan 6, 2026

View reviewed changes

MAINT: apply suggestions from Sebastian

8b317b6

ngoldbaum mentioned this pull request Jan 6, 2026

DOC: Add python-level parallelism tips to the docs #30597

Open

MAINT: Lift critical section outside for loop

997d62b

seberg approved these changes Jan 9, 2026

View reviewed changes

numpy/_core/src/multiarray/ctors.c Show resolved Hide resolved

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

numpy/_core/src/multiarray/ctors.c Show resolved Hide resolved

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

seberg reviewed Jan 9, 2026

View reviewed changes

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

seberg reviewed Jan 9, 2026

View reviewed changes

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

seberg reviewed Jan 9, 2026

View reviewed changes

numpy/_core/src/multiarray/ctors.c Outdated Show resolved Hide resolved

Apply suggestions from code review

3553b46

ngoldbaum merged commit f254f12 into numpy:main Jan 9, 2026
75 checks passed

charris pushed a commit to charris/numpy that referenced this pull request Jan 9, 2026

ENH: use more fine-grained critical sections in array coercion intern…

3d3af25

…als (numpy#30514)

charris mentioned this pull request Jan 9, 2026

ENH: use more fine-grained critical sections in array coercion internals (#30514) #30620

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jan 9, 2026

charris added a commit that referenced this pull request Jan 9, 2026

Merge pull request #30620 from charris/backport-30514

bf7ef0b

ENH: use more fine-grained critical sections in array coercion internals (#30514)

ngoldbaum mentioned this pull request Jan 13, 2026

orjson thread safety under the free-threaded interpreter ijl/orjson#591

Closed

	// this macro thakes the argument of PySequence_Fast, which is orig_obj;
	// this macro takes the argument of PySequence_Fast, which is orig_obj;

Uh oh!

ENH: use more fine-grained critical sections in array coercion internals #30514

ENH: use more fine-grained critical sections in array coercion internals #30514

Conversation

ngoldbaum commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorenham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

ngoldbaum commented Dec 24, 2025

Uh oh!

Uh oh!

ngoldbaum commented Dec 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kumaraditya303 left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Jan 6, 2026

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ngoldbaum commented Dec 24, 2025 •

edited

Loading

ngoldbaum Dec 29, 2025 •

edited

Loading

ngoldbaum commented Dec 29, 2025 •

edited

Loading

ngoldbaum commented Jan 6, 2026 •

edited

Loading

seberg commented Jan 7, 2026 •

edited

Loading

ngoldbaum commented Jan 8, 2026 •

edited

Loading

seberg left a comment •

edited

Loading