-
-
Notifications
You must be signed in to change notification settings - Fork 12k
ENH: use more fine-grained critical sections in array coercion internals #30514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jorenham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense 👌
I picked some nits
This comment was marked as outdated.
This comment was marked as outdated.
|
Ping @kumaraditya303 - I'd appreciate your thoughts on this. |
|
I added the backport candidate label and milestoned this for 2.4.1 because it fixes a scaling issue that people will run into. All these critical sections are no-ops on the GIL-enabled build too. |
numpy/_core/src/multiarray/ctors.c
Outdated
| } | ||
| else { | ||
| assert(depth != ndim); | ||
| // this macro thakes *the argument* of PySequence_Fast, which is orig_obj; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // this macro thakes *the argument* of PySequence_Fast, which is orig_obj; | |
| // this macro takes *the argument* of PySequence_Fast, which is orig_obj; |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, generally would seem wise to do the critical sections as far down as possible.
Just one nit about a name, since there is a spelling error to correct anyway. But feel free to ignore.
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, thanks. I think you can just merge, but maybe give me a ping if you want to backport it, since then it makes sense to make double sure there is no silly mistake.
numpy/_core/src/multiarray/ctors.c
Outdated
| assert(depth != ndim); | ||
| // this macro thakes *the argument* of PySequence_Fast, which is orig_obj; | ||
| // not the object returned by PySequence_Fast, which is obj | ||
| NPY_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(orig_obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can still crash and produce data races if the sequence gets mutated but the size remains same. Inside the loop, the items of sequences are accessed using borrowed references (PySequence_Fast_GET_ITEM) so if the critical section gets suspended and other thread sets a new object at current index then it can crash with use-after-free.
You should be able to reproduce this case with a test which tries to create an ndarray from a list whose size is fixed but it getting concurrently modified by other thread with list.__setitem__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix that would I need to use an API that returns a strong reference or could I just INCREF after getting the borrowed reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can incref the item after getting the borrowed reference but it would cause a lot of reference counting contention. I don't think that would scale on free-threading builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm trying to ask: how would you suggest fixing the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try to further reduce the critical section and check the size after each iteration, so something like:
for (int i = 0; i = original_length; i++) {
NPY_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(orig_obj);
if (check length changed) {
raise size changed exception
}
PyObject *value = PySequence_Fast_GET_ITEM(obj, i);
INCREF(value);
END_CRITICAL_SECTION();
// rest of loop
}
The strong reference is still needed though but I think with the reduced critical section, it would not cause a lot of contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented your suggestion and added a new test but I was never able to trigger a race as you described. Do you have any idea why the test I modify in this PR doesn't trigger races using the old code that you said was buggy?
Either way, IMO it's way clearer with the critical section is as short as possible like I have it here. I also think doing the length check every iteration makes sense - I think it may be possible even on the GIL-enabled build for an unlucky thread switch to trigger issues that would be caught by the new repeated check.
|
@seberg would you mind giving this another look to check the new changes to |
numpy/_core/src/multiarray/ctors.c
Outdated
| } | ||
| npy_intp orig_length = PyArray_DIMS(self)[0]; | ||
| for (npy_intp i = 0; i < orig_length; i++) { | ||
| int critical_section_error = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
critical sections are nop on default builds and error can happen there as well, I suggest to name it generic like err
| try: | ||
| with concurrent.futures.ThreadPoolExecutor(max_workers=5) as tpe: | ||
| tasks = [tpe.submit(read_arrs, b) for _ in range(4)] | ||
| tasks.append(tpe.submit(replace_list_items, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be passing mutation_func here, no?
c7b1882 to
6c0c657
Compare
kumaraditya303
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6c0c657 to
204d9a5
Compare
|
Rebased to fix the failing tests. @seberg are you OK with this getting backported? I think it'd still be nice to do backport this even though we probably won't backport all the scaling fixes we're doing. This one fixes a new scaling issue in 2.4 so it's a regression fix in that sense. |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am slightly curious how this affects the uncontested case of:
l = [0] * 100_000
%timeit np.array(l)
but I am happy to trust you that this seems the better trade-off. If the sequence doesn't contain trivial scalars, it certainly makes sense to unlock so other threads can access it (since I guess we have no RW-lock).
(I am still half wondering if we shouldn't just copy the sequences, but not worth trying here.)
| return -1; | ||
| } | ||
|
|
||
| int ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| int ret; | |
| int ret = -1; |
or do you trust the compiler warnings on this one?
numpy/_core/src/multiarray/ctors.c
Outdated
| for (npy_intp i = 0; i < orig_length; i++) { | ||
| int err = 0; | ||
| // this macro takes *the argument* of PySequence_Fast, which is orig_seq; | ||
| // not the object returned by PySequence_Fast, which is obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You likely did, but can you explain why not obj? If PySequence_Fast did convert to tuple, we'll have our own object anyway?
(Ideally in the comment, I guess.)
Also, can't we assume that a fast-sequence will not change its size if we hold a critical section on it? (But then, I guess critical sections are not fool-proof, so maybe that is a silly thought.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When handed a list, PySequence_Fast returns a proxy object with its own per-object lock. If we passed obj in, then we'd lock the PySequence_Fast proxy object and not the original list.
We could check for this in the CRITICAL_SECTION_SEQUENCE_FAST macros but then we'd be doing a check every time we enter the critical section. This is an optimization that I copied from CPython where these macros originally came from.
Also, can't we assume that a fast-sequence will not change its size if we hold a critical section on it? (But then, I guess critical sections are not fool-proof, so maybe that is a silly thought.)
It can't change while we're executing code in the body of the critical section. It also can't change at all if we never call back into the CPython C API.
Here we are calling back into the CPython C API, but we're using the low-level accessor macros which do no error checking and do not cause the critical section to be released. In principle a future version of Python could decide to change that but I doubt that would happen because the property that critical sections aren't released by low-level accessor macros is being used in lots of places.
The rule of thumb I have in my head is that a critical section can be released in exactly the same places where GIL switches are possible (e.g. if the interpreter enters a possibly blocking call). As long as we ensure there are no possibly blocking calls while we hold the critical section, then it behaves like a "real" lock and does not get released by code called inside the critical section body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When handed a list, PySequence_Fast returns a proxy object with its own per-object lock
Aha, I really thought that it just returns the original list in that case, not a proxy that can be mutated by mutating the original list...
Well, I'll let you decide whether you want to explain that.
numpy/_core/src/multiarray/ctors.c
Outdated
| else { | ||
| Py_XSETREF(item_pyvalue, Py_NewRef(PySequence_Fast_GET_ITEM(obj, i))); | ||
| } | ||
| NPY_END_CRITICAL_SECTION_SEQUENCE_FAST(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused is it really important for the relevant use-case to unlock the criticial section on every iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's possible that the size gets changed by another thread while the critical section is released. It could happen any time outside of the critical section if the list is shared with other threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my question was that I am surprised that contention of the list objects with multiple threads is actually a very important use-case (i.e. we can't just lock the whole iteration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the way it works right now, but in a very silly way where we unconditionally acquire a critical section for lists and tuples. That's the critical section I'm deleting in PyArray_FromAny_int down on line 1571 below here.
I observed contention in the benchmark from #30494 - you can see it in the screenshot from #30494 (comment).
This PR fixes that contention because the critical sections are held for a much shorter period of time in this PR.
It also gives people an escape hatch if it does ever come up again: pass a tuple instead of a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sorry for the noise, it wasn't (isn't) quite obvious why that script needs this use-case, but fine with it. The question would really be the opposite I guess: Would this make:
l = [0] * 100_000
%timeit np.array(l)
much slower in the uncontested case. And if it doesn't all of this is moot anyway. If it is much slower, I suppose this may still be the more interesting thing (i.e. ensure we scale well on free-threading, at least for now that is hopefully how things are used)
On current On this PR with the same interpreter: On the GIL-enabled build, with current And this PR: So not nothing, but the impact is disproportionately worse on the free-threaded build. I think I can improve things here - let me try to experiment a little. |
|
Well, to be fair, while very noticeable also not a super bad slowdown if the alternative is terrible congestion on some other cases. (The GIL build doesn't matter here of course) |
|
I posted an incorrect comment and then deleted it - sorry if you all got notifications. |
|
I played around with refactoring to amortize costs and couldn't find a solution that runs faster than this. One thing we can do is document that using a tuple avoids the critical sections: |
|
I opened #30597 as a longer-term followup task. |
|
So one more data point is this: on another computer (this was M1), I see more slowdown, but also <10%. I think it really begs the question whether the optimization of not always converting to a tuple is even worthwhile (at least on free-threading). Admittedly, this probably disregards what happens when we have lists that are so large that they do not fit into the cache, but I am not sure we need to optimize for those. EDIT: OTOH, I guess the picture may be different for very small lists (although, it seems likely that array creation itself to be the lion share then). |
|
Good point. It'd be nice if there were a zero-copy way for a tuple to consume a list's entries. |
|
@colesbury and I chatted and arrived at moving the critical section outside the for loop. We're still only locking the innermost list in a recursive list of lists because only the innermost critical section in the call stack is locked. The difference is that now we keep the lock held while we're processing the whole list. That avoids all the critical section churn and converting a list to an ndarray has very little single-threaded overhead. I think this is OK to backport now. @seberg are you ok with that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems the likely better trade-off to me, thanks. (Not that I care too much which way we go, happy to trust your guts.)
Silly nit/preference that I'll just apply .
And yes, this seems all simple enough to backport to me.
|
Thanks all for your help with this. @charris please feel free to ping me about the backport if there are issues. |
ENH: use more fine-grained critical sections in array coercion internals (#30514)
Towards addressing #30494. Right now the critical section I remove here in
PyArray_FromAny_intshows up in the profile for the script in that issue.I added the critical section in 5a031d9 and this partially reverts that change. On reflection, it's not a good idea to introduce a scaling bottleneck here in service of a sort of wonky thing to do: mutating the operand of
np.array()while the array is being created.Instead, we should error in those cases. Like we already do without the critical section!
I also needed to add new, more fine-grained critical sections, in
PyArray_DiscoverDTypeAndShape_RecursiveandPyArray_AssignFromCache_Recursiveto avoid data races due to use of thePySequence_FastAPI.I also updated the tests for this to allow the cases affected by this to raise errors instead of succeeding, since that success relies on introducing scaling bottlenecks for valid read-only uses.
Here's a Samply profile output run using this PR on the script from #30494 - I no longer see a scaling bottleneck inside the array coercion routines: https://share.firefox.dev/44KLZxs