BUG: fix race initializing legacy dtype casts#28290
Conversation
|
This is decently complicated but I added the backport candidate label because I think this is only possibly risky for the free-threaded build and it fixes an annoying Jax runtime crash on that build that they can't really do anything about. |
| if (PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls, | ||
| (PyObject *) to, Py_None) < 0) { | ||
| return NULL; | ||
| } |
There was a problem hiding this comment.
None gets inserted into to castingimpls later on before exiting the critical section so there's no need to do it here
| * Using this function outside of module initialization without holding a | ||
| * critical section on the castingimpls dict may lead to a race to fill the | ||
| * dict. Use PyArray_GetGastingImpl to lazily register casts at runtime | ||
| * safely. |
There was a problem hiding this comment.
I manually verified that this function is only used during module initialization or inside the critical section I added in this PR
There was a problem hiding this comment.
Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)
This comment was marked as outdated.
This comment was marked as outdated.
|
Pushed a fix that should make ASAN and TSAN CI output more helpful. Also, yay, TSAN CI found a race condition! |
8120070 to
8f53a12
Compare
|
Not sure why the 32 bit runners are unhappy, will have to come back to this. |
9b5289f to
137ca3c
Compare
|
It turns out the hangs and crashes on 32-bit runners were because of memory exhaustion, I think the latest push should pass all the CI. |
Not quite right. At some arbitrary thread count that depends on the precise details of the memory allocation patterns of threads, Python will raise a runtime error saying that it can't spawn any more threads when you ask it to spawn a thread. Normally the exception would bubble up properly. This time, the exception coming from The fix is to put the code working with the |
mhvk
left a comment
There was a problem hiding this comment.
Not sure I can review this properly, but it looks good.
| return NULL; | ||
| } | ||
| Py_RETURN_NONE; | ||
| PyErr_Clear(); |
seberg
left a comment
There was a problem hiding this comment.
Thanks Natahn! LGTM. The only behavior change, might be that we init the within castingimpl immediately now. That seems OK.
If this is annoying someone in practice, I am OK with backporting this.
I left two comments about code cleanup, but I don't feel strongly about them.
| pip install -r requirements/ci_requirements.txt | ||
| pip install -r requirements/test_requirements.txt | ||
| # xdist captures stdout/stderr, but we want the TSAN output | ||
| pip uninstall -y pytest-xdist |
There was a problem hiding this comment.
Didn't know xdist kicks in without -n, but seems good.
| if (!return_error && | ||
| PyDict_SetItem(NPY_DT_SLOTS(from)->castingimpls, | ||
| (PyObject *)to, res) < 0) { | ||
| return_error = 1; |
There was a problem hiding this comment.
I think it would be slightly nicer to keep this inside the res == NULL branch (if it isn't we don't need to set it again).
The first if could go into an else if (res == NULL) (that includes this).
If you are than OK with using Py_CLEAR(res) on this line, you can delete return_error (with the funny thing that the first if does nothing).
To me that would look nicer, I think, but happy to keep this.
There was a problem hiding this comment.
Thanks, it is better to use else if if you can't early return. I'd rather keep the return_error since it makes things a little more explicit in the weird coding style you have to use inside critical sections.
| * Using this function outside of module initialization without holding a | ||
| * critical section on the castingimpls dict may lead to a race to fill the | ||
| * dict. Use PyArray_GetGastingImpl to lazily register casts at runtime | ||
| * safely. |
There was a problem hiding this comment.
Thanks for checking. I guess we are fine as long as we enforce that casts from/to a new DType must be registered at the DType creation. (which is the enforced right now, since this is not public API)
| finally: | ||
| if n_submitted < max_workers and pass_barrier: | ||
| barrier.abort() |
There was a problem hiding this comment.
I think this may be clearer to write as:
except: # or BaseExecption if you prefer
if pass_barrier:
barrier.abort()
raise
and avoid counting the n_submitted above.
(but maybe I missed a subtlety?)
There was a problem hiding this comment.
(but maybe I missed a subtlety?)
bare except: and except BaseException is the sort of thing that linters mark as a code smell or a naive reader in the future would replace with except Exception (which is wrong). I liketry/finally because it matches the intention and uses a language feature that linters won't flag.
I can get rid of n_submitted and just check len(futures) though.
|
Looks like all the CI failures are unrelated network issues. |
|
The loongarch64 failure was legit, but unrelated. #28320 should fix it. |
|
Thanks Nathan. |
Fixes #28048
Locks
castingimplswith a critical section when thecastingimplscache is empty before writing to the dict.Because the critical section macros have braces, I refactored
PyArray_GetCastingImplinto three functions that call each other. Only the "middle" function needs the critical section, so that reduces the pain of not being able to early return in a critical section a little bit.I also added some comments because this took me quite a while to fully understand so future readers will hopefully be less confused.
Edit: there are also some unrelated fixes for the test infrastructure to avoid CI crashes and generate better debug output from CI jobs. Happy to do those separately if anyone requests.