-
-
Notifications
You must be signed in to change notification settings - Fork 12k
BUG: Segfault in nditer buffer dealloc for Object arrays #18450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ouch :(! Thanks, great detective work/summary. I somewhat think the flag can get unset again, but can't think of when or why :). In any case checking it here is definitely correct. Can you add the test? Your example triggers an "invalid read" in valgrind even on success, so that is sensitive enough in any case, apparently we did never check it. I doubt the The |
24c4144 to
15c1933
Compare
|
Good tip about PyErr_Occurred and the API/GIL, that fixed the failures. Updated now to add the check for a python error in the nditer loops I found elsewhere in numpy. The check was already done in some other places. I also added a simple test. |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, would be fine to just put in as is. Planning to do so tomorrow, maybe with the nits changed.
I was considering adding tests for these paths (mainly to have them for the future). This also fails:
arr = np.arange(np.BUFSIZE * 10).reshape(10, -1).astype(str)
oarr = arr.astype(object); oarr[:, -1] = None
np.add(oarr[:, ::-1], arr[:, ::-1], where=np.ones(arr.shape, dtype=bool))
(The where is not necessary).
But, I did not yet manage to somehow trigger incorrect behaviour due to the missing PyErr_Occurred() checks, although they are definitely right. I think it should be possible to somehow get a SystemError due to calling invalid Python code with an exception set, but I can't find a way, so maybe it doesn't matter now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically not necessary for backport, since NumPy (also master), does currently not even support object einsum or unsafe casts I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't hurt. Remaining more or less in sync with master is sometimes helpful for future backports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: long line.
numpy/core/src/umath/ufunc_object.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the definition is overshadowed by needs_api in the other branch, could remove the int there, although we have been using local definitions a lot more, with slightly differing opinions.
numpy/core/tests/test_nditer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # This test would trigger valgrind "invalid read" before the bugfix. | |
| # This test would trigger valgrind "uninitialized read" before the bugfix. |
woops
15c1933 to
ea77464
Compare
ea77464 to
5d864e8
Compare
|
Fixed up nits. I agree it's hard to see a real case where missing the |
|
Thanks Allan! I think it is fine to backport with the additional checks, but if anyone wants to be particularly minimal, I am happy to open a backport PR with just the nditer one. |
|
@seberg I've done the backport. |
| strides_copy, innerloopdata); | ||
|
|
||
| if (needs_api && PyErr_Occurred()) { | ||
| break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This break needs to be a goto finish or so. Pandas is running into this in their CI. Will look into it tomorrow at the latest, but the only tricky thing is that it might be nice to have a test.
xref: pandas-dev/pandas#40158
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to check it out now and make a test+PR, unless you already have it done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I thought a test would be more complicated for some reason. I think the iternext technically also should be checked, although I am willing to bet is not a regression.
I get a segfault in 1.20.0 and in master involving nditer and object arrays. It doesn't seem to happen in 1.18. The cause seems to be some combination of incomplete error handling in nditer ( known problem which @seberg seems to have been working on fixing a few months ago), as well as mis-detection of buffer non-initialization.
This PR has two changes:
npyiter_clear_buffersto avoid clearing unused buffers, which is the main fix. I am not 100% on correctness since I'm not yet sure of the lifetime of validity of theNPY_OP_ITFLAG_USINGBUFFERflag: Can it get set/unset between the call tonpyiter_copy_to_buffers(where it is set) andnpyiter_clear_buffers(which does the check)?PyErr_Occurredin ufunc inner-loop code. While I think there is a problem with fall-through errors, I am not sure if this is the right fix: Are there performance implications? Also, if this is a correct change, I think I'd need to do it elsewhere in the file where nditer is used.Reproducing code example:
This code should return an error,
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType', however it often segfaults. The segfault depends on malloc'ing some memory with uninitualized non-null bytes, so you may need to vary the number 17000 two or three times to trigger a different malloc. The second array should be a 0d (and not 1d) object array. The segfaulting code was recently touched in #17029 (seenpyiter_clear_buffersin backtrace below).Debug Info + Discussion
What appears to happen in the example code:
Noneis converted to a 0d object array. Buffers of 'O' dtype of size 8192 are created for both of the inputs and the one output. (sidenote: can we update nditer to avoid creating a 64k buffer for the 0d array (op 1) which we never even use?)innerloopfails due to an invalid python operation (float*None), and sets PyErrnpyiter_buffered_iternext) copies the output buffer to the output array (no problem)NPY_OP_ITFLAG_USINGBUFFERfor that op._aligned_contig_to_contig_cast. After successfully doing the copy, this function checks ifPyErr_Occurred, which incorrectly triggers due to the previously set error from innerloop.npyiter_clear_buffers) and return failure.npyiter_clear_bufferstries to zero-out op 1's buffer, which involves decref'ing each element since it is Object dtype. However, this buffer was never initialized, so depending on malloc has random addresses, so this often segfaults.GDB traceback:
Additionally, this is the output with
NPY_IT_DBG_TRACINGandNPY_UF_DBG_TRACINGturned on:It doesn't show it, but I in gdb I can see the segfault happens when clearing op 1's buffer
NumPy/Python version information:
1.20.0 and master