BUG: Segfault in nditer buffer dealloc for Object arrays #18450

ahaldane · 2021-02-20T18:17:35Z

I get a segfault in 1.20.0 and in master involving nditer and object arrays. It doesn't seem to happen in 1.18. The cause seems to be some combination of incomplete error handling in nditer ( known problem which @seberg seems to have been working on fixing a few months ago), as well as mis-detection of buffer non-initialization.

This PR has two changes:

It adds a check in npyiter_clear_buffers to avoid clearing unused buffers, which is the main fix. I am not 100% on correctness since I'm not yet sure of the lifetime of validity of the NPY_OP_ITFLAG_USINGBUFFER flag: Can it get set/unset between the call to npyiter_copy_to_buffers (where it is set) and npyiter_clear_buffers (which does the check)?
Second, it adds a check to PyErr_Occurred in ufunc inner-loop code. While I think there is a problem with fall-through errors, I am not sure if this is the right fix: Are there performance implications? Also, if this is a correct change, I think I'd need to do it elsewhere in the file where nditer is used.

Reproducing code example:

>>> np.zeros((17000, 2), dtype='f4') * None

This code should return an error, TypeError: unsupported operand type(s) for *: 'float' and 'NoneType', however it often segfaults. The segfault depends on malloc'ing some memory with uninitualized non-null bytes, so you may need to vary the number 17000 two or three times to trigger a different malloc. The second array should be a 0d (and not 1d) object array. The segfaulting code was recently touched in #17029 (see npyiter_clear_buffers in backtrace below).

Debug Info + Discussion

What appears to happen in the example code:

the ufunc sets up an nditer with buffers to carry out the multiplication op. The None is converted to a 0d object array. Buffers of 'O' dtype of size 8192 are created for both of the inputs and the one output. (sidenote: can we update nditer to avoid creating a 64k buffer for the 0d array (op 1) which we never even use?)
The ufunc does:

        do {
            innerloop(dataptr, count_ptr, stride, innerloopdata);
        } while (iternext(iter));

in this loop, the ufunc innerloop fails due to an invalid python operation (float*None), and sets PyErr
"iternext" in the while condition (npyiter_buffered_iternext) copies the output buffer to the output array (no problem)
"iternext" prepares to copy the next chunk from the input arrays into buffers. However, it determines that the 0d array does not need to use its buffer, so does not zero out that buffer, and unsets NPY_OP_ITFLAG_USINGBUFFER for that op.
"iternext" copies op 0 input into its buffer using _aligned_contig_to_contig_cast. After successfully doing the copy, this function checks if PyErr_Occurred, which incorrectly triggers due to the previously set error from innerloop.
"iternext" now thinks that something went wrong with the input->buffer copy, so tries to clear the buffers (npyiter_clear_buffers) and return failure.
When clearing the buffers, npyiter_clear_buffers tries to zero-out op 1's buffer, which involves decref'ing each element since it is Object dtype. However, this buffer was never initialized, so depending on malloc has random addresses, so this often segfaults.

GDB traceback:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
PyArray_Item_XDECREF (data=<optimized out>, 
    descr=0x7ffff584b300 <OBJECT_Descr>)
    at numpy/core/src/multiarray/refcount.c:102
102	        Py_XDECREF(temp);
(gdb) bt
#0  PyArray_Item_XDECREF (data=<optimized out>, 
    descr=0x7ffff584b300 <OBJECT_Descr>)
    at numpy/core/src/multiarray/refcount.c:102
#1  0x00007ffff559b066 in npyiter_clear_buffers (
    iter=iter@entry=0x555556088680)
    at numpy/core/src/multiarray/nditer_api.c:2659
#2  0x00007ffff5592fe5 in npyiter_buffered_iternext (iter=0x555556088680)
    at numpy/core/src/multiarray/nditer_templ.c.src:331
#3  0x00007ffff575a8d0 in iterator_loop (ufunc=ufunc@entry=0x7ffff58b66d0, 
    op=op@entry=0x7fffffffbc30, dtype=dtype@entry=0x7fffffffa980, 
    order=order@entry=NPY_KEEPORDER, buffersize=buffersize@entry=8192, 
    arr_prep=<optimized out>, full_args=..., 
    innerloop=0x7ffff562c430 <PyUFunc_OO_O>, 
    innerloopdata=0x7ffff7d42df0 <PyNumber_Multiply>, op_flags=0x7fffffffa880)
    at numpy/core/src/umath/ufunc_object.c:1535
#4  0x00007ffff5760f82 in execute_legacy_ufunc_loop (ufunc=0x7ffff58b66d0, 
    trivial_loop_ok=trivial_loop_ok@entry=0, op=op@entry=0x7fffffffbc30, 
    dtypes=dtypes@entry=0x7fffffffa980, order=NPY_KEEPORDER, buffersize=8192, 
    arr_prep=0x7fffffffaa80, full_args=..., op_flags=0x7fffffffa880)
    at numpy/core/src/umath/ufunc_object.c:1702

Additionally, this is the output with NPY_IT_DBG_TRACING and NPY_UF_DBG_TRACING turned on:

Evaluating ufunc multiply
Getting arguments
Finding inner loop
input types:
dtype('O') dtype('O') 
output types:
dtype('O') 
Executing legacy inner loop
iterator loop
Iterator: Checking casting for operand 0
op: dtype('float32'), iter: dtype('O')
Iterator: Setting NPY_OP_ITFLAG_CAST because the types aren't equivalent
Iterator: Checking casting for operand 1
op: dtype('O'), iter: dtype('O')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('O')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 16
Iterator: Copying inputs to buffers
Iterator: Buffer requires init, memsetting to 0
Iterator: Copying operand 0 to buffer (8192 items)
Any buffering needed: 1
Iterator: Finished copying inputs to buffers (buffered size is 8192)
iterator loop count 8192
Iterator: Copying buffers to outputs
Iterator: Freeing refs and zeroing buffer of operand 0
Iterator: Finished copying buffers to outputs
Iterator: Copying inputs to buffers
Iterator: Buffer requires init, memsetting to 0
Iterator: Copying operand 0 to buffer (8192 items)
zsh: segmentation fault (core dumped)  ./runtests.py -g --ipython

It doesn't show it, but I in gdb I can see the segfault happens when clearing op 1's buffer

NumPy/Python version information:

1.20.0 and master

seberg · 2021-02-20T18:45:06Z

Ouch :(! Thanks, great detective work/summary. I somewhat think the flag can get unset again, but can't think of when or why :). In any case checking it here is definitely correct.

Can you add the test? Your example triggers an "invalid read" in valgrind even on success, so that is sensitive enough in any case, apparently we did never check it.

I doubt the PyErr_Occurred() check is slow in the grand scheme of things, of course you could have cases where the inner-loop is called for 2 items at a time, and I am not sure if it is measurable in such cases.

The PyErr_Occurred() is better there, I did not bother adding it figuring that for the moment it just stays identically broken (and the near future is to return errors explicitly, as casting does now). However, you need to put it behind a check for whether the iter requires API (i.e. whether we may have release the API). I am not certain that PyErr_Occurred() is safe if the GIL is released.

ahaldane · 2021-02-21T04:09:47Z

Good tip about PyErr_Occurred and the API/GIL, that fixed the failures.

Updated now to add the check for a python error in the nditer loops I found elsewhere in numpy. The check was already done in some other places. I also added a simple test.

seberg

Thanks, would be fine to just put in as is. Planning to do so tomorrow, maybe with the nits changed.

I was considering adding tests for these paths (mainly to have them for the future). This also fails:

arr = np.arange(np.BUFSIZE * 10).reshape(10, -1).astype(str)
oarr = arr.astype(object); oarr[:, -1] = None
np.add(oarr[:, ::-1], arr[:, ::-1], where=np.ones(arr.shape, dtype=bool))

(The where is not necessary).

But, I did not yet manage to somehow trigger incorrect behaviour due to the missing PyErr_Occurred() checks, although they are definitely right. I think it should be possible to somehow get a SystemError due to calling invalid Python code with an exception set, but I can't find a way, so maybe it doesn't matter now...

seberg · 2021-02-21T21:42:17Z

numpy/core/src/multiarray/einsum.c.src

This is technically not necessary for backport, since NumPy (also master), does currently not even support object einsum or unsafe casts I think.

Won't hurt. Remaining more or less in sync with master is sometimes helpful for future backports.

seberg · 2021-02-21T21:42:32Z

numpy/core/src/multiarray/nditer_api.c

NIT: long line.

seberg · 2021-02-21T21:47:05Z

numpy/core/src/umath/ufunc_object.c

Nit: the definition is overshadowed by needs_api in the other branch, could remove the int there, although we have been using local definitions a lot more, with slightly differing opinions.

seberg · 2021-02-21T21:47:27Z

numpy/core/tests/test_nditer.py

Suggested change

# This test would trigger valgrind "invalid read" before the bugfix.

# This test would trigger valgrind "uninitialized read" before the bugfix.

woops

ahaldane · 2021-02-22T04:38:50Z

Fixed up nits.

I agree it's hard to see a real case where missing the PyErr_Occurred checks causes a bug, because we usually check for an error pretty soon after the nditer loops anyways. The main opportunity for a bug would be during iternext immediately in the loops' while clause. Iternext calls copy/cast code, so a bug would be most likely to happen there somehow. At least, this PR demonstrates an incorrect code path can be taken in that code, even if that was harmless in this case.

seberg · 2021-02-22T17:20:04Z

Thanks Allan! I think it is fine to backport with the additional checks, but if anyone wants to be particularly minimal, I am happy to open a backport PR with just the nditer one.

charris · 2021-02-22T17:54:17Z

@seberg I've done the backport.

seberg · 2021-03-02T00:30:57Z

numpy/core/src/umath/ufunc_object.c

                        strides_copy, innerloopdata);

+            if (needs_api && PyErr_Occurred()) {
+                break;


This break needs to be a goto finish or so. Pandas is running into this in their CI. Will look into it tomorrow at the latest, but the only tricky thing is that it might be nice to have a test.

xref: pandas-dev/pandas#40158

I'm going to check it out now and make a test+PR, unless you already have it done

Thanks, I thought a test would be more complicated for some reason. I think the iternext technically also should be checked, although I am willing to bet is not a regression.

Fixup to numpygh-18450

github-actions bot added the 00 - Bug label Feb 20, 2021

ahaldane added the component: numpy._core label Feb 20, 2021

seberg added the 06 - Regression label Feb 20, 2021

seberg added this to the 1.20.2 release milestone Feb 20, 2021

charris added the 09 - Backport-Candidate PRs tagged should be backported label Feb 20, 2021

ahaldane force-pushed the fix_nditer_od_segfault branch 2 times, most recently from 24c4144 to 15c1933 Compare February 21, 2021 00:50

seberg approved these changes Feb 21, 2021

View reviewed changes

seberg mentioned this pull request Feb 21, 2021

ENH: Adding Object dtype to einsum #18053

Merged

ahaldane force-pushed the fix_nditer_od_segfault branch from 15c1933 to ea77464 Compare February 22, 2021 03:55

BUG: Segfault in nditer buffer dealloc for Object arrays

5d864e8

ahaldane force-pushed the fix_nditer_od_segfault branch from ea77464 to 5d864e8 Compare February 22, 2021 04:10

seberg merged commit d23d968 into numpy:master Feb 22, 2021

charris mentioned this pull request Feb 22, 2021

BUG: Segfault in nditer buffer dealloc for Object arrays #18469

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Feb 22, 2021

charris removed this from the 1.20.2 release milestone Feb 22, 2021

seberg mentioned this pull request Mar 1, 2021

CI: npdev SystemError pandas-dev/pandas#40158

Closed

seberg reviewed Mar 2, 2021

View reviewed changes

charris added 09 - Backport-Candidate PRs tagged should be backported and removed 09 - Backport-Candidate PRs tagged should be backported labels Mar 2, 2021

ahaldane mentioned this pull request Mar 2, 2021

BUG: incorrect error fallthrough in nditer #18530

Merged

ahaldane added a commit to ahaldane/numpy that referenced this pull request Mar 3, 2021

BUG: incorrect error fallthrough in nditer

940f7d8

Fixup to numpygh-18450

ahaldane added a commit to ahaldane/numpy that referenced this pull request Mar 4, 2021

BUG: incorrect error fallthrough in nditer

33398ca

Fixup to numpygh-18450

ahaldane mentioned this pull request Mar 4, 2021

BUG: incorrect error fallthrough in nditer #18546

Merged

seberg mentioned this pull request Mar 17, 2021

Division by None can lead to Segfault #18633

Closed

adamantike mentioned this pull request Apr 19, 2021

Segfault when summing array slice from Pandas dataframe #18810

Closed

	# This test would trigger valgrind "invalid read" before the bugfix.
	# This test would trigger valgrind "uninitialized read" before the bugfix.

Uh oh!

BUG: Segfault in nditer buffer dealloc for Object arrays #18450

BUG: Segfault in nditer buffer dealloc for Object arrays #18450

Uh oh!

Conversation

ahaldane commented Feb 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproducing code example:

Debug Info + Discussion

NumPy/Python version information:

Uh oh!

seberg commented Feb 20, 2021

Uh oh!

ahaldane commented Feb 21, 2021

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

seberg Feb 21, 2021

Choose a reason for hiding this comment

Uh oh!

charris Feb 22, 2021

Choose a reason for hiding this comment

Uh oh!

seberg Feb 21, 2021

Choose a reason for hiding this comment

Uh oh!

seberg Feb 21, 2021

Choose a reason for hiding this comment

Uh oh!

seberg Feb 21, 2021

Choose a reason for hiding this comment

Uh oh!

ahaldane commented Feb 22, 2021

Uh oh!

seberg commented Feb 22, 2021

Uh oh!

charris commented Feb 22, 2021

Uh oh!

seberg Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahaldane Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Mar 2, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ahaldane commented Feb 20, 2021 •

edited

Loading

seberg Mar 2, 2021 •

edited

Loading

ahaldane Mar 2, 2021 •

edited

Loading