Skip to content

Conversation

@ahaldane
Copy link
Member

@ahaldane ahaldane commented Feb 20, 2021

I get a segfault in 1.20.0 and in master involving nditer and object arrays. It doesn't seem to happen in 1.18. The cause seems to be some combination of incomplete error handling in nditer ( known problem which @seberg seems to have been working on fixing a few months ago), as well as mis-detection of buffer non-initialization.

This PR has two changes:

  1. It adds a check in npyiter_clear_buffers to avoid clearing unused buffers, which is the main fix. I am not 100% on correctness since I'm not yet sure of the lifetime of validity of the NPY_OP_ITFLAG_USINGBUFFER flag: Can it get set/unset between the call to npyiter_copy_to_buffers (where it is set) and npyiter_clear_buffers (which does the check)?
  2. Second, it adds a check to PyErr_Occurred in ufunc inner-loop code. While I think there is a problem with fall-through errors, I am not sure if this is the right fix: Are there performance implications? Also, if this is a correct change, I think I'd need to do it elsewhere in the file where nditer is used.

Reproducing code example:

>>> np.zeros((17000, 2), dtype='f4') * None

This code should return an error, TypeError: unsupported operand type(s) for *: 'float' and 'NoneType', however it often segfaults. The segfault depends on malloc'ing some memory with uninitualized non-null bytes, so you may need to vary the number 17000 two or three times to trigger a different malloc. The second array should be a 0d (and not 1d) object array. The segfaulting code was recently touched in #17029 (see npyiter_clear_buffers in backtrace below).

Debug Info + Discussion

What appears to happen in the example code:

  1. the ufunc sets up an nditer with buffers to carry out the multiplication op. The None is converted to a 0d object array. Buffers of 'O' dtype of size 8192 are created for both of the inputs and the one output. (sidenote: can we update nditer to avoid creating a 64k buffer for the 0d array (op 1) which we never even use?)
  2. The ufunc does:
        do {
            innerloop(dataptr, count_ptr, stride, innerloopdata);
        } while (iternext(iter));
  1. in this loop, the ufunc innerloop fails due to an invalid python operation (float*None), and sets PyErr
  2. "iternext" in the while condition (npyiter_buffered_iternext) copies the output buffer to the output array (no problem)
  3. "iternext" prepares to copy the next chunk from the input arrays into buffers. However, it determines that the 0d array does not need to use its buffer, so does not zero out that buffer, and unsets NPY_OP_ITFLAG_USINGBUFFER for that op.
  4. "iternext" copies op 0 input into its buffer using _aligned_contig_to_contig_cast. After successfully doing the copy, this function checks if PyErr_Occurred, which incorrectly triggers due to the previously set error from innerloop.
  5. "iternext" now thinks that something went wrong with the input->buffer copy, so tries to clear the buffers (npyiter_clear_buffers) and return failure.
  6. When clearing the buffers, npyiter_clear_buffers tries to zero-out op 1's buffer, which involves decref'ing each element since it is Object dtype. However, this buffer was never initialized, so depending on malloc has random addresses, so this often segfaults.

GDB traceback:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
PyArray_Item_XDECREF (data=<optimized out>, 
    descr=0x7ffff584b300 <OBJECT_Descr>)
    at numpy/core/src/multiarray/refcount.c:102
102	        Py_XDECREF(temp);
(gdb) bt
#0  PyArray_Item_XDECREF (data=<optimized out>, 
    descr=0x7ffff584b300 <OBJECT_Descr>)
    at numpy/core/src/multiarray/refcount.c:102
#1  0x00007ffff559b066 in npyiter_clear_buffers (
    iter=iter@entry=0x555556088680)
    at numpy/core/src/multiarray/nditer_api.c:2659
#2  0x00007ffff5592fe5 in npyiter_buffered_iternext (iter=0x555556088680)
    at numpy/core/src/multiarray/nditer_templ.c.src:331
#3  0x00007ffff575a8d0 in iterator_loop (ufunc=ufunc@entry=0x7ffff58b66d0, 
    op=op@entry=0x7fffffffbc30, dtype=dtype@entry=0x7fffffffa980, 
    order=order@entry=NPY_KEEPORDER, buffersize=buffersize@entry=8192, 
    arr_prep=<optimized out>, full_args=..., 
    innerloop=0x7ffff562c430 <PyUFunc_OO_O>, 
    innerloopdata=0x7ffff7d42df0 <PyNumber_Multiply>, op_flags=0x7fffffffa880)
    at numpy/core/src/umath/ufunc_object.c:1535
#4  0x00007ffff5760f82 in execute_legacy_ufunc_loop (ufunc=0x7ffff58b66d0, 
    trivial_loop_ok=trivial_loop_ok@entry=0, op=op@entry=0x7fffffffbc30, 
    dtypes=dtypes@entry=0x7fffffffa980, order=NPY_KEEPORDER, buffersize=8192, 
    arr_prep=0x7fffffffaa80, full_args=..., op_flags=0x7fffffffa880)
    at numpy/core/src/umath/ufunc_object.c:1702

Additionally, this is the output with NPY_IT_DBG_TRACING and NPY_UF_DBG_TRACING turned on:

Evaluating ufunc multiply
Getting arguments
Finding inner loop
input types:
dtype('O') dtype('O') 
output types:
dtype('O') 
Executing legacy inner loop
iterator loop
Iterator: Checking casting for operand 0
op: dtype('float32'), iter: dtype('O')
Iterator: Setting NPY_OP_ITFLAG_CAST because the types aren't equivalent
Iterator: Checking casting for operand 1
op: dtype('O'), iter: dtype('O')
Iterator: Checking casting for operand 2
op: <null>, iter: dtype('O')
Iterator: Setting allocated stride 1 for iterator dimension 0 to 8
Iterator: Setting allocated stride 0 for iterator dimension 1 to 16
Iterator: Copying inputs to buffers
Iterator: Buffer requires init, memsetting to 0
Iterator: Copying operand 0 to buffer (8192 items)
Any buffering needed: 1
Iterator: Finished copying inputs to buffers (buffered size is 8192)
iterator loop count 8192
Iterator: Copying buffers to outputs
Iterator: Freeing refs and zeroing buffer of operand 0
Iterator: Finished copying buffers to outputs
Iterator: Copying inputs to buffers
Iterator: Buffer requires init, memsetting to 0
Iterator: Copying operand 0 to buffer (8192 items)
zsh: segmentation fault (core dumped)  ./runtests.py -g --ipython

It doesn't show it, but I in gdb I can see the segfault happens when clearing op 1's buffer

NumPy/Python version information:

1.20.0 and master

@seberg
Copy link
Member

seberg commented Feb 20, 2021

Ouch :(! Thanks, great detective work/summary. I somewhat think the flag can get unset again, but can't think of when or why :). In any case checking it here is definitely correct.

Can you add the test? Your example triggers an "invalid read" in valgrind even on success, so that is sensitive enough in any case, apparently we did never check it.

I doubt the PyErr_Occurred() check is slow in the grand scheme of things, of course you could have cases where the inner-loop is called for 2 items at a time, and I am not sure if it is measurable in such cases.

The PyErr_Occurred() is better there, I did not bother adding it figuring that for the moment it just stays identically broken (and the near future is to return errors explicitly, as casting does now). However, you need to put it behind a check for whether the iter requires API (i.e. whether we may have release the API). I am not certain that PyErr_Occurred() is safe if the GIL is released.

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Feb 20, 2021
@ahaldane ahaldane force-pushed the fix_nditer_od_segfault branch 2 times, most recently from 24c4144 to 15c1933 Compare February 21, 2021 00:50
@ahaldane
Copy link
Member Author

Good tip about PyErr_Occurred and the API/GIL, that fixed the failures.

Updated now to add the check for a python error in the nditer loops I found elsewhere in numpy. The check was already done in some other places. I also added a simple test.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, would be fine to just put in as is. Planning to do so tomorrow, maybe with the nits changed.

I was considering adding tests for these paths (mainly to have them for the future). This also fails:

arr = np.arange(np.BUFSIZE * 10).reshape(10, -1).astype(str)
oarr = arr.astype(object); oarr[:, -1] = None
np.add(oarr[:, ::-1], arr[:, ::-1], where=np.ones(arr.shape, dtype=bool))

(The where is not necessary).

But, I did not yet manage to somehow trigger incorrect behaviour due to the missing PyErr_Occurred() checks, although they are definitely right. I think it should be possible to somehow get a SystemError due to calling invalid Python code with an exception set, but I can't find a way, so maybe it doesn't matter now...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically not necessary for backport, since NumPy (also master), does currently not even support object einsum or unsafe casts I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't hurt. Remaining more or less in sync with master is sometimes helpful for future backports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: long line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the definition is overshadowed by needs_api in the other branch, could remove the int there, although we have been using local definitions a lot more, with slightly differing opinions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This test would trigger valgrind "invalid read" before the bugfix.
# This test would trigger valgrind "uninitialized read" before the bugfix.

woops

@ahaldane ahaldane force-pushed the fix_nditer_od_segfault branch from ea77464 to 5d864e8 Compare February 22, 2021 04:10
@ahaldane
Copy link
Member Author

Fixed up nits.

I agree it's hard to see a real case where missing the PyErr_Occurred checks causes a bug, because we usually check for an error pretty soon after the nditer loops anyways. The main opportunity for a bug would be during iternext immediately in the loops' while clause. Iternext calls copy/cast code, so a bug would be most likely to happen there somehow. At least, this PR demonstrates an incorrect code path can be taken in that code, even if that was harmless in this case.

@seberg seberg merged commit d23d968 into numpy:master Feb 22, 2021
@seberg
Copy link
Member

seberg commented Feb 22, 2021

Thanks Allan! I think it is fine to backport with the additional checks, but if anyone wants to be particularly minimal, I am happy to open a backport PR with just the nditer one.

@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Feb 22, 2021
@charris charris removed this from the 1.20.2 release milestone Feb 22, 2021
@charris
Copy link
Member

charris commented Feb 22, 2021

@seberg I've done the backport.

strides_copy, innerloopdata);

if (needs_api && PyErr_Occurred()) {
break;
Copy link
Member

@seberg seberg Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break needs to be a goto finish or so. Pandas is running into this in their CI. Will look into it tomorrow at the latest, but the only tricky thing is that it might be nice to have a test.

xref: pandas-dev/pandas#40158

Copy link
Member Author

@ahaldane ahaldane Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to check it out now and make a test+PR, unless you already have it done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I thought a test would be more complicated for some reason. I think the iternext technically also should be checked, although I am willing to bet is not a regression.

@charris charris added 09 - Backport-Candidate PRs tagged should be backported and removed 09 - Backport-Candidate PRs tagged should be backported labels Mar 2, 2021
ahaldane added a commit to ahaldane/numpy that referenced this pull request Mar 3, 2021
ahaldane added a commit to ahaldane/numpy that referenced this pull request Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants