BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis#15385
BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis#15385seberg merged 3 commits intonumpy:masterfrom
Conversation
4e40618 to
719c892
Compare
719c892 to
41fc5e8
Compare
There was a problem hiding this comment.
Pretty sure this was a bug, but constructing a failing string is non-trivial
|
Is this the patch you mentioned for #15363? If so the example would be a nice regression test 😄 |
This comment has been minimized.
This comment has been minimized.
|
It's not the cause of the segfault, but it turns out that using |
|
Welcome to the swamp :) |
a6a7957 to
a5d8554
Compare
7f3fd86 to
c1e7eae
Compare
|
@Zac-HD: Regression test added |
c1e7eae to
634263a
Compare
|
LGTM |
|
I have two possible worries with this patch:
If reviewers can constructs possible corner cases for this second point, it would be helpful. |
Here's one such case, which used to error, but now assigns the UCS4 codepoints directly: >>> a = np.zeros(1, 'V8')
>>> a[:] = 'te'
>>> a
array([b'\x74\x00\x00\x00\x65\x00\x00\x00'], dtype='|V8')I don't think this is any weirder than the behavior of |
|
And another one: >>> a = np.str_("test")
>>> b"%s" % a
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'which again, isn't much weirder than |
seberg
left a comment
There was a problem hiding this comment.
The changes look good to me, although I did not look super closely. I think the Length discovery in the dtype discovery code was correct before. Was worried for a second about the delayed initialization, but it seems fine.
As for supporting the buffer interface... Yeah, it should mostly be strange with respect to void (and maybe including buffered) datatypes. But overall, it should be strange enough that it does not matter. I suppose we could mention it in the release notes.
These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`. After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this. Fixes numpygh-3258 Fixes numpygh-15363
634263a to
9a49dfb
Compare
This eliminates the need for special casing in `np.generic.__reduce__`
9a49dfb to
d0b7b66
Compare
|
Updated with a test of |
|
|
||
| if unicode: | ||
| if sys.maxunicode == 0xffff: | ||
| # On a narrow Python build, the buffer for Unicode | ||
| # strings is UCS2, which doesn't match the buffer for | ||
| # NumPy Unicode types, which is ALWAYS UCS4. | ||
| # Therefore, we need to convert the buffer. On Python | ||
| # 2.6 and later, we can use the utf_32 codec. Earlier | ||
| # versions don't have that codec, so we convert to a | ||
| # numerical array that matches the input buffer, and | ||
| # then use NumPy to convert it to UCS4. All of this | ||
| # should happen in native endianness. | ||
| obj = obj.encode('utf_32') | ||
| else: | ||
| obj = str(obj) | ||
| else: | ||
| # Let the default Unicode -> string encoding (if any) take | ||
| # precedence. | ||
| obj = bytes(obj) | ||
|
|
||
| return chararray(shape, itemsize=itemsize, unicode=unicode, | ||
| buffer=obj, order=order) |
There was a problem hiding this comment.
This code never made sense in the first place, as chararray.__new__ has an identity crisis over whether it's trying to be np.ndarray.__new__ or np.array, and accepts str objects in place of the buffer.
Edit: Perhaps it was a workaround for the original bug.
|
The test failure is the windows heisenbug. I will put this in soon if there are no objections. |
|
Thanks for the threat Matti, and thanks Eric. Had another glance over and it looks good to me, so putting it in. The test takes around half a second on my computer. A bit slow, but probably fine (and easily changed later). |
The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite. Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.
These APIs work with either UCS2 or UCS4, depending on the value of
Py_UNICODE_WIDE.After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.
Fixes gh-3258
Fixes gh-15363