BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis by eric-wieser · Pull Request #15385 · numpy/numpy

eric-wieser · 2020-01-22T11:03:56Z

These APIs work with either UCS2 or UCS4, depending on the value of Py_UNICODE_WIDE.
After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this.

Fixes gh-3258
Fixes gh-15363

eric-wieser · 2020-01-30T16:27:31Z

numpy/core/src/multiarray/arraytypes.c.src

Pretty sure this was a bug, but constructing a failing string is non-trivial

numpy/core/src/multiarray/buffer.c

numpy/core/src/multiarray/common.c

eric-wieser · 2020-01-30T16:29:56Z

numpy/core/src/multiarray/scalartypes.c.src

Zac-HD · 2020-02-02T10:29:32Z

Is this the patch you mentioned for #15363? If so the example would be a nice regression test 😄

eric-wieser · 2020-02-02T16:36:20Z

It's not the cause of the segfault, but it turns out that using PyUnicode_New(size, 0x10ffff) is not legal unless I actually write a character greater than 0x10000 to the underlying buffer. Otherwise str.__eq__ behaves incorrectly.

charris · 2020-02-02T16:50:56Z

Welcome to the swamp :)

numpy/core/tests/test_multiarray.py

eric-wieser · 2020-02-04T17:02:08Z

@Zac-HD: Regression test added

mattip · 2020-02-05T12:36:32Z

LGTM

eric-wieser · 2020-02-05T12:42:51Z

I have two possible worries with this patch:

np.str_ scalars may get a little more memory intensive.
Making np.str_ support the buffer interface might allow silent conversion to bytes where previously none was allowed. I haven't found a case of this yet.

If reviewers can constructs possible corner cases for this second point, it would be helpful.

eric-wieser · 2020-02-05T14:46:07Z

I haven't found a case of this yet.

Here's one such case, which used to error, but now assigns the UCS4 codepoints directly:

>>> a = np.zeros(1, 'V8')
>>> a[:] = 'te'
>>> a
array([b'\x74\x00\x00\x00\x65\x00\x00\x00'], dtype='|V8')

I don't think this is any weirder than the behavior of a[:] = 1.0, which assigns the float bits to the void element.

eric-wieser · 2020-02-05T14:48:26Z

And another one:

>>> a = np.str_("test")
>>> b"%s" % a
b't\x00\x00\x00e\x00\x00\x00s\x00\x00\x00t\x00\x00\x00'

which again, isn't much weirder than b'%s' % np.float_(1.0)

seberg

The changes look good to me, although I did not look super closely. I think the Length discovery in the dtype discovery code was correct before. Was worried for a second about the delayed initialization, but it seems fine.
As for supporting the buffer interface... Yeah, it should mostly be strange with respect to void (and maybe including buffered) datatypes. But overall, it should be strange enough that it does not matter. I suppose we could mention it in the release notes.

numpy/core/src/multiarray/common.c

numpy/core/src/multiarray/arraytypes.c.src

These APIs work with either UCS2 or UCS4, depending on the value of `Py_UNICODE_WIDE`. After python 3.3, there's a better way to handle this type of thing, which means we no longer have to care about this. Fixes numpygh-3258 Fixes numpygh-15363

This eliminates the need for special casing in `np.generic.__reduce__`

eric-wieser · 2020-02-08T21:24:15Z

Updated with a test of memoryview(np.str_)

eric-wieser · 2020-02-08T21:51:27Z

numpy/core/defchararray.py


-        if unicode:
-            if sys.maxunicode == 0xffff:
-                # On a narrow Python build, the buffer for Unicode
-                # strings is UCS2, which doesn't match the buffer for
-                # NumPy Unicode types, which is ALWAYS UCS4.
-                # Therefore, we need to convert the buffer.  On Python
-                # 2.6 and later, we can use the utf_32 codec.  Earlier
-                # versions don't have that codec, so we convert to a
-                # numerical array that matches the input buffer, and
-                # then use NumPy to convert it to UCS4.  All of this
-                # should happen in native endianness.
-                obj = obj.encode('utf_32')
-            else:
-                obj = str(obj)
-        else:
-            # Let the default Unicode -> string encoding (if any) take
-            # precedence.
-            obj = bytes(obj)
-
        return chararray(shape, itemsize=itemsize, unicode=unicode,
                         buffer=obj, order=order)


This code never made sense in the first place, as chararray.__new__ has an identity crisis over whether it's trying to be np.ndarray.__new__ or np.array, and accepts str objects in place of the buffer.

Edit: Perhaps it was a workaround for the original bug.

mattip · 2020-02-12T08:42:04Z

The test failure is the windows heisenbug. I will put this in soon if there are no objections.

numpy/core/include/numpy/arrayscalars.h

seberg · 2020-02-14T00:17:02Z

Thanks for the threat Matti, and thanks Eric. Had another glance over and it looks good to me, so putting it in. The test takes around half a second on my computer. A bit slow, but probably fine (and easily changed later).

The bug was reported in numpy#15363 and fixed in numpy#15385, before Numpy decided to allow Hypothesis in it's own test suite. Since it does now, I thought it would be nice to include the test that found the bug as well as the more specific regression test I wrote.

eric-wieser mentioned this pull request Jan 22, 2020

MAINT: clean up some macros in scalarapi.c #15386

Merged

eric-wieser added 00 - Bug 25 - WIP labels Jan 22, 2020

This was referenced Jan 23, 2020

MAINT/BUG: Fixups to scalar base classes #15393

Merged

MAINT: Inline gentype_getreadbuf #15422

Merged

MAINT: Use the PyArrayScalar_VAL macro where possible #15426

Merged

eric-wieser force-pushed the fix-unicode-ucs2 branch 3 times, most recently from 4e40618 to 719c892 Compare January 30, 2020 16:22

eric-wieser changed the title ~~WIP,BUG: Remove some internal UCS2 uses~~ BUG, MAINT: Stop using the error-prone deprecated Py_UNICODE apis Jan 30, 2020

eric-wieser force-pushed the fix-unicode-ucs2 branch from 719c892 to 41fc5e8 Compare January 30, 2020 16:24

eric-wieser commented Jan 30, 2020

View reviewed changes

numpy/core/src/multiarray/buffer.c Outdated Show resolved Hide resolved

eric-wieser commented Jan 30, 2020

View reviewed changes

numpy/core/src/multiarray/common.c Outdated Show resolved Hide resolved

eric-wieser commented Jan 30, 2020

View reviewed changes

numpy/core/src/multiarray/scalartypes.c.src Outdated

Comment on lines 394 to 351

Copy link

Member Author

eric-wieser Jan 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #15477

eric-wieser marked this pull request as ready for review January 30, 2020 16:30

This comment has been minimized.

Sign in to view

eric-wieser force-pushed the fix-unicode-ucs2 branch from a6a7957 to a5d8554 Compare February 4, 2020 14:41

eric-wieser removed the 25 - WIP label Feb 4, 2020

eric-wieser force-pushed the fix-unicode-ucs2 branch from 7f3fd86 to c1e7eae Compare February 4, 2020 16:08

eric-wieser commented Feb 4, 2020

View reviewed changes

numpy/core/tests/test_multiarray.py Outdated Show resolved Hide resolved

eric-wieser force-pushed the fix-unicode-ucs2 branch from c1e7eae to 634263a Compare February 5, 2020 09:49

seberg reviewed Feb 8, 2020

View reviewed changes

numpy/core/src/multiarray/common.c Outdated Show resolved Hide resolved

numpy/core/src/multiarray/arraytypes.c.src Outdated Show resolved Hide resolved

eric-wieser mentioned this pull request Feb 8, 2020

MAINT: Extract repeated code to a helper function #15538

Merged

eric-wieser force-pushed the fix-unicode-ucs2 branch from 634263a to 9a49dfb Compare February 8, 2020 21:00

eric-wieser added 2 commits February 8, 2020 21:23

MAINT,TST: Tidy test_datetime_memoryview a little

48c0b14

ENH: Implement the buffer protocol on numpy str_ scalars

d0b7b66

This eliminates the need for special casing in `np.generic.__reduce__`

eric-wieser force-pushed the fix-unicode-ucs2 branch from 9a49dfb to d0b7b66 Compare February 8, 2020 21:23

eric-wieser commented Feb 8, 2020

View reviewed changes

eric-wieser mentioned this pull request Feb 9, 2020

NEP 40: Informational NEP about current DTypes #15505

Merged

mattip added the triage review Issue/PR to be discussed at the next triage meeting label Feb 12, 2020

hameerabbasi reviewed Feb 12, 2020

View reviewed changes

numpy/core/include/numpy/arrayscalars.h Show resolved Hide resolved

seberg merged commit 1f9ab28 into numpy:master Feb 14, 2020

eric-wieser mentioned this pull request Jun 17, 2020

.itemsize vs .dtype.itemsize on np.unicode_ objects #8901

Closed

Nimrod0901 mentioned this pull request Mar 14, 2022

DOC: Outdated info in arrays.interface #21197

Closed

Uh oh!

Conversation

eric-wieser commented Jan 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eric-wieser Jan 30, 2020

Choose a reason for hiding this comment

Uh oh!

Zac-HD commented Feb 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment has been minimized.

eric-wieser commented Feb 2, 2020

Uh oh!

charris commented Feb 2, 2020

Uh oh!

Uh oh!

eric-wieser commented Feb 4, 2020

Uh oh!

mattip commented Feb 5, 2020

Uh oh!

eric-wieser commented Feb 5, 2020

Uh oh!

eric-wieser commented Feb 5, 2020

Uh oh!

eric-wieser commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eric-wieser commented Feb 8, 2020

Uh oh!

eric-wieser Feb 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattip commented Feb 12, 2020

Uh oh!

Uh oh!

seberg commented Feb 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eric-wieser commented Jan 22, 2020 •

edited

Loading

Zac-HD commented Feb 2, 2020 •

edited

Loading

eric-wieser commented Feb 5, 2020 •

edited

Loading

eric-wieser Feb 8, 2020 •

edited

Loading