ENH: allow NEP 42 dtypes to work with np.char by ngoldbaum · Pull Request #22863 · numpy/numpy

ngoldbaum · 2022-12-21T21:13:44Z

This makes it possible for new-style NEP 42 string dtypes like ASCIIDType to work with the functions in np.char.

It will only work with dtypes with a scalar that subclasses str or bytes. I also assume that you can create instances of the user dtype from python like dtype_instance = CustomDType(size_in_bytes). This is a pretty big assumption about the API of the dtype, I'm not sure offhand how I can do this more portably or more safely.

I also added a new macro, NPY_DT_is_user_defined, which checks dtype->type_num == -1, which is currently true for all custom dtypes using the experimental dtype API. This new macro is needed because NPY_DT_is_legacy will return false for np.void.

I don't know how I can add tests to numpy for this change without adding a custom string dtype. So far I've tested this locally using asciidtype.

I've included some more detail on how I'm testing this below:

Details

To test this out, you'll need to install asciidtype (using an unmerged PR for asciidtype) and run a script like the following:

import os

os.environ["NUMPY_EXPERIMENTAL_DTYPE_API"] = "1"

import numpy as np
from asciidtype import ASCIIDType

dtype = ASCIIDType(5)

arr = np.array(["this", "is", "an", "array"], dtype=dtype)

print(repr(np.char.add(arr, arr)))
print(repr(np.char.multiply(arr, 3)))
print(repr(np.char.capitalize(arr)))
print(repr(np.char.endswith(arr, "is")))

you should get the following output:

array(['thisthis', 'isis', 'anan', 'arrayarray'], dtype=ASCIIDType(10))
array(['thisthisthis', 'isisis', 'ananan', 'arrayarrayarray'],
      dtype=ASCIIDType(15))
array(['This', 'Is', 'An', 'Array'], dtype=ASCIIDType(5))
array([ True,  True, False, False])

ngoldbaum · 2022-12-21T21:15:24Z

numpy/core/src/multiarray/descriptor.c

+                return ret;
+            }
+        }
+


This section is needed because some of the functions in np.char pass the scalar type rather than a dtype instance to _vec_string. For example, np.char.multiply: https://github.com/numpy/numpy/blob/main/numpy/core/defchararray.py#L373-L374

Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).

We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.

seberg

The multiarray changes look clean to me. The other change, I am not sure about unfortunately, what do you think about the suggestion of allowing and using type(np.dtype("S"))(8)?

seberg · 2023-01-02T17:24:39Z

numpy/core/src/multiarray/descriptor.c

+                return ret;
+            }
+        }
+


Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).

We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.

seberg · 2023-01-02T17:24:42Z

numpy/core/src/multiarray/multiarraymodule.c

+        PyTypeObject* scalar_type = DType->scalar_type;
+        PyObject* allowed_scalar_types = PyTuple_Pack(2, (PyObject*)&PyBytes_Type, (PyObject*)&PyUnicode_Type);
+        int is_sub = PyObject_IsSubclass(
+            (PyObject*)scalar_type, allowed_scalar_types);


Just use: PyUnicode_Check() || PyBytes_Check(), building that tuple is both tedious and probably just slower anyway.

Ack, sorry, you would need PyType_IsSubtype(scalar_type, &PyBytes_Type) ||... I still would just use that (IsSubtype only checks for exact subclasses, but that is OK/good here).

seberg · 2023-01-02T17:25:29Z

numpy/core/src/multiarray/multiarraymodule.c

+        Py_DECREF(allowed_scalar_types);
+        if (is_sub == 1) {
+            method = PyObject_GetAttr((PyObject*)scalar_type, method_name);
+        } else {


Suggested change

} else {

} else

{

Just to note, I don't care.

Arrrg! sorry, I typed the above, it should have been:

} else {

🤦

ngoldbaum · 2023-01-02T22:39:22Z

OK, current plan is to do a followup making type(np.dtype("S"))(8) work correctly. I think what I'll need to do is modify dtypemeta_wrap_legacy_descriptor so that for string dtypes instead of using legacy_dtype_default_new, it uses a new function for string dtypes that allows integer arguments and inteprets it as the size. Currently type(np.dtype("S"))(8) raises an error, so it should be totally safe to allow this new behavior.

Once that's working I'll come back to this and simplify things. Thanks for the review!

charris · 2023-01-03T00:40:22Z

Note that enhancements need a release note.

seberg · 2023-01-03T11:15:43Z

Thanks. I would also be totally fine with not digging so deep and just adding a special path for these two into np.descr.__new__ (rather than implementing a __new__ for the subclass).

ngoldbaum · 2023-01-03T17:20:40Z

Thanks. I would also be totally fine with not digging so deep and just adding a special path for these two into np.descr.__new__ (rather than implementing a __new__ for the subclass).

I'm confused which code path this comment refers to. Was it a typo and you meant np.dtype.__new__? Although I'm not finding a base __new__ implementation, so I'm not sure that's right either.

seberg · 2023-01-03T17:27:37Z

No typo, just using the Python dunder name to refer to the C slot implementation: https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/descriptor.c#L2294

(I.e. where the current error comes from)

seberg · 2023-01-03T17:46:09Z

Ohh, sorry... I had forgotten/missed that I already override new with legacy_dtype_default_new... So that is clearly the right thing to expand.

Following up from #22863 (comment), this makes it possible to create string dtype instances from an abstract string DTypeMeta. Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>

ngoldbaum · 2023-01-06T20:06:41Z

This should be ready for review again, assuming the tests pass on the CI runners.

I accidentally typo'd the release note filename for #22923 to user this PR number, which is why it's included in here.

seberg

Just comments passing, too late for a careful review, but thought I would comment. I don't think you can do that comparison thing, maybe its even just not needed? I bet practically nobody needs rstrip and the normal comparisons work for string arrays otherwise?

numpy/core/src/multiarray/multiarraymodule.c

seberg · 2023-01-06T21:13:09Z

numpy/core/src/multiarray/descriptor.c

+
+        return (PyArray_Descr *)res;
+    }
+    else if (PyDataType_ISUNSIZED(type)) {


In principle we could now move all of this into Python, but I am happy to skip it here. (To simplify the C-code to always get a dtype instance.)

ngoldbaum · 2023-01-09T20:34:46Z

This should be ready to review again.

seberg

C-Code looks great now (we don't usually have a newline for the opening bracket on else { right now, but I suspect we may have at some point so...) (EDIT: Haha, I see that was my mistake above anyway!)

Sorry, the Python code, I like the general change, but added some comments (which is mainly because this code was a mess before you looked at it already, of course!).

seberg · 2023-01-09T21:03:38Z

numpy/core/defchararray.py

+    ret = numpy.asarray(result.tolist())
+    dtype = getattr(output_dtype_like, 'dtype', None)
+    if dtype is not None:
+        return ret.astype(type(dtype)(_get_num_chars(ret)))


Hmmm, at least add a copy=False to the astype call here. I guess the objects should usually be all bytes or strings and thus the array already be correct (typically).

We can drop the .tolist() also, unless I am very mistaken. The reason for that smells like a bug I fixed 3 years ago.

For whatever you have, I guess the subclass info (and thus correct type) may get lost. It should possibly be spelled as asarray(result.tolist(), dtype=type(output_dtype_like.dtype)), but we can't easily yet.

It looks like I need to keep the tolist(), at least for partition to work correctly:

Traceback (most recent call last): File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1726, in main pdb._runscript(mainpyfile) File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1586, in _runscript self.run(statement) File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/bdb.py", line 597, in run exec(cmd, globals, locals) File "<string>", line 1, in <module> File "/home/nathan/Documents/numpy-experiments/setup-asciidtype.py", line 38, in <module> print(repr(np.char.partition(arr, "r"))) File "<__array_function__ internals>", line 185, in partition File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 1244, in partition return _to_string_or_unicode_array( File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 90, in _to_string_or_unicode_array return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False) TypeError: Can only store ASCII text in a ASCIIDType array. Uncaught exception. Entering post mortem debugging Running 'cont' or 'step' will restart the program > /home/nathan/Documents/numpy/numpy/core/defchararray.py(90)_to_string_or_unicode_array() -> return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False) (Pdb) p ret array([('this', '', ''), ('is', '', ''), ('an', '', ''), ('a', 'r', 'ray')], dtype=object) (Pdb) p numpy.asarray(result.tolist()) array([['this', '', ''], ['is', '', ''], ['an', '', ''], ['a', 'r', 'ray']], dtype='<U4')

I don't think this is a problem for np.str_ or np.unicode_ so indeed we could get rid of tolist() for those, but for asciidtype we end up passing asciidtype's setitem a tuple of strings, which I would expect numpy to have already processed. I'm also a tiny bit nervous about breaking code using numpy in unexpected ways, so I think I'd prefer to just leave the tolist() as-is for now.

Yeah, OK, lets keep it, it would a followup issue anyway really. I suspect partition is just badly written here, since it should store the 3 objects into a an (..., 3) object array right away (and doesn't). But probably that goes all the way into _vec_string, so...

Thanks for having a look!

seberg · 2023-01-09T21:06:06Z

numpy/core/defchararray.py

-        if (isinstance(x, str) or
-                issubclass(numpy.asarray(x).dtype.type, unicode_)):
+        scalar_type = numpy.asarray(x).dtype.type
+        if (isinstance(x, str) or issubclass(scalar_type, unicode_)):


This smells a bit like it may have been a fast-path for a scalar unicode? OTOH, I am not sure that is really worthwhile...

EDIT: sorry to be more clear. If the isinstance check is a fast path, it has to come before converting with asarray.

The logic ended up a little more complicated for _use_unicode but I restored the fast path.

numpy/core/defchararray.py

seberg · 2023-01-10T13:14:50Z

numpy/core/defchararray.py

-            return unicode_
-    return string_
+    if (isinstance(arr, str) or
+            issubclass(numpy.asarray(arr).dtype.type, str)):


I changed this to only check for str. np.unicode_ is a subclass of string anyway. I am comfortable with this: The test should notice something incorrect here.

seberg · 2023-01-10T13:17:30Z

numpy/core/defchararray.py

+        raise TypeError(
+            "np.char.add() requires both arrays of the same dtype kind, but "
+            f"got dtypes: '{arr.dtype}' and '{arr2.dtype}' (the few cases "
+            "where this used to work often lead to incorrect results).")


After some back and forth, I decided that trying to get the logic right seems the wrong approach. So I nuked it. The things that didn't already fail were:

bytes + unicode. Which converted the unicode to bytes (all 4 for each character) and then stored it into a too short unicode result as a byte string.

unicode + object (where the object array contains strings). Which would always use np.dtype("O").itemsize for the number of characters, and thus shorten longer object strings anways.

Neither seems reasonable to me.

bytes+void made a little bit more sense, but even in that case you strip NUL padding of the void by casting it to bytes. The promotion of void to bytes doesn't really make sense...

It never really worked anyway, so nuke it and forget about it...

seberg

Well, this was much more tedious and time consuming to figure out then I would have liked, but I am happy to put it in now. @ngoldbaum can you confirm if you think the last choices seem fine?

_get_num_chars() is still broken for user dtypes (in general at least). I am happy to consider this a follow up. At this point this is more cleanup than additional code paths.

ngoldbaum · 2023-01-10T16:46:48Z

This works for me! Thanks for your attention on this and help getting it across the line.

seberg · 2023-01-10T16:54:10Z

OK, thanks @ngoldbaum lets put it in then.

ngoldbaum commented Dec 21, 2022

View reviewed changes

seberg reviewed Jan 2, 2023

View reviewed changes

charris added 01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Jan 3, 2023

ngoldbaum mentioned this pull request Jan 3, 2023

ENH: Create string dtype instances from the abstract dtype #22923

Merged

ngoldbaum force-pushed the char-with-nep42-dtypes branch 2 times, most recently from 297bf2e to 90f3f1f Compare January 6, 2023 19:07

seberg reviewed Jan 6, 2023

View reviewed changes

ngoldbaum added 6 commits January 9, 2023 12:52

ENH: allow NEP 42 dtypes to work with np.char

ebb2b6b

respond to review comments

433ee4a

fixes for a few more np.char functions

db43118

add release note

2fd81bf

account for non-ndarary inputs to np.char functions

7f7eeb4

back out changes to comapre_chararrays

9553e96

ngoldbaum force-pushed the char-with-nep42-dtypes branch 2 times, most recently from eba92bb to 9b6f554 Compare January 9, 2023 20:31

refactor error handling

f4085a6

ngoldbaum force-pushed the char-with-nep42-dtypes branch from 9b6f554 to f4085a6 Compare January 9, 2023 20:34

back out changes to _convert_from_tuple

41d0754

seberg reviewed Jan 9, 2023

View reviewed changes

fix brackets

c6f15c7

ngoldbaum added 2 commits January 9, 2023 15:11

respond to review comments

608ca54

fix lint error

cd385be

seberg reviewed Jan 10, 2023

View reviewed changes

numpy/core/defchararray.py Outdated Show resolved Hide resolved

seberg reviewed Jan 10, 2023

View reviewed changes

numpy/core/defchararray.py Outdated Show resolved Hide resolved

seberg reviewed Jan 10, 2023

View reviewed changes

MAINT: Just disallow different dtypes for np.char.add

b76f7f9

It never really worked anyway, so nuke it and forget about it...

seberg force-pushed the char-with-nep42-dtypes branch from 75464bf to b76f7f9 Compare January 10, 2023 13:31

seberg approved these changes Jan 10, 2023

View reviewed changes

seberg merged commit 737b064 into numpy:main Jan 10, 2023

Uh oh!

Conversation

ngoldbaum commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Jan 2, 2023

Uh oh!

charris commented Jan 3, 2023

Uh oh!

seberg commented Jan 3, 2023

Uh oh!

ngoldbaum commented Jan 3, 2023

Uh oh!

seberg commented Jan 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jan 3, 2023

Uh oh!

ngoldbaum commented Jan 6, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Jan 9, 2023

Uh oh!

seberg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Jan 10, 2023

Uh oh!

seberg commented Jan 10, 2023

Uh oh!

Reviewers

Assignees

ngoldbaum commented Dec 21, 2022 •

edited

Loading

seberg commented Jan 3, 2023 •

edited

Loading

seberg left a comment •

edited

Loading

seberg Jan 9, 2023 •

edited

Loading