ENH: allow NEP 42 dtypes to work with np.char#22863
Conversation
| return ret; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This section is needed because some of the functions in np.char pass the scalar type rather than a dtype instance to _vec_string. For example, np.char.multiply: https://github.com/numpy/numpy/blob/main/numpy/core/defchararray.py#L373-L374
There was a problem hiding this comment.
Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).
We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.
seberg
left a comment
There was a problem hiding this comment.
The multiarray changes look clean to me. The other change, I am not sure about unfortunately, what do you think about the suggestion of allowing and using type(np.dtype("S"))(8)?
| return ret; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).
We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.
| PyTypeObject* scalar_type = DType->scalar_type; | ||
| PyObject* allowed_scalar_types = PyTuple_Pack(2, (PyObject*)&PyBytes_Type, (PyObject*)&PyUnicode_Type); | ||
| int is_sub = PyObject_IsSubclass( | ||
| (PyObject*)scalar_type, allowed_scalar_types); |
There was a problem hiding this comment.
Just use: PyUnicode_Check() || PyBytes_Check(), building that tuple is both tedious and probably just slower anyway.
There was a problem hiding this comment.
Ack, sorry, you would need PyType_IsSubtype(scalar_type, &PyBytes_Type) ||... I still would just use that (IsSubtype only checks for exact subclasses, but that is OK/good here).
| Py_DECREF(allowed_scalar_types); | ||
| if (is_sub == 1) { | ||
| method = PyObject_GetAttr((PyObject*)scalar_type, method_name); | ||
| } else { |
There was a problem hiding this comment.
| } else { | |
| } else | |
| { |
Just to note, I don't care.
There was a problem hiding this comment.
Arrrg! sorry, I typed the above, it should have been:
}
else {
🤦
|
OK, current plan is to do a followup making type(np.dtype("S"))(8) work correctly. I think what I'll need to do is modify Once that's working I'll come back to this and simplify things. Thanks for the review! |
|
Note that enhancements need a release note. |
|
Thanks. I would also be totally fine with not digging so deep and just adding a special path for these two into |
I'm confused which code path this comment refers to. Was it a typo and you meant |
|
No typo, just using the Python dunder name to refer to the C slot implementation: https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/descriptor.c#L2294 (I.e. where the current error comes from) |
|
Ohh, sorry... I had forgotten/missed that I already override new with |
Following up from #22863 (comment), this makes it possible to create string dtype instances from an abstract string DTypeMeta. Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>
297bf2e to
90f3f1f
Compare
|
This should be ready for review again, assuming the tests pass on the CI runners. I accidentally typo'd the release note filename for #22923 to user this PR number, which is why it's included in here. |
seberg
left a comment
There was a problem hiding this comment.
Just comments passing, too late for a careful review, but thought I would comment. I don't think you can do that comparison thing, maybe its even just not needed? I bet practically nobody needs rstrip and the normal comparisons work for string arrays otherwise?
|
|
||
| return (PyArray_Descr *)res; | ||
| } | ||
| else if (PyDataType_ISUNSIZED(type)) { |
There was a problem hiding this comment.
In principle we could now move all of this into Python, but I am happy to skip it here. (To simplify the C-code to always get a dtype instance.)
eba92bb to
9b6f554
Compare
9b6f554 to
f4085a6
Compare
|
This should be ready to review again. |
There was a problem hiding this comment.
C-Code looks great now (we don't usually have a newline for the opening bracket on else { right now, but I suspect we may have at some point so...) (EDIT: Haha, I see that was my mistake above anyway!)
Sorry, the Python code, I like the general change, but added some comments (which is mainly because this code was a mess before you looked at it already, of course!).
numpy/core/defchararray.py
Outdated
| ret = numpy.asarray(result.tolist()) | ||
| dtype = getattr(output_dtype_like, 'dtype', None) | ||
| if dtype is not None: | ||
| return ret.astype(type(dtype)(_get_num_chars(ret))) |
There was a problem hiding this comment.
Hmmm, at least add a copy=False to the astype call here. I guess the objects should usually be all bytes or strings and thus the array already be correct (typically).
We can drop the .tolist() also, unless I am very mistaken. The reason for that smells like a bug I fixed 3 years ago.
For whatever you have, I guess the subclass info (and thus correct type) may get lost. It should possibly be spelled as asarray(result.tolist(), dtype=type(output_dtype_like.dtype)), but we can't easily yet.
There was a problem hiding this comment.
It looks like I need to keep the tolist(), at least for partition to work correctly:
Traceback (most recent call last):
File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1726, in main
pdb._runscript(mainpyfile)
File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1586, in _runscript
self.run(statement)
File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/bdb.py", line 597, in run
exec(cmd, globals, locals)
File "<string>", line 1, in <module>
File "/home/nathan/Documents/numpy-experiments/setup-asciidtype.py", line 38, in <module>
print(repr(np.char.partition(arr, "r")))
File "<__array_function__ internals>", line 185, in partition
File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 1244, in partition
return _to_string_or_unicode_array(
File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 90, in _to_string_or_unicode_array
return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False)
TypeError: Can only store ASCII text in a ASCIIDType array.
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/nathan/Documents/numpy/numpy/core/defchararray.py(90)_to_string_or_unicode_array()
-> return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False)
(Pdb) p ret
array([('this', '', ''), ('is', '', ''), ('an', '', ''),
('a', 'r', 'ray')], dtype=object)
(Pdb) p numpy.asarray(result.tolist())
array([['this', '', ''],
['is', '', ''],
['an', '', ''],
['a', 'r', 'ray']], dtype='<U4')
I don't think this is a problem for np.str_ or np.unicode_ so indeed we could get rid of tolist() for those, but for asciidtype we end up passing asciidtype's setitem a tuple of strings, which I would expect numpy to have already processed. I'm also a tiny bit nervous about breaking code using numpy in unexpected ways, so I think I'd prefer to just leave the tolist() as-is for now.
There was a problem hiding this comment.
Yeah, OK, lets keep it, it would a followup issue anyway really. I suspect partition is just badly written here, since it should store the 3 objects into a an (..., 3) object array right away (and doesn't). But probably that goes all the way into _vec_string, so...
Thanks for having a look!
numpy/core/defchararray.py
Outdated
| if (isinstance(x, str) or | ||
| issubclass(numpy.asarray(x).dtype.type, unicode_)): | ||
| scalar_type = numpy.asarray(x).dtype.type | ||
| if (isinstance(x, str) or issubclass(scalar_type, unicode_)): |
There was a problem hiding this comment.
This smells a bit like it may have been a fast-path for a scalar unicode? OTOH, I am not sure that is really worthwhile...
EDIT: sorry to be more clear. If the isinstance check is a fast path, it has to come before converting with asarray.
There was a problem hiding this comment.
The logic ended up a little more complicated for _use_unicode but I restored the fast path.
| return unicode_ | ||
| return string_ | ||
| if (isinstance(arr, str) or | ||
| issubclass(numpy.asarray(arr).dtype.type, str)): |
There was a problem hiding this comment.
I changed this to only check for str. np.unicode_ is a subclass of string anyway. I am comfortable with this: The test should notice something incorrect here.
| raise TypeError( | ||
| "np.char.add() requires both arrays of the same dtype kind, but " | ||
| f"got dtypes: '{arr.dtype}' and '{arr2.dtype}' (the few cases " | ||
| "where this used to work often lead to incorrect results).") |
There was a problem hiding this comment.
After some back and forth, I decided that trying to get the logic right seems the wrong approach. So I nuked it. The things that didn't already fail were:
- bytes + unicode. Which converted the unicode to bytes (all 4 for each character) and then stored it into a too short unicode result as a byte string.
- unicode + object (where the object array contains strings). Which would always use
np.dtype("O").itemsizefor the number of characters, and thus shorten longer object strings anways.
Neither seems reasonable to me.
There was a problem hiding this comment.
bytes+void made a little bit more sense, but even in that case you strip NUL padding of the void by casting it to bytes. The promotion of void to bytes doesn't really make sense...
It never really worked anyway, so nuke it and forget about it...
75464bf to
b76f7f9
Compare
seberg
left a comment
There was a problem hiding this comment.
Well, this was much more tedious and time consuming to figure out then I would have liked, but I am happy to put it in now. @ngoldbaum can you confirm if you think the last choices seem fine?
_get_num_chars() is still broken for user dtypes (in general at least). I am happy to consider this a follow up. At this point this is more cleanup than additional code paths.
|
This works for me! Thanks for your attention on this and help getting it across the line. |
|
OK, thanks @ngoldbaum lets put it in then. |
This makes it possible for new-style NEP 42 string dtypes like ASCIIDType to work with the functions in
np.char.It will only work with dtypes with a scalar that subclasses
strorbytes. I also assume that you can create instances of the user dtype from python likedtype_instance = CustomDType(size_in_bytes). This is a pretty big assumption about the API of the dtype, I'm not sure offhand how I can do this more portably or more safely.I also added a new macro,
NPY_DT_is_user_defined, which checksdtype->type_num == -1, which is currently true for all custom dtypes using the experimental dtype API. This new macro is needed becauseNPY_DT_is_legacywill return false for np.void.I don't know how I can add tests to numpy for this change without adding a custom string dtype. So far I've tested this locally using
asciidtype.I've included some more detail on how I'm testing this below:
Details
To test this out, you'll need to install asciidtype (using an unmerged PR for asciidtype) and run a script like the following:
you should get the following output: