Skip to content

ENH: allow NEP 42 dtypes to work with np.char#22863

Merged
seberg merged 12 commits intonumpy:mainfrom
ngoldbaum:char-with-nep42-dtypes
Jan 10, 2023
Merged

ENH: allow NEP 42 dtypes to work with np.char#22863
seberg merged 12 commits intonumpy:mainfrom
ngoldbaum:char-with-nep42-dtypes

Conversation

@ngoldbaum
Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum commented Dec 21, 2022

This makes it possible for new-style NEP 42 string dtypes like ASCIIDType to work with the functions in np.char.

It will only work with dtypes with a scalar that subclasses str or bytes. I also assume that you can create instances of the user dtype from python like dtype_instance = CustomDType(size_in_bytes). This is a pretty big assumption about the API of the dtype, I'm not sure offhand how I can do this more portably or more safely.

I also added a new macro, NPY_DT_is_user_defined, which checks dtype->type_num == -1, which is currently true for all custom dtypes using the experimental dtype API. This new macro is needed because NPY_DT_is_legacy will return false for np.void.

I don't know how I can add tests to numpy for this change without adding a custom string dtype. So far I've tested this locally using asciidtype.

I've included some more detail on how I'm testing this below:

Details

To test this out, you'll need to install asciidtype (using an unmerged PR for asciidtype) and run a script like the following:

import os

os.environ["NUMPY_EXPERIMENTAL_DTYPE_API"] = "1"

import numpy as np
from asciidtype import ASCIIDType

dtype = ASCIIDType(5)

arr = np.array(["this", "is", "an", "array"], dtype=dtype)

print(repr(np.char.add(arr, arr)))
print(repr(np.char.multiply(arr, 3)))
print(repr(np.char.capitalize(arr)))
print(repr(np.char.endswith(arr, "is")))

you should get the following output:

array(['thisthis', 'isis', 'anan', 'arrayarray'], dtype=ASCIIDType(10))
array(['thisthisthis', 'isisis', 'ananan', 'arrayarrayarray'],
      dtype=ASCIIDType(15))
array(['This', 'Is', 'An', 'Array'], dtype=ASCIIDType(5))
array([ True,  True, False, False])

return ret;
}
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is needed because some of the functions in np.char pass the scalar type rather than a dtype instance to _vec_string. For example, np.char.multiply: https://github.com/numpy/numpy/blob/main/numpy/core/defchararray.py#L373-L374

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).

We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiarray changes look clean to me. The other change, I am not sure about unfortunately, what do you think about the suggestion of allowing and using type(np.dtype("S"))(8)?

return ret;
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for only looking at this closer now. I have to admit, I am not quite convinced that adding more logic here is the best path. I would be happy to allow e.g. type(np.dtype("S"))(8) to succeed (allow instantiating these legacy dtypes with a length from their type).

We pass (type, size) to _vec_string, but it seems to me that these are few cases where the code could be easily changed to type(dtype)(size) instead.

PyTypeObject* scalar_type = DType->scalar_type;
PyObject* allowed_scalar_types = PyTuple_Pack(2, (PyObject*)&PyBytes_Type, (PyObject*)&PyUnicode_Type);
int is_sub = PyObject_IsSubclass(
(PyObject*)scalar_type, allowed_scalar_types);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use: PyUnicode_Check() || PyBytes_Check(), building that tuple is both tedious and probably just slower anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, sorry, you would need PyType_IsSubtype(scalar_type, &PyBytes_Type) ||... I still would just use that (IsSubtype only checks for exact subclasses, but that is OK/good here).

Py_DECREF(allowed_scalar_types);
if (is_sub == 1) {
method = PyObject_GetAttr((PyObject*)scalar_type, method_name);
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
} else
{

Just to note, I don't care.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrrg! sorry, I typed the above, it should have been:

}
else {

🤦

@ngoldbaum
Copy link
Copy Markdown
Member Author

OK, current plan is to do a followup making type(np.dtype("S"))(8) work correctly. I think what I'll need to do is modify dtypemeta_wrap_legacy_descriptor so that for string dtypes instead of using legacy_dtype_default_new, it uses a new function for string dtypes that allows integer arguments and inteprets it as the size. Currently type(np.dtype("S"))(8) raises an error, so it should be totally safe to allow this new behavior.

Once that's working I'll come back to this and simplify things. Thanks for the review!

@charris charris added 01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes labels Jan 3, 2023
@charris
Copy link
Copy Markdown
Member

charris commented Jan 3, 2023

Note that enhancements need a release note.

@seberg
Copy link
Copy Markdown
Member

seberg commented Jan 3, 2023

Thanks. I would also be totally fine with not digging so deep and just adding a special path for these two into np.descr.__new__ (rather than implementing a __new__ for the subclass).

@ngoldbaum
Copy link
Copy Markdown
Member Author

Thanks. I would also be totally fine with not digging so deep and just adding a special path for these two into np.descr.__new__ (rather than implementing a __new__ for the subclass).

I'm confused which code path this comment refers to. Was it a typo and you meant np.dtype.__new__? Although I'm not finding a base __new__ implementation, so I'm not sure that's right either.

@seberg
Copy link
Copy Markdown
Member

seberg commented Jan 3, 2023

No typo, just using the Python dunder name to refer to the C slot implementation: https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/descriptor.c#L2294

(I.e. where the current error comes from)

@seberg
Copy link
Copy Markdown
Member

seberg commented Jan 3, 2023

Ohh, sorry... I had forgotten/missed that I already override new with legacy_dtype_default_new... So that is clearly the right thing to expand.

seberg added a commit that referenced this pull request Jan 5, 2023
Following up from #22863 (comment), this makes it possible to create string dtype instances from an abstract string DTypeMeta.

Co-authored-by: Sebastian Berg <sebastianb@nvidia.com>
@ngoldbaum ngoldbaum force-pushed the char-with-nep42-dtypes branch 2 times, most recently from 297bf2e to 90f3f1f Compare January 6, 2023 19:07
@ngoldbaum
Copy link
Copy Markdown
Member Author

This should be ready for review again, assuming the tests pass on the CI runners.

I accidentally typo'd the release note filename for #22923 to user this PR number, which is why it's included in here.

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments passing, too late for a careful review, but thought I would comment. I don't think you can do that comparison thing, maybe its even just not needed? I bet practically nobody needs rstrip and the normal comparisons work for string arrays otherwise?


return (PyArray_Descr *)res;
}
else if (PyDataType_ISUNSIZED(type)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle we could now move all of this into Python, but I am happy to skip it here. (To simplify the C-code to always get a dtype instance.)

@ngoldbaum ngoldbaum force-pushed the char-with-nep42-dtypes branch 2 times, most recently from eba92bb to 9b6f554 Compare January 9, 2023 20:31
@ngoldbaum ngoldbaum force-pushed the char-with-nep42-dtypes branch from 9b6f554 to f4085a6 Compare January 9, 2023 20:34
@ngoldbaum
Copy link
Copy Markdown
Member Author

This should be ready to review again.

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C-Code looks great now (we don't usually have a newline for the opening bracket on else { right now, but I suspect we may have at some point so...) (EDIT: Haha, I see that was my mistake above anyway!)

Sorry, the Python code, I like the general change, but added some comments (which is mainly because this code was a mess before you looked at it already, of course!).

ret = numpy.asarray(result.tolist())
dtype = getattr(output_dtype_like, 'dtype', None)
if dtype is not None:
return ret.astype(type(dtype)(_get_num_chars(ret)))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, at least add a copy=False to the astype call here. I guess the objects should usually be all bytes or strings and thus the array already be correct (typically).

We can drop the .tolist() also, unless I am very mistaken. The reason for that smells like a bug I fixed 3 years ago.

For whatever you have, I guess the subclass info (and thus correct type) may get lost. It should possibly be spelled as asarray(result.tolist(), dtype=type(output_dtype_like.dtype)), but we can't easily yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I need to keep the tolist(), at least for partition to work correctly:

Traceback (most recent call last):
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1726, in main
    pdb._runscript(mainpyfile)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/pdb.py", line 1586, in _runscript
    self.run(statement)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/bdb.py", line 597, in run
    exec(cmd, globals, locals)
  File "<string>", line 1, in <module>
  File "/home/nathan/Documents/numpy-experiments/setup-asciidtype.py", line 38, in <module>
    print(repr(np.char.partition(arr, "r")))
  File "<__array_function__ internals>", line 185, in partition
  File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 1244, in partition
    return _to_string_or_unicode_array(
  File "/home/nathan/Documents/numpy/numpy/core/defchararray.py", line 90, in _to_string_or_unicode_array
    return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False)
TypeError: Can only store ASCII text in a ASCIIDType array.
Uncaught exception. Entering post mortem debugging
Running 'cont' or 'step' will restart the program
> /home/nathan/Documents/numpy/numpy/core/defchararray.py(90)_to_string_or_unicode_array()
-> return ret.astype(type(dtype)(_get_num_chars(ret)), copy=False)
(Pdb) p ret
array([('this', '', ''), ('is', '', ''), ('an', '', ''),
       ('a', 'r', 'ray')], dtype=object)
(Pdb) p numpy.asarray(result.tolist())
array([['this', '', ''],
       ['is', '', ''],
       ['an', '', ''],
       ['a', 'r', 'ray']], dtype='<U4')

I don't think this is a problem for np.str_ or np.unicode_ so indeed we could get rid of tolist() for those, but for asciidtype we end up passing asciidtype's setitem a tuple of strings, which I would expect numpy to have already processed. I'm also a tiny bit nervous about breaking code using numpy in unexpected ways, so I think I'd prefer to just leave the tolist() as-is for now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, OK, lets keep it, it would a followup issue anyway really. I suspect partition is just badly written here, since it should store the 3 objects into a an (..., 3) object array right away (and doesn't). But probably that goes all the way into _vec_string, so...

Thanks for having a look!

if (isinstance(x, str) or
issubclass(numpy.asarray(x).dtype.type, unicode_)):
scalar_type = numpy.asarray(x).dtype.type
if (isinstance(x, str) or issubclass(scalar_type, unicode_)):
Copy link
Copy Markdown
Member

@seberg seberg Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells a bit like it may have been a fast-path for a scalar unicode? OTOH, I am not sure that is really worthwhile...

EDIT: sorry to be more clear. If the isinstance check is a fast path, it has to come before converting with asarray.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic ended up a little more complicated for _use_unicode but I restored the fast path.

return unicode_
return string_
if (isinstance(arr, str) or
issubclass(numpy.asarray(arr).dtype.type, str)):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to only check for str. np.unicode_ is a subclass of string anyway. I am comfortable with this: The test should notice something incorrect here.

raise TypeError(
"np.char.add() requires both arrays of the same dtype kind, but "
f"got dtypes: '{arr.dtype}' and '{arr2.dtype}' (the few cases "
"where this used to work often lead to incorrect results).")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some back and forth, I decided that trying to get the logic right seems the wrong approach. So I nuked it. The things that didn't already fail were:

  • bytes + unicode. Which converted the unicode to bytes (all 4 for each character) and then stored it into a too short unicode result as a byte string.
  • unicode + object (where the object array contains strings). Which would always use np.dtype("O").itemsize for the number of characters, and thus shorten longer object strings anways.

Neither seems reasonable to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes+void made a little bit more sense, but even in that case you strip NUL padding of the void by casting it to bytes. The promotion of void to bytes doesn't really make sense...

It never really worked anyway, so nuke it and forget about it...
@seberg seberg force-pushed the char-with-nep42-dtypes branch from 75464bf to b76f7f9 Compare January 10, 2023 13:31
Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this was much more tedious and time consuming to figure out then I would have liked, but I am happy to put it in now. @ngoldbaum can you confirm if you think the last choices seem fine?

_get_num_chars() is still broken for user dtypes (in general at least). I am happy to consider this a follow up. At this point this is more cleanup than additional code paths.

@ngoldbaum
Copy link
Copy Markdown
Member Author

This works for me! Thanks for your attention on this and help getting it across the line.

@seberg
Copy link
Copy Markdown
Member

seberg commented Jan 10, 2023

OK, thanks @ngoldbaum lets put it in then.

@seberg seberg merged commit 737b064 into numpy:main Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants