Skip to content

API: Introduce stringdtype [NEP 55]#25347

Merged
ngoldbaum merged 1 commit intonumpy:mainfrom
ngoldbaum:stringdtype
Feb 7, 2024
Merged

API: Introduce stringdtype [NEP 55]#25347
ngoldbaum merged 1 commit intonumpy:mainfrom
ngoldbaum:stringdtype

Conversation

@ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Dec 8, 2023

This PR adds stringdtype to numpy. See NEP 55 for details.

@ngoldbaum ngoldbaum changed the title API: Introduce stringdtype [NEP 55] [skip ci] API: Introduce stringdtype [NEP 55] Dec 8, 2023
@xor2k
Copy link
Contributor

xor2k commented Dec 8, 2023

Hi Nathan,

thank you for your great work on UTF8 strings and their integration in Numpy. This is a very important dtype to support, especially with the widespread use of large language models (LLM) nowadays.
However, I would like to comment on the serialization. Hope it's not too late at this point (the last time I looked I think the serialization was less detailed), but the approach with sidecar_size has the disadvantage that Numpy arrays would not be efficiently appendable anymore. Also, while it's comfortable to not need to specify how the sidecar data looks like, it makes the format more proprietary, less open and more difficult to debug. To be precise, I see this part of NEP 1 violated:

Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation.

On top of that, what if the data format for the sidecar data changes? Is it then still possible to read old files with a newer Numpy version?

To overcome this issues, first and foremost I suggest not to introduce a .npy file format version 4.0 with this kind of fundamental change. I suggest to instead of adding the sidecar data to the same .npy file to add it to an extra, standard .npy file. Here an example: if the user wants to save to "mystrings.npy", the files "mystrings.npy" and "mystrings.npy.idx" could be created where latter is also just a regular .npy file and contains indices/offsets into mystrings.npy that would otherwise end up in the sidecar data. When the file "mystrings.npy" is loaded, Numpy checks whether there is a mystrings.npy.idx and tries to load it as well. This is an approach I use pretty regularly with video data: the single frames are all in one big (appendable) array and the index array contains the begin indices/offsets and (redundantly, for fast lookups, sort etc.) the lengths and end offsets. This concept is very generic, can be used for all sorts of ragged arrays including text, satisfies the requirements of NEP 1 and is efficiently appendable (or at least moves the burden from the programmer to the filesystem).

Best, Michael

@xor2k
Copy link
Contributor

xor2k commented Dec 8, 2023

Also, a dedicated index array can come with a custom shape, which would allow for multidimensional ragged arrays (in the future). Wouldn't that be great?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Dec 8, 2023

Thankfully, I haven't started work on the serialization yet. I wasn't aware that there was already a capability to add extra data in the npy format. I will look at this and make sure it's usable for my case and if so, update the NEP to drop the part about updating the npy format. I only added that because I thought there was no other option - I'm glad to hear I was wrong!

@xor2k
Copy link
Contributor

xor2k commented Dec 8, 2023

Okay, good to hear I'm not too late 😅 It's just an alternative idea where to house the sidecar data and of course for discussion. I'll send it to the Numpy user group, too. Also, how those .idx files would be handled would still need to be defined. Just as a definition of how the sidecar data would look like for the String dtype and all aspects around that. Two options come to my mind:

  1. check if some dtype is String and then look for an .npy.idx file. If that fails, raise some error
  2. check for .npy.idx files every time a .npy file is loaded (could be variant for ragged arrays)

Ragged arrays are not defined at all in Numpy yet, afaik. Just described a solution I used frequently (which could be generalized to do the trick). Let's see what others think.

@xor2k
Copy link
Contributor

xor2k commented Dec 9, 2023

Actually, I've spend quite some time thinking about this entire topic and I think I now came up with a solution that could fit very well with how things are done in Numpy.

Above, I've mentioned that I've used index files for Numpy files to create some sort of ragged array. That's production proven and works and users are happy. However, it would probably make more sense to do it the other way round: to have the index file as the main .npy file and declare what used to be the main .npy file as some sort of heap. I've mentioned that idea in a Numpy community meeting once. The advantage would be a stronger hierarchy: one main file with indices/offsets/pointers that point to heaps in other files. Let me give an example what that could mean for strings: let's say the main .npy file "main.npy" has a shape of (100, 100) and that new String dtype. Then, the actual strings would all be lined up one after another in "main.npy.strings" (also a regular Numpy file, e.g. dtype byte or char) and "main.npy" could e.g. contain int64 values that can be interpreted as byte offsets to "main.npy.strings". Or dual int64 values that contain byte offsets of the string begin and string end (or string length). Just as a very small example to get some opinions.

What do you think?

@seberg
Copy link
Member

seberg commented Dec 9, 2023

Can we open a new issue for the storage disucssion? It's interesting, but I don't think it is central to this PR.

@xor2k
Copy link
Contributor

xor2k commented Dec 10, 2023

I'll open an issue in the coming week.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for this! I added some (mostly take-or-leave) comments with some a bit bigger or important. One thing I am curious: Is there a reason to keep static_string.c in the multiarray folder rather than the new stringdtype one?
Have to look a bit closer at the arena, although I am happy with rolling with it for now (just missed if/where we reclaim unused memory).

Overall, I am still leaning towards: nobody is going to test this deep enough unless we ship it, unfortunately. So my preference is to just ship it, having such a dtype is what we want, I think. And this takes care to keep the C-ABI flexible and the worst I can think of on the Python API side is some bugs and that we will end up grabbing the GIL, because it is all trickier than it seems now.

If someone else doesn't want to dive into the C-code, but have a look. Maybe reviewing the current tests in the numpy-user-dtype repo may actually be useful.

c[3] = (0x80 | (code & 0x3f));
return 4;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to keep these here, they do also look pretty good as their own file with utf8 utilities.

@ngoldbaum
Copy link
Member Author

Maybe reviewing the current tests in the numpy-user-dtype repo may actually be useful.

Those tests are coming over here too.

@ngoldbaum
Copy link
Member Author

One thing I am curious: Is there a reason to keep static_string.c in the multiarray folder rather than the new stringdtype one?

No particular reason. I wrote static_string.c such that the dtype doesn't need to know anything about the implementation of the string library and thought it might be generally useful elsewhere. Happy to move it back in with the dtype since it does logically belong there.

(just missed if/where we reclaim unused memory).

The memory gets reclaimed by the OS in NpyString_free_allocator, which is called in stringdtype_dealloc.

@xor2k
Copy link
Contributor

xor2k commented Dec 12, 2023

Can we open a new issue for the storage disucssion? It's interesting, but I don't think it is central to this PR.

Okay, I've done that, see #25374. However, I think serialization is important for the String dtype, because otherwise one could just keep using object arrays if I understand correctly.

@ngoldbaum ngoldbaum force-pushed the stringdtype branch 2 times, most recently from cb9ee6a to ee5cab2 Compare December 13, 2023 21:35
@ngoldbaum
Copy link
Member Author

It looks like all the tests pass except on pypy. Woohoo! I know at one point we had some concerns about how things would work on a 32 bit python or a big endian system so it's good to see that everything seems to be working with only very minor adjustments (see the last few commits).

The pypy failures look like they're happening inside the stringdtype tests and are crashing the pypy interpreter (I think?), so I'm going to need to figure out how I'm breaking pypy. Maybe @mattip we could pair on this sometime if you're interested in taking a look. Some RPython tracebacks showing up in the build log:

RPython traceback:
  File "pypy_interpreter.c", line 47668, in BuiltinCode_funcrun_obj
  File "pypy_module_cpyext_5.c", line 64539, in generic_cpy_call__StdObjSpaceConst_funcPtr_SomeI_11
  File "pypy_module_cpyext_3.c", line 52360, in CpyTypedescr_str_realize
  File "pypy_module_cpyext_6.c", line 19973, in unicode_realize
  File "pypy_interpreter.c", line 32661, in _unibuf_to_utf8

Next up after that are docs and serialization.

The only remaining TODO comment in the codebase is because I never implemented casts to and from extended precision types like longdouble. Do I need to do that? It's not terribly hard but also I have other things to worry about.

Also I could add a cast to and from bytes, erroring out if the bytes aren't valid UTF-8 (say).

@seberg
Copy link
Member

seberg commented Dec 14, 2023

Also I could add a cast to and from bytes, erroring out if the bytes aren't valid UTF-8 (say).

Bytes or void? For bytes, I think it is either latin1 or error out if not ascii? For void, you could allow it but check for valid utf-8. Bytes cast seems like it should exist to me.

Now coercing Python byte objects could and maybe should raise, I think.

I never implemented casts to and from extended precision types like longdouble.

I would consider it a bug to miss them, but I can live with considering it an unfixed bug. Would be nice to have casts for all builtin dtypes though.

@ngoldbaum
Copy link
Member Author

@mattip take a look at 743084e. Happy to entertain alternative approaches.

I also tried just now to report a bug on the pypy bug tracker about the incorrect implementation of Py_Is on pypy 3.10, but heptapod thought my bug report was spam and refused to let me post it...

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Dec 17, 2023

I isolated the windows pypy crash and reported it upstream: https://foss.heptapod.net/pypy/pypy/-/issues/4046

@ngoldbaum
Copy link
Member Author

It turns out we can rely on pointer comparison for singletons, just not for NaN. I refactored things to first do a pointer comparison, then do a NaN check, and finally check for == equality. That seems to work fine on CPython and pypy, so no pypy-specific hacks needed. I also added a new pypy-specific hack to work around the crash I saw last week, which should fix CI and avoid crashes on older unpatched pypy versions.

@ngoldbaum
Copy link
Member Author

Bytes or void? For bytes, I think it is either latin1 or error out if not ascii? For void, you could allow it but check for valid utf-8. Bytes cast seems like it should exist to me.

Coming back to this - I'm fine with it, we just need to document it.

I could say something like "If you need to work with memory-mapped UTF-8 rows, use a void dtype to load it as a numpy array and then cast it to stringdtype if you need to treat it as strings. If you have Latin-1 strings that don't contain embedded nulls (FITS?), memory-map it as the bytes dtype and then cast it to stringdtype if you want to treat it as variable-width python strings". And then mention that both casts need a copy since stringdtype's storage model is fundamentally different from the other dtypes.

@ngoldbaum ngoldbaum changed the title API: Introduce stringdtype [NEP 55] API: Introduce stringdtype [NEP 55] [skip CI] Dec 18, 2023
@ngoldbaum ngoldbaum changed the title API: Introduce stringdtype [NEP 55] [skip CI] API: Introduce stringdtype [NEP 55] [skip ci] Dec 21, 2023
@ngoldbaum ngoldbaum changed the title API: Introduce stringdtype [NEP 55] [skip ci] API: Introduce stringdtype [NEP 55] Dec 21, 2023
@ngoldbaum
Copy link
Member Author

I finished out all the casts, please feel free to look over the last few commits I just pushed. I ended up deciding to make the bytes casts only work with ASCII data, to match the existing unicode to bytes casts. I also decided to allow the string to void cast to truncate a UTF-8 character without warning or erroring.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the new casts look fine, some small commnts (also old ones that I never submitted), but nothing much.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final comments! Partially about maybe reducing the tests a little - on my machine this adds 3000 tests that take 14 s, not huge but also rather non-negligible.

I would tend to remove the dtype parametrization on tests that have nothing to do with coercion and missing strings.

other * arr


def test_create_with_na(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move way higher up in the file, near general creation tests.

np.testing.assert_array_equal(sa, a.astype("U"))


def test_null_roundtripping(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also move this one high up.


UFUNC_TEST_DATA = [
"hello" * 10,
"Ae¢☃€ 😊" * 100,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size does matter here, I'd suggest reducing it.

@ngoldbaum
Copy link
Member Author

There should be a github reward for doing careful code review on a 10,000 line PR. Thank you so so so much for your time on this. I'm very impressed with your ability to spot C bugs. I'll finish up the comments on the tests tomorrow and hopefully we'll be able to merge this soon!

@mhvk
Copy link
Contributor

mhvk commented Feb 1, 2024

Thanks! It has actually been not too bad reviewing, since your code is really quite clear (but indeed doing the lot meant getting sometimes a bit grumpy about boilerplate!), and I felt that after looking over the development of the new dtype and array method machinery, it made sense to try to understand an actual application.

Also a question: the hardest part about reviewing is to see what is missing and I realized that I don't think I saw many tests with views (either implicit as one gets by slicing, or explicit .view()). For an implicit view (say a[:5]) presumably the StringDType is shared (or at least the allocator); this matters a bit for in-place operations like a += a[:] - for that, the iterator will make internal copies, so sin != sout; is your dtype indeed also not shared? But they have the same allocator? What about a[a.startswith("a")] (copies array data but what happens to the arena? copied as well?).

Though really the above came up beause I wondered what happens if I do a.view(StringDType(...)), perhaps to replace the NaN object. Does the storage get properly shared?

Anyway, can obviously look at the code again, but not at this hour...

@seberg
Copy link
Member

seberg commented Feb 1, 2024

beause I wondered what happens if I do a.view(StringDType(...))

We can follow up on this, I think. But would be good to have a test that this just fails due to the HASREFS flag being checked on the dtype instance (I am assuming it fails, I didn't test).

@mhvk
Copy link
Contributor

mhvk commented Feb 1, 2024

It looks like we need more tests with views...

In [1]: a = np.array([['0', '±'*20], [';'*10, 'X']], 'T')

In [2]: a += a[0]
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 a += a[0]

MemoryError: Failed to allocate string in string to string cast

In [3]: a
Out[3]: 
array([['00',
        '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'],
       [';;;;;;;;;;', 'X']], dtype=StringDType())

Things that work:

a += a[0].copy()
a += a
a += a[:]

Also no problem for short strings only:

 In [1]: a = np.array([['0', '±'*3], [';'*6, 'X']], 'T')

In [2]: a += a.T

In [3]: a
Out[3]: 
array([['00', '±±±;;;;;;'],
       [';;;;;;±±±', 'XX']], dtype=StringDType())

In [4]: a += a.T
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 a += a.T

MemoryError: Failed to allocate string in string to string cast

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 1, 2024

For an implicit view (say a[:5]) presumably the StringDType is shared (or at least the allocator); this matters a bit for in-place operations like a += a[:] - for that, the iterator will make internal copies, so sin != sout; is your dtype indeed also not shared? But they have the same allocator? What about a[a.startswith("a")] (copies array data but what happens to the arena? copied as well?).

Thanks for pointing this out, I wasn't aware of this wrinkle with the iterators.

I'd rather not completely ban views, since I found it convenient with pandas in a few places to change the na_object temporarily without a copy.

It turns out there are two issues I need to fix to get this to work. First, I need a way to detect whether a descriptor actually owns the array data, or if it is a view into another array. It looks like setting array_owned=2 and checking for that in a few places avoids double-frees when cleaning up descriptors and allocators.

The other issue is that there's no way to easily determine in the ufunc loop if two different packed strings happen to share a heap allocation. As you said, the iterator makes a copy, so the packed strings pointers are distinct and the input and output array pointers are distinct, but they still share an underlying heap allocation.

One way to avoid this issue would be to assume if there are any shared allocators between the input and output arrays that there may be an in-place operation happening, and in that case ufuncs should allocate a temporary output buffer, do the operation, and only then pack the string data into the output array.

Of course that means we'll be doing copies unnecessarily inside ufunc loops for any possibly in-place operation, even if it isn't actually in-place.

Another option would be to add a new C API function along the lines of:

int NpyString_share_memory(npy_packed_static_string *s1, npy_string_allocator *alloc1,
                           npy_packed_static_string *s2, npy_string_allocator *alloc2)

which would return a positive value if the two strings share a heap allocation, 0 if they don't, and a negative value to indicate memory corruption or a logic error (e.g. s1 or s2 have flags set indicating a heap allocation, but alloc1 or alloc2 have a null arena pointer).

Do either of you two (or anyone else still following along) have an opinion on this? I think the second option is probably preferable and doable, but I'll need to make sure I audit all the ufunc loops to make sure they have such a check and add various tests that do operations with an array and a slice into the same array and an array and a view into that array.

Should I fix this immediately before this PR is merged or should we wait until it's easier to do code review on the fix?

@seberg
Copy link
Member

seberg commented Feb 1, 2024

Maybe disabling the broken views for now could make sense in this PR to undo that later? But if not also fine. I have to think about this more to make a good input, but making a larger iteration here also doesn't seem helpful.

@mhvk
Copy link
Contributor

mhvk commented Feb 1, 2024

I similarly would need more time... But maybe you were right all along to compare the dtype to see whether things were in-place? It is true that views share the dtype, i.e., a.T.dtype is a.dtype. Though I cannot immediately reason it through, it may be worth reverting to what you had before and check whether my examples "just work".

@ngoldbaum
Copy link
Member Author

But maybe you were right all along to compare the dtype to see whether things were in-place? It is true that views share the dtype, i.e., a.T.dtype is a.dtype. Though I cannot immediately reason it through, it may be worth reverting to what you had before and check whether my examples "just work".

Thanks for this hint! This does mostly work, although I also need to fix the issue alluded to above in the string to string cast, since in that case I do need to do the more thorough memory sharing check that looks at the unpacked representation. So for now I've added a private version of NpyString_share_memory and if we need to expose it publicly in the future we can come back to making it public. It turns out the identical descriptor check is sufficient in the ufunc loops, at least in the tests I have so far.

While doing this after applying some of the other test changes you suggested above, I uncovered two different bugs in the arena allocator implementation related to handling mutated strings. Glad I caught these!

Unfortunately that took all day and I didn't have enough time to add the tests you wanted with views and slices, but I did verify they work now and the errors you ran into before are fixed.

@mhvk
Copy link
Contributor

mhvk commented Feb 2, 2024

Great to hear that things mostly work! I'm now reviewing an actual paper, but can have a last look at the ufunc loops/tests later. It will be very nice to have this in and be able to look at smaller bits & pieces...

@mhvk
Copy link
Contributor

mhvk commented Feb 7, 2024

@ngoldbaum - I slightly lost track of where we are with this -- let me know when it is ready for final review on the ufuncs and associated tests...

@ngoldbaum
Copy link
Member Author

It's ready for final review whenever you're ready.

@mhvk
Copy link
Contributor

mhvk commented Feb 7, 2024

OK, great! Tomorrow hopefully!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, had a last look and only minimal things left, except one test case that still confuses me, see in-line comment (maybe I'm just missing something).

Other remaining question: as is, my sense would be to squash all commits together in the merge rather than have 196 commits with a path that may not be random but seems unlikely to be helpful (not sure all tests passed at all stages either). Is that OK?

return NPY_NO_CASTING;
}

template <typename t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capital T for type?

&newsize, cursize, factor);
if (overflowed) {
npy_gil_error(PyExc_MemoryError,
"Failed to allocate string in string mutiply");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutiply -> multiply

return (NPY_CASTING)-1;
}

if (eq_res != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore, but I'd write if (!eq_res)

}

static int
minimum_strided_loop(PyArrayMethod_Context *context, char *const data[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum and maximum strided loops are also a bit of auxdata away of being able to share the same code... Named them in the task list.


INIT_MULTIPLY(Int64, int64);
INIT_MULTIPLY(UInt64, uint64);
// TODO: for reasons I don't understand at the moment we can't add these using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - any advice on why the promotion doesn't "just work" without explicitly adding promotors for every integer type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what @ngoldbaum tried. I assume it refers to the code below, which is indeed quite annoying!

I would think what works is adding a promoter for &PyArray_StringDType, &PyArray_DescrType (or None I think might work). Which will then be called always if one of the two dtypes is string dtype (and it must finish the promotion).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can one put in the more general abstract integer and unsigned integer types?

Copy link
Member Author

@ngoldbaum ngoldbaum Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abstract integer dtype is only for python integers. We don't have a way of representing the dtype type hierarchy in the C DType API at the moment.

The promoter sebastian is suggesting will only fire for multiply, so the fact that it will fire for all dtypes is fine because multiplying by all other types is an error. Of course that limitation would be more annoying for other dtypes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I always meant to create an abstract base DType you can easily register with that matches any integer/float, etc.
For some reason it was a bit trickier than I thought it would be (I don't recall now). The machinery should support that just fine (I am not 100% sure we might have ripped the logic out because it was unused at the time).

At some point I thought the Python abstract integer DType could serve that purpose also, but I don't think it can anymore (we need to match exactly for some of the more complex stuff).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks! And, yes, the promoter with None sounds like a good idea as well!

out[0] = "hello"
res = sarr.take(np.arange(len(string_list)), out=out)
assert res is out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a repeat of assert_array_equal(sarr, res) just to be sure the "hello" is overwritten...

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Feb 7, 2024

The last push addresses today's comments. Thanks @seberg for the suggestion to refactor the multiply promoters, it's much simpler now with a single promoter.

Once Marten approves I'm happy to squash this into one enormous commit. There might be some history in the PR that would be useful in the future though so I'm also happy to leave it as-is. It should at least build on most systems on every commit.

@seberg
Copy link
Member

seberg commented Feb 7, 2024

Sounds good. However you like it, I would normally squash, but some big histories are just neat to see how much effort went in ;).
I thought git bisect is smart enough to prefer splitting on merge commits anyway, so even if there is a bad commit it doesn't destroy bisecting a lot? (But I admit, I don't bisect often in practice).

@ngoldbaum
Copy link
Member Author

Yeah, you just git bisect skip if you hit a commit that doesn't build (there are definitely commits in the numpy history and most open source projects on github that don't build...). It also can handle arbitrary history graphs with merging.

tmp = signature[i];
}
else if (op_dtypes[i] != &PyArray_StringDType) {
else if (is_signed_integer_dtype(op_dtypes[i])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm getting a little less confused, but wonder if this can still be easier: If we just register one promoter for None that just sets int64 here and another one for uint64 that sets uint64, would we then not be OK without the is_(un)signed_integer_dtype functions? That would provide safe casting routes for all integers, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can't work because I'm installing the promoter with a None entry in the dtypes I'm installing it for. That makes the promoter fire for all dtypes. There's no way to say "i only want to install a promoter for unsigned integers" without explicitly adding promoters for all unsigned integers like I had before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that None would be for all integers except uint64 -- which is the only one that cannot be cast to int64. I think that if the uint64 is registered first, that should work, no? But I must admit that even though I reviewed @seberg's PRs, I fear I don't completely understand the paths yet...

Copy link
Member

@seberg seberg Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration order is meaningless, subclass relationships define priority. None is just the same as np.dtype (always matches any dtype with lowest priority). EDIT: probably None has a lower priority than np.dtype, but not sure.

But I think Marten is right (with caveat): If you install the None promoter it is guaranteed to never trigger for uint64 because there is a more specific uint64 loop already registered.

The (big) caveat is that, unfortunately, ulong and ulonglong are both unit64 and I never made them subclasses of each other :(. And until they are, you need to register both (that could be another promoter to convert the ulonglong to uint64 or another loop)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I am wondering now if we can try to hack that subclass relationship without merging the two uint64 into a single DType, but not for the here and now!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that if I say, pass a float array in, I get a casting error about not being able to cast to int instead of an error saying there's no ufunc loop registered, unless I have a check for is_integer_dtype in the promoter.

In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType())

In [2]: arr * [1.2, 4.5]
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 arr * [1.2, 4.5]

UFuncTypeError: Cannot cast ufunc 'multiply' input 1 from dtype('float64') to dtype('int64') with casting rule 'same_kind'

I personally think the is_integer_dtype check is fine, it's just a bunch of pointer comparisons and I doubt it would have performance implications. A single promoter that checks for all ints and unconditionally casts to int64 would also allow me to avoid defining an extra promoter just for ulong and ulonglong.

Of course if we're worried about performance at that level, we would implement the loops directly as I originally had it...

See latest push to see what I'm talking about. With this push I get an error that's more immediately understandable to me:

In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType())

In [2]: arr * [1.2, 4.5]
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 arr * [1.2, 4.5]

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (<class 'numpy.dtypes.StringDType'>, <class 'numpy.dtypes.Float64DType'>) -> None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promoters don't have any performance implications at all since the result is always cached (a bit too much if anything). The only question is what is nicer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only reason I brought it up was that I thought it would be nicer without the is_integer function, since it just seemed promoters should be able to handle this - so a bit less code (both source and compiled). But the beauty is slightly destroyed by needing different ones for ulong and ulonglong...

To me, the errors do not make too much of a difference - in a way, it is nice that you can do casting='unsafe' for float.

But really this is not a big deal - I really just hoped we could avoid registering a whole slew of loops and that is done now! So, @ngoldbaum, happy to go with what you feel is nicest here!

@ngoldbaum
Copy link
Member Author

OK, in that case... can I hit the merge button? 😁

@mhvk
Copy link
Contributor

mhvk commented Feb 7, 2024

Yes go ahead! Let's have this wonderful addition!!

@ngoldbaum
Copy link
Member Author

OK, squashed everything into one commit. I'll merge it tonight once all the tests are green, just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants