API: Introduce stringdtype [NEP 55] by ngoldbaum · Pull Request #25347 · numpy/numpy

ngoldbaum · 2023-12-08T18:17:54Z

This PR adds stringdtype to numpy. See NEP 55 for details.

xor2k · 2023-12-08T19:40:33Z

Hi Nathan,

thank you for your great work on UTF8 strings and their integration in Numpy. This is a very important dtype to support, especially with the widespread use of large language models (LLM) nowadays.
However, I would like to comment on the serialization. Hope it's not too late at this point (the last time I looked I think the serialization was less detailed), but the approach with sidecar_size has the disadvantage that Numpy arrays would not be efficiently appendable anymore. Also, while it's comfortable to not need to specify how the sidecar data looks like, it makes the format more proprietary, less open and more difficult to debug. To be precise, I see this part of NEP 1 violated:

Be reverse engineered. Datasets often live longer than the programs that created them. A competent developer should be able to create a solution in his preferred programming language to read most NPY files that he has been given without much documentation.

On top of that, what if the data format for the sidecar data changes? Is it then still possible to read old files with a newer Numpy version?

To overcome this issues, first and foremost I suggest not to introduce a .npy file format version 4.0 with this kind of fundamental change. I suggest to instead of adding the sidecar data to the same .npy file to add it to an extra, standard .npy file. Here an example: if the user wants to save to "mystrings.npy", the files "mystrings.npy" and "mystrings.npy.idx" could be created where latter is also just a regular .npy file and contains indices/offsets into mystrings.npy that would otherwise end up in the sidecar data. When the file "mystrings.npy" is loaded, Numpy checks whether there is a mystrings.npy.idx and tries to load it as well. This is an approach I use pretty regularly with video data: the single frames are all in one big (appendable) array and the index array contains the begin indices/offsets and (redundantly, for fast lookups, sort etc.) the lengths and end offsets. This concept is very generic, can be used for all sorts of ragged arrays including text, satisfies the requirements of NEP 1 and is efficiently appendable (or at least moves the burden from the programmer to the filesystem).

Best, Michael

xor2k · 2023-12-08T19:43:13Z

Also, a dedicated index array can come with a custom shape, which would allow for multidimensional ragged arrays (in the future). Wouldn't that be great?

ngoldbaum · 2023-12-08T19:46:13Z

Thankfully, I haven't started work on the serialization yet. I wasn't aware that there was already a capability to add extra data in the npy format. I will look at this and make sure it's usable for my case and if so, update the NEP to drop the part about updating the npy format. I only added that because I thought there was no other option - I'm glad to hear I was wrong!

xor2k · 2023-12-08T20:01:20Z

Okay, good to hear I'm not too late 😅 It's just an alternative idea where to house the sidecar data and of course for discussion. I'll send it to the Numpy user group, too. Also, how those .idx files would be handled would still need to be defined. Just as a definition of how the sidecar data would look like for the String dtype and all aspects around that. Two options come to my mind:

check if some dtype is String and then look for an .npy.idx file. If that fails, raise some error
check for .npy.idx files every time a .npy file is loaded (could be variant for ragged arrays)

Ragged arrays are not defined at all in Numpy yet, afaik. Just described a solution I used frequently (which could be generalized to do the trick). Let's see what others think.

xor2k · 2023-12-09T18:59:17Z

Actually, I've spend quite some time thinking about this entire topic and I think I now came up with a solution that could fit very well with how things are done in Numpy.

Above, I've mentioned that I've used index files for Numpy files to create some sort of ragged array. That's production proven and works and users are happy. However, it would probably make more sense to do it the other way round: to have the index file as the main .npy file and declare what used to be the main .npy file as some sort of heap. I've mentioned that idea in a Numpy community meeting once. The advantage would be a stronger hierarchy: one main file with indices/offsets/pointers that point to heaps in other files. Let me give an example what that could mean for strings: let's say the main .npy file "main.npy" has a shape of (100, 100) and that new String dtype. Then, the actual strings would all be lined up one after another in "main.npy.strings" (also a regular Numpy file, e.g. dtype byte or char) and "main.npy" could e.g. contain int64 values that can be interpreted as byte offsets to "main.npy.strings". Or dual int64 values that contain byte offsets of the string begin and string end (or string length). Just as a very small example to get some opinions.

What do you think?

seberg · 2023-12-09T21:31:43Z

Can we open a new issue for the storage disucssion? It's interesting, but I don't think it is central to this PR.

xor2k · 2023-12-10T07:16:01Z

I'll open an issue in the coming week.

seberg

Awesome, thanks for this! I added some (mostly take-or-leave) comments with some a bit bigger or important. One thing I am curious: Is there a reason to keep static_string.c in the multiarray folder rather than the new stringdtype one?
Have to look a bit closer at the arena, although I am happy with rolling with it for now (just missed if/where we reclaim unused memory).

Overall, I am still leaning towards: nobody is going to test this deep enough unless we ship it, unfortunately. So my preference is to just ship it, having such a dtype is what we want, I think. And this takes care to keep the C-ABI flexible and the worst I can think of on the Python API side is some bugs and that we will end up grabbing the GIL, because it is all trickier than it seems now.

If someone else doesn't want to dive into the C-code, but have a look. Maybe reviewing the current tests in the numpy-user-dtype repo may actually be useful.

numpy/_core/include/numpy/ndarraytypes.h

numpy/_core/src/multiarray/experimental_public_dtype_api.c

numpy/_core/src/multiarray/static_string.c

numpy/_core/src/multiarray/stringdtype/dtype.c

numpy/_core/src/umath/stringdtype_ufuncs.c

numpy/_core/include/numpy/ndarraytypes.h

numpy/_core/src/multiarray/stringdtype/casts.c

seberg · 2023-12-10T13:19:16Z

numpy/_core/src/multiarray/stringdtype/casts.c

+        c[3] = (0x80 | (code & 0x3f));
+        return 4;
+    }
+}


Happy to keep these here, they do also look pretty good as their own file with utf8 utilities.

ngoldbaum · 2023-12-11T16:00:13Z

Maybe reviewing the current tests in the numpy-user-dtype repo may actually be useful.

Those tests are coming over here too.

ngoldbaum · 2023-12-11T16:05:00Z

One thing I am curious: Is there a reason to keep static_string.c in the multiarray folder rather than the new stringdtype one?

No particular reason. I wrote static_string.c such that the dtype doesn't need to know anything about the implementation of the string library and thought it might be generally useful elsewhere. Happy to move it back in with the dtype since it does logically belong there.

(just missed if/where we reclaim unused memory).

The memory gets reclaimed by the OS in NpyString_free_allocator, which is called in stringdtype_dealloc.

xor2k · 2023-12-12T16:30:58Z

Can we open a new issue for the storage disucssion? It's interesting, but I don't think it is central to this PR.

Okay, I've done that, see #25374. However, I think serialization is important for the String dtype, because otherwise one could just keep using object arrays if I understand correctly.

ngoldbaum · 2023-12-13T23:04:25Z

It looks like all the tests pass except on pypy. Woohoo! I know at one point we had some concerns about how things would work on a 32 bit python or a big endian system so it's good to see that everything seems to be working with only very minor adjustments (see the last few commits).

The pypy failures look like they're happening inside the stringdtype tests and are crashing the pypy interpreter (I think?), so I'm going to need to figure out how I'm breaking pypy. Maybe @mattip we could pair on this sometime if you're interested in taking a look. Some RPython tracebacks showing up in the build log:

RPython traceback:
  File "pypy_interpreter.c", line 47668, in BuiltinCode_funcrun_obj
  File "pypy_module_cpyext_5.c", line 64539, in generic_cpy_call__StdObjSpaceConst_funcPtr_SomeI_11
  File "pypy_module_cpyext_3.c", line 52360, in CpyTypedescr_str_realize
  File "pypy_module_cpyext_6.c", line 19973, in unicode_realize
  File "pypy_interpreter.c", line 32661, in _unibuf_to_utf8

Next up after that are docs and serialization.

The only remaining TODO comment in the codebase is because I never implemented casts to and from extended precision types like longdouble. Do I need to do that? It's not terribly hard but also I have other things to worry about.

Also I could add a cast to and from bytes, erroring out if the bytes aren't valid UTF-8 (say).

seberg · 2023-12-14T07:49:35Z

Also I could add a cast to and from bytes, erroring out if the bytes aren't valid UTF-8 (say).

Bytes or void? For bytes, I think it is either latin1 or error out if not ascii? For void, you could allow it but check for valid utf-8. Bytes cast seems like it should exist to me.

Now coercing Python byte objects could and maybe should raise, I think.

I never implemented casts to and from extended precision types like longdouble.

I would consider it a bug to miss them, but I can live with considering it an unfixed bug. Would be nice to have casts for all builtin dtypes though.

ngoldbaum · 2023-12-14T21:02:34Z

@mattip take a look at 743084e. Happy to entertain alternative approaches.

I also tried just now to report a bug on the pypy bug tracker about the incorrect implementation of Py_Is on pypy 3.10, but heptapod thought my bug report was spam and refused to let me post it...

ngoldbaum · 2023-12-17T03:48:10Z

I isolated the windows pypy crash and reported it upstream: https://foss.heptapod.net/pypy/pypy/-/issues/4046

ngoldbaum · 2023-12-18T21:45:09Z

It turns out we can rely on pointer comparison for singletons, just not for NaN. I refactored things to first do a pointer comparison, then do a NaN check, and finally check for == equality. That seems to work fine on CPython and pypy, so no pypy-specific hacks needed. I also added a new pypy-specific hack to work around the crash I saw last week, which should fix CI and avoid crashes on older unpatched pypy versions.

ngoldbaum · 2023-12-18T22:02:51Z

Bytes or void? For bytes, I think it is either latin1 or error out if not ascii? For void, you could allow it but check for valid utf-8. Bytes cast seems like it should exist to me.

Coming back to this - I'm fine with it, we just need to document it.

I could say something like "If you need to work with memory-mapped UTF-8 rows, use a void dtype to load it as a numpy array and then cast it to stringdtype if you need to treat it as strings. If you have Latin-1 strings that don't contain embedded nulls (FITS?), memory-map it as the bytes dtype and then cast it to stringdtype if you want to treat it as variable-width python strings". And then mention that both casts need a copy since stringdtype's storage model is fundamentally different from the other dtypes.

ngoldbaum · 2023-12-21T22:03:06Z

I finished out all the casts, please feel free to look over the last few commits I just pushed. I ended up deciding to make the bytes casts only work with ASCII data, to match the existing unicode to bytes casts. I also decided to allow the string to void cast to truncate a UTF-8 character without warning or erroring.

seberg

Thanks, the new casts look fine, some small commnts (also old ones that I never submitted), but nothing much.

numpy/_core/src/multiarray/stringdtype/static_string.h

numpy/_core/src/multiarray/stringdtype/dtype.h

numpy/_core/src/multiarray/stringdtype/casts.c

mhvk

Final comments! Partially about maybe reducing the tests a little - on my machine this adds 3000 tests that take 14 s, not huge but also rather non-negligible.

I would tend to remove the dtype parametrization on tests that have nothing to do with coercion and missing strings.

numpy/_core/tests/test_stringdtype.py

mhvk · 2024-01-30T22:01:54Z

numpy/_core/tests/test_stringdtype.py

+            other * arr
+
+
+def test_create_with_na(dtype):


Maybe move way higher up in the file, near general creation tests.

mhvk · 2024-01-30T22:03:36Z

numpy/_core/tests/test_stringdtype.py

+    np.testing.assert_array_equal(sa, a.astype("U"))
+
+
+def test_null_roundtripping(dtype):


Also move this one high up.

numpy/_core/tests/test_stringdtype.py

mhvk · 2024-01-30T22:06:16Z

numpy/_core/tests/test_stringdtype.py

+
+UFUNC_TEST_DATA = [
+    "hello" * 10,
+    "Ae¢☃€ 😊" * 100,


Size does matter here, I'd suggest reducing it.

ngoldbaum · 2024-01-31T23:27:54Z

There should be a github reward for doing careful code review on a 10,000 line PR. Thank you so so so much for your time on this. I'm very impressed with your ability to spot C bugs. I'll finish up the comments on the tests tomorrow and hopefully we'll be able to merge this soon!

mhvk · 2024-02-01T10:30:53Z

Thanks! It has actually been not too bad reviewing, since your code is really quite clear (but indeed doing the lot meant getting sometimes a bit grumpy about boilerplate!), and I felt that after looking over the development of the new dtype and array method machinery, it made sense to try to understand an actual application.

Also a question: the hardest part about reviewing is to see what is missing and I realized that I don't think I saw many tests with views (either implicit as one gets by slicing, or explicit .view()). For an implicit view (say a[:5]) presumably the StringDType is shared (or at least the allocator); this matters a bit for in-place operations like a += a[:] - for that, the iterator will make internal copies, so sin != sout; is your dtype indeed also not shared? But they have the same allocator? What about a[a.startswith("a")] (copies array data but what happens to the arena? copied as well?).

Though really the above came up beause I wondered what happens if I do a.view(StringDType(...)), perhaps to replace the NaN object. Does the storage get properly shared?

Anyway, can obviously look at the code again, but not at this hour...

seberg · 2024-02-01T11:02:39Z

beause I wondered what happens if I do a.view(StringDType(...))

We can follow up on this, I think. But would be good to have a test that this just fails due to the HASREFS flag being checked on the dtype instance (I am assuming it fails, I didn't test).

mhvk · 2024-02-01T14:14:05Z

It looks like we need more tests with views...

In [1]: a = np.array([['0', '±'*20], [';'*10, 'X']], 'T')

In [2]: a += a[0]
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[2], line 1
----> 1 a += a[0]

MemoryError: Failed to allocate string in string to string cast

In [3]: a
Out[3]: 
array([['00',
        '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'],
       [';;;;;;;;;;', 'X']], dtype=StringDType())

Things that work:

a += a[0].copy()
a += a
a += a[:]

Also no problem for short strings only:

 In [1]: a = np.array([['0', '±'*3], [';'*6, 'X']], 'T')

In [2]: a += a.T

In [3]: a
Out[3]: 
array([['00', '±±±;;;;;;'],
       [';;;;;;±±±', 'XX']], dtype=StringDType())

In [4]: a += a.T
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 a += a.T

MemoryError: Failed to allocate string in string to string cast

ngoldbaum · 2024-02-01T16:15:59Z

For an implicit view (say a[:5]) presumably the StringDType is shared (or at least the allocator); this matters a bit for in-place operations like a += a[:] - for that, the iterator will make internal copies, so sin != sout; is your dtype indeed also not shared? But they have the same allocator? What about a[a.startswith("a")] (copies array data but what happens to the arena? copied as well?).

Thanks for pointing this out, I wasn't aware of this wrinkle with the iterators.

I'd rather not completely ban views, since I found it convenient with pandas in a few places to change the na_object temporarily without a copy.

It turns out there are two issues I need to fix to get this to work. First, I need a way to detect whether a descriptor actually owns the array data, or if it is a view into another array. It looks like setting array_owned=2 and checking for that in a few places avoids double-frees when cleaning up descriptors and allocators.

The other issue is that there's no way to easily determine in the ufunc loop if two different packed strings happen to share a heap allocation. As you said, the iterator makes a copy, so the packed strings pointers are distinct and the input and output array pointers are distinct, but they still share an underlying heap allocation.

One way to avoid this issue would be to assume if there are any shared allocators between the input and output arrays that there may be an in-place operation happening, and in that case ufuncs should allocate a temporary output buffer, do the operation, and only then pack the string data into the output array.

Of course that means we'll be doing copies unnecessarily inside ufunc loops for any possibly in-place operation, even if it isn't actually in-place.

Another option would be to add a new C API function along the lines of:

int NpyString_share_memory(npy_packed_static_string *s1, npy_string_allocator *alloc1,
                           npy_packed_static_string *s2, npy_string_allocator *alloc2)

which would return a positive value if the two strings share a heap allocation, 0 if they don't, and a negative value to indicate memory corruption or a logic error (e.g. s1 or s2 have flags set indicating a heap allocation, but alloc1 or alloc2 have a null arena pointer).

Do either of you two (or anyone else still following along) have an opinion on this? I think the second option is probably preferable and doable, but I'll need to make sure I audit all the ufunc loops to make sure they have such a check and add various tests that do operations with an array and a slice into the same array and an array and a view into that array.

Should I fix this immediately before this PR is merged or should we wait until it's easier to do code review on the fix?

seberg · 2024-02-01T16:33:18Z

Maybe disabling the broken views for now could make sense in this PR to undo that later? But if not also fine. I have to think about this more to make a good input, but making a larger iteration here also doesn't seem helpful.

mhvk · 2024-02-01T16:47:57Z

I similarly would need more time... But maybe you were right all along to compare the dtype to see whether things were in-place? It is true that views share the dtype, i.e., a.T.dtype is a.dtype. Though I cannot immediately reason it through, it may be worth reverting to what you had before and check whether my examples "just work".

ngoldbaum · 2024-02-01T22:47:29Z

But maybe you were right all along to compare the dtype to see whether things were in-place? It is true that views share the dtype, i.e., a.T.dtype is a.dtype. Though I cannot immediately reason it through, it may be worth reverting to what you had before and check whether my examples "just work".

Thanks for this hint! This does mostly work, although I also need to fix the issue alluded to above in the string to string cast, since in that case I do need to do the more thorough memory sharing check that looks at the unpacked representation. So for now I've added a private version of NpyString_share_memory and if we need to expose it publicly in the future we can come back to making it public. It turns out the identical descriptor check is sufficient in the ufunc loops, at least in the tests I have so far.

While doing this after applying some of the other test changes you suggested above, I uncovered two different bugs in the arena allocator implementation related to handling mutated strings. Glad I caught these!

Unfortunately that took all day and I didn't have enough time to add the tests you wanted with views and slices, but I did verify they work now and the errors you ran into before are fixed.

mhvk · 2024-02-02T01:33:50Z

Great to hear that things mostly work! I'm now reviewing an actual paper, but can have a last look at the ufunc loops/tests later. It will be very nice to have this in and be able to look at smaller bits & pieces...

mhvk · 2024-02-07T01:45:22Z

@ngoldbaum - I slightly lost track of where we are with this -- let me know when it is ready for final review on the ufuncs and associated tests...

ngoldbaum · 2024-02-07T01:46:17Z

It's ready for final review whenever you're ready.

mhvk · 2024-02-07T01:47:23Z

OK, great! Tomorrow hopefully!

mhvk

OK, had a last look and only minimal things left, except one test case that still confuses me, see in-line comment (maybe I'm just missing something).

Other remaining question: as is, my sense would be to squash all commits together in the merge rather than have 196 commits with a path that may not be random but seems unlikely to be helpful (not sure all tests passed at all stages either). Is that OK?

mhvk · 2024-02-07T17:57:03Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+    return NPY_NO_CASTING;
+}
+
+template <typename t>


capital T for type?

mhvk · 2024-02-07T17:57:36Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+                &newsize, cursize, factor);
+        if (overflowed) {
+            npy_gil_error(PyExc_MemoryError,
+                      "Failed to allocate string in string mutiply");


mutiply -> multiply

mhvk · 2024-02-07T18:00:52Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+        return (NPY_CASTING)-1;
+    }
+
+    if (eq_res != 1) {


Feel free to ignore, but I'd write if (!eq_res)

mhvk · 2024-02-07T18:06:55Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+}
+
+static int
+minimum_strided_loop(PyArrayMethod_Context *context, char *const data[],


The minimum and maximum strided loops are also a bit of auxdata away of being able to share the same code... Named them in the task list.

mhvk · 2024-02-07T18:20:43Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+
+    INIT_MULTIPLY(Int64, int64);
+    INIT_MULTIPLY(UInt64, uint64);
+// TODO: for reasons I don't understand at the moment we can't add these using


@seberg - any advice on why the promotion doesn't "just work" without explicitly adding promotors for every integer type?

I am not sure what @ngoldbaum tried. I assume it refers to the code below, which is indeed quite annoying!

I would think what works is adding a promoter for &PyArray_StringDType, &PyArray_DescrType (or None I think might work). Which will then be called always if one of the two dtypes is string dtype (and it must finish the promotion).

Can one put in the more general abstract integer and unsigned integer types?

The abstract integer dtype is only for python integers. We don't have a way of representing the dtype type hierarchy in the C DType API at the moment.

The promoter sebastian is suggesting will only fire for multiply, so the fact that it will fire for all dtypes is fine because multiplying by all other types is an error. Of course that limitation would be more annoying for other dtypes.

Yes, I always meant to create an abstract base DType you can easily register with that matches any integer/float, etc.
For some reason it was a bit trickier than I thought it would be (I don't recall now). The machinery should support that just fine (I am not 100% sure we might have ripped the logic out because it was unused at the time).

At some point I thought the Python abstract integer DType could serve that purpose also, but I don't think it can anymore (we need to match exactly for some of the more complex stuff).

OK, thanks! And, yes, the promoter with None sounds like a good idea as well!

mhvk · 2024-02-07T18:31:20Z

numpy/_core/tests/test_stringdtype.py

+    out[0] = "hello"
+    res = sarr.take(np.arange(len(string_list)), out=out)
+    assert res is out
+


needs a repeat of assert_array_equal(sarr, res) just to be sure the "hello" is overwritten...

numpy/_core/tests/test_stringdtype.py

ngoldbaum · 2024-02-07T20:25:05Z

The last push addresses today's comments. Thanks @seberg for the suggestion to refactor the multiply promoters, it's much simpler now with a single promoter.

Once Marten approves I'm happy to squash this into one enormous commit. There might be some history in the PR that would be useful in the future though so I'm also happy to leave it as-is. It should at least build on most systems on every commit.

seberg · 2024-02-07T20:30:51Z

Sounds good. However you like it, I would normally squash, but some big histories are just neat to see how much effort went in ;).
I thought git bisect is smart enough to prefer splitting on merge commits anyway, so even if there is a bad commit it doesn't destroy bisecting a lot? (But I admit, I don't bisect often in practice).

ngoldbaum · 2024-02-07T20:33:01Z

Yeah, you just git bisect skip if you hit a commit that doesn't build (there are definitely commits in the numpy history and most open source projects on github that don't build...). It also can handle arbitrary history graphs with merging.

mhvk · 2024-02-07T20:34:25Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

            tmp = signature[i];
        }
-        else if (op_dtypes[i] != &PyArray_StringDType) {
+        else if (is_signed_integer_dtype(op_dtypes[i])) {


I think I'm getting a little less confused, but wonder if this can still be easier: If we just register one promoter for None that just sets int64 here and another one for uint64 that sets uint64, would we then not be OK without the is_(un)signed_integer_dtype functions? That would provide safe casting routes for all integers, no?

That can't work because I'm installing the promoter with a None entry in the dtypes I'm installing it for. That makes the promoter fire for all dtypes. There's no way to say "i only want to install a promoter for unsigned integers" without explicitly adding promoters for all unsigned integers like I had before.

I meant that None would be for all integers except uint64 -- which is the only one that cannot be cast to int64. I think that if the uint64 is registered first, that should work, no? But I must admit that even though I reviewed @seberg's PRs, I fear I don't completely understand the paths yet...

Registration order is meaningless, subclass relationships define priority. None is just the same as np.dtype (always matches any dtype with lowest priority). EDIT: probably None has a lower priority than np.dtype, but not sure.

But I think Marten is right (with caveat): If you install the None promoter it is guaranteed to never trigger for uint64 because there is a more specific uint64 loop already registered.

The (big) caveat is that, unfortunately, ulong and ulonglong are both unit64 and I never made them subclasses of each other :(. And until they are, you need to register both (that could be another promoter to convert the ulonglong to uint64 or another loop)

(I am wondering now if we can try to hack that subclass relationship without merging the two uint64 into a single DType, but not for the here and now!)

Another issue is that if I say, pass a float array in, I get a casting error about not being able to cast to int instead of an error saying there's no ufunc loop registered, unless I have a check for is_integer_dtype in the promoter.

In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType()) In [2]: arr * [1.2, 4.5] --------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 arr * [1.2, 4.5] UFuncTypeError: Cannot cast ufunc 'multiply' input 1 from dtype('float64') to dtype('int64') with casting rule 'same_kind'

I personally think the is_integer_dtype check is fine, it's just a bunch of pointer comparisons and I doubt it would have performance implications. A single promoter that checks for all ints and unconditionally casts to int64 would also allow me to avoid defining an extra promoter just for ulong and ulonglong.

Of course if we're worried about performance at that level, we would implement the loops directly as I originally had it...

See latest push to see what I'm talking about. With this push I get an error that's more immediately understandable to me:

In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType()) In [2]: arr * [1.2, 4.5] --------------------------------------------------------------------------- UFuncTypeError Traceback (most recent call last) Cell In[2], line 1 ----> 1 arr * [1.2, 4.5] UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (<class 'numpy.dtypes.StringDType'>, <class 'numpy.dtypes.Float64DType'>) -> None

Promoters don't have any performance implications at all since the result is always cached (a bit too much if anything). The only question is what is nicer.

Yes, the only reason I brought it up was that I thought it would be nicer without the is_integer function, since it just seemed promoters should be able to handle this - so a bit less code (both source and compiled). But the beauty is slightly destroyed by needing different ones for ulong and ulonglong...

To me, the errors do not make too much of a difference - in a way, it is nice that you can do casting='unsafe' for float.

But really this is not a big deal - I really just hoped we could avoid registering a whole slew of loops and that is done now! So, @ngoldbaum, happy to go with what you feel is nicest here!

ngoldbaum · 2024-02-07T21:55:59Z

OK, in that case... can I hit the merge button? 😁

mhvk · 2024-02-07T22:17:52Z

Yes go ahead! Let's have this wonderful addition!!

ngoldbaum · 2024-02-07T22:24:25Z

OK, squashed everything into one commit. I'll merge it tonight once all the tests are green, just in case.

github-actions bot added the 30 - API label Dec 8, 2023

ngoldbaum changed the title ~~API: Introduce stringdtype [NEP 55] [skip ci]~~ API: Introduce stringdtype [NEP 55] Dec 8, 2023

seberg reviewed Dec 10, 2023

View reviewed changes

ngoldbaum force-pushed the stringdtype branch from 2ddc47c to 0b1c3b2 Compare December 11, 2023 21:55

xor2k mentioned this pull request Dec 12, 2023

ENH: new datatypes for efficient storage #25374

Open

ngoldbaum force-pushed the stringdtype branch 2 times, most recently from cb9ee6a to ee5cab2 Compare December 13, 2023 21:35

ngoldbaum force-pushed the stringdtype branch from eb6f3a6 to 743084e Compare December 14, 2023 20:42

ngoldbaum force-pushed the stringdtype branch from e45f2ce to 6ab9081 Compare December 18, 2023 22:05

ngoldbaum changed the title ~~API: Introduce stringdtype [NEP 55]~~ API: Introduce stringdtype [NEP 55] [skip CI] Dec 18, 2023

ngoldbaum changed the title ~~API: Introduce stringdtype [NEP 55] [skip CI]~~ API: Introduce stringdtype [NEP 55] [skip ci] Dec 21, 2023

ngoldbaum changed the title ~~API: Introduce stringdtype [NEP 55] [skip ci]~~ API: Introduce stringdtype [NEP 55] Dec 21, 2023

seberg reviewed Dec 22, 2023

View reviewed changes

mhvk reviewed Jan 30, 2024

View reviewed changes

ngoldbaum force-pushed the stringdtype branch from 077addc to 0de2fc1 Compare January 31, 2024 23:11

mhvk reviewed Feb 7, 2024

View reviewed changes

ngoldbaum mentioned this pull request Feb 7, 2024

ENH: Add index/rindex ufuncs for unicode and bytes dtypes #25775

Merged

ngoldbaum force-pushed the stringdtype branch from 5dd3dc7 to 661eacd Compare February 7, 2024 21:24

API: add stringdtype to numpy

ced4336

ngoldbaum force-pushed the stringdtype branch from 661eacd to ced4336 Compare February 7, 2024 22:23

ngoldbaum merged commit cd0e99a into numpy:main Feb 7, 2024

TomNicholas mentioned this pull request Mar 15, 2024

In-memory representation of chunks: array instead of a dict? zarr-developers/VirtualiZarr#33

Closed

neutrinoceros mentioned this pull request Mar 18, 2024

BUG: (numpy 2) defect in string summation ? #26062

Closed

guilhermeleobas mentioned this pull request Dec 9, 2024

ENH: Support for unicode_type in guvectorize numba/numba#9830

Open

		np.testing.assert_array_equal(sa, a.astype("U"))


		def test_null_roundtripping(dtype):

Uh oh!

Conversation

ngoldbaum commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xor2k commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xor2k commented Dec 8, 2023

Uh oh!

ngoldbaum commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xor2k commented Dec 8, 2023

Uh oh!

xor2k commented Dec 9, 2023

Uh oh!

seberg commented Dec 9, 2023

Uh oh!

xor2k commented Dec 10, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Dec 11, 2023

Uh oh!

ngoldbaum commented Dec 11, 2023

Uh oh!

xor2k commented Dec 12, 2023

Uh oh!

ngoldbaum commented Dec 13, 2023

Uh oh!

seberg commented Dec 14, 2023

Uh oh!

ngoldbaum commented Dec 14, 2023

Uh oh!

ngoldbaum commented Dec 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Dec 18, 2023

Uh oh!

ngoldbaum commented Dec 18, 2023

Uh oh!

ngoldbaum commented Dec 21, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Dec 8, 2023 •

edited

Loading

xor2k commented Dec 8, 2023 •

edited

Loading

ngoldbaum commented Dec 8, 2023 •

edited

Loading

ngoldbaum commented Dec 17, 2023 •

edited

Loading

ngoldbaum commented Feb 1, 2024 •

edited

Loading

ngoldbaum Feb 7, 2024 •

edited

Loading