API: Introduce stringdtype [NEP 55]#25347
Conversation
|
Hi Nathan, thank you for your great work on UTF8 strings and their integration in Numpy. This is a very important dtype to support, especially with the widespread use of large language models (LLM) nowadays.
On top of that, what if the data format for the sidecar data changes? Is it then still possible to read old files with a newer Numpy version? To overcome this issues, first and foremost I suggest not to introduce a .npy file format version 4.0 with this kind of fundamental change. I suggest to instead of adding the sidecar data to the same .npy file to add it to an extra, standard .npy file. Here an example: if the user wants to save to "mystrings.npy", the files "mystrings.npy" and "mystrings.npy.idx" could be created where latter is also just a regular .npy file and contains indices/offsets into mystrings.npy that would otherwise end up in the sidecar data. When the file "mystrings.npy" is loaded, Numpy checks whether there is a mystrings.npy.idx and tries to load it as well. This is an approach I use pretty regularly with video data: the single frames are all in one big (appendable) array and the index array contains the begin indices/offsets and (redundantly, for fast lookups, sort etc.) the lengths and end offsets. This concept is very generic, can be used for all sorts of ragged arrays including text, satisfies the requirements of NEP 1 and is efficiently appendable (or at least moves the burden from the programmer to the filesystem). Best, Michael |
|
Also, a dedicated index array can come with a custom shape, which would allow for multidimensional ragged arrays (in the future). Wouldn't that be great? |
|
Thankfully, I haven't started work on the serialization yet. I wasn't aware that there was already a capability to add extra data in the npy format. I will look at this and make sure it's usable for my case and if so, update the NEP to drop the part about updating the npy format. I only added that because I thought there was no other option - I'm glad to hear I was wrong! |
|
Okay, good to hear I'm not too late 😅 It's just an alternative idea where to house the sidecar data and of course for discussion. I'll send it to the Numpy user group, too. Also, how those .idx files would be handled would still need to be defined. Just as a definition of how the sidecar data would look like for the String dtype and all aspects around that. Two options come to my mind:
Ragged arrays are not defined at all in Numpy yet, afaik. Just described a solution I used frequently (which could be generalized to do the trick). Let's see what others think. |
|
Actually, I've spend quite some time thinking about this entire topic and I think I now came up with a solution that could fit very well with how things are done in Numpy. Above, I've mentioned that I've used index files for Numpy files to create some sort of ragged array. That's production proven and works and users are happy. However, it would probably make more sense to do it the other way round: to have the index file as the main .npy file and declare what used to be the main .npy file as some sort of heap. I've mentioned that idea in a Numpy community meeting once. The advantage would be a stronger hierarchy: one main file with indices/offsets/pointers that point to heaps in other files. Let me give an example what that could mean for strings: let's say the main .npy file "main.npy" has a shape of (100, 100) and that new String dtype. Then, the actual strings would all be lined up one after another in "main.npy.strings" (also a regular Numpy file, e.g. dtype byte or char) and "main.npy" could e.g. contain int64 values that can be interpreted as byte offsets to "main.npy.strings". Or dual int64 values that contain byte offsets of the string begin and string end (or string length). Just as a very small example to get some opinions. What do you think? |
|
Can we open a new issue for the storage disucssion? It's interesting, but I don't think it is central to this PR. |
|
I'll open an issue in the coming week. |
seberg
left a comment
There was a problem hiding this comment.
Awesome, thanks for this! I added some (mostly take-or-leave) comments with some a bit bigger or important. One thing I am curious: Is there a reason to keep static_string.c in the multiarray folder rather than the new stringdtype one?
Have to look a bit closer at the arena, although I am happy with rolling with it for now (just missed if/where we reclaim unused memory).
Overall, I am still leaning towards: nobody is going to test this deep enough unless we ship it, unfortunately. So my preference is to just ship it, having such a dtype is what we want, I think. And this takes care to keep the C-ABI flexible and the worst I can think of on the Python API side is some bugs and that we will end up grabbing the GIL, because it is all trickier than it seems now.
If someone else doesn't want to dive into the C-code, but have a look. Maybe reviewing the current tests in the numpy-user-dtype repo may actually be useful.
| c[3] = (0x80 | (code & 0x3f)); | ||
| return 4; | ||
| } | ||
| } |
There was a problem hiding this comment.
Happy to keep these here, they do also look pretty good as their own file with utf8 utilities.
Those tests are coming over here too. |
No particular reason. I wrote
The memory gets reclaimed by the OS in |
2ddc47c to
0b1c3b2
Compare
Okay, I've done that, see #25374. However, I think serialization is important for the String dtype, because otherwise one could just keep using object arrays if I understand correctly. |
cb9ee6a to
ee5cab2
Compare
|
It looks like all the tests pass except on pypy. Woohoo! I know at one point we had some concerns about how things would work on a 32 bit python or a big endian system so it's good to see that everything seems to be working with only very minor adjustments (see the last few commits). The pypy failures look like they're happening inside the stringdtype tests and are crashing the pypy interpreter (I think?), so I'm going to need to figure out how I'm breaking pypy. Maybe @mattip we could pair on this sometime if you're interested in taking a look. Some RPython tracebacks showing up in the build log: Next up after that are docs and serialization. The only remaining TODO comment in the codebase is because I never implemented casts to and from extended precision types like longdouble. Do I need to do that? It's not terribly hard but also I have other things to worry about. Also I could add a cast to and from bytes, erroring out if the bytes aren't valid UTF-8 (say). |
Bytes or void? For bytes, I think it is either latin1 or error out if not ascii? For void, you could allow it but check for valid utf-8. Bytes cast seems like it should exist to me. Now coercing Python byte objects could and maybe should raise, I think.
I would consider it a bug to miss them, but I can live with considering it an unfixed bug. Would be nice to have casts for all builtin dtypes though. |
eb6f3a6 to
743084e
Compare
|
I isolated the windows pypy crash and reported it upstream: https://foss.heptapod.net/pypy/pypy/-/issues/4046 |
|
It turns out we can rely on pointer comparison for singletons, just not for NaN. I refactored things to first do a pointer comparison, then do a NaN check, and finally check for |
Coming back to this - I'm fine with it, we just need to document it. I could say something like "If you need to work with memory-mapped UTF-8 rows, use a void dtype to load it as a numpy array and then cast it to stringdtype if you need to treat it as strings. If you have Latin-1 strings that don't contain embedded nulls (FITS?), memory-map it as the bytes dtype and then cast it to stringdtype if you want to treat it as variable-width python strings". And then mention that both casts need a copy since stringdtype's storage model is fundamentally different from the other dtypes. |
e45f2ce to
6ab9081
Compare
|
I finished out all the casts, please feel free to look over the last few commits I just pushed. I ended up deciding to make the bytes casts only work with ASCII data, to match the existing unicode to bytes casts. I also decided to allow the string to void cast to truncate a UTF-8 character without warning or erroring. |
seberg
left a comment
There was a problem hiding this comment.
Thanks, the new casts look fine, some small commnts (also old ones that I never submitted), but nothing much.
mhvk
left a comment
There was a problem hiding this comment.
Final comments! Partially about maybe reducing the tests a little - on my machine this adds 3000 tests that take 14 s, not huge but also rather non-negligible.
I would tend to remove the dtype parametrization on tests that have nothing to do with coercion and missing strings.
| other * arr | ||
|
|
||
|
|
||
| def test_create_with_na(dtype): |
There was a problem hiding this comment.
Maybe move way higher up in the file, near general creation tests.
| np.testing.assert_array_equal(sa, a.astype("U")) | ||
|
|
||
|
|
||
| def test_null_roundtripping(dtype): |
|
|
||
| UFUNC_TEST_DATA = [ | ||
| "hello" * 10, | ||
| "Ae¢☃€ 😊" * 100, |
There was a problem hiding this comment.
Size does matter here, I'd suggest reducing it.
077addc to
0de2fc1
Compare
|
There should be a github reward for doing careful code review on a 10,000 line PR. Thank you so so so much for your time on this. I'm very impressed with your ability to spot C bugs. I'll finish up the comments on the tests tomorrow and hopefully we'll be able to merge this soon! |
|
Thanks! It has actually been not too bad reviewing, since your code is really quite clear (but indeed doing the lot meant getting sometimes a bit grumpy about boilerplate!), and I felt that after looking over the development of the new dtype and array method machinery, it made sense to try to understand an actual application. Also a question: the hardest part about reviewing is to see what is missing and I realized that I don't think I saw many tests with views (either implicit as one gets by slicing, or explicit Though really the above came up beause I wondered what happens if I do Anyway, can obviously look at the code again, but not at this hour... |
We can follow up on this, I think. But would be good to have a test that this just fails due to the |
|
It looks like we need more tests with views... Things that work: Also no problem for short strings only: |
Thanks for pointing this out, I wasn't aware of this wrinkle with the iterators. I'd rather not completely ban views, since I found it convenient with pandas in a few places to change the It turns out there are two issues I need to fix to get this to work. First, I need a way to detect whether a descriptor actually owns the array data, or if it is a view into another array. It looks like setting The other issue is that there's no way to easily determine in the ufunc loop if two different packed strings happen to share a heap allocation. As you said, the iterator makes a copy, so the packed strings pointers are distinct and the input and output array pointers are distinct, but they still share an underlying heap allocation. One way to avoid this issue would be to assume if there are any shared allocators between the input and output arrays that there may be an in-place operation happening, and in that case ufuncs should allocate a temporary output buffer, do the operation, and only then pack the string data into the output array. Of course that means we'll be doing copies unnecessarily inside ufunc loops for any possibly in-place operation, even if it isn't actually in-place. Another option would be to add a new C API function along the lines of: which would return a positive value if the two strings share a heap allocation, 0 if they don't, and a negative value to indicate memory corruption or a logic error (e.g. Do either of you two (or anyone else still following along) have an opinion on this? I think the second option is probably preferable and doable, but I'll need to make sure I audit all the ufunc loops to make sure they have such a check and add various tests that do operations with an array and a slice into the same array and an array and a view into that array. Should I fix this immediately before this PR is merged or should we wait until it's easier to do code review on the fix? |
|
Maybe disabling the broken views for now could make sense in this PR to undo that later? But if not also fine. I have to think about this more to make a good input, but making a larger iteration here also doesn't seem helpful. |
|
I similarly would need more time... But maybe you were right all along to compare the |
Thanks for this hint! This does mostly work, although I also need to fix the issue alluded to above in the string to string cast, since in that case I do need to do the more thorough memory sharing check that looks at the unpacked representation. So for now I've added a private version of While doing this after applying some of the other test changes you suggested above, I uncovered two different bugs in the arena allocator implementation related to handling mutated strings. Glad I caught these! Unfortunately that took all day and I didn't have enough time to add the tests you wanted with views and slices, but I did verify they work now and the errors you ran into before are fixed. |
|
Great to hear that things mostly work! I'm now reviewing an actual paper, but can have a last look at the ufunc loops/tests later. It will be very nice to have this in and be able to look at smaller bits & pieces... |
|
@ngoldbaum - I slightly lost track of where we are with this -- let me know when it is ready for final review on the ufuncs and associated tests... |
|
It's ready for final review whenever you're ready. |
|
OK, great! Tomorrow hopefully! |
mhvk
left a comment
There was a problem hiding this comment.
OK, had a last look and only minimal things left, except one test case that still confuses me, see in-line comment (maybe I'm just missing something).
Other remaining question: as is, my sense would be to squash all commits together in the merge rather than have 196 commits with a path that may not be random but seems unlikely to be helpful (not sure all tests passed at all stages either). Is that OK?
| return NPY_NO_CASTING; | ||
| } | ||
|
|
||
| template <typename t> |
| &newsize, cursize, factor); | ||
| if (overflowed) { | ||
| npy_gil_error(PyExc_MemoryError, | ||
| "Failed to allocate string in string mutiply"); |
| return (NPY_CASTING)-1; | ||
| } | ||
|
|
||
| if (eq_res != 1) { |
There was a problem hiding this comment.
Feel free to ignore, but I'd write if (!eq_res)
| } | ||
|
|
||
| static int | ||
| minimum_strided_loop(PyArrayMethod_Context *context, char *const data[], |
There was a problem hiding this comment.
The minimum and maximum strided loops are also a bit of auxdata away of being able to share the same code... Named them in the task list.
|
|
||
| INIT_MULTIPLY(Int64, int64); | ||
| INIT_MULTIPLY(UInt64, uint64); | ||
| // TODO: for reasons I don't understand at the moment we can't add these using |
There was a problem hiding this comment.
@seberg - any advice on why the promotion doesn't "just work" without explicitly adding promotors for every integer type?
There was a problem hiding this comment.
I am not sure what @ngoldbaum tried. I assume it refers to the code below, which is indeed quite annoying!
I would think what works is adding a promoter for &PyArray_StringDType, &PyArray_DescrType (or None I think might work). Which will then be called always if one of the two dtypes is string dtype (and it must finish the promotion).
There was a problem hiding this comment.
Can one put in the more general abstract integer and unsigned integer types?
There was a problem hiding this comment.
The abstract integer dtype is only for python integers. We don't have a way of representing the dtype type hierarchy in the C DType API at the moment.
The promoter sebastian is suggesting will only fire for multiply, so the fact that it will fire for all dtypes is fine because multiplying by all other types is an error. Of course that limitation would be more annoying for other dtypes.
There was a problem hiding this comment.
Yes, I always meant to create an abstract base DType you can easily register with that matches any integer/float, etc.
For some reason it was a bit trickier than I thought it would be (I don't recall now). The machinery should support that just fine (I am not 100% sure we might have ripped the logic out because it was unused at the time).
At some point I thought the Python abstract integer DType could serve that purpose also, but I don't think it can anymore (we need to match exactly for some of the more complex stuff).
There was a problem hiding this comment.
OK, thanks! And, yes, the promoter with None sounds like a good idea as well!
| out[0] = "hello" | ||
| res = sarr.take(np.arange(len(string_list)), out=out) | ||
| assert res is out | ||
|
|
There was a problem hiding this comment.
needs a repeat of assert_array_equal(sarr, res) just to be sure the "hello" is overwritten...
|
The last push addresses today's comments. Thanks @seberg for the suggestion to refactor the multiply promoters, it's much simpler now with a single promoter. Once Marten approves I'm happy to squash this into one enormous commit. There might be some history in the PR that would be useful in the future though so I'm also happy to leave it as-is. It should at least build on most systems on every commit. |
|
Sounds good. However you like it, I would normally squash, but some big histories are just neat to see how much effort went in ;). |
|
Yeah, you just |
| tmp = signature[i]; | ||
| } | ||
| else if (op_dtypes[i] != &PyArray_StringDType) { | ||
| else if (is_signed_integer_dtype(op_dtypes[i])) { |
There was a problem hiding this comment.
I think I'm getting a little less confused, but wonder if this can still be easier: If we just register one promoter for None that just sets int64 here and another one for uint64 that sets uint64, would we then not be OK without the is_(un)signed_integer_dtype functions? That would provide safe casting routes for all integers, no?
There was a problem hiding this comment.
That can't work because I'm installing the promoter with a None entry in the dtypes I'm installing it for. That makes the promoter fire for all dtypes. There's no way to say "i only want to install a promoter for unsigned integers" without explicitly adding promoters for all unsigned integers like I had before.
There was a problem hiding this comment.
I meant that None would be for all integers except uint64 -- which is the only one that cannot be cast to int64. I think that if the uint64 is registered first, that should work, no? But I must admit that even though I reviewed @seberg's PRs, I fear I don't completely understand the paths yet...
There was a problem hiding this comment.
Registration order is meaningless, subclass relationships define priority. None is just the same as np.dtype (always matches any dtype with lowest priority). EDIT: probably None has a lower priority than np.dtype, but not sure.
But I think Marten is right (with caveat): If you install the None promoter it is guaranteed to never trigger for uint64 because there is a more specific uint64 loop already registered.
The (big) caveat is that, unfortunately, ulong and ulonglong are both unit64 and I never made them subclasses of each other :(. And until they are, you need to register both (that could be another promoter to convert the ulonglong to uint64 or another loop)
There was a problem hiding this comment.
(I am wondering now if we can try to hack that subclass relationship without merging the two uint64 into a single DType, but not for the here and now!)
There was a problem hiding this comment.
Another issue is that if I say, pass a float array in, I get a casting error about not being able to cast to int instead of an error saying there's no ufunc loop registered, unless I have a check for is_integer_dtype in the promoter.
In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType())
In [2]: arr * [1.2, 4.5]
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 arr * [1.2, 4.5]
UFuncTypeError: Cannot cast ufunc 'multiply' input 1 from dtype('float64') to dtype('int64') with casting rule 'same_kind'
I personally think the is_integer_dtype check is fine, it's just a bunch of pointer comparisons and I doubt it would have performance implications. A single promoter that checks for all ints and unconditionally casts to int64 would also allow me to avoid defining an extra promoter just for ulong and ulonglong.
Of course if we're worried about performance at that level, we would implement the loops directly as I originally had it...
See latest push to see what I'm talking about. With this push I get an error that's more immediately understandable to me:
In [1]: arr = np.array(["hello", "world"], dtype=np.dtypes.StringDType())
In [2]: arr * [1.2, 4.5]
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
Cell In[2], line 1
----> 1 arr * [1.2, 4.5]
UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (<class 'numpy.dtypes.StringDType'>, <class 'numpy.dtypes.Float64DType'>) -> None
There was a problem hiding this comment.
Promoters don't have any performance implications at all since the result is always cached (a bit too much if anything). The only question is what is nicer.
There was a problem hiding this comment.
Yes, the only reason I brought it up was that I thought it would be nicer without the is_integer function, since it just seemed promoters should be able to handle this - so a bit less code (both source and compiled). But the beauty is slightly destroyed by needing different ones for ulong and ulonglong...
To me, the errors do not make too much of a difference - in a way, it is nice that you can do casting='unsafe' for float.
But really this is not a big deal - I really just hoped we could avoid registering a whole slew of loops and that is done now! So, @ngoldbaum, happy to go with what you feel is nicest here!
5dd3dc7 to
661eacd
Compare
|
OK, in that case... can I hit the merge button? 😁 |
|
Yes go ahead! Let's have this wonderful addition!! |
661eacd to
ced4336
Compare
|
OK, squashed everything into one commit. I'll merge it tonight once all the tests are green, just in case. |
This PR adds stringdtype to numpy. See NEP 55 for details.