-
-
Notifications
You must be signed in to change notification settings - Fork 12k
ENH, API: New sorting mechanism for DType API #28516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Running this script on both branches: import numpy
import random
import timeit
print(numpy.__version__) # '2.0.0rc2'
options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))
print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))produces on master: and on this branch: |
|
@ngoldbaum just to let you know, I'll let you decide on whether you want this. I am starting to think it is time to implement a |
Agreed. @MaanasArora as you said this was a draft, would you be up for a bigger refactor? IMO this functionality deserves support in the new DType system without relying on the ArrFuncs baggage.
Also agreed if you don't want to take this further. |
|
Yes, agreed, and willing to do a larger refactor! I actually began by considering special casing array sorting for strings overall, but wondered what the preferred approach would be. I think the sorting routines are not very flexible and could use an overhaul. PS just to clarify, adding a slot to the dtype will mean we will restructure the sorting to be more generic and allow replacing or extending compare etc.? I'll look further into this but would appreciate pointers. |
Take a look at We already have entries for getitem and setitem as well as the legacy arrfuncs slots. You'd be migrating from using |
Yes, if you look around, you will find for example I see nathan has some other pointers, I don't expect mine to be quite enough, so please ask! |
|
Thank you both, this was helpful! Starting to plan this now and will surely clarify if needed. |
|
I've added the slots and done some patchy work around using it, and the stringdtype integration. Looking into how to better relate to the array funcs. WIP, but hopefully this is in the right direction! |
|
Yeah, I think you see we still have a lot of other functionality that we should have slots for. Definitely nonzero, at least. |
|
Yes, nonzero and some other arrayfuncs could definitely use a slot! Thank you for the guidance--I've completed most of the missing pieces I think. I assume we would deprecate some of the earlier uses more gradually, so I've fallen back on array funcs as defaults in some places. I think this should be ready for a first pass! |
| } | ||
|
|
||
| return 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a lot of somewhat redundant code, but I added it here as a 'test' use which provides a boilerplate to replace the routines with a more efficient special-case indirect sort in a future PR. As a "bonus" it allows us to at least temporarily do the allocation this PR was intended for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'll let @seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.
That said - there definitely are specialized string sorting implementations we could be using here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good, the thing that I would like to change is the PyArray_(Arg)SortFunc itself, so that it gets a context and auxdata instead of the "array". And this function would get NpyAuxData *out_auxdata in also.
(Although, maybe for sorting this is less interesting as it is not as common to sort many arrays in one.)
The unfortunate thing is that you need to wrap the existing functions if you do this or have a second path for the old function.
I am also considering if we should have a return -2 or so to indicate that the sort-kind is not supported (no error set), to allow NumPy to fall back to a different one.
(But I am not sure we need it, it is useful only for somtehing like mergesort/stablesort, explicitly.)
@charris may have a thought on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into passing context now! It looks like a good idea, will try to implement. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change would be really nice; unfortunately both the PyArray_CompareFunc and the sort functions use the PyArrayObject right now, which we probably (!) shouldn't get from the context.
If I'm thinking right, we can define a new type such as PyArray_SortCompareFunc that uses the descr instead of the array and make new sort functions that do not use the array somehow (as we can no longer interchange the SortCompareFunc and the CompareFunc!), but we would still probably need the old functions to use with the older compare functions; I think the duplication will be quite complicated.
At the same time, this would be a missed opportunity to have the new CompareFunc type if we do deprecate later and want to go down this route...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two possible approaches here:
- We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
- You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
That is, we wrap it into a dummy object for which basically the only valid field isarr->descr(and maybearr->flags, don't recall). (Seeget_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
If you do that, you can write a short function that wraps the old call into a function taking the new one.
I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.
Allowing to set a compare function, seems like a nice idea (also to have a simple default).
It would be nice to move that into a default slot function. I.e. rather than setting it for StringDType here, auto-fill the slot with the function that tries to use the SortCompareFunc (If that slot is undefined, we can keep the slot filled with NULL).
That also removes the second check later.
About this SortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect the dtype to do the comparisons -- for example structured has to do this).
But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e. <=> in C++ can return a partial order, which means that:
- For us probably an "error" is a valid return (right now we can't propagate errors!).
- "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with
compare(b, b)andcompare(a, a).
But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left,unordered_both,unordered_right.
Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try the first approach I think! I think it will help isolate the new API in a way that makes deprecation easier later too. I will also do this default logic for both compare and sort compare funcs.
As for allowing partial order: yes that could break the symmetry, I suppose. But it could be useful to make sorting more precise in the long run too, and so "unordered" does seem a better way to allow for those kinds of extensions. And then we can use this machinery as the go-to for anywhere comparison decisions need to be made as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an update: after some thought, I think it might be quite nice to go with unordered_left, unordered_both, unordered_right, mainly because it saves us any issues with reverse sorting down the line, might as well get everything in, especially as you mentioned dataframes and that's clearly a very important use case. Working on this! I'll try to draft an API that can easily fill in 'defaults' somehow, so that the user-facing side can be used at different levels of complexity / customization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed a draft for this! Includes reversed (currently not functional, but working on that) and nan_position added into the context.
I've also restructured the context / legacy wrappers and updated the SortCompareFunc to use the sorting context instead of the descr; it should allow richer features down the line.
| npy_intp, int, PyArray_SortFunc **); | ||
| typedef int *(PyArrayDTypeMeta_GetArgSortFunction)(PyArray_Descr *, | ||
| npy_intp, int, PyArray_ArgSortFunc **); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New stuff in the public API needs new API docs as well as a release note describing the new features.
Maybe also as a proof-of-concept, it looks like both quaddtype and mpfdtype in numpy-user-dtypes implement sorting - would you be willing to update them to use the new API in a PR to numpy-user-dtypes that depends on this PR to numpy? That should give you a feeling for whether this API is helpful for someone writing a new user dtype. It'll also be a form of documentation - we don't have great docs for writing user dtypes besides the examples in numpy-user-dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what should we do about the flags that got added before we made the dtype API public, e.g. NPY_DT_PyArray_ArrFuncs_compare? I guess we can deprecate them although I don't know how hard it would be to generate deprecation warnings if those are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to generate a deprecation warning during registration (a bit tedious maybe, as you need explicit check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add API docs and a release note, and willing to make a PR to numpy-user-dtypes! Will look into that.
Just to be clear, NPY_DT_PyArray_ArrFuncs_compare is still needed right? We can move it to a new slot rather than an arrayfunc but it's going to be different from the sort comparison for now if I'm thinking right (as it is user-facing rather than used in the sorting). Do we need to do this another way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change slot numbers (unless they are guarded as private)! So the numbers are fixed (until they have not been used for a bit at least).
So yeah, I think we should keep it the old slot for now, maybe easier to make the deprecation a follow up.[^depr]
So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.
(It's not a big issue, the only thing is the convenience if slot numbers == slot offset so you don't need to translate it.)
[^depr] I think this is as simple as asking users to compile with the new NumPy, and then adding PyArray_RUNTIME_VERSION, but this PR is complicated enough due to API decisions for the new loops, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we just have to live with the numbering we got, I half thought I asked for an offset for the NPY_DT_PyArray_ArrFuncs slots, but maybe I didn't bother.
There is an offset, _NPY_DT_ARRFUNCS_OFFSET:
numpy/numpy/_core/src/multiarray/dtypemeta.h
Lines 94 to 95 in 9389862
| #define NPY_DT_MAX_ARRFUNCS_SLOT \ | |
| NPY_NUM_DTYPE_PYARRAY_ARRFUNCS_SLOTS + _NPY_DT_ARRFUNCS_OFFSET |
| } | ||
|
|
||
| return 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I'll let @seberg evaluate whether this API needs adjustments to fit in with the broader DType API but at a first glance it looks reasonable to me, especially if the common case with no specializations can be done with less boilerplate.
That said - there definitely are specialized string sorting implementations we could be using here.
| } | ||
|
|
||
| static inline PyArray_CompareFunc * | ||
| PyArray_SortCompare(PyArray_Descr *descr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this PyArray_GetSortCompareFunction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, made this change. Thanks
| * additional minor optimization skips the recursion depth checking on the | ||
| * smaller partition as it is always less than half of the remaining data and | ||
| * will thus terminate fast enough | ||
| * will thus terminate fast enough` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was added by mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry!
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating so quickly :)
I think the new error path needs a little more thought, sorry for the back-and-forth.
| sort-kind and order. | ||
|
|
||
| Additionally, the new `NPY_DT_sort_compare` slot can be used to provide a comparison function for | ||
| sorting, which will replace the default comparison function for the dtype in sorting functions. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a note that the old arrfuncs slots may be deprecated in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
| #define NPY_DT_sort_compare 11 | ||
| #define NPY_DT_get_clear_loop 12 | ||
| #define NPY_DT_get_fill_zero_loop 13 | ||
| #define NPY_DT_finalize_descr 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you re-order these so the slots that were already in the struct keep their old values? I don't know offhand if changing this order is problematic but it seems more consistent to not change the old values even if it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, done!
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, long comments, and I realize this is becoming a lot more complex than it may have looked initially. But, I would really like to see the context passed in/a new signature.
That also probably includes filling in/returning ARRAY_METHODFLAGS, even if the only useful flag is "requires GIL".
I also still tend to think it may make sense to have a magic return for "unsupported sort method", although should maybe ask Chuck once in a meeting about that.
(in principle I agree we usually just need stable and not-stable, but if we want users to be able to choose more precisely, I think it may make sense to allow us to fallback here. We could still use something like "no error indicated, but func == NULL for it even, but maybe a special return is easier.)
If needed, maybe we have to talk briefly about it synchronously? Or maybe just write the docs/signatures first that we want for the public API.
| @@ -0,0 +1 @@ | |||
| * `PyArray_GetSortFunction`, `PyArray_GetArgSortFunction`, and `PyArray_GetSortCompareFunction` have been added to the C-API. These functions return the sorting, argsorting, and sort comparison functions if provided for a given dtype in new slots. No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did not actually add them to the public C-API. Which is totally fine, though.
(I might start with adding a SortBuffer() function or so.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still valid, I think we might as well remove this for now.
| } | ||
|
|
||
| return 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two possible approaches here:
- We deal with it in the sort function, and just have to different calls depending on whether it is an old or new sort function.
- You ignore the fact that an array is currently passed (effectively). We do that in some other places as well, due to how terrible it is.
That is, we wrap it into a dummy object for which basically the only valid field isarr->descr(and maybearr->flags, don't recall). (Seeget_dummy_stack_array, yes this is terrible and even reviewers stumble over it, but...)
If you do that, you can write a short function that wraps the old call into a function taking the new one.
I did the second for ufuncs (not sure if that was the easier!), so I suspect the first is likely the simplest here.
Allowing to set a compare function, seems like a nice idea (also to have a simple default).
It would be nice to move that into a default slot function. I.e. rather than setting it for StringDType here, auto-fill the slot with the function that tries to use the SortCompareFunc (If that slot is undefined, we can keep the slot filled with NULL).
That also removes the second check later.
About this SortCompareFunc, it may make sense to keep it "light-weight" (i.e. a single function even if that may not be ideal if you have to inspect the dtype to do the comparisons -- for example structured has to do this).
But I would like to think about what we need to sort things like NaNs if possible. Unfortunately, I am not immediately sure, i.e. <=> in C++ can return a partial order, which means that:
- For us probably an "error" is a valid return (right now we can't propagate errors!).
- "unordered" is a valid return, although I am not sure how to deal with it. If we have "compare(a, b) == unordered" (i.e. one or both are NaN), we don't yet know how to swap them. That may be possible to resolve with
compare(b, b)andcompare(a, a).
But the only way I am quite sure how to resolve possible reversed sorting, etc. might be to haveunordered_left,unordered_both,unordered_right.
Or we should keep it roughly as is, and accept that this function doesn't exist for all dtypes... A neat thing about having a clear order with "unordered" is that we could also use it from the comparison (u)funcs.
| NPY_SORTKIND which, int descending, PyArray_SortFunc **out_sort) | ||
| { | ||
| if (NPY_DT_SLOTS(NPY_DTYPE(descr))->get_sort_function == NULL) { | ||
| return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to set an error (TypeError or DTypeTypeError, which is defined somewhere I think.)
(An error here will make sense after you move the fallback logic into a default slot filling. Or you could have the default slot raise an error, that is also completely fine.)
|
No worries; thank you for the detailed feedback actually! It's nice to be able to iron out the direction for the API. I'll address the docs and public API changes and keep working away at the SortFunc changes. Happy to have a synchronous chat if it seems useful. |
| typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *); | ||
|
|
||
| typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *, | ||
| PyArray_Descr *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming is a bit weird here, but I didn't want to disturb the original type as it's used a lot. I think the SortCompareFunc should still be a unique type so will do that (even if only a clone of this type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from the np.less_than or np.great_than ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)
But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).
So overall, I think I end up at just doing this, although I could imaging punting if we don't need it for StringDType (I suspect we do, though).
Would like to hear if @ngoldbaum has an opinion.
(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is fine, if only because it exists right now 😄
|
Sorry for the bit of delay, I was thinking through this and essentially ended up with separating more the legacy sorting machinery from this API. This way, the new signatures can freely use context-related features and we do not have to create some sort of empty array or refactor a very large number of sorting-related files. Aside from how nice the feature is, I think this separation is actually a plus and even rolling back the stringdtype integration was worth it (in any case, user dtypes are not to define sorting with the internal, now legacy functions, so may be best to add a specialized routine). We also have the new |
|
Sorry for not getting to this yet. I'm going to try to make sure to give this a once-over next week. I think you can fix the test failures by rebasing? |
|
No worries, I have some things to address as well. Just rebased--sorry not sure if things went perfectly smoothly. |
|
Just brought this implementation with the new signature to parity with the previous one, including the usage in StringDType and ensuring use cases for the new and old sort functions are handled properly in the handlers. There is repetitive code but I guess we will phase out the legacy slots. Now we can make the context nicer if needed, and incorporate the auxdata! |
d7cd9ed to
0e8b6a5
Compare
|
Hello! Getting back to this, is there anything I need to address? Thinking of adding the functions to the public C-API if things look fine. Would we need to create a new C-API version (regenerate the hashes and such), and I guess it would come under 2.4, given how close 2.3 is? |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, yeah, this won't make 2.3, sorry. I think it might be good to discuss a bit in depth with @ngoldbaum some time (not next week, sorry).
Another thing that I would like addressed/discussed is the problem of reverse sorting.
I do think we need at least a reverse=True, I think it might make sense to also provision for a nan_position (if nan goes first or last).
(NULL/NA ordering is very important in dataframe world, and I am tempted to include this, even if we say that the value for now is always "last").
doc/source/reference/c-api/array.rst
Outdated
| embedded references. | ||
| .. c:type:: int (PyArray_SortFunc)( \ | ||
| void *start, npy_intp num, PyArrayMethod_Context *context, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move the context to the first spot just for similarity. I think I added a context for traversal functions, I am not sure that was smart, but since we have it, it may be a slightly better fit.
I might call start, data (not that it matters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done!
doc/source/reference/c-api/array.rst
Outdated
| NpyAuxData *auxdata, NpyAuxData **out_auxdata) | ||
| A function to sort a buffer of data. The *start* is a pointer to the | ||
| beginning of the buffer containing *num* elements. A function of this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| beginning of the buffer containing *num* elements. A function of this | |
| beginning of the contiguous containing *num* elements. A function of this |
It also should be aligned, but I have to think whether we should allow indicating support for unaligned data here. (Which would require flags, for ufuncs "supports unaligned" is flagged before get_loop(), although since here we always do contiguous, flagging it inside get_loop() is OK also -- that is, becuase get_loop() is not passed any strides).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense! Committed (modulo typo).
| NpyAuxData **out_auxdata) | ||
| A function to arg-sort a buffer of data. The *start* is a pointer to the | ||
| beginning of the buffer containing *num* elements. The *tosort* is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charris to confirm, even for argsorting it probably makes sense to always use a contiguous buffer for sorting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some, but we don't sort the data being argsorted, so it will eventually be accessed out of order in any case if the data is random. All this is to say, it depends :)
doc/source/reference/c-api/array.rst
Outdated
| PyArrayMethod_Context *context, NpyAuxData *auxdata, \ | ||
| NpyAuxData **out_auxdata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PyArrayMethod_Context *context, NpyAuxData *auxdata, \ | |
| NpyAuxData **out_auxdata) | |
| PyArrayMethod_Context *context, NpyAuxData *auxdata) |
The out_auxdata belongs on the get_loop function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done, thank you :)
doc/source/reference/c-api/array.rst
Outdated
| .. c:macro:: NPY_DT_get_sort_function | ||
| .. c:type:: int *(PyArrayDTypeMeta_GetSortFunction)(PyArray_Descr *, \ | ||
| npy_intp sort_kind, int descending, PyArray_SortFunc **out_sort); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs *out_flags, since we need the ability to indicate whether the GIL is required (I think we can ignore FPEs), but who knows if we'll have a reason for other flags eventually.
It also needs **out_auxdata, since auxdata needs to come from somewhere :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
| } | ||
|
|
||
| sort = PyDataType_GetArrFuncs(PyArray_DESCR(op))->sort[which]; | ||
| PyArray_GetSortFunction(PyArray_DESCR(op), which, 0, &sort); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, let's just use < 0 to decide if it's an error. In which case sort != NULL is assumed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, this is done!
| typedef PyObject *(PyArrayDTypeMeta_GetItem)(PyArray_Descr *, char *); | ||
|
|
||
| typedef int (PyArray_CompareFuncWithDescr)(const void *, const void *, | ||
| PyArray_Descr *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have slightly mixed feelings. On the one hand, I think this is the pragmatic thing to have.
On the other hand, we could also look this function from the np.less_than or np.great_than ufunc to implement sorting, I think.
(The problem there is still how to deal with unordered elements, a compare ufunc would work better...)
But, on the other hand, it seems pragmatic even if it won't work well e.g. for structured dtypes (performance issues), it will always work and provides an easy entry-point (we can also use this to define default comparison ufuncs).
So overall, I think I end up at just doing this, although I could imaging punting if we don't need it for StringDType (I suspect we do, though).
Would like to hear if @ngoldbaum has an opinion.
(A neater future path would also be if this was more of a header-only code binding generator job with us making the sorting patterns available maybe. I.e. if this was defined in a C++ class and our sort code available, the DType could compile the full loop and avoid calling such a helper everywhere.)
|
Thanks for the comments! And no worries, I was just making sure I wasn't missing something to do :) I think I need to think a bit more about the best way to adjust this for the extra features you mentioned, yes. Unordered elements is definitely something to consider at this stage, so I might try to draft something for that soon enough; that should hopefully create a clearer story around these features! |
39f5ef2 to
237b7f0
Compare
|
I want to call your attention to this suggestion: #28516 (comment). Did you ever take a look at numpy-user-dtypes? A worked example would help. |
|
Yes, sorry, I took a look actually--but was having some trouble with installing the dtype packages over the editable install of numpy. I'll push my draft anyway and try to address that. Thanks for the reminder. |
| allows defining custom argsorting implementations for each of the | ||
| sorting algorithms numpy implements. | ||
| sorting algorithms numpy implements. If `NPY_DT_get_argsort_function` | ||
| is defined, it will be used instead. This slot may be deprecated in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does it need to be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are dtype slots, so they will be defined with the user dtype. Here is the example for StringDType.
| Indicates that NaN values should be sorted to the end. | ||
| .. c:enum:: NPY_COMPARE_RESULT | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, C enums may be unsigned depending on whether they are explicitly defined. Are these actually enums, or just here for markup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are actually enums, defined in dtype_api.h!
| NPY_UNORDERED_BOTH = 4, | ||
| NPY_COMPARE_ERROR = 5, | ||
| } NPY_COMPARE_RESULT; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the comparison function need to handle all those values? What if it doesn't care? The current sorts only require NPY_LESS, and only use the (-1, 0, 1) sequence where that is conventional for code that comes from outside of Numpy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, perhaps it doesn't, though the main issue comes up in nan sorting (as in dataframes) as was mentioned by @seberg. If we don't include this, NaNs would be treated consistently in both ascending and descending sorts, causing their position to be flipped, if that is a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see making sure that -1, 0, 1 retain their meaning for simplicity. You could also just not bother here and say if you want it you need to use the more complicated API...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this mean certain user dtypes (that choose not to bother) would not support NaN order sensitive sorting? Should we allow to indicate that somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so dunno if it's good. Maybe the best thing is to pass in the information, then the choice to ignore it is rather obvious :).
(I.e. return -1, 0, 1, and pass in where to put NaN... unfortunately, it would be "value of NaN" style not "start/end", but "smallest/largest value")
Another thing is, @charris do you feel we should prepare for nan_order="undefined" thing?!
EDIT: I suppose maybe passing it in but returning -1, 0, 1 is the best?
|
|
||
| typedef int (PyArray_SortFuncWithContext)(PyArrayMethod_SortContext *, | ||
| void *, npy_intp, | ||
| NpyAuxData *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is in NpyAuxData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any additional information that the sorting may need to access. I suppose it's for future compatibility, if say, we need to add dtypes that use auxiliary data in sorting or even sort multiple related arrays in some way, but @seberg would probably have a better idea of the use cases he thought of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auxdata is what all casts (and ufuncs) are passed. It allows holding on to temporary data. E.g. casts need it for structured dtypes (and this could make sense here in the same way).
|
It would be helpful to document how this apparatus is intended to operate. Maybe with some diagrams with arrows tracing a call. |
|
I have a somewhat different concept as to how this should be handled. I would just add two slots that take function dispatchers that get passed the context and return the relevant sorting implementation, if there is one, null otherwise. The context would also contain a I don't think this is very different from what is here at the lower level, but the upper level is somewhat different. It just adds a level of indirection. For reuse we probably need access to all the numpy builtin dtypes, so that their sorting functions can be called from user dtypes if needed. Maybe a toplevel function dispatcher? We currently have builtin introsort, heapsort, timsort, and radix sort, do we want to expose those to users? We should add to those decending sorts and the nan options. The main idea here is the explicit use of function dispatchers, and maybe an explicit distinction between builtin and user. The goal should be simplicity for the average user, with detailed selection for experts who want it. I am reminded of our random module in that regard. I think discussion should be about the top level of this. The original idea for making the builtin sorting functions a library was to make it easy to add them to the existing dtype slots. |
@charris can you sketch that out slightly more? I think, you are basically saying to drop the element-wise compare function and just do the Then, rather than exposing an elementwise function, we have a public For FWIW, I am not too worried about versioning. We can add things to the context if backwards compatible. If not, it is possible to add a |
|
Thanks @charris, that makes sense! I think we can definitely simplify the implementation. If I’m understanding correctly, you mean we could drop the sort-compare and expose the internal That sounds really nice, but the current sort functions read the comparison from the DType (a legacy array function rather than a slot, via Essentially I am wondering if the sort-compare user is actually more common than the sort user, mainly because most 'compound' types like StringDType or multi-precision float would need a new sort_compare or a complete new sort, and sort-compare is easier. Of course a dtype could go all the way and rewrite the functions completely, but then exposing the builtin functions may not be that useful. Of course this doesn’t mean we need a full new slot! It actually seems to be an implementation detail, yes – when you say ‘indirection’ at the sort-compare level, would you like something like this (except we pass the sort-compare some other way if you don’t want it in the context)? stringdtype_get_sort_function(SortContext *ctx)
{
npy_quicksort_with_context(ctx, &stringdtype_sort_compare, ...);
// or something like this, though probably not this?
NpySortContext_SetCompare(ctx, &stringdtype_sort_compare);
npy_quicksort_with_context(ctx, ...);
}I think this would actually be quite nice! Users can choose to do this, or reimplement the full function. Sorry, this got long, but just wanted to clarify what you were thinking. Thanks again for your feedback; it was very helpful! |
They don't, except for generic types. The comparison functions are defined as inline functions in What I am thinking is that for public use there are four attributes to select sorting shim functions. The following information needs to be passed, either in context or by arguments:
Note that with included dtype, only one dispatcher is needed for all dtypes. Or we could omit the dtype and attach different dispatchers to each dtype. The returned sorting shim function allocates/releases working memory, makes contiguous copies if needed, and calls the lowlevel sorting implementation. Arguments something like
This is close to what the current implementation has evolved to. In the beginning we called by algorithm, but here call by functionality. The question if we want to expose the lowest level functions, or maybe the shims, in addition to the dispatchers. I don't intend to this to be the final proposal, but I think this design should start by answering the question of what we want the new interface to accomplish. Feedback and alternate proposals would be good to hear. |
Ah, right thanks, those are for most builtin types indeed. Only dtypes for which explicit dispatchers aren't defined actually take the That aside, the API seems good! What I understand is that we pass
I'm not sure I fully understand the dispatcher idea but it seems useful to add a layer on top of the low-level sort. The lower-level generic implementation, as far as I can see, does not contain axis and output array arguments. So, to be clear if I understand this correctly, if we want to implement these changes, I think we first need to decide:
And then maybe:
Sorry if this is off base. I'll think about this further, thanks! Edit: If all of this won't be done in this PR, perhaps just the user API with some dispatch architecture (such that it fixes the external API) would be a good place to start. Once it's fixed, we can change the internals. |
|
I would be happy to say that we only need stable/unstable sort kind. The main comment I have is that
We may be thinking of much the same, but I would like NumPy sort to deal with contiguous copies, result allocation and outer iteration, so that the (user) DType only needs to provide the lowlevel sorting implementation. (I.e. the same pattern that ufuncs/casts use) I would like to achieve that anything that a user DType may provide doesn't take arrays. NumPy DTypes need no knowledge of arrays (except for some old API, one of which we are replacing here). I suppose I am just repeating that Although, I would be OK to just provide a But beyond that, e.g. whether we should have a
We should always include the dtype (i.e. PyArray_Descr), for example In a sense I would love if we could implement sorting as (simple) gufunc even if that is overcomplicated for sorting... Unfortunately, we don't have a defined way to pass the parameters like descending. |
|
@seberg Your proposal sounds like the current implementation, but with more slots for the low level functions. As I see it, we would like eight slots rather than the current three, that is, three booleans:
We currently only really use two: stable/possibly not stable. Note that we already have shims for functions that require workspace, the shims allocate/release working memory. The question then becomes how to allow extensions without making disruptive changes. I suggest that the easiest way is to have a pointer to a structure dedicated to holding function pointers rather than a function that returns the wanted pointer. The only complication is that the function signature is different between the generic/builtin cases, but I suppose that can be handled by a NULL in the comparison function arguments. If we are not worried about future extensions, we could just add eight slots, sixteen total with sort/argsort. What is it you would like with dtype inheritance? I am interested in what users might like, @mhvk and pandas might have some ideas. |
I would like to just add one slot actually (please forget about the point that we don't have to use that). I.e. the function (a second one for argsort): Now, internally, the dtype may or may not have 8 different versions. Because the new Why the
Now, passing
We are not (overly) worried about the future, because these are not the same kind of slots as the old ones. These are slots on the DType class the |
|
To stress this: Even if we add So we should be comfortable with the choice for the mid-term, but we don't have to worry too much in case someone comes around and wants sorts that e.g. work with non-contiguous data (if we don't really expect that to happen). Adding nan last/first and descending now, is because there is clear interest for it in the near or mid term (e.g. because of |
|
There are going to effectively be 8 slots even with the function + context, it is a matter of how they are accessed. I'd rather they be in a table instead of in a switch. That said, I'll note that the improvement in speed between the original quicksort implementation and the type specific one was about 5x. That was due to several things, first, builtin comparisons, and second, a cleaner and simpler version of quicksort. |
True, although some of them may be duplicate entries (if perfect specialization isn't worth it or unnecessary e.g. for integers which can ignore NaNs).
I don't mind that too much the slot could be a struct with 8 entries. Although, I prefer this because of the reasons I said above:
But yeah, it may be that that we repeat the pattern to dispatch to the specialized version a few more times with my preference, because it could be repeated for multiple DTypes rather than just once. |
|
That said, I guess we could always add my Slots may have the advantage of being slightly easier to reason about for user DType authors as well. (i.e. for ufuncs, the user doesn't need to use the |
|
For the public API it seems easier and clearer (in intent) to me to use a function rather than a struct. I'm not too familiar, but I would think it is possible for compilers to optimize, say, NaN order and descending, if the user writes obvious fast paths; in such cases it may be simpler to allow the user to merge functions than write 8 wrappers or versions (and as @seberg said this is similar to ufuncs too). And not all dtypes would care (equally) about all options (especially nan). Also, this seems more consistent with the slot design we have so far. Also I think it would be nice (if @charris meant that as well) to expose some of the internal sort machinery we have and in that case user dtypes can reuse it more easily with a single function, presumably, if we wrap things (efficiently) (though this ties into the discussion whether we need to generalize compare, which I'm not sure about anymore...) |
|
To be clear I'm not against going with either approach if there are strong opinions, just thinking out loud! Happy to go with whatever everyone thinks makes sense :) |
|
To be clear, I wasn't recommending actually adding 8 (16) slots to the dtype, that was just to illustrate the effective functionality. What I would suggest is adding a pointer to a structure instead of a function. Yes, some of the entries could be duplicates or NULL. For instance the only reason to maybe prefer quicksort over radix sort for short integers is that radix sort requires extra memory. The only difference from a function is using an index rather than a context. I suspect a function might want to use a table for implementation, converting the context to an index. So I guess it is just a matter of the easiest way to pass the selection information. I would suggest either new |
|
For descending sorts, we could just reverse the output of existing sorts as a quick, if dirty, solution. That would also work for floats if we added the option of sorting nans to the beginning. |
Yeah, that's a good point. If we provide the family of functions (yes, via a single slot as you said), then we can auto-fill half the slots via reversing others. So in the end, I think where we are heading is basically adding a sorting-slots struct. That was maybe always the right thing to do (even for the fewer ones in the PR, I guess I just didn't need it as much before; the one indirection should be meaningless). Since I do like the idea of potentially adding a I don't have an opinion on how many sort functions we include. 4+4 we need, but maybe all 8+8 (as the NaN versions can be implemented via reversing if descending sorts to end). |
|
Been thinking about this some more, The current logic is:
That is pretty much what this PR does AFAICT. What I think we need is to have more sorting options, i.e., decending, maybe nan position. To make those accessible, we need to either extend NPY_SORTKIND or add a new C-API function with appropriate arguments. I think that is the first decision to be made. Once that is decided, we should implement that for all current sort types and use it internally. That will allow the current tests to be used to check the infrastructure. That is steo one. The second step (new PR) would add descending builtin sorts, with extended tests. That could be broken down into several steps, maybe doing only generic/stable/etc at each point. Thinking about descending sorts, Come to think of it, reversal before sorting followed by reversal after should work in the stable case. Table pointer or function is an implementation detail, I could go with either. I think we should be able to get descending sorts added for 2.4. |
|
Yes, that is the current logic! In order of priority:
Just to be clear, for 1 and 2, the sort methods are restructured and What is not done yet is the sort implementations for built-in dtypes, like numpy integers or floats - those are also implemented in SIMD AFAIK. We only have public API so far; none of this is in use internally, so yes, that should be next so we can test it here. This basically amounts to replacing options 3 and 4 with 1 and 2 for all dtypes. This should be possible by initializing the internal dtypes with the |
|
How do you access those from the public C-API, PyArray_Sort doesn't have the arguments ... |
|
Should have been precise, sorry, functional in terms of the actual sorting routines, the levels down from We do need to add the actual options as you said. |
|
As I understand things, the first commit 87f410b is all that is needed to actually solve the slow string sorting. Perhaps it would be worthwhile to have that as a separate PR to solve the immediate problem, together with some benchmarks, until the dust settles around the wider refactor? |
Resolves #26510.
Allocates the lock for the StringDType array before sort and releases after.
I noticed the sorting algorithms independently get the compare function from the descriptor, so I have created new helper functions in
stringdtype/dtype.cbut not sure if that's the right place. Changes have been made only inquicksort.cpp(but will add others later), so this is a draft but would appreciate feedback.Posting simple benchmarks in a comment below. Thank you for reviewing!