-
-
Notifications
You must be signed in to change notification settings - Fork 12k
ENH: Add extended sorting APIs #29642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Whoa, that is one of the ugliest diffs I've ever seen. The actual code just adds new functions and deletes the content of old ones. I suggest looking at the files direct. |
|
Note that I use some static arrays inside the functions which are only written to when they are initialized. The initialization may not be thread safe, although they should be in C++11. They can be purely local without much cost. I used tables so that minimal changes would be needed if we add function slots one way or another. They contain variables that are not available until link time, but then, so do the standard dtypes. I'm not clear if function names can be considered constant at compile time, If so, I could simply pull them out of the functions and initialize them in the usual way. I think it can be done, but I not sure. |
|
OK, |
79ad3b1 to
2b45310
Compare
|
Cleaned things up a bit. @seberg You wanted to remove the legacy |
|
I think I'd still suggest that the new C-API does not have sort kind at all, and that the old functions just translate. Alternatively, for now, make the new functions private and have the public |
Internally, I've been using the fourth flag bit to mark using "kind" as a table index. We could translate the kind values into flag bits: kind == 0 -> 0 (default) or maybe kind == 0 -> 8 (default) i.e., if the fourth bit is set, use the lower bits to hold the flag value. I chose the fourth bit to reserve eight values of the keywords, but we could pretty reserve more space. I expect we will have a minimum of 32 bits to work with, flags is an int type. So we can get rid of Maybe we should just make the flags variable uint32. |
| * Sort an array in-place with extended parameters | ||
| */ | ||
| NPY_NO_EXPORT int | ||
| PyArray_SortEx(PyArrayObject *op, int axis, NPY_SORTKIND which, int flags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now looking a bit more closely, I see how you use which to infer flags. My sense would still be not to have which in the new C API call sequence at all, but, say, store it in the top 4 bits of uint32 flags, so one would have
which = ((int32)flags >> 28);
if (which == NPY_USEFLAGS) {
switch (flags & 0x0fffffff) {
...
/* no case 8 */
} else {
sort = sort_table[which];
}
And then call appropriately elsewhere.
In the header file, we should then reserve the appropriate bits.
Alternatively, it is perhaps easier to just rename the functions you have here Py_Array_Sort_int(...) and then define
NPY_NO_EXPORT int
PyArray_SortEx(op, axis, flags) {
return PyArray_SortEX_int(op, axis, NPY_USEFLAGS, flags);
}
Yet another alternative would be to remove the which part of the logic here altogether, and move it to the old functions. But that would seem to lead to quite a lot of code duplication.
|
The way I see things at the moment. Decided:
To be Decided:
I would also argue for reserving the lower 8 bits of of the flags for sort selection, and the remaining bits for hints. |
|
Agreed on the "decided". For those to be decided, I'd say "remap, make public, and deprecate" - though the deprecation could be done in a follow-up PR. |
2b45310 to
e953c3e
Compare
|
I went ahead and made the modifications. It isn't possible to set the enum type until C23, so to be safe I just left it an enum, which will change size if needed, but the size will be platform specific. I think that is safe. |
|
We can postpone adding documentation and keywords at the Python Level until we actually add the new functionality. |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments, probably too early, since really they are mostly about making sure that the new C API gets fully exercised. (Maybe one way is to start with just the stable flag in this commit, and introduce descending and nanfirst in following ones, together with their support. But probably all premature...)
| */ | ||
| typedef enum { | ||
| _NPY_SORT_UNDEFINED=-1, | ||
| NPY_SORT_USEFLAGS=-1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe want to mention in the comment that this enum is intended to be deprecated eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we leave a comment here, it should be searchable. Hmm, maybe
\\ TODO: remove after sort kind deprecation expires.
| // remap which to flags, note that NPY_HEAPSORT is effectively | ||
| // remapped to NPY_QUICKSORT. | ||
| switch (which) { | ||
| case NPY_STABLESORT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also have NPY_MERGESORT here, right? For the C API, we cannot be sure that it won't be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mergesort and stablesort are the same thing in the enum. And mergesort isn't actually mergesort, it is either timsort or radix sort depending on the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, should have seen that. Ignore this one then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe add to the comment you have already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used NPY_MERGESORT instead, which might make more sense.
| // remap which to flags, note that NPY_HEAPSORT is effectively | ||
| // remapped to NPY_QUICKSORT. | ||
| switch (which) { | ||
| case NPY_STABLESORT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add NPY_MERGESORT also here.
numpy/_core/src/multiarray/methods.c
Outdated
| } | ||
| // remap sortkind to flags, NPY_HEAPSORT is effectively mapped to NPY_QUICKSORT. | ||
| switch (sortkind) { | ||
| case NPY_STABLESORT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here NPY_MERGESORT indeed cannot happen. Maybe add a comment just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just used NPY_MERGESORT instead, which might make it clearer.
numpy/_core/src/multiarray/methods.c
Outdated
| if (order == Py_None) { | ||
| order = NULL; | ||
|
|
||
| if (sortkind == NPY_SORT_USEFLAGS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth splitting out this sortkind checking into a separate function, since it is duplicated between sort and argsort. Something like
int get_sort_flags(NPY_SORTKIND sortkind, int stable, int descending, int nanfirst)
where the return argument is flags (or -1 if an error occurred. (Maybe better with flags passed in?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of such duplications, I thought to leave it for a later cleanup if desired. See also the legacy C-API functions. And wait until after the new slot is added, which might take over that whole bit and delete this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough!
indeed, though whatever we expose at the C level should be correct (even if it means erroring on |
e953c3e to
80babce
Compare
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good now, spotted just one missing break. Mostly now seems a question of how to do follow-up: as is, we introduce C API and (undocumented) python API for options that error. Do we want that? E.g., @seberg suggested to not have nanfirst yet.
On the other hand, things are now in place to add that!
numpy/_core/src/multiarray/methods.c
Outdated
| default: | ||
| case NPY_HEAPSORT: | ||
| case NPY_QUICKSORT: | ||
| flags = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a break after, no?
I'd also swap the order, so default is part of the end (also above, in item_selection.c, and below), but no big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used all the different possibilities, so the default is not longer needed. The range is already checked by the sortkind converter.
e2140e6 to
9200fe0
Compare
|
All the keywords are there if you want to test them: They just won't be documented. I suppose we could print out the flag values to make it clearer. |
Yes, there is no reason in Python to add a kwarg early, it is just confusing to have it. On C the story may be a bit different (although, not much because I am curious now if we can't avoid the new function and just change the definition of |
Yes, we could. But you didn't want to extend SORTKIND :) And to be fair, that would have been a kludge. I rather like the new functions and the flags interface, I think things worked out well. The only real difference is heapsort not being public, and I've thought it unneeded since introsort went in. From the beginning it was the slowest sort, taking about 2x longer than quicksort, and quicksort has gotten faster. The flag interface is nicely extensible in the future, we have been abusing SORTKIND for many years as new sorting functions have come along: introsort, radix sort, timsort, SIMD quicksort, etc. We don't want to make users choose among those options, let the experts do it. All we have do to get rid of The parser then raises an error: |
|
Note that all user dtypes need is a compare function, they just put NULLs in the current sorting slots. We only need the current sorting slots for type specific builtin sorts. To add descending sorts, the quick and dirty solution is to add generic sorting routines and add them to the new tables which are introduced here along with the new interfaces. We only need an extended dtype if we want them to be builtin. We can start with the generic sorts and use them for timing comparison, which will tell us if it is worth going to type specific sorts. I need to see how runtime dispatching works, I haven't figured that out yet. |
Yeah, but I happen to care a lot about this, that if we add I don't see that we can expect generic sorts to have decent performance for
Maybe, I may have changed my opinion. But there were two different things here: One is keeping the actual algorithm choice (we are doing the opposite now, always removing that choice) and the other is absorbing the new sort option flags into a single larger enum. Now that we decided to narrow the algorithm choice to stable/unstable. I think avoiding
Isn't that what the other PR tried to solve -- and while it should morph with recent discussion, mostly did solve? |
The current values are orthogonal, so three enum entries gives 8 choices. I don't expect the enum to grow much, what will grow are the switch statements. EDIT: I expect the other PR, if implemented with either a table or function, to get passed a flag, which is just a number. If implemented with a table the number is an index into the table. In short, I expect the switch statements to go away, essentially moved down to the new interface. |
|
For instance, |
fe16eac to
fcc77bd
Compare
This PR adds new sorting APIs that can be used for extending our current sort types. The new C-APIs do not have the *which* argument of the legacy C-APIs, instead they have *flags* request flag API. In this transition, heapsort can no longer be called directly, introsort (quicksort) is called instead. The sort/argsort methods get new keyword arguments used to set the flags when *kind* is not given, so *flags* is invisible at the Python level. The new keywords are: - stable -- whether the sort should stable, default is False, - descending -- whether to sort in descending order, default is False, - nanfirst -- whether NaNs sort to the last or first, default is False. The main difference here is that the keywords select by functionality, whereas *kind* selects by algorithm. Note that descending and nanfirst are not yet implemented, so an error is raised if either is specified. The added C-API functions are: - (API)PyArray_SortEx, - (API)PyArray_ArgSortEx, Those are the two functions that need changes depending on how we want to implement adding new sort algorithms. They should be pretty easy to adapt to whatever we choose to do. The legacy C-API functions, PyArray_Sort and PyArray_ArgSort, have been modified to call through the new C-API functions. In order to do so the *kind* function has been encoded in the *flags* like so: - "quicksort" - default (0) - "heapsort" - default (0) - "mergesort" - stable (1) I note that the partitioning/selection functions have not been modified, we may want to do that a some point.
fcc77bd to
9f42dca
Compare
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last set of (very) nitpicky comments -- I think it is over to @seberg with this!
| * Updated with new names denoting requirements rather than the algorithm. All | ||
| * the previous values are reused in a way that should be downstream | ||
| * compatible, but the actual algorithms used may be different than before. The | ||
| * new approach should be more flexible and easier to update. The idea is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big deal but would delete the sentence "The idea is ..." - the implementation should not matter here (I may well end up thinking that this is the right approach, and now see how indexing is sort-of baked in to the present API, but not having seeing any implementation, I still feel the logical solution would be a single slot for a function with a flags argument...).
doc/source/reference/c-api/array.rst
Outdated
| will use the second field and so on. To alter the sort order of a | ||
| structured array, create a new data-type with a different order of names | ||
| and construct a view of the array with that new data-type. | ||
| Equivalent to :meth:`ndarray.sort<numpy.ndarray.sort>` (*self*, *axis*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already not quite true any more, and will soon be even less so; I'd tend to start with what the function actually does (so, "Return an array ... " and then at the very end say something like
This function is the implementation for :meth:`ndarray.sort<numpy.ndarray.sort>`,
though with slightly different meaning of ``kind`` -- see ``NPY_SORTKIND`` below.
(and similar for argsort)
| Sorting ``kind='heapsort'`` now maps to ``kind='quicksort'`` | ||
| ------------------------------------------------------------ | ||
| It is unlikely that this change will be noticed, but if you do see a change in | ||
| execution time or unstable argsort order, this is likely the cause. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add "Please let us know if this causes a performance regression."?
6b29d5c to
22be8c5
Compare
numpy/_core/src/multiarray/methods.c
Outdated
| PyArray_Descr *newd, *saved=NULL; | ||
| int stable = -1; | ||
| int descending = -1; | ||
| int nanfirst = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need the optional bool converter, since the default is well defined in both cases. The optional one is just for when None/not passed has a special meaning.
(Which is the case for stable, because it needs to distinguish not passed for the error.)
I would also be happy to just delete the NaN thing here, but either way. I have been drifting away from wanting it much. And I think if we have it then nan_placement="first" might be clearer than a boolean thing in Python.
(Anyway, it doesn't really matter of course as it is all commented out anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go further, with heapsort gone, the kind argument has the same functionality as the stable keyword, we can deprecate it! Internally, we use it to set the stable default when it is given. Then all the keywords can be treated the same way. No reason not to allow both arguments any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I would say to hold off with deprecating it, unfortunately.
I think stable= was added in NumPy 2.0, and it would be a pain if projects like sklearn can't transition to stable= for all NumPy versions they support (i.e. avoid the need for a numpy version specific code).
(Unless stable= was actually added earlier.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow both 'kindand keywords,kind` would precedence in the case (kind="stable", stable=False), the only difference with current behavior would be no error message, and we could still do that but it would be different message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to allow mixing both, at least if we raise an error when they mismatch. But I am not following what you are suggesting.
Users can't pass stable=True yet, on all versions, so I don't want to ask them to replace kind=... yet (even if we allow kind="stable" it forces double-change: kind="mergesort" -> kind="stable" -> stable=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind="stable" is an alias for kind="mergesort", no double-change needed, except maybe in documentation. Something like: "Use stable=True in place of kind='mergesort' or kind='stable' ".
We can leave that for a discussion in a different PR, or perhaps an issue. I think the only thing to be decided is the removal of NPY_NANFIRST from the enum. If we are unlikely to use it, it is easier to add it when needed than to remove it once it is in. But we should leave space for more requirements if we add new 'hint' keywords in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather a follow on PR if we want to enable both kind and keywords. I think the important thing for this PR is to have the C interface defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. FWIW, the nan parameter matters more for the DType API layer, where new hints can be added nicely (if the user expects it), but requirements cannot.
We could add the bit-mask that Marten suggested. I am starting to think: let's forget about it. I think in practice we likely only need to define that a descending sort also should return the NaNs last.
And if we have to do slightly uglier things in the case that we need more I am OK with that.
For hints, we could prepare for it, but we can also do something like:
#define NPY_SORT_LOWMEM (PyArray_RuntimeVersion > 2_4 ? FLAG : 0)
a bit ugly, but the user won't notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think stable= was added in NumPy 2.0
Yep, #25437.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For hints, we could add something like "NPY_SORT_FIRST_HINT = 16," that would put the requirements in the first four bits of NPY_SORTKIND, with the first bit unused, e.g,, three possible requirements. You get an index by
index = (sortkind & 0xF) >> 1.
Since it often leads to future troubles to start with too few bits, "NPY_SORT_FIRST_HINT = 1 << 8," might be a better choice. Then the first eight bits would be reserved.
You get a TypeError: "no current sort function meets the requirements" |
|
What do you want done before this goes in? |
|
I'm OK with it as is, but think given the discussion it might be best not to advertise But again, I don't feel strongly, so maybe @seberg can pipe in too. p.s. Overall, a bit of a tortuous path, but I think a very nice result! |
If there is no nanfirst keyword, we should say that all alternatives sort NaNs to the end. Not sure where that should go, maybe in the extending dtypes documentation. |
|
I have removed all references to nanfirst. |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good to me now!
Maybe left is a comment by @Serberg (#29642 (comment)):
This doesn't need the optional bool converter, since the default is well defined in both cases. The optional one is just for when None/not passed has a special meaning.
(I think OptionalBoolConverter is for the case that one can pass in None as well.)
Why do you need which is not simpler than calling the |
|
I think just use |
Ah, it's already written, is it. Which raises the opposite question, why isn't it used for |
975c12f to
c2eeeba
Compare
|
Well, that answers that question, the original was in fact correct, @mhvk it was suggested by @seberg that I not use |
I am not following. You don't really need it for stable but it is has some sense because of the But for |
I tested that and it didn't work. It compiles, and runs without errors, but not correctly. Which is to say, sortkind is not correctly constructed. The function signature is defined in the parser call, and both I think the vectorcall logic goes this way. All the possible variables must be processed. If it is determined that there is no corresponding value in the args list, the question is "what to do"? This is determined on our end, and I think we have decided to return Note that the two functions also have different return types, the straight converter returns to @ngoldbaum I think you implemented the fastcall processing, can you confirm my understanding? |
I implemented fastcall, but it has nothing to do with things here, it behaves exactly the same as the builtin parser.
When the argument is optional, you always have to set the default. I would still avoid |
|
Summary: As long as we allow The problem with not using the non-optional converter is that we assume that it doesn't affect the current value of the variable (which is what online documentation suggests), which is incorrect. It sets the variable to either 0 ot 1. Setting the variable -1 before the call serves no purpose. So the non-optional version can only be used when 0 is the default value. The code can be fixed to work with the non-optional parser, but will not work as long as we want to enforce use I don't think we need to keep the one or the other choice now that heapsort is gone, but as long as we do, we need to use the optional parser. If we allow both and give the keyword priority over |
I suppose I think there is no reason to reject EDIt: anyway, I don't care much, just do the version you prefer. |
|
Summary: I'll stick with the current implementation. Using a split is an easy rule to remember, as opposed to having different behaviors for different keywords. I'll go for that. |
|
Thanks all. And especial thanks to @mhvk for his close reading and suggestions. |
|
Looking forward to the next steps! |
This PR adds new sorting APIs that can be used for extending our current
sort types. The new C-APIs do not have the which argument of the
legacy C-APIs, instead they have flags request flag API. In this
transition, heapsort can no longer be called directly, introsort
(quicksort) is called instead. The sort/argsort methods get new keyword
arguments used to set the flags when kind is not given, so flags is
invisible at the Python level. The new keywords are:
The main difference here is that the keywords select by functionality,
whereas kind selects by algorithm. Note that descending and nanfirst
are not yet implemented, so an error is raised if either is specified.
The added C-API functions are:
Those are the two functions that need changes depending on how we want
to implement adding new sort algorithms. They should be pretty easy to
adapt to whatever we choose to do.
The legacy C-API functions, PyArray_Sort and PyArray_ArgSort, have been
modified to call through the new C-API functions. In order to do so
the kind function has been encoded in the flags like so:
I note that the partitioning/selection functions have not been modified,
we may want to do that a some point.