Skip to content

Conversation

@MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Oct 16, 2025

This follows #29900 and #29737 to add a sort_compare slot to the DType, which can be used to enable sorting for user dtypes without completely redefining the sort. Also uses it in StringDType to allocate the lock just once and thus fixes #26510.

Also, there are now wrappers for the existing sorts, so it is probably easier to move to the array method API later. However doing so would require templating the SIMD sorts and doing other infrastructure gymnastics, so not done here. This does not implement descending, which can actually also be done as a follow-up if we keep the API the same and just flip the comparison twice.

This still needs release notes and docs but wanted to push this before to clarify direction. Unfortunately, it is quite verbose, especially for StringDType. ping @seberg @mhvk @ngoldbaum if you're interested - thanks!

// "must provide a `resolve_descriptors` function if any "
// "output DType is parametric. (method: %s)",
// spec->name);
// return -1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be merged!!! I noticed we cannot actually not pass resolve_descriptors in most cases, at least if the dtype is parametric (since it's passed as output)! Should we special case somehow?

}
NPY_DT_SLOTS(stringdtype)->argsort_meth = argsort_method->method;
Py_INCREF(argsort_method->method);
Py_DECREF(argsort_method);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slot is already assigned - we probably need to decref the existing method, but I wonder if we should also check in PyUFunc_AddLoopsFromSpecs? Though one doesn't usually define both the compare and the sort...

@MaanasArora MaanasArora force-pushed the enh/sort-compare-slot branch from bdae010 to 5c5ad5a Compare October 16, 2025 04:07
@seberg seberg self-requested a review November 7, 2025 10:38
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaanasArora - I just had a quick glance, to see how this would look. Overall, it makes sense and looks good. Let me know when it is ready for a more detailed review.

@MaanasArora
Copy link
Contributor Author

Thanks for reviewing @mhvk! I've added docs and done some cleanup - I think this should be ready now for more detailed review.


I suppose a main part of the API that needs discussion is how to implement descending sorts (with NaNs to the end):

  • We could do the 'double flip' idea that @charris suggested, i.e. reverse array once before sorting and once after. In practice, if I'm understanding correctly, this can simply mean writing new descending sort loops that are identical except they do cmp(b, a, arr) > 0 instead of cmp(a, b, arr) < 0 (swap a, b, and sign). In this can we can simply pass the descriptor and don't need to change the API.
  • We could pass the PyArrayMethod_Context into the sort_compare rather than the descriptor, and let the function be a 'should swap' function. This is probably more flexible if we need to add more features to the slot or if we want custom descending behavior (which we probably don't?).

I've implemented this PR such that we could do the first in a follow-up (i.e. I use the descr), but happy to change direction if we prefer the second.

@charris
Copy link
Member

charris commented Nov 12, 2025

reverse array once before sorting and once after.

That was to make stable descending sorts, but it doesn't fix the NaN problem, i.e., NaNs will still sort to the beginning. Which I suppose might be correct, but ... I think we need to settle what descending stable sorts look like and where the NaNs go. Sounds like an array-api discussion. The double reversal keeps the index order for elements with the same value.

@seberg
Copy link
Member

seberg commented Nov 12, 2025

I think we need to settle what descending stable sorts look like and where the NaNs go.

We think we have already settled that one pretty much in the earlier API discussions. Happy to bring it up there, but I don't think it is going to add anything, except a shrug and agree with my opinion. The point is, array libs have no clue about this anyway, you have to look at dataframes/tables for prior art and there both versions exists (but my feeling is our version is more common in newer API).

"Our version" was to specify where NaNs end up irrelevant of the sort order. This is the better user API, IMO, it also means that top_k/bottom_k == sort(x, asencing=True/False)[:k].
Unfortunately, from a sort-compare implementation point of view the opposite logic is actually a bit more obvious (i.e. say that NaN is larger than inf or smaller than -inf).

Anyway, I at least consider the discussion of how to handle NaNs by default as settled.

@MaanasArora
Copy link
Contributor Author

That was to make stable descending sorts, but it doesn't fix the NaN problem, i.e., NaNs will still sort to the beginning.

Ah yes, sorry, NaNs will still get flipped, which means passing the context is necessary to enable sort compare to respond to descending. I'll do that now.

@MaanasArora
Copy link
Contributor Author

Sorry, just one more question: is there perhaps a dtype that can be used to test this PR? With the most recent refactor, StringDType actually does not need to use the indirection to sort compare, so if preferred I can remove it before more review (and do it in a follow up).

@ngoldbaum
Copy link
Member

Sorry, just one more question: is there perhaps a dtype that can be used to test this PR?

We only really have the scaled float dtype for testing inside numpy. If one of the dtypes in the user-dtypes repo won't work, then this whole discussion is a little moot until someone writes a dtype that needs the functionality.

I haven't been following as closely as I'd like, so I'm not sure offhand what sort of dtype you'd need.

@MaanasArora
Copy link
Contributor Author

If one of the dtypes in the user-dtypes repo won't work, this whole discussion is a little moot until someone writes a dtype that needs the functionality.

This functionality is already part of the arrfunc machinery (as NPY_DT_PyArray_ArrFuncs_compare) and the user-dtypes actually use it presently - so they do work for testing! I was just wondering if we'd like to add tests within NumPy (for CI for example), but it is completely fine if not.

This PR is basically a migration of the older compare slot to the DType and to use the new array methods for sorting, as far as I understand.

@seberg
Copy link
Member

seberg commented Nov 12, 2025

Sorry, just one more question: is there perhaps a dtype that can be used to test this PR?

We could rewire object and/or void (of course for void it would be nice to have setup code as well), but yeah, ScaledFloat is the only other one as a playground for this.

@ngoldbaum
Copy link
Member

ngoldbaum commented Nov 13, 2025

I was just wondering if we'd like to add tests within NumPy (for CI for example), but it is completely fine if not.

I think it would be nice to have a testing job that checked out the user-dtype repo and tried to build it and run the unit tests inside it with the latest numpy version.

Totally not necessary to do that to merge this PR though.

@MaanasArora
Copy link
Contributor Author

Closing this after #30266 was merged - going to break this into two! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: np.unique and sorting is slow for StringDType

5 participants