Skip to content

Conversation

@MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Nov 20, 2025

Sorry, many PRs - I'm closing the other one (#30112). This should be much simpler.

Implements sorts and argsorts for StringDType using the feature introduced in #29737. The sorts allocate the mutex lock only once, thus fixes #26510. Some refactors of the npysort code were needed which should help with further work (e.g., descending sorts and moving the builtin dtype machinery).

Posting benchmarks below. ping @seberg @ngoldbaum @mhvk - thanks for reviewing!

@MaanasArora
Copy link
Contributor Author

Running this script:

import numpy
import random
import timeit

print(numpy.__version__)

options = ["left", "right", "leftovers", "righty", "up", "down", numpy.nan]
lst = random.choices(options, k=1000)
arr_s = numpy.array(lst, dtype="T")

time = timeit.timeit(
    "numpy.sort(arr_s, kind='stable')",
    globals=globals(),
    number=1000,
    setup="import numpy",
)
print(f"Time taken for stable sort: {time:.6f} seconds")

produces on main:

2.4.0.dev0+git20251119.a684abe
Time taken for stable sort: 0.396732 seconds

and on this PR:

2.4.0.dev0+git20251119.cce3cd0
Time taken for stable sort: 0.130454 seconds

@ngoldbaum
Copy link
Member

Which Python version did you use to do the comparison? If it's 3.13 or newer I bet you'll see even better results using 3.12.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, looks good to me.

This does add a teeny bit of per-array extra overhead in the sorting fast paths to call fill_sort_data_with_array. Maybe worth running the sorting benchmarks. It would only matter for tiny arrays though and I bet it's lost in the noise.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaanasArora - this looks great! The edits to the code you added before all make sense. For the PR as is, I just have a suggestion about using get_loop to pass on the sort implementation function, which allows one to remove some of the wrappers.

A larger question -- quite possibly just for follow-up -- is whether the hooking in to sort is at the right level -- you probably won't be surprised to hear I thin it might be nicer (and more useful later for custom dtypes) if inside StringDType, we avoid calling internal sort functions, but rather wrap a new, standard ArrayMethod loop that covers the use case of sorting with a comparison function.

I guess this would still need your new npy_quicksort_impl, etc., but would avoid exposing them. Instead, we would expose a new general ArrayMethod loop (one each for sort and argort) that expects to find the comparison function somewhere (probably best in context->method->static_data; we have used that so-far private slot a lot in stringdtype_ufuncs.cpp). This new loop would then call the existing npy_quicksort, etc., based on the sort flags, passing on the comparison function.

Does this make sense?

Note that now that I realized this would still need the changes to npysort that you have here, it may make more sense to do this in follow-up -- since we do not expose any new API here, that is certainly possible (just want to confirm, though, that we are not planning to expose npy_quicksort_impl, etc.!).

(PyArray_StringDTypeObject *)context->descriptors[0];

npy_string_allocator *allocator = NpyString_acquire_allocator(sdescr);
int ret = sort_loop(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it would feel more logical if sort_loop had the ArrayMethod signature rather than that of the (hopefully eventually removed) sort loop.

But obviously that would need to have a standard array method loop that uses a comparison function -- see main comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better, and also it seems to be a very useful feature if exposed - dtypes do not need to restrict themselves to sort_compare or similar functionality if they need to do setup work etc. Seems like a natural extension if it's okay in a follow-up; I just tried to keep this PR minimal!


NPY_NO_EXPORT int
npy_mergesort(void *start, npy_intp num, void *varr)
npy_mergesort(void *start, npy_intp num, void *varr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: keep starting brace on its own line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done, thanks!

*/

static inline void
fill_sort_data_with_array(void *varr, npy_intp *elsize, PyArray_CompareFunc **cmp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name confused me, since it suggested filling an array. Maybe just get_sort_data_from_array?

Though since it is only two lines, I also think it is also fine to repeat those 4 times in the code - it helps understanding why it is needed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the main idea was just to have it reusable along with a fill_sort_data_with_context for new arraymethod-style loops - in case we need to change some logic, we just change the helper. But I can see why this is a bit too minimal not to duplicate, and the sorts are legacy anyway - @ngoldbaum since you raised performance concerns with this, would you prefer just duplicating?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you get any speed benefit from explicitly in-lining, since the function is marked as inline, presumably the compiler will do that already. My main complaint really was about the name -- but if there is precedent in the same file, feel free to ignore that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! There is no precedent - I've renamed the function. Thank you!

PyArrayMethod_SortParameters *parameters = (PyArrayMethod_SortParameters *)context->parameters;
*flags |= NPY_METH_NO_FLOATINGPOINT_ERRORS;

if (parameters->flags == NPY_SORT_STABLE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two loops that we return are identical except for passing on different NpyAuxData, I think we can simplify this to have just one loop, but already set the auxdata here (a nice aspect of what get_loop can do!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I fully understand! We don't pass different transferdata? Do you mean we can use NpyAuxData to pass the legacy sort loop? Where would that go?

I can see using context->parameters inside the sort loop, instead of get_loop, though! That sounds like a nice idea if that's what you meant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I wasn't very. Basically, I think you can do,

stringdtype_sort_loop(  // Just one sort loop!
        PyArrayMethod_Context *context,
        char *const *data, const npy_intp *dimensions, const npy_intp *strides,
        NpyAuxData *transferdata)
{
    return stringdtype_wrap_sort_loop(
            context, data, dimensions, strides, transferdata,
            transferdata);  // transferdata contains npy_aquicksort_impl or npy_amergesort_impl
}

int
stringdtype_get_sort_loop(
        PyArrayMethod_Context *context,
        int aligned, int move_references,
        const npy_intp *strides,
        PyArrayMethod_StridedLoop **out_loop,
        NpyAuxData **out_transferdata,
        NPY_ARRAYMETHOD_FLAGS *flags)
{
    PyArrayMethod_SortParameters *parameters = (PyArrayMethod_SortParameters *)context->parameters;
    *flags |= NPY_METH_NO_FLOATINGPOINT_ERRORS;
    if (parameters->flags == NPY_SORT_STABLE) {
        *out_transferdata = &npy_amergesort_impl;
    else if (parameters->flags == NPY_SORT_DEFAULT) {
        *out_transferdata = &npy_aquicksort_impl;
    else {
        PyErr_SetString(PyExc_RuntimeError, "unsupported sort kind");
        return -1;
    }
    *out_loop = (PyArrayMethod_StridedLoop *)stringdtype_sort_loop;
    return 0;
}

Copy link
Contributor Author

@MaanasArora MaanasArora Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I am missing something obvious, but would this not cause a pointer issue (casting &npy_amergesort_impl to NpyAuxData *, for example, cannot be done)? Should we define a new struct with the auxdata base to cast to and back is what you mean? Happy to do that of course!

Edit: tried what you sent, and indeed there was a cast issue, at least the function cannot be directly called with the signature then...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems using NpyAuxData requires a struct definition with an auxdata base, so something like:

typedef struct {
    NpyAuxData base;
    PyArray_SortImpl *sort_impl;
} StringDTypeSortingAuxData;

with free and clone might work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a struct...Note that you are using auxdata to pass the function in the code you have above as well, just a layer deeper.

Grepping, I see that in dtype_transfer.c, the same kind of thing is done, but it does have a cast:

*out_transferdata = (NpyAuxData *)data;

That said, maybe this does behave differently than the auxdata in legacy-style loops: it seems everywhere I can find it, memory is allocated for it. Let's ping @seberg, since that may just be because everywhere else there is actually an allocation needed. Unfortunately, the documentation does not quite help -- https://numpy.org/doc/stable/reference/c-api/array.html#c.PyArrayMethod_GetLoop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, memory probably needs to be allocated. Even in that example though, there is a struct, which is the type of data:

typedef struct {
NpyAuxData base;
PyArray_Descr *descr;
int move_references;
} _object_to_any_auxdata;

And after that descr and move_references are members of the struct.

Note that you are using auxdata to pass the function in the code you have above as well, just a layer deeper.

Maybe you are looking at the (arg)sort_loop parameter? There is an additional argument after transferdata:

int
stringdtype_wrap_sort_loop(
        PyArrayMethod_Context *context,
        char *const *data, const npy_intp *dimensions, const npy_intp *strides,
        NpyAuxData *transferdata, PyArray_SortImpl *sort_loop)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaanasArora - Now had time to look a bit more carefully and with out_transferdata requiring allocation, it is just too much work. It is also not really necessary: if we move selection of the implementation in the wrapping loops, things can stay super simple. I just made a PR to your PR to implement that - have a look!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks - I've merged your PR!

return ret;
}

int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need only one loop (say, stringdtype_sort_loop and stringdtype_argsort_loop if we use transferdata -- see comment on get_loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above!

@MaanasArora
Copy link
Contributor Author

Thanks for reviewing both!

@mhvk I'm a bit confused about the get_loop suggestion, replied inline. The arraymethod idea is really nice and I anticipated a follow up for that; glad to know we have the static_data place, thanks! Since it would be nice to get the performance improvement in 2.4, perhaps it should be a follow-up, and it will play very nicely with the sort compare work. I totally did not plan to expose the _impl functions :)

@ngoldbaum I actually used Python 3.12.7. The improvement in 3.13 is a bit more modest (a bit less than 2x). I tried to run the sort benchmarks, but they seem quite flaky; perhaps we can just do away with the helper based on @mhvk's suggestion inline.

@ngoldbaum
Copy link
Member

I tried to run the sort benchmarks, but they seem quite flaky

Argh, too bad. Sorry!

I'll defer to Marten on the API decisions, he has much better taste for that stuff than I do...

@mhvk
Copy link
Contributor

mhvk commented Nov 20, 2025

@MaanasArora - OK, definitely happy to go with what you have here and do follow-up. Having something working that uses the new mechanism will help a lot for that anyway.

@charris charris added this to the 2.4.0 release milestone Nov 21, 2025
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small in-line comment.

Also, I wonder if we should use the ufunc registration rather than the private sort method pointers? Since there are probably people like me who copy & paste parts of numpy code, it is helpful to use public api as much as possible. But no big deal if you don't feel like it!

return -1;
}

// sorting cannot be a view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small one, but I think by setting *view_offset = 0 you actually indicate it can be a view (no view is NPY_MIN_INTP. I think it is fine not to touch view_offset at all, but maybe best to ask @seberg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, perhaps not good to touch it, thanks. I'll delete these lines, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the lines!

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Nov 22, 2025

@mhvk I did actually try using PyUFunc_AddLoopsFromSpecs, but it didn't work because array_sort is not initialized when this code is called! (I even tried moving things to stringdtype_ufuncs, but no luck.) It looked much better, so a bit unfortunate.

@mhvk
Copy link
Contributor

mhvk commented Nov 22, 2025

@MaanasArora - OK, that's indeed unfortunate, but for another time! With your latest change, I think this is all ready!

Still, maybe it would be good if @seberg had a look too... I do think this is all OK, but I wondered a little whether this case is supported by the PyUFunc_AddWrappingLoop mechanism. From a quick look, I think not, since that only seems to deal with descriptors, not an ability to (like here) do something both before and after the loop is called.

Side note for future: the hackery here to get the standard sort/argsort implementation with a _compare would ideally not be necessary. I've been thinking that one approach might be for the API to have not just the ability to register a loop from a specification, but also to retrieve one.

@charris
Copy link
Member

charris commented Nov 23, 2025

The test failure is unrelated.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAven't looked very close, but besides a small nit, looks good to me.

Yeah, the wrapping loop is for ufuncs, and frankly, this isn't really a ufunc. Even if it works, I suspect it would just be confusing.

I've been thinking that one approach might be for the API to have not just the ability to register a loop from a specification, but also to retrieve one.

Yes, that is something we could and should expose (there is some python stuff for ufuncs but just to allow experimenting).
But I don't think that would actually be useful here?

return 0;
}

int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT:

Suggested change
int
static int

(and elsewhere). We are not hiding symbols by default yet, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is done!

@mhvk
Copy link
Contributor

mhvk commented Nov 24, 2025

I've been thinking that one approach might be for the API to have not just the ability to register a loop from a specification, but also to retrieve one.

Yes, that is something we could and should expose (there is some python stuff for ufuncs but just to allow experimenting). But I don't think that would actually be useful here?

Indeed, not useful here yet. It would become useful if one could use it to retrieve the default "use compare slot" loop, and then wrap the loop in the allocate acquire/release part (i.e., we wouldn't have to get those loops through using the sort header file).

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Nov 24, 2025

Thanks @seberg @mhvk!

Yes, I think a wrapping loop doesn't twist well to this idea and might be hard to maintain / evolve too.

Side note for future: the hackery here to get the standard sort/argsort implementation with a _compare would ideally not be necessary. I've been thinking that one approach might be for the API to have not just the ability to register a loop from a specification, but also to retrieve one.

Perhaps we could expose default sorting loops too? I think those would be pretty much necessary and was done (though very lightly) in the sort compare PR, so we could expose them directly, or through convenience functions (like this idea). C++ templating came up too - but it probably won't be absolutely necessary to keep parity since we were passing a comparison function even in the legacy code. It's the same idea anyway!

@ngoldbaum
Copy link
Member

Thanks for your persistence on this @MaanasArora! Merging...

@ngoldbaum ngoldbaum merged commit 2b6c35d into numpy:main Nov 26, 2025
79 of 80 checks passed
cakedev0 pushed a commit to cakedev0/numpy that referenced this pull request Dec 5, 2025
* ENH: New-style sorting for StringDType

* BUG: fix descriptor handling in stringdtype_sort_resolve_descriptors

* REF: Cleanup braces and function order

* REF: rename fill_sort_data_with_array to get_sort_data_from_array for clarity

* MAINT: Move sort type selection one level down, reducing duplication.

* MAINT: Combine two levels of indirection

* REF: remove unnecessary view_offset assignment in stringdtype_sort_resolve_descriptors

* REF: change function visibility to static for stringdtype sort functions

---------

Co-authored-by: Marten Henric van Kerkwijk <mhvk@astro.utoronto.ca>
IndifferentArea pushed a commit to IndifferentArea/numpy that referenced this pull request Dec 7, 2025
* ENH: New-style sorting for StringDType

* BUG: fix descriptor handling in stringdtype_sort_resolve_descriptors

* REF: Cleanup braces and function order

* REF: rename fill_sort_data_with_array to get_sort_data_from_array for clarity

* MAINT: Move sort type selection one level down, reducing duplication.

* MAINT: Combine two levels of indirection

* REF: remove unnecessary view_offset assignment in stringdtype_sort_resolve_descriptors

* REF: change function visibility to static for stringdtype sort functions

---------

Co-authored-by: Marten Henric van Kerkwijk <mhvk@astro.utoronto.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: np.unique and sorting is slow for StringDType

5 participants