Skip to content

Conversation

@MaanasArora
Copy link
Contributor

@MaanasArora MaanasArora commented Oct 31, 2025

Implements merge sorts for StringDType using the feature introduced in #29737. The sorts allocate the mutex lock only once, thus fixes #26510.

Instead of utilizing the generic sorts as done in #29987, this PR completely rewrites the sorts. While that is more verbose, just duplicating the code can help move to more specialized string sorting algorithms and make optimizations independently from the npysort code. Also, we don't need to rely on the 'hack' in the other PR i.e. indirect use of sort compare function.

Posting a benchmark in a comment below; there is ~3x gain in speed on stable sort. ping @seberg - thanks!

Edit: also supports descending sorts now!

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Oct 31, 2025

Running this script:

import numpy
import random
import timeit

print(numpy.__version__)

options = ["left", "right", "leftovers", "righty", "up", "down", numpy.nan]
lst = random.choices(options, k=1000)
arr_s = numpy.array(lst, dtype="T")

time = timeit.timeit(
    "numpy.sort(arr_s, kind='stable')",
    globals=globals(),
    number=1000,
    setup="import numpy",
)
print(f"Time taken for stable sort: {time:.6f} seconds")

produces on main:

2.4.0.dev0+git20251030.d8eb225
Time taken for stable sort: 0.397690 seconds

and with this PR:

2.4.0.dev0+git20251030.dab50e6
Time taken for stable sort: 0.120393 seconds

Edit: just noting that this doesn't implement quick (non stable) sort so far - I didn't find a significant performance difference when working with this script and a few other tests, but I didn't really try out a lot of variety. It would also be more verbose if we added it. I'll try running experiments sometime.

// "output DType is parametric. (method: %s)",
// spec->name);
// return -1;
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be merged!!! I noticed that we actually cannot skip resolve_descriptors for any parametric dtype, at least for sorts, because the output is parametric. Should we special-case somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized this is initialization, so we don't actually have the descriptors. Not sure what the condition could be...


NPY_ARRAYMETHOD_FLAGS flags = 0;
method_flags = &flags;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a bug here and below!

Py_INCREF(argsort_method->method);
Py_DECREF(argsort_method);

return 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we cannot use PyUFunc_AddLoopsWithSpecs if we group this with the ufuncs because NumPy is not fully initialized when multiarray is initialized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.

@ngoldbaum ngoldbaum added this to the 2.4.0 release milestone Nov 3, 2025
@ngoldbaum
Copy link
Member

Added a 2.4.0 milestone - it'd be really nice to get the performance boost shipped this month. I'll try to give this a once-over this week sometime.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems nice but incomplete. Left some comments inline.

Did you think about using C++ instead of C?

IMO this needs explicit string sorting tests, especially given the custom algorithm code in this PR. I'm not sure offhand if there's a good test corpus - maybe look in the CPython test suite?

Even if you change to use a C++ standard library sorting algorithm, I'd still like to see some more tests for sorting in test_stringdtype.py before this is merged.

extern "C" {
#endif

#define SMALL_STRING_MERGESORT 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what motivates this choice? maybe add a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the algorithm, this is directly taken from npysort :

#define SMALL_MERGESORT 20

// "output DType is parametric. (method: %s)",
// spec->name);
// return -1;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.

memcpy(pj, vp, elsize);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty tricky C code - I would be very careful to make sure you're implementing it correctly (e.g. by adding a lot of tests). Or maybe use the C++ standard library instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:

static void
npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize,
PyArray_CompareFunc *cmp, PyArrayObject *arr)
{
char *pi, *pj, *pk, *pm;
if (pr - pl > SMALL_MERGESORT * elsize) {
/* merge sort */
pm = pl + (((pr - pl) / elsize) >> 1) * elsize;
npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr);
npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr);
GENERIC_COPY(pw, pl, pm - pl);
pi = pw + (pm - pl);
pj = pw;
pk = pl;
while (pj < pi && pm < pr) {
if (cmp(pm, pj, arr) < 0) {
GENERIC_COPY(pk, pm, elsize);
pm += elsize;
pk += elsize;
}
else {
GENERIC_COPY(pk, pj, elsize);
pj += elsize;
pk += elsize;
}
}
GENERIC_COPY(pk, pj, pi - pj);
}
else {
/* insertion sort */
for (pi = pl + elsize; pi < pr; pi += elsize) {
GENERIC_COPY(vp, pi, elsize);
pj = pi;
pk = pi - elsize;
while (pj > pl && cmp(vp, pk, arr) < 0) {
GENERIC_COPY(pj, pk, elsize);
pj -= elsize;
pk -= elsize;
}
GENERIC_COPY(pj, vp, elsize);
}
}
}
NPY_NO_EXPORT int
npy_mergesort(void *start, npy_intp num, void *varr)
{
PyArrayObject *arr = (PyArrayObject *)varr;
npy_intp elsize = PyArray_ITEMSIZE(arr);
PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare;
char *pl = (char *)start;
char *pr = pl + num * elsize;
char *pw;
char *vp;
int err = -NPY_ENOMEM;
/* Items that have zero size don't make sense to sort */
if (elsize == 0) {
return 0;
}
pw = (char *)malloc((num >> 1) * elsize);
vp = (char *)malloc(elsize);
if (pw != NULL && vp != NULL) {
npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr);
err = 0;
}
free(vp);
free(pw);
return err;
}

I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.

*pj = vi;
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too - can we maybe rely on the C++ standard library instead of having an algorithm implementation here in the stringdtype internals?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!


if (init_stringdtype_sorts() < 0) {
return -1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can avoid the numpy initialization problem you're having if you move this into init_string_dtype over in the dtype.c file in the multiarray/stringdtype folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needidxbuffer = rstride != sizeof(npy_intp);

if (method_flags != NULL) {
if (strided_loop != NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the two changes above bugfixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.

Py_INCREF(argsort_method->method);
Py_DECREF(argsort_method);

return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@MaanasArora MaanasArora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! I didn't look much at C++ - it makes sense to do that I think, though as I've said in a comment I'm unsure if using their standard sorts is the way to go:

  • The sorting code is directly copied from the npy_mergesort code in npysort, so this implementation provides some consistency.
  • In the future, if someone wants to do this more efficiently, add features, or even use SIMD (!), having custom code makes that easier.

That being said I totally see it can get very complicated to have lots of tricky algorithm code lying around. I actually see it as a very good reason to actually make our sorting infrastructure reusable, a public library, and perhaps use C++ templates to expose the generic sorts. Maybe this would be preferable to using C++ sorts simply because we can make extensions, even if we actually use C++ sorts internally! (See #29987 for a very initial step).

For the short term though, if it is very tricky to review this even given the reuse, I am happy to move to the C++ standard library - just making a general point. I'll add expanded test coverage soon. Thanks again!

needidxbuffer = rstride != sizeof(npy_intp);

if (method_flags != NULL) {
if (strided_loop != NULL) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.

memcpy(pj, vp, elsize);
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:

static void
npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize,
PyArray_CompareFunc *cmp, PyArrayObject *arr)
{
char *pi, *pj, *pk, *pm;
if (pr - pl > SMALL_MERGESORT * elsize) {
/* merge sort */
pm = pl + (((pr - pl) / elsize) >> 1) * elsize;
npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr);
npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr);
GENERIC_COPY(pw, pl, pm - pl);
pi = pw + (pm - pl);
pj = pw;
pk = pl;
while (pj < pi && pm < pr) {
if (cmp(pm, pj, arr) < 0) {
GENERIC_COPY(pk, pm, elsize);
pm += elsize;
pk += elsize;
}
else {
GENERIC_COPY(pk, pj, elsize);
pj += elsize;
pk += elsize;
}
}
GENERIC_COPY(pk, pj, pi - pj);
}
else {
/* insertion sort */
for (pi = pl + elsize; pi < pr; pi += elsize) {
GENERIC_COPY(vp, pi, elsize);
pj = pi;
pk = pi - elsize;
while (pj > pl && cmp(vp, pk, arr) < 0) {
GENERIC_COPY(pj, pk, elsize);
pj -= elsize;
pk -= elsize;
}
GENERIC_COPY(pj, vp, elsize);
}
}
}
NPY_NO_EXPORT int
npy_mergesort(void *start, npy_intp num, void *varr)
{
PyArrayObject *arr = (PyArrayObject *)varr;
npy_intp elsize = PyArray_ITEMSIZE(arr);
PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare;
char *pl = (char *)start;
char *pr = pl + num * elsize;
char *pw;
char *vp;
int err = -NPY_ENOMEM;
/* Items that have zero size don't make sense to sort */
if (elsize == 0) {
return 0;
}
pw = (char *)malloc((num >> 1) * elsize);
vp = (char *)malloc(elsize);
if (pw != NULL && vp != NULL) {
npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr);
err = 0;
}
free(vp);
free(pw);
return err;
}

I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.

*pj = vi;
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!

Py_INCREF(argsort_method->method);
Py_DECREF(argsort_method);

return 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.

extern "C" {
#endif

#define SMALL_STRING_MERGESORT 20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the algorithm, this is directly taken from npysort :

#define SMALL_MERGESORT 20


if (init_stringdtype_sorts() < 0) {
return -1;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// "output DType is parametric. (method: %s)",
// spec->name);
// return -1;
// }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!

@ngoldbaum
Copy link
Member

Ah, that makes more sense, I didn't realize the sorting algorithm code was copied from elsewhere. Maybe we can refactor things so this calls into npy_mergesort? That would avoid duplication. Or maybe that's what you were saying about making the sorting code reusable.

I'll try to poke around in a debugger to see if I can come up with a suggestion for the resolve_descriptors stuff.

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Nov 7, 2025

Thanks! Yeah unfortunately the main issue with calling into npy_mergesort is that the sort_compare is not implemented yet, nor descending. (Actually, #29987 has a workaround for exactly this.)

@ngoldbaum
Copy link
Member

I guess that argues for merging that PR before this one. I'll try to give that one a look soon. Sorry for letting it drop...

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Nov 7, 2025

No worries at all! It actually does StringDType too, but it's a larger change and there are some confusing points about the API there. Completely fine with doing this just for now!

@MaanasArora
Copy link
Contributor Author

MaanasArora commented Nov 7, 2025

I've added tests using an ad-hoc string corpus; I couldn't find anything standard, sorry. It is two lorem ipsum passsages with some random unicode strings in between.

I'm getting some very bizzare failures with running the new test, only for argsort, and only when na_object is unset and the corpus has duplicates... I suspect it has to do something with the allocator. Debugging now.

@MaanasArora
Copy link
Contributor Author

Closing in favor of #30266.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: np.unique and sorting is slow for StringDType

2 participants