-
-
Notifications
You must be signed in to change notification settings - Fork 12k
ENH: New-style sorting for StringDType #30112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dab50e6 to
220338d
Compare
|
Running this script: import numpy
import random
import timeit
print(numpy.__version__)
options = ["left", "right", "leftovers", "righty", "up", "down", numpy.nan]
lst = random.choices(options, k=1000)
arr_s = numpy.array(lst, dtype="T")
time = timeit.timeit(
"numpy.sort(arr_s, kind='stable')",
globals=globals(),
number=1000,
setup="import numpy",
)
print(f"Time taken for stable sort: {time:.6f} seconds")produces on main: and with this PR: Edit: just noting that this doesn't implement quick (non stable) sort so far - I didn't find a significant performance difference when working with this script and a few other tests, but I didn't really try out a lot of variety. It would also be more verbose if we added it. I'll try running experiments sometime. |
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be merged!!! I noticed that we actually cannot skip resolve_descriptors for any parametric dtype, at least for sorts, because the output is parametric. Should we special-case somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just realized this is initialization, so we don't actually have the descriptors. Not sure what the condition could be...
|
|
||
| NPY_ARRAYMETHOD_FLAGS flags = 0; | ||
| method_flags = &flags; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed a bug here and below!
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we cannot use PyUFunc_AddLoopsWithSpecs if we group this with the ufuncs because NumPy is not fully initialized when multiarray is initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.
|
Added a 2.4.0 milestone - it'd be really nice to get the performance boost shipped this month. I'll try to give this a once-over this week sometime. |
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems nice but incomplete. Left some comments inline.
Did you think about using C++ instead of C?
IMO this needs explicit string sorting tests, especially given the custom algorithm code in this PR. I'm not sure offhand if there's a good test corpus - maybe look in the CPython test suite?
Even if you change to use a C++ standard library sorting algorithm, I'd still like to see some more tests for sorting in test_stringdtype.py before this is merged.
| extern "C" { | ||
| #endif | ||
|
|
||
| #define SMALL_STRING_MERGESORT 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what motivates this choice? maybe add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the algorithm, this is directly taken from npysort :
| #define SMALL_MERGESORT 20 |
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.
| memcpy(pj, vp, elsize); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty tricky C code - I would be very careful to make sure you're implementing it correctly (e.g. by adding a lot of tests). Or maybe use the C++ standard library instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:
numpy/numpy/_core/src/npysort/mergesort.cpp
Lines 338 to 412 in 2b216b1
| static void | |
| npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize, | |
| PyArray_CompareFunc *cmp, PyArrayObject *arr) | |
| { | |
| char *pi, *pj, *pk, *pm; | |
| if (pr - pl > SMALL_MERGESORT * elsize) { | |
| /* merge sort */ | |
| pm = pl + (((pr - pl) / elsize) >> 1) * elsize; | |
| npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr); | |
| npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr); | |
| GENERIC_COPY(pw, pl, pm - pl); | |
| pi = pw + (pm - pl); | |
| pj = pw; | |
| pk = pl; | |
| while (pj < pi && pm < pr) { | |
| if (cmp(pm, pj, arr) < 0) { | |
| GENERIC_COPY(pk, pm, elsize); | |
| pm += elsize; | |
| pk += elsize; | |
| } | |
| else { | |
| GENERIC_COPY(pk, pj, elsize); | |
| pj += elsize; | |
| pk += elsize; | |
| } | |
| } | |
| GENERIC_COPY(pk, pj, pi - pj); | |
| } | |
| else { | |
| /* insertion sort */ | |
| for (pi = pl + elsize; pi < pr; pi += elsize) { | |
| GENERIC_COPY(vp, pi, elsize); | |
| pj = pi; | |
| pk = pi - elsize; | |
| while (pj > pl && cmp(vp, pk, arr) < 0) { | |
| GENERIC_COPY(pj, pk, elsize); | |
| pj -= elsize; | |
| pk -= elsize; | |
| } | |
| GENERIC_COPY(pj, vp, elsize); | |
| } | |
| } | |
| } | |
| NPY_NO_EXPORT int | |
| npy_mergesort(void *start, npy_intp num, void *varr) | |
| { | |
| PyArrayObject *arr = (PyArrayObject *)varr; | |
| npy_intp elsize = PyArray_ITEMSIZE(arr); | |
| PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare; | |
| char *pl = (char *)start; | |
| char *pr = pl + num * elsize; | |
| char *pw; | |
| char *vp; | |
| int err = -NPY_ENOMEM; | |
| /* Items that have zero size don't make sense to sort */ | |
| if (elsize == 0) { | |
| return 0; | |
| } | |
| pw = (char *)malloc((num >> 1) * elsize); | |
| vp = (char *)malloc(elsize); | |
| if (pw != NULL && vp != NULL) { | |
| npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr); | |
| err = 0; | |
| } | |
| free(vp); | |
| free(pw); | |
| return err; | |
| } |
I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.
| *pj = vi; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this too - can we maybe rely on the C++ standard library instead of having an algorithm implementation here in the stringdtype internals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!
|
|
||
| if (init_stringdtype_sorts() < 0) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can avoid the numpy initialization problem you're having if you move this into init_string_dtype over in the dtype.c file in the multiarray/stringdtype folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| needidxbuffer = rstride != sizeof(npy_intp); | ||
|
|
||
| if (method_flags != NULL) { | ||
| if (strided_loop != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the two changes above bugfixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MaanasArora
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing! I didn't look much at C++ - it makes sense to do that I think, though as I've said in a comment I'm unsure if using their standard sorts is the way to go:
- The sorting code is directly copied from the
npy_mergesortcode innpysort, so this implementation provides some consistency. - In the future, if someone wants to do this more efficiently, add features, or even use SIMD (!), having custom code makes that easier.
That being said I totally see it can get very complicated to have lots of tricky algorithm code lying around. I actually see it as a very good reason to actually make our sorting infrastructure reusable, a public library, and perhaps use C++ templates to expose the generic sorts. Maybe this would be preferable to using C++ sorts simply because we can make extensions, even if we actually use C++ sorts internally! (See #29987 for a very initial step).
For the short term though, if it is very tricky to review this even given the reuse, I am happy to move to the C++ standard library - just making a general point. I'll add expanded test coverage soon. Thanks again!
| needidxbuffer = rstride != sizeof(npy_intp); | ||
|
|
||
| if (method_flags != NULL) { | ||
| if (strided_loop != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.
| memcpy(pj, vp, elsize); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:
numpy/numpy/_core/src/npysort/mergesort.cpp
Lines 338 to 412 in 2b216b1
| static void | |
| npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize, | |
| PyArray_CompareFunc *cmp, PyArrayObject *arr) | |
| { | |
| char *pi, *pj, *pk, *pm; | |
| if (pr - pl > SMALL_MERGESORT * elsize) { | |
| /* merge sort */ | |
| pm = pl + (((pr - pl) / elsize) >> 1) * elsize; | |
| npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr); | |
| npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr); | |
| GENERIC_COPY(pw, pl, pm - pl); | |
| pi = pw + (pm - pl); | |
| pj = pw; | |
| pk = pl; | |
| while (pj < pi && pm < pr) { | |
| if (cmp(pm, pj, arr) < 0) { | |
| GENERIC_COPY(pk, pm, elsize); | |
| pm += elsize; | |
| pk += elsize; | |
| } | |
| else { | |
| GENERIC_COPY(pk, pj, elsize); | |
| pj += elsize; | |
| pk += elsize; | |
| } | |
| } | |
| GENERIC_COPY(pk, pj, pi - pj); | |
| } | |
| else { | |
| /* insertion sort */ | |
| for (pi = pl + elsize; pi < pr; pi += elsize) { | |
| GENERIC_COPY(vp, pi, elsize); | |
| pj = pi; | |
| pk = pi - elsize; | |
| while (pj > pl && cmp(vp, pk, arr) < 0) { | |
| GENERIC_COPY(pj, pk, elsize); | |
| pj -= elsize; | |
| pk -= elsize; | |
| } | |
| GENERIC_COPY(pj, vp, elsize); | |
| } | |
| } | |
| } | |
| NPY_NO_EXPORT int | |
| npy_mergesort(void *start, npy_intp num, void *varr) | |
| { | |
| PyArrayObject *arr = (PyArrayObject *)varr; | |
| npy_intp elsize = PyArray_ITEMSIZE(arr); | |
| PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare; | |
| char *pl = (char *)start; | |
| char *pr = pl + num * elsize; | |
| char *pw; | |
| char *vp; | |
| int err = -NPY_ENOMEM; | |
| /* Items that have zero size don't make sense to sort */ | |
| if (elsize == 0) { | |
| return 0; | |
| } | |
| pw = (char *)malloc((num >> 1) * elsize); | |
| vp = (char *)malloc(elsize); | |
| if (pw != NULL && vp != NULL) { | |
| npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr); | |
| err = 0; | |
| } | |
| free(vp); | |
| free(pw); | |
| return err; | |
| } |
I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.
| *pj = vi; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.
| extern "C" { | ||
| #endif | ||
|
|
||
| #define SMALL_STRING_MERGESORT 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the algorithm, this is directly taken from npysort :
| #define SMALL_MERGESORT 20 |
|
|
||
| if (init_stringdtype_sorts() < 0) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!
|
Ah, that makes more sense, I didn't realize the sorting algorithm code was copied from elsewhere. Maybe we can refactor things so this calls into I'll try to poke around in a debugger to see if I can come up with a suggestion for the |
|
Thanks! Yeah unfortunately the main issue with calling into |
|
I guess that argues for merging that PR before this one. I'll try to give that one a look soon. Sorry for letting it drop... |
|
No worries at all! It actually does |
|
I've added tests using an ad-hoc string corpus; I couldn't find anything standard, sorry. It is two lorem ipsum passsages with some random unicode strings in between. I'm getting some very bizzare failures with running the new test, only for argsort, and only when |
|
Closing in favor of #30266. |
Implements merge sorts for StringDType using the feature introduced in #29737. The sorts allocate the mutex lock only once, thus fixes #26510.
Instead of utilizing the generic sorts as done in #29987, this PR completely rewrites the sorts. While that is more verbose, just duplicating the code can help move to more specialized string sorting algorithms and make optimizations independently from the
npysortcode. Also, we don't need to rely on the 'hack' in the other PR i.e. indirect use of sort compare function.Posting a benchmark in a comment below; there is ~3x gain in speed on stable sort. ping @seberg - thanks!
Edit: also supports descending sorts now!