ENH: np.unique and sorting is slow for StringDType

I also did some basic performance tests in `v2.0.0rc2` for the experimental `StringDType` but it seems to be much slower than `"O"` or `"U"` dtypes in certain cases, are you aware of these issues:

```python
import numpy
import random
import timeit

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078
```

Profiling (will include in a separate comment below) indicates that the overhead is from acquiring and releasing the allocator lock (it does it for every entry in the array because sorting is implemented using getitem).

We could improve performance for sorting specifically by adding a fast getitem path that assumes the allocator lock is already acquired and using it in the sorting implementation after acquiring the lock once.

Optimizing getitem and setitem would also help (maybe by adding a stringdtype scalar that interns the packed string).

_Originally posted by @MarkusSintonen in https://github.com/numpy/numpy/issues/25693#issuecomment-2126856099_
            

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: np.unique and sorting is slow for StringDType #26510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: np.unique and sorting is slow for StringDType #26510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions