Skip to content

ENH: np.unique and sorting is slow for StringDType #26510

@ngoldbaum

Description

@ngoldbaum

I also did some basic performance tests in v2.0.0rc2 for the experimental StringDType but it seems to be much slower than "O" or "U" dtypes in certain cases, are you aware of these issues:

import numpy
import random
import timeit

print(numpy.__version__)  # '2.0.0rc2'

options = ["a", "bb", "ccc", "dddd"]
lst = random.choices(options, k=1000)
arr_s = numpy.fromiter(lst, dtype="T", count=len(lst))
arr_o = numpy.fromiter(lst, dtype="O", count=len(lst))
arr_u = numpy.fromiter(lst, dtype="U5", count=len(lst))

print(timeit.timeit(lambda: numpy.unique(arr_s), number=10000))  # 4.270 <- why so much slower?
print(timeit.timeit(lambda: numpy.unique(arr_o), number=10000))  # 2.357
print(timeit.timeit(lambda: numpy.unique(arr_u), number=10000))  # 0.502
print(timeit.timeit(lambda: sorted(set(lst)), number=10000))  #  0.078

Profiling (will include in a separate comment below) indicates that the overhead is from acquiring and releasing the allocator lock (it does it for every entry in the array because sorting is implemented using getitem).

We could improve performance for sorting specifically by adding a fast getitem path that assumes the allocator lock is already acquired and using it in the sorting implementation after acquiring the lock once.

Optimizing getitem and setitem would also help (maybe by adding a stringdtype scalar that interns the packed string).

Originally posted by @MarkusSintonen in #25693 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions