ENH: Add find/rfind ufuncs for unicode and byte dtypes by lysnikolaou · Pull Request #24868 · numpy/numpy

lysnikolaou · 2023-10-05T15:45:07Z

No description provided.

charris · 2023-10-09T17:47:49Z

Needs a release note.

lysnikolaou · 2023-10-16T11:38:31Z

This is also ready to go from my side. A review would be very helpful!

lysnikolaou · 2023-10-18T08:39:22Z

@ngoldbaum @seberg @charris A review here would be very helpful, whenever you get some free cycles.

lysnikolaou · 2023-10-18T09:37:28Z

Benchmark results:

| Change   | Before [19bfa3ff] <main>   | After [b35de314] <string-ufuncs-find-rfind>   |   Ratio | Benchmark (Parameter)                                |
|----------|----------------------------|-----------------------------------------------|---------|------------------------------------------------------|
| -        | 17.1±0.08μs                | 9.24±0.02μs                                   |    0.54 | bench_core.NumPyChar.time_find_small_list_big_string |
| -        | 5.67±0.03ms                | 15.5±0.05μs                                   |    0    | bench_core.NumPyChar.time_find_big_list_small_string |

seberg

A few comments only. Do we you really need to vendor that file with large changes? Any chance of vendoring it without changes? OTOH, as the comment says, it seems like it would need more changes.

numpy/_core/src/umath/ufunc_type_resolution.c

numpy/_core/src/umath/string_fastsearch.h

numpy/_core/src/umath/ufunc_type_resolution.c

lysnikolaou · 2023-10-18T12:06:45Z

Do we you really need to vendor that file with large changes? Any chance of vendoring it without changes?

Unfortunately, for this to play along with string_ufuncs.cpp and the templating mechanism it uses, the changes are needed. This file has been there part of the CPython source tree for a long time with very minimal changes, in case that makes you less concerned.

lysnikolaou · 2023-10-19T11:39:51Z

How should I go about fixing the test failure?

numpy._core._exceptions._UFuncNoLoopError:
    ufunc 'find' did not contain a loop with signature matching types
    (<class 'numpy.dtypes.StrDType'>,
     <class 'numpy.dtypes.StrDType'>,
     <class 'numpy.dtypes.Int64DType'>,
     <class 'numpy.dtypes.Int64DType'>) -> <class 'numpy.dtypes.Int32DType'>

Add another loop? Downcast the result? While I think I've understood a lot about how to handle input arguments dtypes and type resolving, output dtypes still confuse me a bit for some reason.

ngoldbaum · 2023-10-24T18:24:24Z

I left a comment on the last commit, should probably have been here: fcd600a#r130812759

lysnikolaou · 2023-10-27T11:17:38Z

@seberg @ngoldbaum I've reworked this PR to use the buffer approach suggested by @ngoldbaum. Unfortunately, the tests all fail with the same kind of error:

[gw1] linux -- Python 3.11.6 /opt/hostedtoolcache/Python/3.11.6/x64/bin/python

    def test_find_access_past_buffer():
        # This checks that no read past the string buffer occurs in
        # string_fastsearch.h. The READC macro is there to check this.
        # To see it in action, you can redefine READC to just read the
        # i'th character of the buffer and this test will produce an
        # 'Invalid read' if run under valgrind.
        arr = np.array([b'abcd', b'ebcd'])
>       result = np._core.umath.find(arr, b'cde', 0, np.iinfo(np.int64).max)
E       numpy._core._exceptions._UFuncNoLoopError: ufunc 'find' did not contain a loop with signature matching types (<class 'numpy.dtypes.BytesDType'>, <class 'numpy.dtypes.BytesDType'>, <class 'numpy._IntegerAbstractDType'>, <class 'numpy._IntegerAbstractDType'>) -> None

This seems completely unrelated to the change I did in the last commits and I can't reproduce it locally either on a Ubuntu machine or an M1 macbook. Any ideas what might be causing it?

lysnikolaou · 2023-10-30T13:09:00Z

@seberg @ngoldbaum This appears to be working properly now and all the fixes are in, as well as the migration to using a buffer class and a promoter rather than the old-style type resolution functions. Can you have a final look, so that we can merge this and I can start working on reworking the rest of the PRs to the new buffer solution?

ngoldbaum · 2023-10-30T15:24:51Z

Are the benchmark results still similar to #24868 (comment)?

lysnikolaou · 2023-10-30T15:36:13Z

Yeah, sorry, I forgot to post the updated benchmark results.

| Change   | Before [cdfbdf42] <main>   | After [03d737dc] <string-ufuncs-find-rfind>   |   Ratio | Benchmark (Parameter)                                |
|----------|----------------------------|-----------------------------------------------|---------|------------------------------------------------------|
| -        | 17.1±0.2μs                 | 9.45±0.3μs                                    |    0.55 | bench_core.NumPyChar.time_find_small_list_big_string |
| -        | 5.73±0.2ms                 | 17.2±0.4μs                                    |    0    | bench_core.NumPyChar.time_find_big_list_small_string |

ngoldbaum

Overall looks really good, just one minor nit. I looked over the ufunc setup and the buffer class but didn't look over string_fastsearch.h in detail. I think the string buffer class is a nice abstraction that will make it much easier to plug in UTF-8 in a month or two.

I have two minor comments but think this is ready from my end. I want Sebastian to give this a once-over to get his opinion on the string buffer class before merging.

numpy/_core/code_generators/ufunc_docstrings.py

numpy/_core/src/umath/string_ufuncs.cpp

lysnikolaou · 2023-10-30T16:28:39Z

Test failures seem to be unrelated.

seberg

I don't have a strong opinion about the Buffer class/struct, as it is relatively localized.
But it does seem very alpha when it comes to actually be useful and not just buggy for utf-8.

Maybe we can just remove the utf-8 stuff? Or put a big comment somewhere that the utf-8 stuff was intended to be used in the near future, but is not usable yet.

numpy/_core/src/umath/string_buffer.h

numpy/_core/src/umath/string_ufuncs.cpp

ngoldbaum · 2023-10-30T17:42:59Z

Maybe we can just remove the utf-8 stuff? Or put a big comment somewhere that the utf-8 stuff was intended to be used in the near future, but is not usable yet.

Removing it is fine with me.

I'm happy to benchmark whether implicit conversions to UCS-4 arrays is faster than just plugging in the Buffer class when I do that.

lysnikolaou · 2023-10-30T17:57:16Z

Removed the UTF-8 stuff and addressed all review comments.

numpy/_core/src/umath/string_ufuncs.cpp

seberg · 2023-10-31T09:44:44Z

I'm happy to benchmark whether implicit conversions to UCS-4 arrays is faster than just plugging in the Buffer class when I do that.

Yeah, I am not too worried about it. My gut feeling is that the simplest solution for many algorithms will be to use the bytes directly (for whitespace/isalpha, etc. of course not, but those are clean single pass algorithms anyway).
For find/rfind you will have to do translate the slicing into byte-offets at the start and the result byte-offset back to character offsets when done.

Btw. I don't think it is useful enough at this point, but the code could in principle be optimized a bit for the likely case that needle is a constant: we can re-use the bloom filter from the previous iteration.

lysnikolaou · 2023-10-31T13:08:16Z

Are there any more actions points for me to take care of before we can merge this?

seberg · 2023-11-02T16:22:18Z

@lysnikolaou sorry, I don't want to fret about a few style nits, and I can't see a better path than vendoring the Python stuff (I really don't love how much code it is for something we don't do often, but...).
So I think, given that Nathan is also around to help with maintaining thing, I think we can just put it in.

There is one problem now, unfortunately. I suspect the tests will fail after merging the change to what the default integer is. My suggestion would be that you squash all current commits into one (or few) and then add a single commit to fix all NPY_LONG to NPY_DEFAULT_INT (or intp if easier), hopefully that will make any test pass?

That way, the PR isn't directly tied to any possible revert, etc. due to the intp changes.

lysnikolaou · 2023-11-02T17:55:59Z

@seberg Done.

seberg · 2023-11-02T21:33:50Z

Thanks, lets see how it goes :).

github-actions bot added the 25 - WIP label Oct 5, 2023

charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Oct 9, 2023

lysnikolaou marked this pull request as ready for review October 11, 2023 08:09

lysnikolaou changed the title ~~WIP: Add find/rfind ufuncs for unicode and byte dtypes~~ ENH: Add find/rfind ufuncs for unicode and byte dtypes Oct 11, 2023

seberg reviewed Oct 18, 2023

View reviewed changes

numpy/_core/src/umath/ufunc_type_resolution.c Outdated Show resolved Hide resolved

numpy/_core/src/umath/string_fastsearch.h Outdated Show resolved Hide resolved

numpy/_core/src/umath/ufunc_type_resolution.c Outdated Show resolved Hide resolved

lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 9870853 to 03327ab Compare October 18, 2023 11:57

lysnikolaou force-pushed the string-ufuncs-find-rfind branch 2 times, most recently from 82ee826 to 065e808 Compare October 23, 2023 17:40

lysnikolaou mentioned this pull request Oct 24, 2023

ENH: Add startswith & endswith ufuncs for unicode and bytes dtypes #24947

Merged

lysnikolaou force-pushed the string-ufuncs-find-rfind branch 2 times, most recently from 4f22f04 to f59bb9f Compare October 27, 2023 10:21

ngoldbaum reviewed Oct 30, 2023

View reviewed changes

numpy/_core/code_generators/ufunc_docstrings.py Outdated Show resolved Hide resolved

numpy/_core/src/umath/string_ufuncs.cpp Show resolved Hide resolved

seberg reviewed Oct 30, 2023

View reviewed changes

numpy/_core/src/umath/string_ufuncs.cpp Show resolved Hide resolved

lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 36e4210 to 8899e95 Compare November 2, 2023 17:54

lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 8899e95 to d4d97eb Compare November 2, 2023 18:17

lysnikolaou added 2 commits November 2, 2023 19:18

ENH: Add find/rfind ufuncs for unicode and byte dtypes

83c780d

Use NPY_DEFAULT_INT rather than NPY_LONG

671fa53

lysnikolaou force-pushed the string-ufuncs-find-rfind branch from d4d97eb to 671fa53 Compare November 2, 2023 18:19

seberg merged commit 22ab9aa into numpy:main Nov 2, 2023

Uh oh!

Conversation

lysnikolaou commented Oct 5, 2023

Uh oh!

charris commented Oct 9, 2023

Uh oh!

lysnikolaou commented Oct 16, 2023

Uh oh!

lysnikolaou commented Oct 18, 2023

Uh oh!

lysnikolaou commented Oct 18, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Oct 18, 2023

Uh oh!

lysnikolaou commented Oct 19, 2023

Uh oh!

ngoldbaum commented Oct 24, 2023

Uh oh!

lysnikolaou commented Oct 27, 2023

Uh oh!

lysnikolaou commented Oct 30, 2023

Uh oh!

ngoldbaum commented Oct 30, 2023

Uh oh!

lysnikolaou commented Oct 30, 2023

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lysnikolaou commented Oct 30, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Oct 30, 2023

Uh oh!

lysnikolaou commented Oct 30, 2023

Uh oh!

Uh oh!

seberg commented Oct 31, 2023

Uh oh!

lysnikolaou commented Oct 31, 2023

Uh oh!

seberg commented Nov 2, 2023

Uh oh!

lysnikolaou commented Nov 2, 2023

Uh oh!

seberg commented Nov 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants