Skip to content

ENH: Add find/rfind ufuncs for unicode and byte dtypes#24868

Merged
seberg merged 2 commits intonumpy:mainfrom
lysnikolaou:string-ufuncs-find-rfind
Nov 2, 2023
Merged

ENH: Add find/rfind ufuncs for unicode and byte dtypes#24868
seberg merged 2 commits intonumpy:mainfrom
lysnikolaou:string-ufuncs-find-rfind

Conversation

@lysnikolaou
Copy link
Member

No description provided.

@charris
Copy link
Member

charris commented Oct 9, 2023

Needs a release note.

@charris charris added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Oct 9, 2023
@lysnikolaou lysnikolaou marked this pull request as ready for review October 11, 2023 08:09
@lysnikolaou lysnikolaou changed the title WIP: Add find/rfind ufuncs for unicode and byte dtypes ENH: Add find/rfind ufuncs for unicode and byte dtypes Oct 11, 2023
@lysnikolaou
Copy link
Member Author

This is also ready to go from my side. A review would be very helpful!

@lysnikolaou
Copy link
Member Author

@ngoldbaum @seberg @charris A review here would be very helpful, whenever you get some free cycles.

@lysnikolaou
Copy link
Member Author

Benchmark results:

| Change   | Before [19bfa3ff] <main>   | After [b35de314] <string-ufuncs-find-rfind>   |   Ratio | Benchmark (Parameter)                                |
|----------|----------------------------|-----------------------------------------------|---------|------------------------------------------------------|
| -        | 17.1±0.08μs                | 9.24±0.02μs                                   |    0.54 | bench_core.NumPyChar.time_find_small_list_big_string |
| -        | 5.67±0.03ms                | 15.5±0.05μs                                   |    0    | bench_core.NumPyChar.time_find_big_list_small_string |

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments only. Do we you really need to vendor that file with large changes? Any chance of vendoring it without changes? OTOH, as the comment says, it seems like it would need more changes.

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 9870853 to 03327ab Compare October 18, 2023 11:57
@lysnikolaou
Copy link
Member Author

Do we you really need to vendor that file with large changes? Any chance of vendoring it without changes?

Unfortunately, for this to play along with string_ufuncs.cpp and the templating mechanism it uses, the changes are needed. This file has been there part of the CPython source tree for a long time with very minimal changes, in case that makes you less concerned.

@lysnikolaou
Copy link
Member Author

How should I go about fixing the test failure?

numpy._core._exceptions._UFuncNoLoopError:
    ufunc 'find' did not contain a loop with signature matching types
    (<class 'numpy.dtypes.StrDType'>,
     <class 'numpy.dtypes.StrDType'>,
     <class 'numpy.dtypes.Int64DType'>,
     <class 'numpy.dtypes.Int64DType'>) -> <class 'numpy.dtypes.Int32DType'>

Add another loop? Downcast the result? While I think I've understood a lot about how to handle input arguments dtypes and type resolving, output dtypes still confuse me a bit for some reason.

@ngoldbaum
Copy link
Member

I left a comment on the last commit, should probably have been here: fcd600a#r130812759

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-find-rfind branch 2 times, most recently from 4f22f04 to f59bb9f Compare October 27, 2023 10:21
@lysnikolaou
Copy link
Member Author

@seberg @ngoldbaum I've reworked this PR to use the buffer approach suggested by @ngoldbaum. Unfortunately, the tests all fail with the same kind of error:

[gw1] linux -- Python 3.11.6 /opt/hostedtoolcache/Python/3.11.6/x64/bin/python

    def test_find_access_past_buffer():
        # This checks that no read past the string buffer occurs in
        # string_fastsearch.h. The READC macro is there to check this.
        # To see it in action, you can redefine READC to just read the
        # i'th character of the buffer and this test will produce an
        # 'Invalid read' if run under valgrind.
        arr = np.array([b'abcd', b'ebcd'])
>       result = np._core.umath.find(arr, b'cde', 0, np.iinfo(np.int64).max)
E       numpy._core._exceptions._UFuncNoLoopError: ufunc 'find' did not contain a loop with signature matching types (<class 'numpy.dtypes.BytesDType'>, <class 'numpy.dtypes.BytesDType'>, <class 'numpy._IntegerAbstractDType'>, <class 'numpy._IntegerAbstractDType'>) -> None

This seems completely unrelated to the change I did in the last commits and I can't reproduce it locally either on a Ubuntu machine or an M1 macbook. Any ideas what might be causing it?

@lysnikolaou
Copy link
Member Author

@seberg @ngoldbaum This appears to be working properly now and all the fixes are in, as well as the migration to using a buffer class and a promoter rather than the old-style type resolution functions. Can you have a final look, so that we can merge this and I can start working on reworking the rest of the PRs to the new buffer solution?

@ngoldbaum
Copy link
Member

Are the benchmark results still similar to #24868 (comment)?

@lysnikolaou
Copy link
Member Author

Yeah, sorry, I forgot to post the updated benchmark results.

| Change   | Before [cdfbdf42] <main>   | After [03d737dc] <string-ufuncs-find-rfind>   |   Ratio | Benchmark (Parameter)                                |
|----------|----------------------------|-----------------------------------------------|---------|------------------------------------------------------|
| -        | 17.1±0.2μs                 | 9.45±0.3μs                                    |    0.55 | bench_core.NumPyChar.time_find_small_list_big_string |
| -        | 5.73±0.2ms                 | 17.2±0.4μs                                    |    0    | bench_core.NumPyChar.time_find_big_list_small_string |

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good, just one minor nit. I looked over the ufunc setup and the buffer class but didn't look over string_fastsearch.h in detail. I think the string buffer class is a nice abstraction that will make it much easier to plug in UTF-8 in a month or two.

I have two minor comments but think this is ready from my end. I want Sebastian to give this a once-over to get his opinion on the string buffer class before merging.

@lysnikolaou
Copy link
Member Author

Test failures seem to be unrelated.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about the Buffer class/struct, as it is relatively localized.
But it does seem very alpha when it comes to actually be useful and not just buggy for utf-8.

Maybe we can just remove the utf-8 stuff? Or put a big comment somewhere that the utf-8 stuff was intended to be used in the near future, but is not usable yet.

@ngoldbaum
Copy link
Member

Maybe we can just remove the utf-8 stuff? Or put a big comment somewhere that the utf-8 stuff was intended to be used in the near future, but is not usable yet.

Removing it is fine with me.

I'm happy to benchmark whether implicit conversions to UCS-4 arrays is faster than just plugging in the Buffer class when I do that.

@lysnikolaou
Copy link
Member Author

Removed the UTF-8 stuff and addressed all review comments.

@seberg
Copy link
Member

seberg commented Oct 31, 2023

I'm happy to benchmark whether implicit conversions to UCS-4 arrays is faster than just plugging in the Buffer class when I do that.

Yeah, I am not too worried about it. My gut feeling is that the simplest solution for many algorithms will be to use the bytes directly (for whitespace/isalpha, etc. of course not, but those are clean single pass algorithms anyway).
For find/rfind you will have to do translate the slicing into byte-offets at the start and the result byte-offset back to character offsets when done.

Btw. I don't think it is useful enough at this point, but the code could in principle be optimized a bit for the likely case that needle is a constant: we can re-use the bloom filter from the previous iteration.

@lysnikolaou
Copy link
Member Author

Are there any more actions points for me to take care of before we can merge this?

@seberg
Copy link
Member

seberg commented Nov 2, 2023

@lysnikolaou sorry, I don't want to fret about a few style nits, and I can't see a better path than vendoring the Python stuff (I really don't love how much code it is for something we don't do often, but...).
So I think, given that Nathan is also around to help with maintaining thing, I think we can just put it in.

There is one problem now, unfortunately. I suspect the tests will fail after merging the change to what the default integer is. My suggestion would be that you squash all current commits into one (or few) and then add a single commit to fix all NPY_LONG to NPY_DEFAULT_INT (or intp if easier), hopefully that will make any test pass?

That way, the PR isn't directly tied to any possible revert, etc. due to the intp changes.

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 36e4210 to 8899e95 Compare November 2, 2023 17:54
@lysnikolaou
Copy link
Member Author

@seberg Done.

@lysnikolaou lysnikolaou force-pushed the string-ufuncs-find-rfind branch from 8899e95 to d4d97eb Compare November 2, 2023 18:17
@lysnikolaou lysnikolaou force-pushed the string-ufuncs-find-rfind branch from d4d97eb to 671fa53 Compare November 2, 2023 18:19
@seberg seberg merged commit 22ab9aa into numpy:main Nov 2, 2023
@seberg
Copy link
Member

seberg commented Nov 2, 2023

Thanks, lets see how it goes :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

25 - WIP 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants