ENH: Add find/rfind ufuncs for unicode and byte dtypes#24868
ENH: Add find/rfind ufuncs for unicode and byte dtypes#24868seberg merged 2 commits intonumpy:mainfrom
Conversation
|
Needs a release note. |
|
This is also ready to go from my side. A review would be very helpful! |
|
@ngoldbaum @seberg @charris A review here would be very helpful, whenever you get some free cycles. |
|
Benchmark results: |
seberg
left a comment
There was a problem hiding this comment.
A few comments only. Do we you really need to vendor that file with large changes? Any chance of vendoring it without changes? OTOH, as the comment says, it seems like it would need more changes.
9870853 to
03327ab
Compare
Unfortunately, for this to play along with |
|
How should I go about fixing the test failure? Add another loop? Downcast the result? While I think I've understood a lot about how to handle input arguments dtypes and type resolving, output dtypes still confuse me a bit for some reason. |
82ee826 to
065e808
Compare
|
I left a comment on the last commit, should probably have been here: fcd600a#r130812759 |
4f22f04 to
f59bb9f
Compare
|
@seberg @ngoldbaum I've reworked this PR to use the buffer approach suggested by @ngoldbaum. Unfortunately, the tests all fail with the same kind of error: This seems completely unrelated to the change I did in the last commits and I can't reproduce it locally either on a Ubuntu machine or an M1 macbook. Any ideas what might be causing it? |
|
@seberg @ngoldbaum This appears to be working properly now and all the fixes are in, as well as the migration to using a buffer class and a promoter rather than the old-style type resolution functions. Can you have a final look, so that we can merge this and I can start working on reworking the rest of the PRs to the new buffer solution? |
|
Are the benchmark results still similar to #24868 (comment)? |
|
Yeah, sorry, I forgot to post the updated benchmark results. |
ngoldbaum
left a comment
There was a problem hiding this comment.
Overall looks really good, just one minor nit. I looked over the ufunc setup and the buffer class but didn't look over string_fastsearch.h in detail. I think the string buffer class is a nice abstraction that will make it much easier to plug in UTF-8 in a month or two.
I have two minor comments but think this is ready from my end. I want Sebastian to give this a once-over to get his opinion on the string buffer class before merging.
|
Test failures seem to be unrelated. |
seberg
left a comment
There was a problem hiding this comment.
I don't have a strong opinion about the Buffer class/struct, as it is relatively localized.
But it does seem very alpha when it comes to actually be useful and not just buggy for utf-8.
Maybe we can just remove the utf-8 stuff? Or put a big comment somewhere that the utf-8 stuff was intended to be used in the near future, but is not usable yet.
Removing it is fine with me. I'm happy to benchmark whether implicit conversions to UCS-4 arrays is faster than just plugging in the Buffer class when I do that. |
|
Removed the UTF-8 stuff and addressed all review comments. |
Yeah, I am not too worried about it. My gut feeling is that the simplest solution for many algorithms will be to use the bytes directly (for whitespace/isalpha, etc. of course not, but those are clean single pass algorithms anyway). Btw. I don't think it is useful enough at this point, but the code could in principle be optimized a bit for the likely case that needle is a constant: we can re-use the bloom filter from the previous iteration. |
|
Are there any more actions points for me to take care of before we can merge this? |
|
@lysnikolaou sorry, I don't want to fret about a few style nits, and I can't see a better path than vendoring the Python stuff (I really don't love how much code it is for something we don't do often, but...). There is one problem now, unfortunately. I suspect the tests will fail after merging the change to what the default integer is. My suggestion would be that you squash all current commits into one (or few) and then add a single commit to fix all That way, the PR isn't directly tied to any possible revert, etc. due to the intp changes. |
36e4210 to
8899e95
Compare
|
@seberg Done. |
8899e95 to
d4d97eb
Compare
d4d97eb to
671fa53
Compare
|
Thanks, lets see how it goes :). |
No description provided.