Skip to content

Conversation

@lidavidm
Copy link
Member

This uses RE2 to implement a case-insensitive substring search.

Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)

@ianmcook
Copy link
Member

Bikeshedding here, but maybe we should rename the C++ and Python argument to ignore_case instead of case_insensitive which is a little easier to understand without getting mixed up in double negatives

@github-actions
Copy link

@lidavidm
Copy link
Member Author

ignore_case is also what Python's regex engine calls it at least (though not RE2), so I've renamed it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example where re2 doesn't follow proper Unicode semantics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why two MatchSubstring and MatchSubstringImpl classes. It seems one should be sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one of each. I moved them around in this PR, but it's the same as before.

Copy link
Member Author

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of RE2's Unicode handling is here: google/re2#262

That said I don't think it's too big a deal for us.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one of each. I moved them around in this PR, but it's the same as before.

@pitrou
Copy link
Member

pitrou commented May 26, 2021

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

@lidavidm
Copy link
Member Author

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

It is a bit of a bummer but I think it's also not 'unexpected' in that other systems (languages, etc) probably make the same tradeoff, hence why I thought it was worth a note in the docs, but wasn't worth implementing a full solution using utf8proc.

@pitrou
Copy link
Member

pitrou commented Jun 1, 2021

Rebased to check for CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants