-
-
Notifications
You must be signed in to change notification settings - Fork 230
#1884 take surrogate code points into account in RegexSearcher #1886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
because if a Pattern (in java) uses BnMS then chars with two code points (high and low surrogate) match even if the start index points to the lower surrogate.
404cc83 to
82b1e45
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1886 +/- ##
=========================================
Coverage 91.80% 91.80%
Complexity 119 119
=========================================
Files 455 455
Lines 5014 5017 +3
Branches 239 240 +1
=========================================
+ Hits 4603 4606 +3
Misses 364 364
Partials 47 47
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
| override fun search(searchIn: CharSequence, searchFor: Regex): Int { | ||
| var counter = 0 | ||
| var matchResult = searchFor.find(searchIn) | ||
| val searchInString = searchIn.toString() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: does searchIn.findAll(searchFor): Sequence<MatchResult> work?
MatchResult.next(): MatchResult? work?
It looks a bit weird to hand-craft it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findAll would be a disjoint search, i.e. search aa in aaaa would find 2 occurrences. we use a non-disjoint search i.e. it finds it 3 times. Don't ask me why I went with a non-disjoint search tough. Looking at it now it seems a bit weird but back in 2017 I had a use case where it was relevant and it seemed natural to implement it that way. I guess it was an edge case as I cannot remember it. Most times it probably doesn't matter, so... I would keep it that way (only break it in a major version).
I wasn't aware of the MatchResult.next() method. I would hope it works. Will check it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MatchResult.next() and friends go for disjoint matches which is what I would have expected if I used toContain o exactly 3 regex.
If you implement overlapping occurrences, then the hand-rolled impl is ok. I was just curios, there were not comments, so I asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disjoint matches which is what I would have expected if I used toContain o exactly 3 regex.
I get that as I had the same assumption some time ago. It's documented in the KDoc (also for toContain o exactly 3 regex) but it's not intuitive. I guess I will change the behaviour with 2.0.0
because if a Pattern (in java) uses BnMS then chars with two code points (high and low surrogate) match even if the start index points to the lower surrogate.
fixes #1884
I confirm that I have read the Contributor Agreements v1.0, agree to be bound on them and confirm that my contribution is compliant.