Skip to content

Fix char class normalisation for overlapping class content#1066

Merged
lsf37 merged 4 commits intomasterfrom
sorted-interval
Feb 26, 2023
Merged

Fix char class normalisation for overlapping class content#1066
lsf37 merged 4 commits intomasterfrom
sorted-interval

Conversation

@lsf37
Copy link
Member

@lsf37 lsf37 commented Feb 26, 2023

In a negated character class that has overlapping content, such as [^\n\s], the normalisation code is violating a precondition of IntCharSet.sub() and leaves the class content in an inconsistent state. This either triggers an exception at generation time if another set operation interacts with the inconsistent part, or may lead to matching wrong input at runtime if nothing else interacts with the set.

This PR fixes the problem by first computing the union of the class content \n\s, which becomes a single set (joining the overlapping parts) and then computing the complement of that set.

  • enforce invariant in Interval class
  • avoid violating sub precondition
  • add regression test case for negating overlapping char class content

Fixes #1065

@lsf37 lsf37 self-assigned this Feb 26, 2023
@lsf37 lsf37 added the bug Not working as intended label Feb 26, 2023
@lsf37 lsf37 force-pushed the sorted-interval branch 2 times, most recently from 0fa0be9 to a3772dc Compare February 26, 2023 02:04
Regression test for #1065: test that a negated char class with
overlapping content is generated and matched correctly.
`IntCharSet.sub(s)` expects `s` to be fully contained in `this`. If the
contents of the inner char class expression overlap, this assumption is
violated and leads to an inconsistent IntCharSet state.

Fix this by computing the union of the inner expression first, and then
returning the complement. Since the only difference to the CCLASS case
is the complement at the end, the two cases can be merged.

Fixes #1065
@lsf37 lsf37 added this to the 1.9.1 milestone Feb 26, 2023
@lsf37 lsf37 merged commit 29b6852 into master Feb 26, 2023
@lsf37 lsf37 deleted the sorted-interval branch February 26, 2023 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Not working as intended

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whitespaces negation in group not working as expected

1 participant