Fix JDK regex support#888
Fix JDK regex support#888stevehu merged 1 commit intonetworknt:masterfrom PicnicSupermarket:bugfix/fix-jdk-regex-support
Conversation
Stephan202
left a comment
There was a problem hiding this comment.
Added some context. Still thinking about the "trailing newline" cases, but didn't find any solution yet.
This PR resolves #872.
Suggested commit message:
Fix JDK regex support
Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
fields, such that the test framework does not skip them with a "Not a
valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
regexes to match their intended semantics and (b) include a few
negative test cases.
- Partially revert the change introduced by #783: the use of
`Matcher#find()` is correct, but the `hasStartAnchor` and
`hasEndAnchor` logic introduces more bugs than the issue it aims to
solve.
- Extend the test coverage introduced by #783, by introducing regexes
that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not
handled correctly by either regex implementation.
Resolves #872.
| public boolean matches(String value) { | ||
| Matcher matcher = this.pattern.matcher(value); | ||
| return matcher.find() && (!this.hasStartAnchor || 0 == matcher.start()) && (!this.hasEndAnchor || matcher.end() == value.length()); | ||
| return this.pattern.matcher(value).find(); |
|
|
||
| byte[] bytes = s.getBytes(StandardCharsets.UTF_8); | ||
| this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript); | ||
| this.pattern = new Regex(bytes, 0, bytes.length, Option.SINGLELINE, UTF8Encoding.INSTANCE, Syntax.ECMAScript); |
There was a problem hiding this comment.
I did not find the Joni documentation, but this change makes more test cases pass; please carefully review.
| @Test | ||
| void jdkTypePattern() { | ||
| JDKRegularExpression ex = new JDKRegularExpression("^list|date|time|string|enum|int|double|long|boolean|number$"); | ||
| JDKRegularExpression ex = new JDKRegularExpression("^(list|date|time|string|enum|int|double|long|boolean|number)$"); |
| @Test | ||
| void jdkOptionsPattern() { | ||
| JDKRegularExpression ex = new JDKRegularExpression("^\\d*|[a-zA-Z_]+$"); | ||
| JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$"); |
There was a problem hiding this comment.
Note that I changed * to +: because ^\d* is vacuously true: it matches the empty string at the start of any input. (Without this the internal9 test case below would also yield true.)
| [ | ||
| { | ||
| "description": "issue495 using ECMA-262", | ||
| "regex": "ecma-262", |
There was a problem hiding this comment.
I moved this fields, because regex is defined in the TestSpec class, not inside TestCase. Without this the tests were skipped...
| "pattern": "^[a-z]{1,10}$", | ||
| "patternProperties": { | ||
| "^[a-z]{1,10}$": true, | ||
| "(^1$)": true | ||
| }, |
There was a problem hiding this comment.
The pattern keyword didn't seem to work, so I switched to patternProperties; let me know if something else was intended. Also added an extra test case to show why inspecting the regex will not work in general.
| { | ||
| "description": "trailing newline", | ||
| "regex": "jdk", | ||
| "data": { "aaa\n": 3 }, | ||
| "valid": false | ||
| "valid": false, | ||
| "disabled": true, | ||
| "comment": "Test fails" | ||
| }, |
There was a problem hiding this comment.
This is the only test case on which this PR regresses. Both regex implementation suffer from the same issue:
jshell> Pattern.compile("^a$").matcher("a\n").find()
$1 ==> true
jshell> Pattern.compile("(^a$)").matcher("a\n").find()
$2 ==> true
There was a problem hiding this comment.
Okay, I found why this happens. The documentation states:
The regular expression
$matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.
So in theory we could avoid this by transforming all $ anchors by prepending a negative lookahead, like so:
jshell> Pattern.compile("^a(?!\n|\r\n|\r|\\u0085|\\u2028|\\u2029)$").matcher("a\n").find()
$1 ==> false
But that will require careful parsing of the input regex, because the $ could be escaped, or inside a \Q...\E section, or... Quite frankly, that way lies madness; would like to keep it out of this PR.
| JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
| assertTrue(ex.matches("0external")); | ||
| assertTrue(ex.matches("internal")); | ||
| assertTrue(ex.matches("external")); | ||
| assertTrue(ex.matches("external_gte")); | ||
| assertTrue(ex.matches("force")); | ||
| assertFalse(ex.matches("internal9")); |
There was a problem hiding this comment.
@Stephan202 Actually the ^\\d+|[a-zA-Z_]+$ is the same as ^\\d|[a-zA-Z_]$
Explanation:
you should read the ^\\d+|[a-zA-Z_]+$ like: matches to ^\\d+ or to [a-zA-Z_]+$.
In 'human' words: starts with digit or ends with alpha-or-underscore.
Actually it is identical to: matches to ^\\d or to [a-zA-Z_]$ or ^\\d|[a-zA-Z_]$
| assertTrue(ex.matches("0external")); | ||
| assertTrue(ex.matches("internal")); | ||
| assertTrue(ex.matches("external")); | ||
| assertTrue(ex.matches("external_gte")); |
There was a problem hiding this comment.
I think that similar examples can be removed, and the next can be added:
(matches left part: starts with digit:)
assertTrue - "5"
assertTrue - "55"
assertTrue - "5%"
(matches right part: ends with alpha-or-undescore:)
assertTrue - "a"
assertTrue - "aa"
assertTrue - "%a"
assertTrue - "%_"
(matches both parts)
assertTrue - "55aa"
assertTrue - "5%%a"
assertFalse: ""
assertFalse: "%"
assertFalse: "a5"
There was a problem hiding this comment.
Sure, lemme replace them.
Summary of changes: - Fix the test resources introduced by #783 by moving the `regex` fields, such that the test framework does not skip them with a "Not a valid test case" message. - Revert the changes introduced by #815, as those are simply incorrect. - Extend the test coverage introduced by #815 by (a) updating the test regexes to match their intended semantics and (b) include a few negative test cases. - Partially revert the change introduced by #783: the use of `Matcher#find()` is correct, but the `hasStartAnchor` and `hasEndAnchor` logic introduces more bugs than the issue it aims to solve. - Extend the test coverage introduced by #783, by introducing regexes that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic. - Update the Joni regular expression integration such that it passes more of the test cases. - Disable the "trailing newline" test cases, as these are currently not handled correctly by either regex implementation.
Stephan202
left a comment
There was a problem hiding this comment.
Thanks for taking a look @kool79!
I squashed the following changes into the commit:
diff --git a/src/test/java/com/networknt/schema/regex/Issue814Test.java b/src/test/java/com/networknt/schema/regex/Issue814Test.java
index 3e7d42d..15ab31c 100644
--- a/src/test/java/com/networknt/schema/regex/Issue814Test.java
+++ b/src/test/java/com/networknt/schema/regex/Issue814Test.java
@@ -21,13 +21,19 @@ class Issue814Test {
@Test
void jdkOptionsPattern() {
- JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
- assertTrue(ex.matches("0external"));
- assertTrue(ex.matches("external"));
- assertTrue(ex.matches("external_gte"));
- assertTrue(ex.matches("force"));
- assertTrue(ex.matches("internal"));
- assertFalse(ex.matches("internal9"));
+ JDKRegularExpression ex = new JDKRegularExpression("^\\d|[a-zA-Z_]$");
+ assertTrue(ex.matches("5"));
+ assertTrue(ex.matches("55"));
+ assertTrue(ex.matches("5%"));
+ assertTrue(ex.matches("a"));
+ assertTrue(ex.matches("aa"));
+ assertTrue(ex.matches("%a"));
+ assertTrue(ex.matches("%_"));
+ assertTrue(ex.matches("55aa"));
+ assertTrue(ex.matches("5%%a"));
+ assertFalse(ex.matches(""));
+ assertFalse(ex.matches("%"));
+ assertFalse(ex.matches("a5"));
}
@Test
@@ -45,13 +51,19 @@ class Issue814Test {
@Test
void joniOptionsPattern() {
- JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
- assertTrue(ex.matches("0external"));
- assertTrue(ex.matches("internal"));
- assertTrue(ex.matches("external"));
- assertTrue(ex.matches("external_gte"));
- assertTrue(ex.matches("force"));
- assertFalse(ex.matches("internal9"));
+ JoniRegularExpression ex = new JoniRegularExpression("^\\d|[a-zA-Z_]$");
+ assertTrue(ex.matches("5"));
+ assertTrue(ex.matches("55"));
+ assertTrue(ex.matches("5%"));
+ assertTrue(ex.matches("a"));
+ assertTrue(ex.matches("aa"));
+ assertTrue(ex.matches("%a"));
+ assertTrue(ex.matches("%_"));
+ assertTrue(ex.matches("55aa"));
+ assertTrue(ex.matches("5%%a"));
+ assertFalse(ex.matches(""));
+ assertFalse(ex.matches("%"));
+ assertFalse(ex.matches("a5"));
}
}(I squashed, because looking at the master branch, it appears that the maintainer of this repository uses GitHub's squash-and-rebase feature without cleaning up the commit message auto-generated by GitHub, which creates a rather messy result if there's more than one commit.)
| JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$"); | ||
| assertTrue(ex.matches("0external")); | ||
| assertTrue(ex.matches("internal")); | ||
| assertTrue(ex.matches("external")); | ||
| assertTrue(ex.matches("external_gte")); | ||
| assertTrue(ex.matches("force")); | ||
| assertFalse(ex.matches("internal9")); |
| assertTrue(ex.matches("0external")); | ||
| assertTrue(ex.matches("internal")); | ||
| assertTrue(ex.matches("external")); | ||
| assertTrue(ex.matches("external_gte")); |
There was a problem hiding this comment.
Sure, lemme replace them.
|
@Stephan202 @kool79 Thanks a lot for the hard work. Do you think we are ready to merge? @fdutton Do you have any concerns? |
|
@stevehu I think that not supporting the trailing newline case is an acceptable compromise given the larger context; I'm not planning other changes at this time 👍 (We should trigger the build workflow, though, to make sure that the tests also pass on GHA.) |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #888 +/- ##
============================================
+ Coverage 78.48% 78.52% +0.03%
+ Complexity 1213 1207 -6
============================================
Files 120 120
Lines 3891 3883 -8
Branches 746 740 -6
============================================
- Hits 3054 3049 -5
Misses 546 546
+ Partials 291 288 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Do you feel comfortable to merge this PR? Anyone has some concerns? Thanks. |
|
@stevehu yes, I feel comfortable doing so 👍 (But don't have permission to ;).) |
Summary of changes:
regexfields, such that the test framework does not skip them with a "Not a valid test case" message.Matcher#find()is correct, but thehasStartAnchorandhasEndAnchorlogic introduces more bugs than the issue it aims to solve.hasStartAnchor/hasEndAnchorlogic.