Skip to content

Fix JDK regex support#888

Merged
stevehu merged 1 commit intonetworknt:masterfrom
PicnicSupermarket:bugfix/fix-jdk-regex-support
Dec 2, 2023
Merged

Fix JDK regex support#888
stevehu merged 1 commit intonetworknt:masterfrom
PicnicSupermarket:bugfix/fix-jdk-regex-support

Conversation

@Stephan202
Copy link
Copy Markdown
Contributor

Summary of changes:

Copy link
Copy Markdown
Contributor Author

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some context. Still thinking about the "trailing newline" cases, but didn't find any solution yet.

This PR resolves #872.

Suggested commit message:

Fix JDK regex support

Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
  fields, such that the test framework does not skip them with a "Not a
  valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
  regexes to match their intended semantics and (b) include a few 
  negative test cases.
- Partially revert the change introduced by #783: the use of
  `Matcher#find()` is correct, but the `hasStartAnchor` and 
  `hasEndAnchor` logic introduces more bugs than the issue it aims to
  solve.
- Extend the test coverage introduced by #783, by introducing regexes
  that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
  more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not 
  handled correctly by either regex implementation.

Resolves #872.

public boolean matches(String value) {
Matcher matcher = this.pattern.matcher(value);
return matcher.find() && (!this.hasStartAnchor || 0 == matcher.start()) && (!this.hasEndAnchor || matcher.end() == value.length());
return this.pattern.matcher(value).find();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the original suggestion by @tvahrst in #782, and seems to be closest to what the spec intents (modulo the newline case).


byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
this.pattern = new Regex(bytes, 0, bytes.length, Option.NONE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
this.pattern = new Regex(bytes, 0, bytes.length, Option.SINGLELINE, UTF8Encoding.INSTANCE, Syntax.ECMAScript);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find the Joni documentation, but this change makes more test cases pass; please carefully review.

@Test
void jdkTypePattern() {
JDKRegularExpression ex = new JDKRegularExpression("^list|date|time|string|enum|int|double|long|boolean|number$");
JDKRegularExpression ex = new JDKRegularExpression("^(list|date|time|string|enum|int|double|long|boolean|number)$");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by @kool79 in #872, if a user wishes only to match these words, then the parentheses are required.

@Test
void jdkOptionsPattern() {
JDKRegularExpression ex = new JDKRegularExpression("^\\d*|[a-zA-Z_]+$");
JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I changed * to +: because ^\d* is vacuously true: it matches the empty string at the start of any input. (Without this the internal9 test case below would also yield true.)

[
{
"description": "issue495 using ECMA-262",
"regex": "ecma-262",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this fields, because regex is defined in the TestSpec class, not inside TestCase. Without this the tests were skipped...

Comment on lines -7 to +9
"pattern": "^[a-z]{1,10}$",
"patternProperties": {
"^[a-z]{1,10}$": true,
"(^1$)": true
},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern keyword didn't seem to work, so I switched to patternProperties; let me know if something else was intended. Also added an extra test case to show why inspecting the regex will not work in general.

Comment on lines 78 to +85
{
"description": "trailing newline",
"regex": "jdk",
"data": { "aaa\n": 3 },
"valid": false
"valid": false,
"disabled": true,
"comment": "Test fails"
},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only test case on which this PR regresses. Both regex implementation suffer from the same issue:

jshell> Pattern.compile("^a$").matcher("a\n").find()
$1 ==> true
jshell> Pattern.compile("(^a$)").matcher("a\n").find()
$2 ==> true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I found why this happens. The documentation states:

The regular expression $ matches at the end of the entire input sequence, but also matches just before the last line terminator if this is not followed by any other input character. Other line terminators are ignored, including the last one if it is followed by other input characters.

So in theory we could avoid this by transforming all $ anchors by prepending a negative lookahead, like so:

jshell> Pattern.compile("^a(?!\n|\r\n|\r|\\u0085|\\u2028|\\u2029)$").matcher("a\n").find()
$1 ==> false

But that will require careful parsing of the input regex, because the $ could be escaped, or inside a \Q...\E section, or... Quite frankly, that way lies madness; would like to keep it out of this PR.

Comment on lines +48 to +54
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
assertTrue(ex.matches("force"));
assertFalse(ex.matches("internal9"));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stephan202 Actually the ^\\d+|[a-zA-Z_]+$ is the same as ^\\d|[a-zA-Z_]$
Explanation:
you should read the ^\\d+|[a-zA-Z_]+$ like: matches to ^\\d+ or to [a-zA-Z_]+$.
In 'human' words: starts with digit or ends with alpha-or-underscore.
Actually it is identical to: matches to ^\\d or to [a-zA-Z_]$ or ^\\d|[a-zA-Z_]$

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are!

assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that similar examples can be removed, and the next can be added:
(matches left part: starts with digit:)
assertTrue - "5"
assertTrue - "55"
assertTrue - "5%"
(matches right part: ends with alpha-or-undescore:)
assertTrue - "a"
assertTrue - "aa"
assertTrue - "%a"
assertTrue - "%_"
(matches both parts)
assertTrue - "55aa"
assertTrue - "5%%a"

assertFalse: ""
assertFalse: "%"
assertFalse: "a5"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lemme replace them.

Summary of changes:
- Fix the test resources introduced by #783 by moving the `regex`
  fields, such that the test framework does not skip them with a "Not a
  valid test case" message.
- Revert the changes introduced by #815, as those are simply incorrect.
- Extend the test coverage introduced by #815 by (a) updating the test
  regexes to match their intended semantics and (b) include a few
  negative test cases.
- Partially revert the change introduced by #783: the use of
  `Matcher#find()` is correct, but the `hasStartAnchor` and
  `hasEndAnchor` logic introduces more bugs than the issue it aims to
  solve.
- Extend the test coverage introduced by #783, by introducing regexes
  that are not covered by the `hasStartAnchor`/`hasEndAnchor` logic.
- Update the Joni regular expression integration such that it passes
  more of the test cases.
- Disable the "trailing newline" test cases, as these are currently not
  handled correctly by either regex implementation.
Copy link
Copy Markdown
Contributor Author

@Stephan202 Stephan202 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look @kool79!

I squashed the following changes into the commit:

diff --git a/src/test/java/com/networknt/schema/regex/Issue814Test.java b/src/test/java/com/networknt/schema/regex/Issue814Test.java
index 3e7d42d..15ab31c 100644
--- a/src/test/java/com/networknt/schema/regex/Issue814Test.java
+++ b/src/test/java/com/networknt/schema/regex/Issue814Test.java
@@ -21,13 +21,19 @@ class Issue814Test {
 
     @Test
     void jdkOptionsPattern() {
-        JDKRegularExpression ex = new JDKRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertTrue(ex.matches("internal"));
-        assertFalse(ex.matches("internal9"));
+        JDKRegularExpression ex = new JDKRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
     @Test
@@ -45,13 +51,19 @@ class Issue814Test {
 
     @Test
     void joniOptionsPattern() {
-        JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
-        assertTrue(ex.matches("0external"));
-        assertTrue(ex.matches("internal"));
-        assertTrue(ex.matches("external"));
-        assertTrue(ex.matches("external_gte"));
-        assertTrue(ex.matches("force"));
-        assertFalse(ex.matches("internal9"));
+        JoniRegularExpression ex = new JoniRegularExpression("^\\d|[a-zA-Z_]$");
+        assertTrue(ex.matches("5"));
+        assertTrue(ex.matches("55"));
+        assertTrue(ex.matches("5%"));
+        assertTrue(ex.matches("a"));
+        assertTrue(ex.matches("aa"));
+        assertTrue(ex.matches("%a"));
+        assertTrue(ex.matches("%_"));
+        assertTrue(ex.matches("55aa"));
+        assertTrue(ex.matches("5%%a"));
+        assertFalse(ex.matches(""));
+        assertFalse(ex.matches("%"));
+        assertFalse(ex.matches("a5"));
     }
 
 }

(I squashed, because looking at the master branch, it appears that the maintainer of this repository uses GitHub's squash-and-rebase feature without cleaning up the commit message auto-generated by GitHub, which creates a rather messy result if there's more than one commit.)

Comment on lines +48 to +54
JoniRegularExpression ex = new JoniRegularExpression("^\\d+|[a-zA-Z_]+$");
assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
assertTrue(ex.matches("force"));
assertFalse(ex.matches("internal9"));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right you are!

assertTrue(ex.matches("0external"));
assertTrue(ex.matches("internal"));
assertTrue(ex.matches("external"));
assertTrue(ex.matches("external_gte"));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lemme replace them.

@stevehu stevehu requested a review from fdutton November 17, 2023 02:26
@stevehu
Copy link
Copy Markdown
Contributor

stevehu commented Nov 17, 2023

@Stephan202 @kool79 Thanks a lot for the hard work. Do you think we are ready to merge? @fdutton Do you have any concerns?

@Stephan202
Copy link
Copy Markdown
Contributor Author

@stevehu I think that not supporting the trailing newline case is an acceptable compromise given the larger context; I'm not planning other changes at this time 👍

(We should trigger the build workflow, though, to make sure that the tests also pass on GHA.)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 17, 2023

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.52%. Comparing base (e86c817) to head (636e696).
⚠️ Report is 184 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##             master     #888      +/-   ##
============================================
+ Coverage     78.48%   78.52%   +0.03%     
+ Complexity     1213     1207       -6     
============================================
  Files           120      120              
  Lines          3891     3883       -8     
  Branches        746      740       -6     
============================================
- Hits           3054     3049       -5     
  Misses          546      546              
+ Partials        291      288       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@stevehu
Copy link
Copy Markdown
Contributor

stevehu commented Dec 1, 2023

Do you feel comfortable to merge this PR? Anyone has some concerns? Thanks.

@Stephan202
Copy link
Copy Markdown
Contributor Author

@stevehu yes, I feel comfortable doing so 👍 (But don't have permission to ;).)

@stevehu stevehu merged commit 0924dfe into networknt:master Dec 2, 2023
@Stephan202 Stephan202 deleted the bugfix/fix-jdk-regex-support branch December 2, 2023 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants