Hi,
I have found that neither java regex based (PatternValidatorJava) nor joni based (PatternValidatorEcma262) pattern validation does not work correctly with newlines.
Any of implementation does not correctly interpret ^ and $ anchors. I would expect that, when I use them at the start and end of pattern (eg. ^[a-z]{1,10}$) they would not allow to pass any trailing newline character (eg. abc\n should not be matched). There are separate problems with both implementation, so I will describe them individually.
- Joni
The problem is with default configuration for ECMAScript syntax in Joni library, which has multiline matching by default enabled. From the json-schema-validator code:
private boolean matches(String value) {
if (compiledRegex == null) {
return true;
}
byte[] bytes = value.getBytes();
return compiledRegex.matcher(bytes).search(0, bytes.length, Option.NONE) >= 0;
}
For the fast fix, the last line can be changed to:
return compiledRegex.matcher(bytes).search(0, bytes.length, -Option.MULTILINE) >= 0;
but I have also rised issue in the Joni library, as I believe that this is not correct default (ECMAScript has disabled multiline matching by default).
What's more interesting, because of enabled multiline matching, currently this input \r\nab\nab\n will match this pattern ^[a-z]{1,10}$ and pass validation. We want to allow single character-only word, and the entire sentence passes.
- Java built-in regex find vs match
Current implementation of matching with java regex looks like this:
private boolean matches(String value) {
return compiledPattern == null || compiledPattern.matcher(value).find();
}
The problem is how the find method works. From the documentation: Attempts to find the next subsequence of the input sequence that matches the pattern. So this function matches subsequence of input and for the two JDKs, which I tried, returns true for input abab\n and pattern ^[a-z]{1,10}$, despite that I used ^ and $ anchors. So any ending newline character will be always allowed for such patterns.
Possible solution is to use matches method which attempts to match the entire region against the pattern, but this will result in implicit adding ^ and $ anchors to every pattern.
Hi,
I have found that neither java regex based (
PatternValidatorJava) nor joni based (PatternValidatorEcma262) pattern validation does not work correctly with newlines.Any of implementation does not correctly interpret
^and$anchors. I would expect that, when I use them at the start and end of pattern (eg.^[a-z]{1,10}$) they would not allow to pass any trailing newline character (eg.abc\nshould not be matched). There are separate problems with both implementation, so I will describe them individually.The problem is with default configuration for ECMAScript syntax in Joni library, which has multiline matching by default enabled. From the json-schema-validator code:
For the fast fix, the last line can be changed to:
but I have also rised issue in the Joni library, as I believe that this is not correct default (ECMAScript has disabled multiline matching by default).
What's more interesting, because of enabled multiline matching, currently this input
\r\nab\nab\nwill match this pattern^[a-z]{1,10}$and pass validation. We want to allow single character-only word, and the entire sentence passes.Current implementation of matching with java regex looks like this:
The problem is how the find method works. From the documentation:
Attempts to find the next subsequence of the input sequence that matches the pattern.So this function matches subsequence of input and for the two JDKs, which I tried, returnstruefor inputabab\nand pattern^[a-z]{1,10}$, despite that I used^and$anchors. So any ending newline character will be always allowed for such patterns.Possible solution is to use
matchesmethod which attempts to match the entire region against the pattern, but this will result in implicit adding^and$anchors to every pattern.