-
Notifications
You must be signed in to change notification settings - Fork 122
Description
I'm working on upgrading Lucene's JFlex dependency from 1.6.0 to 1.7.0 (https://issues.apache.org/jira/browse/LUCENE-8527), and ran across a problem with spoon-feeding Readers (i.e. Readers that return fewer than requested chars from read(char[]), when further requests will also return chars): previously such Readers were handled properly, but with the changes in generated zzRefill() for issue #131 (released in JFlex 1.6.1), there are conditions involving above-BMP codepoints where the scanner buffer doesn't get populated as fully as it could be, resulting in improper matching. Lucene includes a test that sends random data through JFlex-generated scanners using a spoon-feeding reader; these tests succeeded with the JFlex 1.6.0-generated scanners, but failed with the 1.7.0-generated ones.
Below I show the changes I put into a custom skeleton for Lucene, in the skeleton's zzRefill() method, that fixed the problem.
Here's the original section:
if (numRead > 0) {
zzEndRead += numRead;
/* If numRead == requested, we might have requested to few chars to
encode a full Unicode character. We assume that a Reader would
otherwise never return half characters. */
if (numRead == requested) {
if (Character.isHighSurrogate(zzBuffer[zzEndRead - 1])) {
--zzEndRead;
zzFinalHighSurrogate = 1;
}
}
And the custom version that fixes the problem
if (numRead > 0) {
zzEndRead += numRead;
if (Character.isHighSurrogate(zzBuffer[zzEndRead - 1])) {
if (numRead == requested) { // We might have requested too few chars to encode a full Unicode character.
--zzEndRead;
zzFinalHighSurrogate = 1;
if (numRead == 1) {
return true;
}
} else { // There is room in the buffer for at least one more char
int c = zzReader.read(); // Expecting to read a low surrogate char
if (c == -1) {
return true;
} else {
zzBuffer[zzEndRead++] = (char)c;
return false;
}
}
}