Skip to content

A spoon-feeding Reader can result in the scanning buffer not being fully populated #538

@sarowe

Description

@sarowe

I'm working on upgrading Lucene's JFlex dependency from 1.6.0 to 1.7.0 (https://issues.apache.org/jira/browse/LUCENE-8527), and ran across a problem with spoon-feeding Readers (i.e. Readers that return fewer than requested chars from read(char[]), when further requests will also return chars): previously such Readers were handled properly, but with the changes in generated zzRefill() for issue #131 (released in JFlex 1.6.1), there are conditions involving above-BMP codepoints where the scanner buffer doesn't get populated as fully as it could be, resulting in improper matching. Lucene includes a test that sends random data through JFlex-generated scanners using a spoon-feeding reader; these tests succeeded with the JFlex 1.6.0-generated scanners, but failed with the 1.7.0-generated ones.

Below I show the changes I put into a custom skeleton for Lucene, in the skeleton's zzRefill() method, that fixed the problem.

Here's the original section:

   if (numRead > 0) {
      zzEndRead += numRead;
      /* If numRead == requested, we might have requested to few chars to
         encode a full Unicode character. We assume that a Reader would
         otherwise never return half characters. */
      if (numRead == requested) {
        if (Character.isHighSurrogate(zzBuffer[zzEndRead - 1])) {
          --zzEndRead;
          zzFinalHighSurrogate = 1;
        }
      }

And the custom version that fixes the problem

    if (numRead > 0) {
      zzEndRead += numRead;
      if (Character.isHighSurrogate(zzBuffer[zzEndRead - 1])) {
        if (numRead == requested) { // We might have requested too few chars to encode a full Unicode character.
          --zzEndRead;
          zzFinalHighSurrogate = 1;
          if (numRead == 1) {
            return true;
          }
        } else {                    // There is room in the buffer for at least one more char
          int c = zzReader.read();  // Expecting to read a low surrogate char
          if (c == -1) {
            return true;
          } else {
            zzBuffer[zzEndRead++] = (char)c;
            return false;
          }
        }
      }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions