3b76df1 has introduced a smart but more costly parsing: this could be improved as follow.
Here a suggestion to move to a single table lookup..
The per-byte Match[] scan has two costs: iterating N matchers per byte, and a double dependent load (match.get(value) → STATES_BY_ORDINAL[next]) where each load can't start until the previous resolves — worst case 3–5 serialised dependent loads on states with multiple matchers.
Both can be eliminated by flattening all transitions into a single byte[N_STATES * 256] and tracking state as a plain int. Hot path: one shift, one OR, one array load — no loop, no second dependent load. Footprint: 7 × 256 = 1,792 bytes in a single contiguous array vs ~1,320 bytes scattered across 30 heap objects (15 Match + 15 long[] backing arrays) with pointer chasing.
There is a correctness fix bundled here: the current matchers in ChunkExtValStart and ChunkExtValToken overlap on specific bytes (" at 0x22 and ; at 0x3B respectively). The original code is safe because it is first-match-wins. A naïve table compilation would be last-write-wins and silently produce wrong transitions for those bytes. The fix is to make the matchers explicitly disjoint by adding '"' and ';' to the relevant exclusion lists, and to enforce disjointness unconditionally at build time:
private static final byte[] TRANSITIONS = compile();
static byte[] compile() {
Match[][] stateMatchers = new Match[N_STATES][];
stateMatchers[SIZE] = new Match[]{
new Match(SIZE).chars("0123456789abcdefABCDEF \t"),
new Match(CHUNK_EXT_NAME).chars(";")
};
stateMatchers[CHUNK_EXT_NAME] = new Match[]{
new Match(CHUNK_EXT_NAME)
.range(0x21, 0x7E)
.chars(" \t")
.chars("(),/:<=>?@[\\]{}", false),
new Match(CHUNK_EXT_VAL_START).chars("=")
};
stateMatchers[CHUNK_EXT_VAL_START] = new Match[]{
new Match(CHUNK_EXT_VAL_START).chars(" \t"),
new Match(CHUNK_EXT_VAL_QUOTED).chars("\""),
// explicitly exclude '"' (0x22) — disjoint with the matcher above
new Match(CHUNK_EXT_VAL_TOKEN)
.range(0x21, 0x7E)
.chars("(),/:<=>?@[\\]{}\"", false)
};
stateMatchers[CHUNK_EXT_VAL_QUOTED] = new Match[]{
new Match(CHUNK_EXT_VAL_QUOTED_ESCAPE).chars("\\"),
new Match(CHUNK_EXT_VAL_QUOTED_END).chars("\""),
// 0x23-0x5B excludes '"' (0x22), 0x5D-0x7E excludes '\' (0x5C) — disjoint by range
new Match(CHUNK_EXT_VAL_QUOTED)
.chars("\t !")
.range(0x23, 0x5B)
.range(0x5D, 0x7E)
.range(0x80, 0xFF)
};
stateMatchers[CHUNK_EXT_VAL_QUOTED_ESCAPE] = new Match[]{
new Match(CHUNK_EXT_VAL_QUOTED)
.chars("\t ")
.range(0x21, 0x7E)
.range(0x80, 0xFF)
};
stateMatchers[CHUNK_EXT_VAL_QUOTED_END] = new Match[]{
new Match(CHUNK_EXT_VAL_QUOTED_END).chars("\t "),
new Match(CHUNK_EXT_NAME).chars(";")
};
stateMatchers[CHUNK_EXT_VAL_TOKEN] = new Match[]{
// explicitly exclude ';' (0x3B) — disjoint with the matcher below
new Match(CHUNK_EXT_VAL_TOKEN)
.range(0x21, 0x7E)
.chars("(),/:<=>?@[\\]{};", false),
new Match(CHUNK_EXT_NAME).chars(";")
};
byte[] table = new byte[N_STATES * 256];
Arrays.fill(table, (byte) -1);
for (int state = 0; state < N_STATES; state++) {
int base = state << 8;
for (Match m : stateMatchers[state]) {
for (int i = m.nextSetBit(0); i >= 0; i = m.nextSetBit(i + 1)) {
if (table[base | i] != -1) {
throw new IllegalStateException(
"overlapping matchers at byte 0x" + Integer.toHexString(i) +
" in state " + state);
}
table[base | i] = (byte) m.then;
}
}
}
return table;
}
private int state = SIZE;
@Override
public boolean process(byte value) {
int next = TRANSITIONS[state << 8 | (value & 0xFF)];
if (next < 0) {
if (state == SIZE) {
throw new NumberFormatException("Invalid chunk size");
}
throw new InvalidChunkExtensionException("Invalid chunk extension");
}
state = next;
return true;
}
& 0xFF on value is required since byte is signed and used as an array index.
Match objects and stateMatchers are GC'd after static init — zero runtime footprint.
- The
IllegalStateException in compile() is unconditional (not an assert) so overlap bugs are caught at class load regardless of JVM flags.
- to not cause regression on frameworks which relies on fast startup time I suggest to use the current matchers only via assert, but actually just populate the table with the values without allocating nor using the matchers themselves: this should be validated via a one shot JMH benchmark which cause the table allocation itself on the class
This should be able to squeeze the very last bit of performance of a stricter validation mechanism with a similar, but less scattered, overall memory footprint.
The original commit contains additional benchmarks which need to stress a bit more the branch predictor (which means need longer, configurable, randomic - with same seed - and more vary sequence of inputs, validating via -prof perfnorm if before/after the change the number of mispredict improve as the IPC, having less data dependent cache references)
3b76df1 has introduced a smart but more costly parsing: this could be improved as follow.
Here a suggestion to move to a single table lookup..
The per-byte
Match[]scan has two costs: iterating N matchers per byte, and a double dependent load (match.get(value)→STATES_BY_ORDINAL[next]) where each load can't start until the previous resolves — worst case 3–5 serialised dependent loads on states with multiple matchers.Both can be eliminated by flattening all transitions into a single
byte[N_STATES * 256]and tracking state as a plainint. Hot path: one shift, one OR, one array load — no loop, no second dependent load. Footprint: 7 × 256 = 1,792 bytes in a single contiguous array vs ~1,320 bytes scattered across 30 heap objects (15Match+ 15long[]backing arrays) with pointer chasing.There is a correctness fix bundled here: the current matchers in
ChunkExtValStartandChunkExtValTokenoverlap on specific bytes ("at 0x22 and;at 0x3B respectively). The original code is safe because it is first-match-wins. A naïve table compilation would be last-write-wins and silently produce wrong transitions for those bytes. The fix is to make the matchers explicitly disjoint by adding'"'and';'to the relevant exclusion lists, and to enforce disjointness unconditionally at build time:& 0xFFonvalueis required sincebyteis signed and used as an array index.Matchobjects andstateMatchersare GC'd after static init — zero runtime footprint.IllegalStateExceptionincompile()is unconditional (not an assert) so overlap bugs are caught at class load regardless of JVM flags.This should be able to squeeze the very last bit of performance of a stricter validation mechanism with a similar, but less scattered, overall memory footprint.
The original commit contains additional benchmarks which need to stress a bit more the branch predictor (which means need longer, configurable, randomic - with same seed - and more vary sequence of inputs, validating via -prof perfnorm if before/after the change the number of mispredict improve as the IPC, having less data dependent cache references)