memory performance: two-level cmap lookup by lsf37 · Pull Request #697 · jflex-de/jflex

lsf37 · 2020-01-01T08:52:16Z

This addresses the memory issue #199. With more recent Unicode versions, the
character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never
accessed, but it still increases overall memory consumption esp for multiple
scanners.

This PR:

changes the current one-level array to a two-level table structure
adds quickcheck tests for the table construction
updates runtime engine to use that structure
updates skeleton files to remove reference to old ZZ_CMAP
removes trailing white space in generated template code

Typical memory consumption for the char map decreases from ~4MB to < 100KB,
for simple scanners (little unicode use > 0xFF) to ~20KB.

Even though this increases the number of operations in the innermost loop,
with a bit of luck performance might actually benefit because of better
cache locality. Benchmarking still to be done.

This addresses the memory issue #199. With more recent Unicode versions, the character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never accessed, but it still increases overall memory consumption esp for multiple scanners. This commit: * changes the current one-level array to a two-level table structure * adds quickcheck tests for the table construction * updates runtime engine to use that structure * updates skeleton files to remove reference to old ZZ_CMAP * removes trailing white space in generated template code Typical memory consumption for the char map decreases from ~4MB to < 100KB, for simple scanners (little unicode use > 0xFF) to ~20KB. Even though this increases the number of operations in the innermost loop, with a bit of luck performance might actually benefit because of better cache locality. Benchmarking still to be done.

Calling getClassCode for each code point cost too much generator performance (10%-20% slower test suite). Going through the intervals incrementally speeds this up again and has no observable generator performance difference to the old single-level table setup.

One of the shrunk instances in a failed test looked like it had overlapping classes, which shouldn't be possible. Couldn't reproduce, but if it does happen, this should catch it.

lsf37 · 2020-01-01T10:23:21Z

This solution is similar to, but (hopefully) simpler than the one implemented for the IntelliJ plugins in #199, with similar memory improvements.

It does not really address the ArrayIndexOutOfBounds problem, but this has already disappeared with @sarowe's work on more recent Unicode versions. For %unicode 2.0, you will still get an exception for code points >= 0xFFFF, but that is as it should be, because that Unicode version doesn't have these characters. The later versions (including the default just %unicode) that do have higher code points don't throw the exception. This is similar to %7bit and %8bit scanners throwing the same exception for higher characters.

There is the question whether this is the best behaviour, i.e. one could throw a different exception or report a more user-friendly error, but that is a separate question.

lsf37 · 2020-01-01T10:25:42Z

The trailing whitespace removal is a bit out of place here -- it slipped in because my editor is set to stripping whitespace, which I forgot about. I think it's nicer to have them removed, though, so I left it.

lsf37 · 2020-01-01T10:26:56Z

ps: huge kudos to @sarowe for the excellent unicode test suite coverage. This and the new quickcheck setup helped a lot in getting that feature implemented much faster than usual.

lsf37 · 2020-01-03T05:38:48Z

Will merge this one now, because I have too many things stacking up, but am still interested in feedback if there is time.

Gerwin Klein and others added 3 commits January 1, 2020 19:17

assert invariants in quickcheck shrinking

bab86ff

One of the shrunk instances in a failed test looked like it had overlapping classes, which shouldn't be possible. Couldn't reproduce, but if it does happen, this should catch it.

lsf37 requested a review from sarowe as a code owner January 1, 2020 08:52

lsf37 added this to the 1.8.0 milestone Jan 1, 2020

lsf37 added enhancement Feature requests performance labels Jan 1, 2020

lsf37 self-assigned this Jan 1, 2020

lsf37 requested a review from regisd January 1, 2020 10:15

This was referenced Jan 1, 2020

Performance benchmarking suite #698

Open

%unicode 2.0 lexers throw IOOBE on input with surrogate chars #199

Closed

lsf37 merged commit fe59529 into master Jan 3, 2020

lsf37 deleted the cmap-block branch January 3, 2020 05:38

lsf37 mentioned this pull request Jan 9, 2020

Benchmark #711

Merged

asfimport mentioned this pull request Mar 22, 2022

upgrade jflex (1.7.0 -> 1.8.2) [LUCENE-10239] apache/lucene#11275

Closed

lsf37 mentioned this pull request Jan 22, 2023

Reduce RAM usage of the char->char class map #196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory performance: two-level cmap lookup#697

memory performance: two-level cmap lookup#697
lsf37 merged 3 commits intomasterfrom
cmap-block

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 1, 2020

Uh oh!

lsf37 commented Jan 3, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants