Conversation
This addresses the memory issue #199. With more recent Unicode versions, the character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never accessed, but it still increases overall memory consumption esp for multiple scanners. This commit: * changes the current one-level array to a two-level table structure * adds quickcheck tests for the table construction * updates runtime engine to use that structure * updates skeleton files to remove reference to old ZZ_CMAP * removes trailing white space in generated template code Typical memory consumption for the char map decreases from ~4MB to < 100KB, for simple scanners (little unicode use > 0xFF) to ~20KB. Even though this increases the number of operations in the innermost loop, with a bit of luck performance might actually benefit because of better cache locality. Benchmarking still to be done.
Calling getClassCode for each code point cost too much generator performance (10%-20% slower test suite). Going through the intervals incrementally speeds this up again and has no observable generator performance difference to the old single-level table setup.
One of the shrunk instances in a failed test looked like it had overlapping classes, which shouldn't be possible. Couldn't reproduce, but if it does happen, this should catch it.
|
This solution is similar to, but (hopefully) simpler than the one implemented for the IntelliJ plugins in #199, with similar memory improvements. It does not really address the There is the question whether this is the best behaviour, i.e. one could throw a different exception or report a more user-friendly error, but that is a separate question. |
|
The trailing whitespace removal is a bit out of place here -- it slipped in because my editor is set to stripping whitespace, which I forgot about. I think it's nicer to have them removed, though, so I left it. |
|
ps: huge kudos to @sarowe for the excellent unicode test suite coverage. This and the new quickcheck setup helped a lot in getting that feature implemented much faster than usual. |
|
Will merge this one now, because I have too many things stacking up, but am still interested in feedback if there is time. |
This addresses the memory issue #199. With more recent Unicode versions, the
character map took 0x110000 * 4 bytes, i.e. ~4MB. Most of this is never
accessed, but it still increases overall memory consumption esp for multiple
scanners.
This PR:
Typical memory consumption for the char map decreases from ~4MB to < 100KB,
for simple scanners (little unicode use > 0xFF) to ~20KB.
Even though this increases the number of operations in the innermost loop,
with a bit of luck performance might actually benefit because of better
cache locality. Benchmarking still to be done.