Conversation
Also added pregenerated Tatsu parser and made Tatsu optional.
…ng on ContentLine name
…include always correct tatsu for testing
71883a1 to
cd97f4b
Compare
|
I ran some performance test with
The results can be summarized as follows:
See b85e0a4 for the code used for testing. Note that I didn't test much with invalid data, but I guess the results for the good case are clear enough (and tatsu is even more prone to spending too much time on broken inputs). Supporting three different parser implementations and giving the user the possibility to choose from them adds a lot of complexity on both sides. As the handwritten parser is faster than the other two, I decided to only ship that one to the users, drastically simplifying the code in cd97f4b. The |
In addition to optimizing the Tatsu Grammar and precompiling it, I included two further alternative parsers in this branch: one that is based on handwritten character searching (using the fast
str.indexandstr.findfunctions without unnecessarily slicing substrings) and one that is based on the alternative regex module and its full history of group captures inregex.Match.captures(whilere.Match.grouponly stores the latest capturing group). Some first tests indicate that both similarly outperfom tatsu, with the optimized tatsu taking roughly a minute with the reference file I generated for #244 and both alternative parsers taking roughly a second. The profiler also indicated that this is close to the possible optimum, as Python object initialization and access consumed the biggest part of that one second. I'll run some more (performance) tests and also test with malformed input and then report back to decide how to use these parsers in v0.8.Note: the
regexparser currently doesn't work on pypy3 due to this unicode bug somewhere in those two components.