Contentline parser by N-Coder · Pull Request #247 · ics-py/ics-py

N-Coder · 2020-06-01T10:43:55Z

In addition to optimizing the Tatsu Grammar and precompiling it, I included two further alternative parsers in this branch: one that is based on handwritten character searching (using the fast str.index and str.find functions without unnecessarily slicing substrings) and one that is based on the alternative regex module and its full history of group captures in regex.Match.captures (while re.Match.group only stores the latest capturing group). Some first tests indicate that both similarly outperfom tatsu, with the optimized tatsu taking roughly a minute with the reference file I generated for #244 and both alternative parsers taking roughly a second. The profiler also indicated that this is close to the possible optimum, as Python object initialization and access consumed the biggest part of that one second. I'll run some more (performance) tests and also test with malformed input and then report back to decide how to use these parsers in v0.8.

Note: the regex parser currently doesn't work on pypy3 due to this unicode bug somewhere in those two components.

Also added pregenerated Tatsu parser and made Tatsu optional.

…ng on ContentLine name

…include always correct tatsu for testing

N-Coder · 2020-10-18T12:20:49Z

I ran some performance test with

three different parsers: handwritten, tatsu, and regex-based
reading the input string from memory or harddisk
line-wise iterating over the input line-wise or reading it as one big string
using the built-in re module (only for handwritten and tatsu) or using the extended regex module
using CPython or PyPy
using a precompiled tatsu grammar or loading it from the ebnf file during runtime.

The results can be summarized as follows:

Tatsu is by far slower than the others. The fastest run was line-wise reading using a precompiled grammar on PyPy, which took about 7 seconds. Switching to CPython or reading the file as a whole each add an order of magnitude to that (yep, the fastest tatsu run on CPython was 49s), while not precompiling only adds a minor overhead. The other parsers all took less than 5s no matter what interpreter was used.
Using the built-in re module is a little bit faster than the extended regex module, while also not failing with weird unicode symbols (see the regex bug mentioned above). Similarly, the handwritten parser is a little bit faster than the one relying on the extended regex functionality.
Reading line-wise instead of the whole file at once is marginally faster and needs a little less memory (unsurprisingly), but the numbers vary a little bit depending on whether we read from memory or from disk.

See b85e0a4 for the code used for testing. Note that I didn't test much with invalid data, but I guess the results for the good case are clear enough (and tatsu is even more prone to spending too much time on broken inputs).

Supporting three different parser implementations and giving the user the possibility to choose from them adds a lot of complexity on both sides. As the handwritten parser is faster than the other two, I decided to only ship that one to the users, drastically simplifying the code in cd97f4b. The regex parser is no longer needed and the tatsu parser has been moved to the testing code, so that we can use it as reference when parsing files from the corpus.

N-Coder added 6 commits May 17, 2020 16:35

new regex-capture and handwritten contentline parsers

52b5f85

Also added pregenerated Tatsu parser and made Tatsu optional.

contentline parser performance test

5228ab9

make ContentLineParser instantiation faster

7b09c42

allow dict-like Container getitem / delitem with str index representi…

6c4e364

…ng on ContentLine name

bugfixes

2e8b123

let CI also test pypy

d4777ee

N-Coder added this to the Version 0.8 milestone Jun 1, 2020

This was referenced Jun 1, 2020

Backport Optimized Tatsu Parser to 0.7 #248

Merged

Roadmap for v0.8 #245

Open

N-Coder added 2 commits October 17, 2020 16:56

check-in final versions of the three parser implementations

b85e0a4

only ship the fastest parser (handwritten with built-in re) and only …

cd97f4b

…include always correct tatsu for testing

N-Coder force-pushed the contentline-parser branch from 71883a1 to cd97f4b Compare October 17, 2020 16:44

N-Coder added 2 commits October 17, 2020 21:39

bugfixes

6392166

also test python 3.9 and update dependencies

fa9f8fa

N-Coder marked this pull request as ready for review October 18, 2020 12:20

make-github-pseudonymous-again approved these changes Oct 18, 2020

View reviewed changes

N-Coder merged commit 8491556 into master Dec 13, 2020

N-Coder deleted the contentline-parser branch December 13, 2020 10:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contentline parser#247

Contentline parser#247
N-Coder merged 10 commits intomasterfrom
contentline-parser

N-Coder commented Jun 1, 2020

Uh oh!

N-Coder commented Oct 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

N-Coder commented Jun 1, 2020

Uh oh!

N-Coder commented Oct 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants