Skip to content

Contentline parser#247

Merged
N-Coder merged 10 commits intomasterfrom
contentline-parser
Dec 13, 2020
Merged

Contentline parser#247
N-Coder merged 10 commits intomasterfrom
contentline-parser

Conversation

@N-Coder
Copy link
Copy Markdown
Member

@N-Coder N-Coder commented Jun 1, 2020

In addition to optimizing the Tatsu Grammar and precompiling it, I included two further alternative parsers in this branch: one that is based on handwritten character searching (using the fast str.index and str.find functions without unnecessarily slicing substrings) and one that is based on the alternative regex module and its full history of group captures in regex.Match.captures (while re.Match.group only stores the latest capturing group). Some first tests indicate that both similarly outperfom tatsu, with the optimized tatsu taking roughly a minute with the reference file I generated for #244 and both alternative parsers taking roughly a second. The profiler also indicated that this is close to the possible optimum, as Python object initialization and access consumed the biggest part of that one second. I'll run some more (performance) tests and also test with malformed input and then report back to decide how to use these parsers in v0.8.

Note: the regex parser currently doesn't work on pypy3 due to this unicode bug somewhere in those two components.

@N-Coder N-Coder added this to the Version 0.8 milestone Jun 1, 2020
@N-Coder N-Coder force-pushed the contentline-parser branch from 71883a1 to cd97f4b Compare October 17, 2020 16:44
@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Oct 18, 2020

I ran some performance test with

  • three different parsers: handwritten, tatsu, and regex-based
  • reading the input string from memory or harddisk
  • line-wise iterating over the input line-wise or reading it as one big string
  • using the built-in re module (only for handwritten and tatsu) or using the extended regex module
  • using CPython or PyPy
  • using a precompiled tatsu grammar or loading it from the ebnf file during runtime.

The results can be summarized as follows:

  • Tatsu is by far slower than the others. The fastest run was line-wise reading using a precompiled grammar on PyPy, which took about 7 seconds. Switching to CPython or reading the file as a whole each add an order of magnitude to that (yep, the fastest tatsu run on CPython was 49s), while not precompiling only adds a minor overhead. The other parsers all took less than 5s no matter what interpreter was used.
  • Using the built-in re module is a little bit faster than the extended regex module, while also not failing with weird unicode symbols (see the regex bug mentioned above). Similarly, the handwritten parser is a little bit faster than the one relying on the extended regex functionality.
  • Reading line-wise instead of the whole file at once is marginally faster and needs a little less memory (unsurprisingly), but the numbers vary a little bit depending on whether we read from memory or from disk.

See b85e0a4 for the code used for testing. Note that I didn't test much with invalid data, but I guess the results for the good case are clear enough (and tatsu is even more prone to spending too much time on broken inputs).

Supporting three different parser implementations and giving the user the possibility to choose from them adds a lot of complexity on both sides. As the handwritten parser is faster than the other two, I decided to only ship that one to the users, drastically simplifying the code in cd97f4b. The regex parser is no longer needed and the tatsu parser has been moved to the testing code, so that we can use it as reference when parsing files from the corpus.

@N-Coder N-Coder marked this pull request as ready for review October 18, 2020 12:20
@N-Coder N-Coder merged commit 8491556 into master Dec 13, 2020
@N-Coder N-Coder deleted the contentline-parser branch December 13, 2020 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants