Pyparsing | Things Python

Repetition in pyparsing – why is it so dumb?

Posted by ptmcg in Pyparsing on June 1, 2024

This issue was raised recently on the pyparsing Github issues, and it is a common occurrence, especially for those using pyparsing for the first time.

The problem is, why doesn’t pyparsing recognize a terminating word of a sequence of words? I use this example in the wiki, parsing a string of words that start with the word “start”, followed by zero or more words, and end with the word “end”.

The obvious implementation should look like:

import pyparsing as pp
word = pp.Word(pp.alphas)
phrase = "start" + pp.ZeroOrMore(word) + "end"

Using the railroad diagramming feature added in pyparsing 3.0, this looks like:

But this doesn’t work (and perhaps, looking at the diagram, you can already see why). When we try to parse “start to be or not to be end”, we get an error:

phrase.parse_string("start to be or not to be end")

Traceback (most recent call last):
    ...
pyparsing.exceptions.ParseException: Expected 'end', found end of text (at char 28), (line:1, col:29)

??? It’s right there! Why didn’t pyparsing see it?

The issue is, pyparsing did see it, but treated it as part of the ZeroOrMore(word) repetition, since “end” is a perfectly valid word. Going through this step by step, pyparsing:
1. parsed the word “start”
2. parsed every word of alphas, “to”, “be”, “or”, “not”, “to”, “be”, and “end”
3. tried to parse the terminating “end”, but that had already been read as part of the ZeroOrMore, and so raised a ParseException, effectively saying “I looked for ‘end’, but there was no more text in the string.”

pyparsing parses purely left-to-right in the input text, and (unlike regular expressions) does not backtrack over already matched words in case a following term might have been accidentally used in the repetition.

The solution begins by more precisely defining what we meant in our original problem statement when we used the phrase “one or more words”. What we really meant was “one or more words that are not the word ‘end'” or “one or more words, but first check if a word is ‘end’ and if it is, stop”. This is a wordy way of describing a negative lookahead – something the parser should look for ahead of the current parse position, and only continue if it does not find that something.

pyparsing has a class that performs a negative lookahead: NotAny. To clarify our ZeroOrMore expression, we use them to verify that we are not about to parse the word “end” when parsing the contents of our phrase:

phrase = "start" + pp.ZeroOrMore(NotAny(Literal("end")) + word) + "end"

Oof, that is a lot! Since the ‘~’ operator is a unary operator that gives a pyparsing NotAny term, we can write this a little more succinctly as:

phrase = "start" + pp.ZeroOrMore(~Literal("end") + word) + "end"

This is such a common occurrence that I added the stop_on parameter to the ZeroOrMore and OneOrMore classes, so that the negative lookahead can be given without having to bring in other classes:

phrase = "start" + pp.ZeroOrMore(word, stop_on="end") + "end"

And our diagram now looks like:

Wait! That is still the old diagram. In the next release of pyparsing, we’ll see the corrected diagram, showing how stop_on emits the negative lookahead terms:

Lastly, you can write this very compactly using the new slice notation for repetition:

phrase = "start" + word[...:"end"] + "end"

Here the ... (called ellipsis) represents the concept of “zero or more”, and the :"end" part represents the similar concept of a terminating element when slicing a string or list – that is for a string
s = "abcdef", where characters start with “a” at index 0, “b” at index 1 and so on, s[0:3] returns "abc", characters 0 through 2 of string s, but not including character 3 (“d”). Just this same way, word[...:"end"] indicates “zero or more word expressions, up to but not including “end”.

Here is all the code used in this article (including the added lines to generate the railroad diagram):

import pyparsing as pp
word = pp.Word(pp.alphas)
# won't work
phrase = "start" + pp.ZeroOrMore(word) + "end"

# add negative lookahead - these all do the same thing
phrase = "start" + pp.ZeroOrMore(NotAny(Literal("end")) + word) + "end"
phrase = "start" + pp.ZeroOrMore(~Literal("end") + word) + "end"
phrase = "start" + pp.ZeroOrMore(word, stop_on="end) + "end"
phrase = "start" + word[...: "end"] + "end"

# have pyparsing find all defined ParserElements in this scope, and call
# set_name() on them, so that they get labeled nicely in the diagram
pp.autoname_elements()
phrase.create_diagram("issue_558.html")

phrase.parse_string("start to be or not to be end")

Pyparsing, python, railroad-diagrams

Things Python

Archive for category Pyparsing

Repetition in pyparsing – why is it so dumb?

Recent Posts

Archives

Categories

Meta