Posts Tagged Pyparsing

Repetition in pyparsing – why is it so dumb?

Posted by ptmcg in Pyparsing on June 1, 2024

This issue was raised recently on the pyparsing Github issues, and it is a common occurrence, especially for those using pyparsing for the first time.

The problem is, why doesn’t pyparsing recognize a terminating word of a sequence of words? I use this example in the wiki, parsing a string of words that start with the word “start”, followed by zero or more words, and end with the word “end”.

The obvious implementation should look like:

import pyparsing as pp
word = pp.Word(pp.alphas)
phrase = "start" + pp.ZeroOrMore(word) + "end"

Using the railroad diagramming feature added in pyparsing 3.0, this looks like:

But this doesn’t work (and perhaps, looking at the diagram, you can already see why). When we try to parse “start to be or not to be end”, we get an error:

phrase.parse_string("start to be or not to be end")

Traceback (most recent call last):
    ...
pyparsing.exceptions.ParseException: Expected 'end', found end of text (at char 28), (line:1, col:29)

??? It’s right there! Why didn’t pyparsing see it?

The issue is, pyparsing did see it, but treated it as part of the ZeroOrMore(word) repetition, since “end” is a perfectly valid word. Going through this step by step, pyparsing:
1. parsed the word “start”
2. parsed every word of alphas, “to”, “be”, “or”, “not”, “to”, “be”, and “end”
3. tried to parse the terminating “end”, but that had already been read as part of the ZeroOrMore, and so raised a ParseException, effectively saying “I looked for ‘end’, but there was no more text in the string.”

pyparsing parses purely left-to-right in the input text, and (unlike regular expressions) does not backtrack over already matched words in case a following term might have been accidentally used in the repetition.

The solution begins by more precisely defining what we meant in our original problem statement when we used the phrase “one or more words”. What we really meant was “one or more words that are not the word ‘end'” or “one or more words, but first check if a word is ‘end’ and if it is, stop”. This is a wordy way of describing a negative lookahead – something the parser should look for ahead of the current parse position, and only continue if it does not find that something.

pyparsing has a class that performs a negative lookahead: NotAny. To clarify our ZeroOrMore expression, we use them to verify that we are not about to parse the word “end” when parsing the contents of our phrase:

phrase = "start" + pp.ZeroOrMore(NotAny(Literal("end")) + word) + "end"

Oof, that is a lot! Since the ‘~’ operator is a unary operator that gives a pyparsing NotAny term, we can write this a little more succinctly as:

phrase = "start" + pp.ZeroOrMore(~Literal("end") + word) + "end"

This is such a common occurrence that I added the stop_on parameter to the ZeroOrMore and OneOrMore classes, so that the negative lookahead can be given without having to bring in other classes:

phrase = "start" + pp.ZeroOrMore(word, stop_on="end") + "end"

And our diagram now looks like:

Wait! That is still the old diagram. In the next release of pyparsing, we’ll see the corrected diagram, showing how stop_on emits the negative lookahead terms:

Lastly, you can write this very compactly using the new slice notation for repetition:

phrase = "start" + word[...:"end"] + "end"

Here the ... (called ellipsis) represents the concept of “zero or more”, and the :"end" part represents the similar concept of a terminating element when slicing a string or list – that is for a string
s = "abcdef", where characters start with “a” at index 0, “b” at index 1 and so on, s[0:3] returns "abc", characters 0 through 2 of string s, but not including character 3 (“d”). Just this same way, word[...:"end"] indicates “zero or more word expressions, up to but not including “end”.

Here is all the code used in this article (including the added lines to generate the railroad diagram):

import pyparsing as pp
word = pp.Word(pp.alphas)
# won't work
phrase = "start" + pp.ZeroOrMore(word) + "end"

# add negative lookahead - these all do the same thing
phrase = "start" + pp.ZeroOrMore(NotAny(Literal("end")) + word) + "end"
phrase = "start" + pp.ZeroOrMore(~Literal("end") + word) + "end"
phrase = "start" + pp.ZeroOrMore(word, stop_on="end) + "end"
phrase = "start" + word[...: "end"] + "end"

# have pyparsing find all defined ParserElements in this scope, and call
# set_name() on them, so that they get labeled nicely in the diagram
pp.autoname_elements()
phrase.create_diagram("issue_558.html")

phrase.parse_string("start to be or not to be end")

Pyparsing, python, railroad-diagrams

Pyparsing Walkthrough – Letter Ranges

Posted by ptmcg in Uncategorized on January 1, 2021

This post is hopefully the first of several to walk through the development of some simple pyparsing parsers, to demonstrate how to make use of pyparsing to compose complex parsers from simple building-blocks.

This walkthrough will address a hypothetical problem: given a string of alphabetic characters, expand any ranges to return a list of all the individual characters.

For instance given:

A-C,X,M-P,Z

return the list of individual characters:

A,B,C,X,M,N,O,P,Z

For simplicity, we will limit ourselves to the ASCII alphabetic characters A-Z. At the end, I’ll suggest some enhancements that you’ll be able to tackle on your own.

Parsing Options in Python

Python offers several built-in mechanisms for slicing and dicing text strings:

str methods – the Python str class itself includes a number of methods such as split, partition, startswith, and endswith that are suitable for many basic string parsing tasks
re and regex modules – the builtin re module and the externally available enhancement regex bring all the power, speed, complexity, and arcana of regular expressions to crack many difficult parsing problems
shlex module – the builtin shlex module takes its name from the Unix sh shell, and is good for parsing strings as if they were command arguments, including recognizing and properly handling string elements enclosed in quotes

These features are good for many quick text-splitting and scanning tasks, but can suffer when it comes to ease of maintenance or extensibility.

Approaching a problem with Pyparsing

Whether using pyparsing or another parsing library, it is always good to start out with a plan for the parser. In parsing terms, this is called a BNF, for Backus-Naur Form. This plan helps structure your thinking and the resulting implementation of your parser. Here is a simplified BNF for the letter range problem:

letter ::= 'A'..'Z'
letter_range ::= letter '-' letter
letter_range_item ::= letter_range | letter
letter_range_list ::= letter_range_item [',' letter_range_item]...

This translates almost directly to pyparsing Python code:

import pyparsing as pp

letter = pp.Char("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
letter_range = letter + '-' + letter
letter_range_item = letter_range | letter
letter_range_list = (letter_range_item + (',' + letter_range_item)[...])

Using parseString to run this parser:

print(letter_range_list.parseString(test))

gives

['A', '-', 'C', ',', 'X', ',', 'M', '-', 'P', ',', 'Z']

While this shows that we did in fact parse everything in the input, we still have to do some cleaning up, and also add the expansion of the A-C type ranges.

Suppress delimiters

First thing is to strip out the commas. They are important during the parsing process, but not really necessary for the parsed results. Pyparsing provides a class Suppress for this type of expression, that needs to be parsed, but should be suppressed from the results:

letter_range_list = letter_range_item + (pp.Suppress(',') + letter_range_item)[...]

As it happens, this expr + (Suppress(delimiter) + expr)[...] pattern occurs so frequently, that pyparsing includes a helper method delimitedList(expr, delim) (with a default delimiter of ',') that supports this pattern:

letter_range_list = pp.delimitedList(letter_range_item)

With this change, our output is now cleaner:

['A', '-', 'C', 'X', 'M', '-', 'P', 'Z']

Expand the ranges

The last step is to expand the sub-ranges 'A-C' and 'M-P'. We could certainly walk this list of parsed items, looking for the '-' symbols:

def expand_range(from_, to_):
    # given two letters, return a list of all the letters between them, inclusive
    # ('C', 'F') will return ['C', 'D', 'E', 'F']
    alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    from_loc = alphas.index(from_)
    to_loc = alphas.index(to_)
    return list(alphas[from_loc:to_loc+1])

out = []
letter_iter = iter(parsed_output)
letter = next(letter_iter, '')
last = ''
while letter:
    if letter != '-':
        out.append(letter)
        last = letter
    else:
        to_ = next(letter_iter)
        out.extend(expand_range(last, to_)[1:])
    letter = next(letter_iter, '')

But doesn’t it feel like we are retracing steps that the parser must have already taken? When pyparsing ran our parser to scan the original input string, it had to follow a very similar process to recognize the letter_range expression. Why not have pyparsing run this expansion process at the same time that it has parsed out a letter_range?

As it happens, we can attach a parse-time callback to an expression; in pyparsing applications, this callback is called a parse action. Parse actions can serve various purposes, but the feature we will use in this example is to replace the parsed tokens with a new set of tokens. We’ll still utilize the expand_range function defined above, but now calling it will be more straightforward:

def expand_parsed_range(t):
    # expand parsed tokans ['C', '-', 'F'] to ['C', 'D', 'E', 'F']
    return expand_range(t[0], t[2])
letter_range.addParseAction(expand_parsed_range)

And now our returned list reads:

'A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

One more thing…

While this accomplishes our goal, there is one pyparsing “best practice” that I’d like to incorporate into this example. Whenever we access parsed data using numeric indices like in:

return expand_range(t[0], t[2])

we make it difficult to make changes in our grammar at some time in the future. For instance, if the grammar were to change and add an optional field in this expression, our index of 2 might have to be conditionalized depending on whether that field were present or not. It would be much better if we could refer to these fields by name.

We can attach names to parsed fields by modifying the definition of the letter_range expression from:

letter_range = letter + '-' + letter

to:

letter_range = letter('start') + '-' + letter('end')

With this change, we can now write our parse action using these results names:

def expand_parsed_range(t):
    return expand_range(t.start, t.end)

And now if the definition of letter_range changes by adding other fields, our starting and ending fields will still be processed correctly, without having to change any existing code. Parsers that use results names are much more robust and maintainable over time.

The Complete Parser

Here is the full working parser example (with tests using run_tests):

import pyparsing as pp

def expand_range(from_, to_):
    """
    Given two letters, return a list of all the letters between them, inclusive.
    >>> expand_range('C', 'F')
    ['C', 'D', 'E', 'F']
    """
    alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    from_loc = alphas.index(from_)
    to_loc = alphas.index(to_)
    return list(alphas[from_loc:to_loc+1])

def expand_parsed_range(t):
    """
    Parse action to expand parsed tokans ['C', '-', 'F'] to ['C', 'D', 'E', 'F']
    """
    return expand_range(t.start, t.end)

# define parser elements - use setName to define names for these elements
# for better error messages
letter = pp.Char("ABCDEFGHIJKLMNOPQRSTUVWXYZ").setName("letter")
letter_range = (letter('start') + '-' + letter('end')).setName("letter_range")
letter_range.addParseAction(expand_parsed_range)
letter_range_list = pp.delimitedList(letter_range | letter)

# test the original input string
test = "A-C,X,M-P,Z"
print(letter_range_list.parseString(test))

# use runTests to easily test against multiple input strings, with comments
letter_range_list.runTests("""\
  # a simple list with no ranges
  A,B

  # a range
  A-E

  # a range and a single letter  
  A-E,X

  # the original complete test string
  A-C,X,M-P,Z
  
  # an erroneous input
  -C,X,M-P,Z
""")

Gives this output:

['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

# a simple list with no ranges
A,B
['A', 'B']

# a range
A-E
['A', 'B', 'C', 'D', 'E']

# a range and a single letter
A-E,X
['A', 'B', 'C', 'D', 'E', 'X']

# the original complete test string
A-C,X,M-P,Z
['A', 'B', 'C', 'X', 'M', 'N', 'O', 'P', 'Z']

# an erroneous input
-C,X,M-P,Z
^
FAIL: Expected {letter_range | letter}, found '-'  (at char 0), (line:1, col:1)

Potential extensions

Here are some extensions to this parser that you can try on your own:

Change the syntax of a range from “A-C” to “A:C”; how much of your program did you have to change?
Add support for lowercase letters and numeric digits
Add a parse action to letter_item_list to return a sorted list, with no duplicates
Add validation that letter_range is proper (do not allow degenerate ranges like “J-J”, out of order ranges “X-M”, or mixed ranges like “T-4”)
Write another parser to handle ranges of integers instead of letters

For More Information

You can find more information on pyparsing by visiting the pyparsing Github repo, at https://github.com/pyparsing/pyparsing.

Pyparsing

4 Comments

Adventures in Unicode: Parsing superscript exponents

Posted by ptmcg in Uncategorized on December 22, 2018

I’ve always wanted to add support for parsing exponentiation using superscript characters, instead of clunky infix operators like ‘^’ or ‘**’. If I want 2 raised to the tenth power, 2**10 just doesn’t look as good as 2¹⁰. Parsing and evaluating these exponents is simple enough with pyparsing, but first the digits need to be convertible to regular int values.

The regular digits 0 through 9 work nicely enough. Since they are contiguous and ordered in the ASCII sequence, and Python has support for them, we can write int(‘9’) and get the value 9. Here we look at all 10 numeric characters:

digits = "0123456789"
for i in sorted(digits):
    print(repr(i), ord(i), end=' ')
    try:
        print(int(i))
    except ValueError:
        print('-error-')

And we get a friendly predictable listing of characters, ordinal values, and converted int values:

'0' 48 0
'1' 49 1
'2' 50 2
'3' 51 3
'4' 52 4
'5' 53 5
'6' 54 6
'7' 55 7
'8' 56 8
'9' 57 9

But the superscript digits are more difficult. Using the same for loop, but with this changed line:

digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"

Now we get these results:

'²' 178 -error-
'³' 179 -error-
'¹' 185 -error-
'⁰' 8304 -error-
'⁴' 8308 -error-
'⁵' 8309 -error-
'⁶' 8310 -error-
'⁷' 8311 -error-
'⁸' 8312 -error-
'⁹' 8313 -error-

What the heck?! They aren’t even in sorted order! Only the 1, 2, and 3 superscripts are even in the 8-bit 0-255 range, and even they aren’t in order.

So looking ahead to when we eventually parse and gather these exponent strings, we need to roll our own int() method to convert to usable Python ints.

I suggest we build a dict for mapping each superscript to its int value:

superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int_map = {digit: value for value, digit in enumerate(superscript_digits)}

Enumerating over the superscript digits in order gives us (int, digit) tuples, which make it easy to convert to a dict using a dict comprehension.

Now we just need a function that will convert a string made up of one or more superscript characters, and give back an int value, for example converting ‘⁷⁸⁹‘ to 789. This code is a typical exercise in beginning programming classes, so should not be much of a surprise.

def super_to_int(superscript_str):
    ret = 0

    # iterate over each digit in the input, get its value (from 0-9)
    # and add it to the current running total, after multiplying
    # the running total by 10
    for chr in superscript_str:
        ret = ret * 10 + super_int_map[chr]

    return ret

With that in our back pocket, we can start writing a parser, knowing that when it comes time to convert superscript exponents to ints, our conversion function is all ready to go.

Pyparsing comes ready-made with a number of helpful classes and objects for building up parsers a piece at a time. So we can begin by writing the expression that will parse an integer subscript using a pyparsing Word class, which indicates that we want to parse a word group composed of one or more characters, in this case the characters are the superscript digits.

exponent = pp.Word(superscript_digits).setName("integer exponent")

The reason for naming this expression becomes clear when we run some simple tests on it, using the runTests() method that is provided on all pyparsing parse expressions:

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

Giving

¹²³
['¹²³']

¹⁰
['¹⁰']

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Without the name, we would have gotten a more cryptic-looking message like “Expected W:(⁰¹²³…) (at char 0), (line:1, col:1)”.

We actually will want to parse and evaluate strings like “2¹⁰“, so we also need an expression for a regular integer. Integer expressions are so common, pyparsing provides a standard expression in its pyparsing_common namespace class, so we can just use that.

integer = pp.pyparsing_common.integer
integer.runTests("""\
   2
   42
   4294967296
""")

Giving

2
[2]

42
[42]

4294967296
[4294967296]

There is a subtle difference from the previous tests – these values have already been converted to ints! There are no quotation marks around the parsed values, as you see above when we tested out our exponent expression. So when we want to perform the final exponentiation operation, the base number will already be in int form.

pyparsing does this using a parse-time callback, called a parse action. Each expression in the parser can have one or more parse actions attached to it, for the purposes of conversion, validation, data structuring, or whatever. We will use this same feature to convert our superscript exponents to ints.

Pyparsing is pretty flexible in letting you define the arguments passed to parse actions. The most common is to just pass the list of parsed tokens. For this particular expression the list is always just 1 element long, since the expression we are acting on parses just a single word of superscript digits.

def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])

exponent.addParseAction(convert_parsed_exponent)

Rerunning our previous tests now gives:

¹²³
[123]

¹⁰
[10]

10
^
FAIL: Expected integer exponent (at char 0), (line:1, col:1)

Now the first two parse correctly, but we are still stumbling over plain old “10”.

To handle our expressions like “2¹⁰“, we define another expression, this time combining the expressions we already have. We’ll allow for our parser to handle integers with or without exponents:

raised_number = integer + pp.Optional(exponent, default=1)

We simply use the ‘+’ operator to show that these two expressions should occur one after the next. We wrap our exponent expression using the pyparsing Optional class, so that pyparsing won’t complain when parsing values that have no exponent. In the event that there is no exponent, we would still like a default value to be given. In this case, a logical default exponent if none is explicitly given is 1, since any number raised to 1 is just that same number.

Testing out our combined expression, shows that we are getting pretty close:

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives

2¹⁰
[2, 10]

10³
[10, 3]

2³²
[2, 32]

10⁰
[10, 0]

10
[10, 1]

The last step will be to do the computation of the actual exponentiation operation. But now that we are used to using parse actions, a second one added to the raised_number expression does the job.

def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Gives the desired results:

2¹⁰
[1024]

10³
[1000]

2³²
[4294967296]

10⁰
[1]

10
[10]

Note that raise_to_power() has no code in it to convert the tokens to ints. This is because the two expressions that make up a raised_number have each already converted their strings to ints, so the resulting parsed tokens are no longer just strings, but ints. Similarly, the code that performs conversion from superscript string to int has no exception handling in case of bad input. Why? Because the input has already been screened in the parser so that only valid superscript strings are sent to the parse action.

Here are some ideas for other expressions that this parser does not currently handle:

    -1³
    -1²
    (4-5)³
    (1/2)³
    1/2¹⁰
    2⁻¹⁰
    6.02×10²³

Here is the full listing of code from this article:

# -*- coding: utf-8 -*-

import pyparsing as pp

def show_numeric_characters(digits):
    for i in sorted(digits):
        print(repr(i), ord(i), end=' ')
        try:
            print(int(i))
        except ValueError:
            print('-error-')

digits = "0123456789"
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
show_numeric_characters(digits)
show_numeric_characters(superscript_digits)

# a pyparsing expression to parse an exponent of superscript digits
exponent = pp.Word(superscript_digits).setName("integer exponent")

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# pyparsing-provided expression to parse plain old integers
integer = pp.pyparsing_common.integer
# alternate form, which could handle a leading '-' sign
# integer = pp.Regex(r'-?\d+').addParseAction(lambda t: int(t[0]))
integer.runTests("""\
    2
    42
    4294967296
""")

# function to convert superscript string to int
superscript_digits = "⁰¹²³⁴⁵⁶⁷⁸⁹"
super_int = dict((digit, value) for value, digit in enumerate(superscript_digits))
def super_to_decimal(superscript_str):
    ret = 0
    for chr in superscript_str:
        ret = ret * 10 + super_int[chr]
    return ret

# parse action to convert a string of superscript digits to an int
def convert_parsed_exponent(tokens):
    return super_to_decimal(tokens[0])
exponent.addParseAction(convert_parsed_exponent)

exponent.runTests("""\
    ¹²³
    ¹⁰
    10
""")

# define an expression to parse an integer optionally raised
# to an exponent power
raised_number = integer + pp.Optional(exponent, default=1)

# add parse action to perform exponentiation
def raise_to_power(t):
    return t[0]**t[1]
raised_number.addParseAction(raise_to_power)

# take it for a spin!
raised_number.runTests("""\
    2¹⁰
    10³
    2³²
    10⁰
    10
    """)

Pyparsing is available through PyPI using pip install. The home for pyparsing has recently moved to GitHub: https://github.com/pyparsing/pyparsing

UPDATE: The unicodedata module in Python’s standard library includes the digit() method, which will do the same job as our super_int_map, converting superscripts (and subscripts too!) to their corresponding int values. We still need to do the conversion in sup_to_int that handles multiple digit superscripts, though.

Pyparsing, python

1 Comment

Things Python