Posts Tagged itertools

Fuzzy Groupby – Unusual Restaurant Part II

In the last blog post (https://thingspython.wordpress.com/2020/10/30/python-groupby-and-an-unusual-restaurant/ ), we looked at Python’s stdlib’s itertools.groupby method, with examples illustrated using a restaurant with an unusual seating plan:

  • they have just one table
  • that table can seat an unlimited number of customers
  • BUT, they must all be in the same party

The restaurant is very popular, so there is always a line of customers outside waiting to get in. The groupby greeter is in charge of identifying each group:

  • on Family Night, groups are determined by those all having the same last name
  • on High School Night, groups are determined by those who are in high school or not

These represent the most common uses of itertools.groupby in Python – from a sequence of items, identify groups of items based on those having some common value (such as a last name), or whose properties all evaluate to a common value (such as a grade level indicating whether or not one is in high school).

In this post, I’ll describe what I call fuzzy groupby – items in a group are defined not by their own attributes, but by their relationship with those around them.

For the first example, let’s assume our restaurant has Kids Eat Free Night, where an adult can bring as many children as they like and the children eat for free. Adults may still eat alone as well. For this example, we will say that a child is one under the age of 12.

Note that this is different from High School Night. On that night, the high schoolers and non-high schoolers each constituted their own separate groups. Now we want to define our groups as an adult + 0 or more children.

For a particular Kids Eat Free Night, our groupby greeter might announce parties as:

Party of one adult and 2 children
Party of one adult and 3 children
Party of one adult alone
Party of one adult and 1 child
Party of one adult alone
...

One way to write this in Python is to write the code just like we did for High School Night: have our grouping function return a bool of whether each person is a child or not. But since we need to combine the adult from an all-adults group with the children in the next all-children group, we’ll need to do that regrouping manually:

  • if adults, each but the last is its own group
  • if children, all are in the same party as the most recent adult

To write this in Python, we first create a line of waiting customers:

from collections import namedtuple

# create a line of waiting customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
    Customer("Alice", 22),    # an adult with group of 2 children
    Customer("Ken", 7),
    Customer("Olivia", 9),
    Customer("Jim", 37),      # an adult with group of 3 children
    Customer("Bob", 5),
    Customer("Chip", 8),
    Customer("Nancy", 2),
    Customer("Dan", 26),      # an adult dining alone
    Customer("Candace", 24),  # an adult with group of 1 child
    Customer("Courtney", 11),
    Customer("Arthur", 26),   # an adult dining alone
    ]

For groupby, we’ll need a function to return whether a customer is an adult or not:

def is_adult(customer):
    return customer.age >= 12

Using groupby, we’ll now get groups of adults and groups of children. If the group is a group of adults, then every adult but the last is a single adult dining alone.

The last adult in the group may be followed by a group of children, so we need to save that adult in a “holding” variable, to be included with the children in the next group given by groupby.

from itertools import groupby

last_adult = None
for group_are_adults, customers in groupby(customers_in_line,
                                 key=is_adult):
    # convert generator to list so we can easily detect 
    # the last customer
    customers = list(customers)
    if group_are_adults:
        # every customer but the last is an adult dining alone
        for customer in customers[:-1]:
            # announce the group, and its members
            print("Party of 1 adult alone")
            print(" -", customer.name, "(adult)")
        last_adult = customers[-1]

    else:
        # group are all children, so associate them with
        # the last adult
        if last_adult is not None:
            # announce the group, and its members
            print("Party of one adult and {} {}".format(
                    len(customers),
                    "child" if len(customers) == 1 else "children"))

            print(" -", last_adult.name, "(adult)")
            for customer in customers:
                print(" -", customer.name, "(child)")
            last_adult = None

        else:
            # this should never happen, but...
            print("what are these children doing here without an adult?!")

Lastly, we must deal with a subtle bug in this code: what happens if the list ends with one or more adults followed by no children? If that happens, then the last adult would get stored in last_adult, but never actually processed. So after the code for our groupby loop, we need additional code to handle any last single-dining adult:

# must see if any adult was at the end of the line
if last_adult is not None:
    print("Party of 1 adult alone")
    print(" -", last_adult.name, "(adult)")

This is some pretty messy code, with lots of duplication. It is only a little better than doing our own iteration over the list instead of using groupby.

It would be better if we could define a key grouping function so that groupby gives us each group as our rules define them.

Such a grouping function will need to keep some state of the current group, and will need to detect when a new group has started. We can do that by moving the “last adult” holding variable into the grouping function, and have the grouping function return that variable’s id(). Now the grouping function doesn’t necessarily return a value that comes from each entry in the list, but it does return a value that represents the group.

We can attach attributes to Python functions just like any other object. Doing so avoids creating a separate global variable, and better indicates the association of that value with the function.

Here is a grouping function that will keep track of the adult for each group of one adult plus zero or more children:

def in_group(customer):
    # if the current customer is an adult, then
    # they are the start of a new group
    if is_adult(customer):
        in_group.first_in_group = customer

    return id(in_group.first_in_group)

# define attribute for tracking the adult for each group
in_group.first_in_group = None

Remember, groupby calls this grouping function for every element in the list, or in this example, every customer waiting in line. If that customer is an adult, then we know they are the start of a new group. In that case, that customer becomes the new first customer of the next group. If the customer is a child, then we simply return the value from the last-seen adult, so that the adult and all the children return the same grouping value.

This value returned by the grouping function is no longer all that interesting. The important parts are that it describes the current group, that all the members of this group return the same value, and that the value will change on the start of the next group. If we could be sure that all people in the line had unique names, we could return in_group.first_in_group.name. But since we can’t be sure of that, it is simple to just return the id() of the first_in_group customer object.

Since we now only care about the contents of the groups, and not the grouping value itself, we can just use the throwaway variable _ for that value:

for _, customers in groupby(customers_in_line, key=in_group):

    # extract customers in the group into the leading adult
    # and the following list of children, if any
    adult, *children = customers

    # announce the group
    if children:
        print("Party of one adult and {} {}".format(
                len(children),
                "child" if len(children) == 1 else "children"))
    else:
        print("Party of 1 adult alone")

    # print out members of the group
    print(" -", adult.name, "(adult)")
    for child in children:
        print(" -", child.name, "(child)")

At the expense of a slightly more complicated grouping function, our groupby loop has cleaned up considerably. We get the same output as before, but now the code that processes each group is straightforward, with no code duplication. And no need for special handling after the loop for straggling adults.

A slight variation on the last example could be made if we envision the restaurant having an Eat With People Close To Your Own Age Night. On this night, the waiting customers are grouped if their ages are all within 5 years of each other. In these groups, there is no distinction of adult or child, so this code is extremely simple. The grouping function is very similar, with a minor change to handle the initial case where the first_in_group instance is None.

def in_group(customer):
    # detect if we are the first group (first_in_group is None) 
    # or if we are starting a new group - in either case, update
    # the first_in_group with the current customer
    if (in_group.first_in_group is None 
            or abs(customer.age - in_group.first_in_group.age) > 5):
        in_group.first_in_group = customer
    return id(in_group.first_in_group)

in_group.first_in_group = None

[EDIT: it turns out that this does not group customers within 5 years of age of each other, but customers within 5 years of age of the first customer in the group, either older or younger. So this would permit groups of customers up to 10 years of age apart. Correct code would also add in_group.min_age and in_group.max_age, update them as new in-group customers are added, and determine if a new group is found if customer.age < in_group.max_age - 5 or customer.age > in_group.min_age + 5. After adding 2 more attributes to in_group and this more complex group detection logic, one might reasonably argue that this should really be moved to a separate class. The code has been updated in the summarized code at the end of this article.]

This is the simplest groupby loop of all, since there is no need to distinguish types of customers – we just list them all and their ages.

for _, customers in groupby(customers_in_line, key=in_group):

    customers = list(customers)

    # announce the party
    print("Party of {} customer{}".format(
                len(customers),
                "" if len(customers) == 1 else "s"))

    # list the customers in the party
    for customer in customers:
        print(" -", customer.name, customer.age)

I have used this same kind of fuzzy groupby in several cases. In my last job, I often had to process log files where some log entries were multiple lines long, or had an associated traceback inserted in the log. In those cases, I would detect a log line that started with a valid timestamp, and that would be the first_in_group line. Log entries on a single line would be a group of just one line; entries on multiple lines would result in a group with the first entry being the timestamped log line, and the remaining entries being the continuation lines. This was very similar to the Kids Eat Free Night example.

In another case, I needed to group log entries by proximity in time. These were actually network packet captures, and I needed to group packets that occurred within 5 milliseconds, so that bursts of packets could be identified. For that case, I used a method similar to the Eat With People Close To Your Own Age Night grouping function.

Here is the full code described in this article:

from collections import namedtuple
from itertools import groupby

# create a line of customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
    Customer("Alice", 22),    # an adult with group of 2 children
    Customer("Ken", 7),
    Customer("Olivia", 9),
    Customer("Jim", 37),      # an adult with group of 3 children
    Customer("Bob", 5),
    Customer("Chip", 8),
    Customer("Nancy", 2),
    Customer("Dan", 26),      # an adult dining alone
    Customer("Candace", 24),  # an adult with group of 1 child
    Customer("Courtney", 11),
    Customer("Arthur", 26),   # an adult dining alone
    ]

#
# Kids Eat Free Night example
#
print("\nKids Eat Free Night example")
def is_adult(customer):
    return customer.age >= 12

last_adult = None
for group_are_adults, customers in groupby(customers_in_line, key=is_adult):
    # convert generator to list so we can easily detect the last
    customers = list(customers)

    if group_are_adults:
        # every customer but the last is an adult dining alone
        for customer in customers[:-1]:
            print("Party of 1 adult alone")
            print(" -", customer.name, "(adult)")

        # save off the last adult to be grouped with the next group
        # (of children)
        last_adult = customers[-1]

    else:
        # group are all children, so associate them with the last adult
        if last_adult is not None:
            print("Party of one adult and {} {}".format(
                    len(customers),
                    "child" if len(customers) == 1 else "children"))
            print(" -", last_adult.name, "(adult)")
            for customer in customers:
                print(" -", customer.name, "(child)")
            last_adult = None
        else:
            # this should never happen, but...
            print("what are these children doing here without an adult?!")

# must see if any adult was at the end of the line
if last_adult is not None:
    print("Party of 1 adult alone")
    print(" -", last_adult.name, "(adult)")



print("\nKids Eat Free Night example #2")

def in_group(customer):
    # if the current customer is an adult, then
    # they are the start of a new group
    if is_adult(customer):
        in_group.first_in_group = customer
    return id(in_group.first_in_group)

# define attribute for tracking the adult for each group
in_group.first_in_group = None

for _, customers in groupby(customers_in_line, key=in_group):

    # extract customers in the group into the leading adult
    # and the following list of children, if any
    adult, *children = customers

    # announce the group
    if children:
        print("Party of one adult and {} {}".format(
                len(children),
                "child" if len(children) == 1 else "children"))
    else:
        print("Party of 1 adult alone")

    # print out members of the group
    print(" -", adult.name, "(adult)")
    for child in children:
        print(" -", child.name, "(child)")


print("\nEat With People Close to Your Own Age Night example")

def in_group(customer):
    # detect if we are the first group (first_in_group is None) 
    # or if we are starting a new group - if so, update the
    # first_in_group with the current customer
    if (in_group.first_in_group is None 
            or customer.age < in_group.max_age - 5
            or customer.age > in_group.min_age + 5
        ):
        in_group.first_in_group = customer
        in_group.min_age = in_group.max_age = customer.age
    else:
        in_group.min_age = min(in_group.min_age, customer.age)
        in_group.max_age = max(in_group.max_age, customer.age)

    return id(in_group.first_in_group)

in_group.first_in_group = None
in_group.min_age = 0
in_group.max_age = 0

for _, customers in groupby(customers_in_line, key=in_group):
    customers = list(customers)

    # announce the party
    print("Party of {} customer{}".format(
                len(customers),
                "" if len(customers) == 1 else "s"))

    # list the customers in the party
    for customer in customers:
        print(" -", customer.name, customer.age)

, , ,

Leave a comment

Python groupby and an Unusual Restaurant

The itertools module in the Python stdlib has a number of interesting and useful functions. One that I find especially powerful when reporting or analyzing data is groupby. Given a sequence of objects and a grouping function, groupby will turn them into a sequence of grouping keys and associated objects. It is great for summarizing or tallying data that is stored as a sequence of objects, which could be dicts, tuples,strings, or really most any type of data item in the Python world.

To picture how groupby works, imagine a restaurant. This restaurant is extremely popular, and people wait outside to get in. The restaurant is unusual, in that it has only a single table, but that table will seat an unlimited number of people. The only constraint is that all the people have to be in the same party.

groupby is the greeter outside the restaurant, who goes down the line to identify who will be the next group to go in. The greeter asks each person what party they are in. When the greeter finds someone not in the current party, this indicates the end of the current group, and the start of the next.

For instance, on Family Night, imagine that everyone who is in a group will all have the same last name. So the groupby greeter asks each person their name, and if the lead person says “Bob Jones”, then the greeter continues down the line until finding someone whose last name is not Jones, say “Rogers”. Those Joneses all form the “Jones” group. If one of the Jones family is late, say “Bill Jones”, and is mixed in with the Rogerses, that is too bad, because the strict rules of the restaurant are that they will only be seated with their group if they are in line together with their group.

Also sad for the Rogerses who are in line after Bill Jones, because since they are no longer in line together with their other Rogers relatives, they too will not be seated with them. Bill Jones will be seated alone as a group of one, followed by another group of the remaining Rogerses.

So the greeter might report on the line like:
“Jones party, the next 4 customers.”
After they are admitted, the greeter continues down the line.
“Rogers party, the next 3 customers.”
Followed by,
“Jones party, 1 person.” (that tardy Bill Jones!)
And then,
“Rogers party, the next 2 customers”.

Note that the greeter does not announce the next party until the previous one has been seated.

Another night might be High School Night, when customers who are in high school are all seated together. Again, the greeter goes down the line asking each customer what grade they are in, and if they are in high school, they are seated in a group together. If not in high school, either younger or older, they are seated in a separate group, which goes until another high schooler is found. The actual grade they are in is not important, just whether it is one that goes to high school (typically grades 9 through 12 in the US).

And groupby in your Python program works much the same way as the groupby greeter. To use groupby, you need some sequence of items (typically a list, or perhaps a database cursor; may also be a Python generator), and a function that will distinguish the key value that determines the current group.

Here is how Family Night might look in Python. First we need a line of customers (we are using namedtuple from the stdlib collections module to define a simple Customer class type):

Customer = namedtuple("Customer", "last_name first_name grade_level")

line_of_customers = [
    Customer("Jones", "Alice", None),
    Customer("Jones", "Bob", None),
    Customer("Jones", "Charles", 10),
    Customer("Jones", "Dave", 9),
    Customer("Rogers", "Sandy", 11),
    Customer("Rogers", "George", 12),
    Customer("Rogers", "Amelia", None),
    Customer("Jones", "Bill", None),
    Customer("Rogers", "Sam", 3),
    Customer("Rogers", "Sylvia", 5),
    Customer("Burton", "Fred", 11),
    Customer("Burton", "Claire", 9),
    Customer("Adams", "Floyd", 10),
    Customer("Adams", "Elaine", None),
    Customer("Adams", "Susan", None),
]

Now to use groupby, we create a grouping function, and then call groupby using the list of customers and that function.

def customers_by_last_name(customer) -> str:
    return customer.last_name

for family_name, family_customers in groupby(line_of_customers,
                                             key=customers_by_last_name):
    family_customers = list(family_customers)
    print(f"{family_name} party, the next {len(family_customers)} people.")

For each group, groupby gives us the key value that all the elements in the group have in common, followed by a generator that yields the individual list elements in turn.

(Note that we had to instantiate the grouped family_customers as a list in order to use the len builtin, since family_customers is given by groupby as a Python generator. Python generators do not support len, so for small sequences you can just instantiate them into a list, and then call len() on that list. In place of len(list(family_customers)) we could have written sum(1 for _ in family_customers), which walks the generator to count up the sum of people without creating the intermediate list.)

(A second note: that function customers_by_last_name is rather wordy. The Python stdlib includes another handy module, the operator module, which includes functions that wrap many common simple operations like getting attributes, getting dict items, performing mathematical functions, etc. The one that would help here is operator.attrgetter, which takes one or more string arguments and returns a function that gets those attributes from an object passed to it, returning those attribute values in a tuple. In this case, we could have replaced the customers_by_last_name with just operator.attrgetter('last_name').)

For High School Night, the code looks very similar, differing only in the grouping method, and code to change the boolean high school status is_in_high_school into a printable string:

def in_high_school(customer) -> bool:
    return customer.grade is not None and 9 <= customer.grade <= 12

for is_in_high_school, group_customers in groupby(line_of_customers,
                                                  key=in_high_school):
    group_customers = list(group_customers)
    party_name = "High School" if is_in_high_school else "Not High School" 
    print(f"{party_name} party, the next {len(group_customers)} people.")

When you are summarizing data and want to avoid the problem of fragmented groups (such as those caused by the tardy Bill Jones breaking up both the Jones and the Rogers parties), it is customary if possible, to sort the incoming sequence of records. If the sequence is a generator, like the records that might be returned by a database query cursor, the sorting would most likely be done as part of an SQL "ORDER BY" clause that would be used to create the database cursor. But for records in a Python list, one would simply sort the list.

If sorting the list, then you would most likely use the same function you would pass to groupby. For Family Night, this would be:

line_of_customers.sort(key=customers_by_last_name)

for family_name, family_customers in groupby(line_of_customers,
                                             key=customers_by_last_name):
    ... etc. ...

You can imagine our restaurant having another greeter going through the line ahead of the groupby greeter, calling out, “Jones party, anyone else in the Jones party please move forward.” This would then pull Bill Jones up with his relatives, and the Joneses and the Rogerses would each all dine together. (In fact, this sorter-greeter would find the Adams and Burton families further down the line, and end up seating them ahead of the Jones, which the Joneses probably wouldn’t like – but at least they would get to eat together!)

Here is all the code from this post:

from collections import namedtuple
import itertools
import operator


Customer = namedtuple("Customer", "last_name first_name grade_level")

line_of_customers = [
    Customer("Jones", "Alice", None),
    Customer("Jones", "Bob", None),
    Customer("Jones", "Charles", 10),
    Customer("Jones", "Dave", 9),
    Customer("Rogers", "Sandy", 11),
    Customer("Rogers", "George", 12),
    Customer("Rogers", "Amelia", None),
    Customer("Jones", "Bill", None),
    Customer("Rogers", "Sam", 3),
    Customer("Rogers", "Sylvia", 5),
    Customer("Burton", "Fred", 11),
    Customer("Burton", "Claire", 9),
    Customer("Adams", "Floyd", 10),
    Customer("Adams", "Elaine", None),
    Customer("Adams", "Susan", None),
]


print("\nFamily Night seating")

def customer_by_last_name(customer) -> str:
    return customer.last_name

# or just
# customer_by_last_name = operator.attrgetter("last_name")

# uncomment out this line to see the grouping with all the groups 
# collected together
#line_of_customers.sort(key=customer_by_last_name)

for group_name, group_customers in itertools.groupby(line_of_customers, 
                                                      key=customer_by_last_name):
    group_customers = list(group_customers)
    group_size = len(group_customers)
    if group_size > 1:
        print(f"{group_name} party, the next {group_size} customers.")
    else:
        print(f"{group_name} party, 1 person.")
    for customer in group_customers:
        print(f" - {customer.first_name} {customer.last_name}")


print("\nHigh School Night seating")

def high_school_status(customer) -> bool:
    return customer.grade_level is not None and 9 <= customer.grade_level <= 12

for is_in_high_school, group_customers in itertools.groupby(line_of_customers,
                                                            key=high_school_status):
    group_name = "High School" if is_in_high_school else "Not High School"
    group_customers = list(group_customers)
    group_size = len(group_customers)
    print(f"{group_name} party, the next {group_size} customers.")
    for customer in group_customers:
        print(f" - {customer.first_name} {customer.last_name} {customer.grade_level}")

, ,

2 Comments