Posts Tagged data analysis
Fuzzy Groupby – Unusual Restaurant Part II
Posted by ptmcg in Uncategorized on November 11, 2020

In the last blog post (https://thingspython.wordpress.com/2020/10/30/python-groupby-and-an-unusual-restaurant/ ), we looked at Python’s stdlib’s itertools.groupby method, with examples illustrated using a restaurant with an unusual seating plan:
- they have just one table
- that table can seat an unlimited number of customers
- BUT, they must all be in the same party
The restaurant is very popular, so there is always a line of customers outside waiting to get in. The groupby greeter is in charge of identifying each group:
- on Family Night, groups are determined by those all having the same last name
- on High School Night, groups are determined by those who are in high school or not
These represent the most common uses of itertools.groupby in Python – from a sequence of items, identify groups of items based on those having some common value (such as a last name), or whose properties all evaluate to a common value (such as a grade level indicating whether or not one is in high school).
In this post, I’ll describe what I call fuzzy groupby – items in a group are defined not by their own attributes, but by their relationship with those around them.
For the first example, let’s assume our restaurant has Kids Eat Free Night, where an adult can bring as many children as they like and the children eat for free. Adults may still eat alone as well. For this example, we will say that a child is one under the age of 12.
Note that this is different from High School Night. On that night, the high schoolers and non-high schoolers each constituted their own separate groups. Now we want to define our groups as an adult + 0 or more children.
For a particular Kids Eat Free Night, our groupby greeter might announce parties as:
Party of one adult and 2 children
Party of one adult and 3 children
Party of one adult alone
Party of one adult and 1 child
Party of one adult alone
...
One way to write this in Python is to write the code just like we did for High School Night: have our grouping function return a bool of whether each person is a child or not. But since we need to combine the adult from an all-adults group with the children in the next all-children group, we’ll need to do that regrouping manually:
- if adults, each but the last is its own group
- if children, all are in the same party as the most recent adult
To write this in Python, we first create a line of waiting customers:
from collections import namedtuple
# create a line of waiting customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
Customer("Alice", 22), # an adult with group of 2 children
Customer("Ken", 7),
Customer("Olivia", 9),
Customer("Jim", 37), # an adult with group of 3 children
Customer("Bob", 5),
Customer("Chip", 8),
Customer("Nancy", 2),
Customer("Dan", 26), # an adult dining alone
Customer("Candace", 24), # an adult with group of 1 child
Customer("Courtney", 11),
Customer("Arthur", 26), # an adult dining alone
]
For groupby, we’ll need a function to return whether a customer is an adult or not:
def is_adult(customer):
return customer.age >= 12
Using groupby, we’ll now get groups of adults and groups of children. If the group is a group of adults, then every adult but the last is a single adult dining alone.
The last adult in the group may be followed by a group of children, so we need to save that adult in a “holding” variable, to be included with the children in the next group given by groupby.
from itertools import groupby
last_adult = None
for group_are_adults, customers in groupby(customers_in_line,
key=is_adult):
# convert generator to list so we can easily detect
# the last customer
customers = list(customers)
if group_are_adults:
# every customer but the last is an adult dining alone
for customer in customers[:-1]:
# announce the group, and its members
print("Party of 1 adult alone")
print(" -", customer.name, "(adult)")
last_adult = customers[-1]
else:
# group are all children, so associate them with
# the last adult
if last_adult is not None:
# announce the group, and its members
print("Party of one adult and {} {}".format(
len(customers),
"child" if len(customers) == 1 else "children"))
print(" -", last_adult.name, "(adult)")
for customer in customers:
print(" -", customer.name, "(child)")
last_adult = None
else:
# this should never happen, but...
print("what are these children doing here without an adult?!")
Lastly, we must deal with a subtle bug in this code: what happens if the list ends with one or more adults followed by no children? If that happens, then the last adult would get stored in last_adult, but never actually processed. So after the code for our groupby loop, we need additional code to handle any last single-dining adult:
# must see if any adult was at the end of the line
if last_adult is not None:
print("Party of 1 adult alone")
print(" -", last_adult.name, "(adult)")
This is some pretty messy code, with lots of duplication. It is only a little better than doing our own iteration over the list instead of using groupby.
It would be better if we could define a key grouping function so that groupby gives us each group as our rules define them.
Such a grouping function will need to keep some state of the current group, and will need to detect when a new group has started. We can do that by moving the “last adult” holding variable into the grouping function, and have the grouping function return that variable’s id(). Now the grouping function doesn’t necessarily return a value that comes from each entry in the list, but it does return a value that represents the group.
We can attach attributes to Python functions just like any other object. Doing so avoids creating a separate global variable, and better indicates the association of that value with the function.
Here is a grouping function that will keep track of the adult for each group of one adult plus zero or more children:
def in_group(customer):
# if the current customer is an adult, then
# they are the start of a new group
if is_adult(customer):
in_group.first_in_group = customer
return id(in_group.first_in_group)
# define attribute for tracking the adult for each group
in_group.first_in_group = None
Remember, groupby calls this grouping function for every element in the list, or in this example, every customer waiting in line. If that customer is an adult, then we know they are the start of a new group. In that case, that customer becomes the new first customer of the next group. If the customer is a child, then we simply return the value from the last-seen adult, so that the adult and all the children return the same grouping value.
This value returned by the grouping function is no longer all that interesting. The important parts are that it describes the current group, that all the members of this group return the same value, and that the value will change on the start of the next group. If we could be sure that all people in the line had unique names, we could return in_group.first_in_group.name. But since we can’t be sure of that, it is simple to just return the id() of the first_in_group customer object.
Since we now only care about the contents of the groups, and not the grouping value itself, we can just use the throwaway variable _ for that value:
for _, customers in groupby(customers_in_line, key=in_group):
# extract customers in the group into the leading adult
# and the following list of children, if any
adult, *children = customers
# announce the group
if children:
print("Party of one adult and {} {}".format(
len(children),
"child" if len(children) == 1 else "children"))
else:
print("Party of 1 adult alone")
# print out members of the group
print(" -", adult.name, "(adult)")
for child in children:
print(" -", child.name, "(child)")
At the expense of a slightly more complicated grouping function, our groupby loop has cleaned up considerably. We get the same output as before, but now the code that processes each group is straightforward, with no code duplication. And no need for special handling after the loop for straggling adults.
A slight variation on the last example could be made if we envision the restaurant having an Eat With People Close To Your Own Age Night. On this night, the waiting customers are grouped if their ages are all within 5 years of each other. In these groups, there is no distinction of adult or child, so this code is extremely simple. The grouping function is very similar, with a minor change to handle the initial case where the first_in_group instance is None.
def in_group(customer):
# detect if we are the first group (first_in_group is None)
# or if we are starting a new group - in either case, update
# the first_in_group with the current customer
if (in_group.first_in_group is None
or abs(customer.age - in_group.first_in_group.age) > 5):
in_group.first_in_group = customer
return id(in_group.first_in_group)
in_group.first_in_group = None
[EDIT: it turns out that this does not group customers within 5 years of age of each other, but customers within 5 years of age of the first customer in the group, either older or younger. So this would permit groups of customers up to 10 years of age apart. Correct code would also add in_group.min_age and in_group.max_age, update them as new in-group customers are added, and determine if a new group is found if customer.age < in_group.max_age - 5 or customer.age > in_group.min_age + 5. After adding 2 more attributes to in_group and this more complex group detection logic, one might reasonably argue that this should really be moved to a separate class. The code has been updated in the summarized code at the end of this article.]
This is the simplest groupby loop of all, since there is no need to distinguish types of customers – we just list them all and their ages.
for _, customers in groupby(customers_in_line, key=in_group):
customers = list(customers)
# announce the party
print("Party of {} customer{}".format(
len(customers),
"" if len(customers) == 1 else "s"))
# list the customers in the party
for customer in customers:
print(" -", customer.name, customer.age)
I have used this same kind of fuzzy groupby in several cases. In my last job, I often had to process log files where some log entries were multiple lines long, or had an associated traceback inserted in the log. In those cases, I would detect a log line that started with a valid timestamp, and that would be the first_in_group line. Log entries on a single line would be a group of just one line; entries on multiple lines would result in a group with the first entry being the timestamped log line, and the remaining entries being the continuation lines. This was very similar to the Kids Eat Free Night example.
In another case, I needed to group log entries by proximity in time. These were actually network packet captures, and I needed to group packets that occurred within 5 milliseconds, so that bursts of packets could be identified. For that case, I used a method similar to the Eat With People Close To Your Own Age Night grouping function.
Here is the full code described in this article:
from collections import namedtuple
from itertools import groupby
# create a line of customers
Customer = namedtuple("Customer", "name age")
customers_in_line = [
Customer("Alice", 22), # an adult with group of 2 children
Customer("Ken", 7),
Customer("Olivia", 9),
Customer("Jim", 37), # an adult with group of 3 children
Customer("Bob", 5),
Customer("Chip", 8),
Customer("Nancy", 2),
Customer("Dan", 26), # an adult dining alone
Customer("Candace", 24), # an adult with group of 1 child
Customer("Courtney", 11),
Customer("Arthur", 26), # an adult dining alone
]
#
# Kids Eat Free Night example
#
print("\nKids Eat Free Night example")
def is_adult(customer):
return customer.age >= 12
last_adult = None
for group_are_adults, customers in groupby(customers_in_line, key=is_adult):
# convert generator to list so we can easily detect the last
customers = list(customers)
if group_are_adults:
# every customer but the last is an adult dining alone
for customer in customers[:-1]:
print("Party of 1 adult alone")
print(" -", customer.name, "(adult)")
# save off the last adult to be grouped with the next group
# (of children)
last_adult = customers[-1]
else:
# group are all children, so associate them with the last adult
if last_adult is not None:
print("Party of one adult and {} {}".format(
len(customers),
"child" if len(customers) == 1 else "children"))
print(" -", last_adult.name, "(adult)")
for customer in customers:
print(" -", customer.name, "(child)")
last_adult = None
else:
# this should never happen, but...
print("what are these children doing here without an adult?!")
# must see if any adult was at the end of the line
if last_adult is not None:
print("Party of 1 adult alone")
print(" -", last_adult.name, "(adult)")
print("\nKids Eat Free Night example #2")
def in_group(customer):
# if the current customer is an adult, then
# they are the start of a new group
if is_adult(customer):
in_group.first_in_group = customer
return id(in_group.first_in_group)
# define attribute for tracking the adult for each group
in_group.first_in_group = None
for _, customers in groupby(customers_in_line, key=in_group):
# extract customers in the group into the leading adult
# and the following list of children, if any
adult, *children = customers
# announce the group
if children:
print("Party of one adult and {} {}".format(
len(children),
"child" if len(children) == 1 else "children"))
else:
print("Party of 1 adult alone")
# print out members of the group
print(" -", adult.name, "(adult)")
for child in children:
print(" -", child.name, "(child)")
print("\nEat With People Close to Your Own Age Night example")
def in_group(customer):
# detect if we are the first group (first_in_group is None)
# or if we are starting a new group - if so, update the
# first_in_group with the current customer
if (in_group.first_in_group is None
or customer.age < in_group.max_age - 5
or customer.age > in_group.min_age + 5
):
in_group.first_in_group = customer
in_group.min_age = in_group.max_age = customer.age
else:
in_group.min_age = min(in_group.min_age, customer.age)
in_group.max_age = max(in_group.max_age, customer.age)
return id(in_group.first_in_group)
in_group.first_in_group = None
in_group.min_age = 0
in_group.max_age = 0
for _, customers in groupby(customers_in_line, key=in_group):
customers = list(customers)
# announce the party
print("Party of {} customer{}".format(
len(customers),
"" if len(customers) == 1 else "s"))
# list the customers in the party
for customer in customers:
print(" -", customer.name, customer.age)
You must be logged in to post a comment.