Lets say I have the following data set, turned into a dataframe:
data = [
['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])
This yields a dataframe that looks like:
Job Date Employee Manager
0 Job 1 2019-06-09 Jim Tom
1 Job 1 2019-06-09 Bill Tom
2 Job 1 2019-06-09 Tom Tom
3 Job 1 2019-06-10 Bill None
4 Job 2 2019-06-10 Tom Tom
What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:
- All employee names are unique (I’ll actually be using unique employee ids rather than names), and Managers are also “employees”, so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
- A work crew can have a manager, or not (see row with id 3, for an example without)
- A manager will always also be listed as an employee (see row with id 2 or 4)
- A job could have a manager, with no additional employees (see row id 4)
I’d like the resulting dataframe to look like:
Job Date Manager Employees
0 Job 1 2019-06-09 Tom Jim, Bill
1 Job 1 2019-06-10 None Bill
2 Job 2 2019-06-10 Tom None
Which leads to my questions:
- Is there a way to do a ‘,’.join like aggregation in a pandas pivot?
- Is there a way to make this aggregation conditional (exclude the name/id in the manager column)
I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.
Solution:
Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.
s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})
Manager Employee
Job Date
Job 1 2019-06-09 Tom Jim, Bill
2019-06-10 None Bill
Job 2 2019-06-10 Tom None