Pandas- pivoting column into (conditional) aggregated string

Lets say I have the following data set, turned into a dataframe:

data = [
    ['Job 1', datetime.date(2019, 6, 9), 'Jim', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Bill', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 9), 'Tom', 'Tom'],
    ['Job 1', datetime.date(2019, 6, 10), 'Bill', None],
    ['Job 2', datetime.date(2019,6,10), 'Tom', 'Tom']
]
df = pd.DataFrame(data, columns=['Job', 'Date', 'Employee', 'Manager'])

This yields a dataframe that looks like:

     Job        Date Employee Manager
0  Job 1  2019-06-09      Jim     Tom
1  Job 1  2019-06-09     Bill     Tom
2  Job 1  2019-06-09      Tom     Tom
3  Job 1  2019-06-10     Bill    None
4  Job 2  2019-06-10      Tom     Tom

What I am trying to generate is a pivot on each unique Job/Date combo, with a column for Manager, and a column for a string with comma separated, non-manager employees. A couple of things to assume:

  1. All employee names are unique (I’ll actually be using unique employee ids rather than names), and Managers are also “employees”, so there will never be a case with an employee and a manager sharing the same name/id, but being different individuals.
  2. A work crew can have a manager, or not (see row with id 3, for an example without)
  3. A manager will always also be listed as an employee (see row with id 2 or 4)
  4. A job could have a manager, with no additional employees (see row id 4)

I’d like the resulting dataframe to look like:

     Job        Date  Manager     Employees
0  Job 1  2019-06-09      Tom     Jim, Bill
1  Job 1  2019-06-10     None          Bill
2  Job 2  2019-06-10      Tom          None

Which leads to my questions:

  1. Is there a way to do a ‘,’.join like aggregation in a pandas pivot?
  2. Is there a way to make this aggregation conditional (exclude the name/id in the manager column)

I suspect 1) is possible, and 2) might be more difficult. If 2) is a no, I can get around it in other ways later in my code.

Solution:

Group to aggregate, then fix the Employees by removing the Manager and setting to None where appropriate. Since the employees are unique, sets will work nicely here to remove the Manager.

s = df.groupby(['Job', 'Date']).agg({'Manager': 'first', 'Employee': lambda x: set(x)})
s['Employee'] = [', '.join(x.difference({y})) for x,y in zip(s.Employee, s.Manager)]
s['Employee'] = s.Employee.replace({'': None})

                 Manager   Employee
Job   Date                         
Job 1 2019-06-09     Tom  Jim, Bill
      2019-06-10    None       Bill
Job 2 2019-06-10     Tom       None

Write multiple Excel files for each value of certain column Python Pandas

Consider the following dataframe:

data = {'Col_A': [3, 2, 1, 0], 'Col_B': ['a', 'b', 'a', 'b']}
df = pd.DataFrame.from_dict(data)

I can create a dataframa a and write it to Excel as follows:

a = df[df.Col_B == 'a']
a
a.to_excel(excel_writer = 'F:\Desktop\output.xlsx', index = False)

I am looking for a way to write an Excel file, for each value of column B. In reality, there are hundreds of values for Col_B. Is there a way to loop through this?

Any help is greatly appreciated!

Regards,

M.

Solution:

If need for each group separate excel file loop by groupby object:

for i, a in df.groupby('Col_B'):
    a.to_excel(f'F:\Desktop\output_{i}.xlsx', index = False)

If need for each group separate sheetname in one exel file use ExcelWriter:

with pd.ExcelWriter('output.xlsx') as writer:
    for i, a in df.groupby('Col_B'):
        a.to_excel(writer, sheet_name=i, index = False)

How to permute values of pandas column?

I have the following DataFrame df:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        5.0
2019-02-01 12:00   10       1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   0        5.0

The sum of Supply is 40. I need to permute Suppy 10 in order to assign them to higher values of Price, while Supply 0 should occur at lower values of Price.

This is the expected result:

Datetime           Supply   Price
2019-02-01 12:00   10       2.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   10       5.0
2019-02-01 12:00   0        1.0
2019-02-01 12:00   0        2.0
2019-02-01 12:00   10       4.0
2019-02-01 12:00   10       5.0

Any clues how to do it?

Solution:

argsort

  • Multiply by negative one as a convenient way to switch the sort
  • Use argsort to track the positions of where to drop my values
  • Create b to house my permuted values
  • Populate b with a sorted version of Supply
  • Assign back to df

a = df.Price.mul(-1).to_numpy().argsort()
b = np.empty_like(df.Supply)

b[a] = df.Supply.sort_values(ascending=False)

df.loc[:, 'Supply'] = b

df

           Datetime  Supply  Price
0  2019-02-01 12:00      10    2.0
1  2019-02-01 12:00       0    1.0
2  2019-02-01 12:00      10    5.0
3  2019-02-01 12:00       0    1.0
4  2019-02-01 12:00       0    2.0
5  2019-02-01 12:00      10    4.0
6  2019-02-01 12:00      10    5.0

There is room to optimize this code but the general idea is there.

Python, Pandas: A Better way to get the first None position in list which give maximum consecutive None count

I have lists that contain None like the following lists.

l1 = [None, 1, None, None, 2, None, None]
l2 = [None, 1, 1, None, None, None, 2, None, None]

I want to get the first None position in this list which gives the maximum consecutive None count.

get_start_None_pos(l1) # should return 2
get_start_None_pos(l2) # should return 3

My current approach with Pandas which works fine but it too slow when I have so many lists to deal with.

def get_start_None_pos(l: list) -> int:
    s = pd.Series(l)
    s = s.isna()
    s = s.cumsum() - s.cumsum().where(~s).ffill().fillna(0)
    return int(s.idxmax() - s.max() + 1)

I would like to know, is there any better way to solve something like this?

Solution:

Here’s one with NumPy –

def maxconsecNone_start(l):
    a = np.isnan(np.asarray(l, dtype=np.float64))
    a1 = np.r_[False,a,False]
    idx = np.flatnonzero(a1[:-1] != a1[1:])
    return idx[2*(idx[1::2]-idx[::2]).argmax()]

Sample runs –

In [49]: l1
Out[49]: [None, 1, None, None, 2, None, None]

In [50]: l2
Out[50]: [None, 1, 1, None, None, None, 2, None, None]

In [51]: maxconsecNone_start(l1)
Out[51]: 2

In [52]: maxconsecNone_start(l2)
Out[52]: 3

Pandas: replace numpy.nan cell with maximum of non-nan adjacent cells

test case:

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                    [3, 4, np.nan, 1],
                    [np.nan, np.nan, np.nan, 5],
                    [np.nan, 3, np.nan, 4]],
                    columns=list('ABCD'))

where A[i + 1, j], A[i – 1, j], A[i, j + 1], A[i, j – 1] are the set of
entries adjacent to A[i,j].

In so many words, this:

     A    B   C  D
0  NaN  2.0 NaN  0
1  3.0  4.0 NaN  1
2  NaN  NaN NaN  5
3  NaN  3.0 NaN  4

should become this:

     A    B   C  D
0  3.0  2.0 2.0  0.0
1  3.0  4.0 4.0  1.0
2  3.0  4.0 5.0  5.0
3  3.0  3.0 4.0  4.0

Solution:

You can use the rolling method over both directions and then find the max of each. Then you can use that to fill in the missing values of the original.

df1 = df.rolling(3, center=True, min_periods=1).max().fillna(-np.inf)
df2 = df.T.rolling(3, center=True, min_periods=1).max().T.fillna(-np.inf)
fill = df1.where(df1 > df2).fillna(df2)
df.fillna(fill)

Output

     A    B    C  D
0  3.0  2.0  2.0  0
1  3.0  4.0  4.0  1
2  3.0  4.0  5.0  5
3  3.0  3.0  4.0  4

Replace certain text with value if text in list

I’m just getting up to speed on Pandas and cannot resolve one issue. I have a list of Counties in NY State. If the County is one of the 5 boroughs, I want to change the county name to New York, otherwise I leave it alone. The following gives the idea, but is not correct.

EDIT – so if the counties in the County column of the first few rows were Albany, Allegheny, Bronx before the change, they would be Albany, Allegheny, New York after the change

# clean up county names
# 5 boroughs must be combined to New York City
# eliminate the word county
nyCounties = ["Kings", "Queens", "Bronx", "Richmond", "New York"]

nypopdf['County'] = ['New York' for nypopdf['County'] in nyCounties else   
nypopdf['County']]

Solution:

A small mockup:

In [44]: c = ['c', 'g']
In [45]: df = pd.DataFrame({'county': list('abccdefggh')})
In [46]: df['county'] = df['county'].where(~df['county'].isin(c), 'N')
In [47]: df
Out[47]:   county
         0      a
         1      b
         2      N
         3      N
         4      d
         5      e
         6      f
         7      N
         8      N
         9      h

So this is using pd.Series.where ~df['county'].isin(c) selects rows that are not in the list c (the ~ at the start is the ‘not’ operation), the second argument is the value to replace with (when the condition is False).

To fit your example:

nypopdf['County'] = nypopdf['County'].where(~nypopdf['County'].isin(nyCounties), 'New York')

or

nypopdf['County'].where(~nypopdf['County'].isin(nyCounties), 'New York', inplace=True)

Complete example:

nypopdf = pd.DataFrame({'County': ['Albany', 'Allegheny', 'Bronx']})
nyCounties = ["Kings", "Queens", "Bronx", "Richmond", "New York"]
print(nypopdf)
      County
0     Albany
1  Allegheny
2      Bronx
nypopdf['County'].where(~nypopdf['County'].isin(nyCounties), 'New York', inplace=True)
print(nypopdf)
      County
0     Albany
1  Allegheny
2   New York

Python, why is this lamdba function not correct?

flight_data is dataframe in panda:

  for c in flight_data.columns:
      if ('Delay' in c):
          flight_data[c].fillna(0, inplace = True)

How do I do this in 1 line using lambda function?

map(lambda c: flight_data[c].fillna(0, inplace = True), list(filter(lambda c : 'Delay' in c, flight_data.columns)))

Why aren’t these two equivalent?

When printing out the data, NaN is not replaced by 0.

Solution:

Don’t use lambda

lambda only obfuscates logic here. Just specify in-scope columns and use fillna directly:

cols = df.filter(like='Delay').columns
df[cols] = df[cols].fillna(0)

How do I do this in 1 line using lambda function?

But to answer your question, you can do this without relying on side-effects of map or a list comprehension:

df = df.assign(**df.pipe(lambda x: {c: x[c].fillna(0) for c in x.filter(like='Delay')}))

Find the dictionary from List which has key-pair 'isGeo':True

How to Find the dictionary from List which has key-pair ‘isGeo’:True

dimensions = [{'key': 2600330, 'id': 'location', 'name': 'Location', 'isGeo': True, 'geoType': 'region'}, {'key': 2600340, 'id': 'subject', 'name': 'Subject', 'isGeo': False, 'geoType': None}, {'key': 2600350, 'id': 'measure', 'name': 'Measure', 'isGeo': False, 'geoType': None}]

I want to below result:

{'key': 2600330, 'id': 'location', 'name': 'Location', 'isGeo': True, 'geoType': 'region'}

Solution:

Use next with a generator expression:

res = next((d for d in dimensions if d['isGeo']), None)

{'key': 2600330, 'id': 'location', 'name': 'Location', 'isGeo': True, 'geoType': 'region'}

Since you tagged , you can also use Pandas:

import pandas as pd

df = pd.DataFrame(dimensions)
res = df.loc[df['isGeo']].iloc[0].to_dict()

The above solutions assume you want only the first dictionary satisfying your condition. If you want a list of dictionaries use:

res = [d for d in dimensions if d['isGeo']]
res = df.loc[df['isGeo']].to_dict('records')

Pandas replace all numeric values not equal to a specific value

My DataFrame:

            HLLM  HXBX  JHWO  RPNZ  ZHNL
2008-08-31     0     0     0     0     0
2008-09-30     0     0     0     0     0
2008-10-31     3     1     0     0     5
2008-11-30     0    -1     0     0     0

I am trying to replace all values that are NOT equal to 0 to the value 1

df = df.replace(df != 0, 1)

How can I rewrite this so that it works?

Solution:

You can simply use

df[df != 0] = 1        

HLLM  HXBX  JHWO  RPNZ  ZHNL
2008-08-31     0     0     0     0     0
2008-09-30     0     0     0     0     0
2008-10-31     1     1     0     0     1
2008-11-30     0     1     0     0     0

Group rows pandas

Background: I have the following dataframe:

import pandas as pd
d = {'day': ["t", "m", "m", "w", "t", "m","w"], 
     'month': ["01", "01", "01", "01", "02","02","02"], 
     'count': [1, 1, 1, 1,1,1,1]}
df = pd.DataFrame(data=d)

I group by day and month:

df.groupby(by=['day','month']).count()

Output:

day  month count    
m    01     2
     02     1
t    01     1
     02     1
w    01     1
     02     1

From here, I would like to organize the data to obtain the following output:

Desired Output:

day  month count    
m    01     2
t    01     1
w    01     1
m    02     1
t    02     1
w    02     1           

I tried df.sort_values('month') and df.sort_values('day') but it doesn’t quite give me what I am looking for

Question: What line(s) of code do I need to add to get my desired output?

Solution:

Here you go. It only happens to get the day ordering correct, but you might want to convert them to actually 0-6 for days of the week if you have more days later.

df.groupby(by=['day','month'], as_index=False).count().sort_values(by=['month', 'day'])

    day month   count
0   m   01  2
2   t   01  1
4   w   01  1
1   m   02  1
3   t   02  1
5   w   02  1