Counting combinations between two Dataframe columns

I’d like to re-format a dataframe such that it shows the counts of combinations of two columns. Here’s an example dataframe:

my_df = pd.DataFrame({'a': ['first', 'second', 'first', 'first', 'third', 'first'],
               'b': ['foo', 'foo', 'bar', 'bar', 'baz', 'baz'],
               'c': ['do', 're', 'mi', 'do', 're', 'mi'],
               'e': ['this', 'this', 'that', 'this', 'those', 'this']})

which looks like this:

        a    b   c      e
0   first  foo  do   this
1  second  foo  re   this
2   first  bar  mi   that
3   first  bar  do   this
4   third  baz  re  those
5   first  baz  mi   this

I want it to make a new dataframe that counts combinations between columns a and c that would look like this:

c        do   mi   re
a                    
first   2.0  2.0  NaN
second  NaN  NaN  1.0
third   NaN  NaN  1.0

I can do this using pivot_table if I set the values argument equal to some other column:

my_pivot_count1 = my_df.pivot_table(values='b', index='a', columns='c', aggfunc='count')

The problem with this is that column ‘b’ could have nan values in it, in which case that combination wouldn’t be counted. For example, if my_df looks like this:

        a    b   c      e
0   first  foo  do   this
1  second  foo  re   this
2   first  bar  mi   that
3   first  bar  do   this
4   third  baz  re  those
5   first  NaN  mi   this

my call to my_df.pivot_table gives this:

first   2.0  1.0  NaN
second  NaN  NaN  1.0
third   NaN  NaN  1.0

I’ve gotten around using b as the values argument for now by setting the values argument equal to a new column I introduce to my_df that is guaranteed to have values using either my_df['count'] = 1 or my_df.reset_index(), but is there a way to get what I want without having to add a column, using only columns a and c?

Solution:

pandas.crosstab has a dropna argument, which by default is set to True, but in your case you can pass False:

pd.crosstab(df['a'], df['c'], dropna=False)
# c       do  mi  re
# a                 
# first    2   2   0
# second   0   0   1
# third    0   0   1

Revert a frequency table

Let’s assume you have a pandas DataFrame that holds frequency information like this:

data = [[1,1,2,3],
        [1,2,3,5],
        [2,1,6,1],
        [2,2,2,4]]
df = pd.DataFrame(data, columns=['id', 'time', 'CountX1', 'CountX2'])

# id    time    CountX1     CountX2
# 0     1   1   2   3
# 1     1   2   3   5
# 2     2   1   6   1
# 3     2   2   2   4

I am looking for a simple command (e.g. using pd.pivot or pd.melt()) to revert these frequencies to tidy data that should look like this:

id time variable
0   1   X1
0   1   X1
0   1   X2
0   1   X2
0   1   X2
1   1   X1
1   1   X1
1   1   X1
1   1   X2 ...  # 5x repeated
2   1   X1 ...  # 6x repeated
2   1   X2 ...  # 1x repeated
2   2   X1 ...  # 2x repeated
2   2   X2 ...  # 4x repeated

Solution:

You can use melt + repeat.

v = df.melt(['id', 'time'])
r = v.pop('value')

df = pd.DataFrame(
        v.values.repeat(r, axis=0),  columns=v.columns
)\
       .sort_values(['id', 'time'])\
       .reset_index(drop=True)

   id time variable
0   1    1  CountX1
1   1    1  CountX1
2   1    1  CountX2
3   1    1  CountX2
4   1    1  CountX2
5   1    2  CountX1
6   1    2  CountX1
7   1    2  CountX1
8   1    2  CountX2
9   1    2  CountX2
10  1    2  CountX2
11  1    2  CountX2
12  1    2  CountX2
13  2    1  CountX1
14  2    1  CountX1
15  2    1  CountX1
16  2    1  CountX1
17  2    1  CountX1
18  2    1  CountX1
19  2    1  CountX2
20  2    2  CountX1
21  2    2  CountX1
22  2    2  CountX2
23  2    2  CountX2
24  2    2  CountX2
25  2    2  CountX2

This produces the ordering as depicted in your question.


Performance

df = pd.concat([df] * 100, ignore_index=True)

# jezrael's stack solution

%%timeit
a = df.set_index(['id','time']).stack()
a.loc[a.index.repeat(a)].reset_index().rename(columns={'level_2':'a'}).drop(0, axis=1)

1 loop, best of 3: 173 ms per loop

# jezrael's melt solution
%%timeit
a = df.melt(['id','time'])
a.loc[a.index.repeat(a['value'])].drop('value', 1).sort_values(['id', 'time']).reset_index(drop=True)

100 loops, best of 3: 6.84 ms per loop

# in this answer

%%timeit
v = df.melt(['id', 'time'])
r = v.pop('value')

pd.DataFrame(
        v.values.repeat(r, axis=0),  columns=v.columns
)\
       .sort_values(['id', 'time'])\
       .reset_index(drop=True)

100 loops, best of 3: 4.65 ms per loop

How to replace all instance of a String with another indexed string Python

This question is a little difficult to articulate with my inadequate English but I will do my best.

I have a directory of xml files, each file contains xml such as:

<root>
    <fields>
        <field>
            <description/>
            <region id="Number.T2S366_R_487" page="1"/>
        </field>
        <field>
            <description/>
            <region id="Number.T2S366_R_488.`0" page="1"/>
            <region id="String.T2S366_R_488.`1" page="1"/>
        </field>
    </fields>
</root>

I’d like to do a String replacement on the lines which contain the dot, tick, number notation such as .`0 with an index notation like [0],[1], [2], … and so forth.

So the transformed xml payload should look like something below:

<root>
    <fields>
        <field>
            <description/>
            <region id="Number.T2S366_R_487" page="1"/>
        </field>
        <field>
            <description/>
            <region id="Number.T2S366_R_488[0]" page="1"/>
            <region id="String.T2S366_R_488[1]" page="1"/>
        </field>
    </fields>
</root>

How can I accomplish this using python? This seems fairly straight forward to do using regex but that would be difficult to do for a directory of files containing multiple files. I’d like to see an implementation using python 3.x, as I am learning it.

Solution:

In Python you can loop over all files in your directory with os.listdir and make substitutions in-place with fileinput:

import os
import fileinput

path = '/home/arabian_albert/'
for f in os.listdir(path):
    with fileinput.FileInput(f, inplace=True, backup='.bak') as file:
        for line in file:
            print(re.sub(r'\.`(\d+)', r'\[\1\]', line), end='')

However, you should consider doing this from the command line with sed:

find . -type f -exec sed -i.bak -E "s/\.`([0-9]+)/[\1]/g" {} \;

The above will make the substitution for all files in the current directory, and backup with old files with .bak.

Creating a function that can convert a list into a dictionary in python

I’m trying to create a function that will convert a given list into a given dictionary (where I can specify/assign values if I want).

So for instance, if I have a list

['a', 'b', 'c', ..., 'z']

and I want to convert to a dictionary like this

{1: 'a', 2: 'b', 3: 'c', ..., 26: 'z'}

I know how to do this using a dictionary comprehension

{num : chr(96 + num) for num in range(1, 26)}

but I can’t figure out how to make this into a more generalized function that would be able to turn any list into a dictionary. What’s the best approach here?

Solution:

Pass enumerated list to dict constructor

>>> items = ['a','b','c']
>>> dict(enumerate(items, 1))
>>> {1: 'a', 2: 'b', 3: 'c'}

Here enumerate(items, 1) will yield tuples of element and its index. Indices will start from 1 (note the second argument of enumerate). Using this expression you can define a function inline like:

>>> func = lambda x: dict(enumerate(x, 1))

Invoke it like:

>>> func(items)
>>> {1: 'a', 2: 'b', 3: 'c'}

Or a regular function

>>> def create_dict(items):
        return dict(enumerate(items, 1))

Numpy set absolute value in place

You have an array A, and you want to turn every value in it as an absolute value. The problem is

numpy.abs(A)

creates a new matrix, and the values in A stay where they were. I find two ways to set put the absolute values back to A

A *= numpy.sign(A)

or

A[:] = numpy.abs(A)

Based on timeit test, their performance is almost the same
enter image description here

Question:

Are there more efficient ways to perform this task?

Solution:

There’s an out parameter, which updates the array in-place:

numpy.abs(A, out=A)

And also happens to be a lot faster because you don’t have to allocate memory for a new array.

A = np.random.randn(1000, 1000)

%timeit np.abs(A)
100 loops, best of 3: 2.9 ms per loop

%timeit np.abs(A, out=A)
1000 loops, best of 3: 647 µs per loop

How to check if a TypeError is for NoneType

I am testing some code I have written, and when functioning normally, it could raise a TypeError of the variety where ‘NoneType’ object is not iterable. Since this expected, I would like to deal with this as it happens, without inadvertently hiding other TypeErrors.

How can I test if my TypeError is for ‘NoneType’, and not for other reasons? I have looked at the attributes of the TypeError but can’t seem to understand which would tell me the cause for the error.

Solution:

You’re essentially trying to check for None types since the error you’re receiving indicates directly that you’re trying to iterate over some variable that is None. I’d use the classic python check for None to determine when that is the case:

if my_iterable_collection is None:
    ... no good
else:
    ... we're good

This is how I’d try and find and bubble up None related issues, rather than using try/catch. You’ll know the exact error is related to None with a high degree of granularity.

Pythonic handling of IF block with two possible values

If I have a function or method which accepts a parameter which can only be one of two values, is it more pythonic to explicitly state both known conditions or abstract one away in the else clause? For example:

Option 1:

def main(group_name):
    if group_name == 'request':
        do_something()
    else:
        do_something_else()

Or option 2:

def main(group_name):
    if group_name == 'request':
        do_something()
    elif group_name == 'response':
        do_something_else()
    else:
        raise Exception

Solution:

Explicit is better than implicit. https://www.python.org/dev/peps/pep-0020/

More importantly, the second option is probably safer in many scenarios. If only two values X and Y are possible, then you shouldn’t trust that it is Y if it isn’t X and assume that with the else statement.

Python struct calsize different from actual

I am trying to read one short and long from a binary file using python struct.

But the

print(struct.calcsize("hl")) # o/p 16

which is wrong, It should have been 2 bytes for short and 8 bytes for long. I am not sure i am using the struct module the wrong way.

When i print the value for each it is

print(struct.calcsize("h")) # o/p 2
print(struct.calcsize("l")) # o/p 8

Is there a way to force python to maintain the precision on datatypes?

Solution:

By default struct alignment rules, 16 is the correct answer. Each field is aligned to match its size, so you end up with a short for two bytes, then six bytes of padding (to reach the next address aligned to a multiple of eight bytes), then eight bytes for the long.

You can use a byte order prefix (any of them disable padding), but they also disable machine native sizes (so struct.calcsize("=l") will be a fixed 4 bytes on all systems, and struct.calcsize("=hl") will be 6 bytes on all systems, not 10, even on systems with 8 byte longs).

If you want to compute struct sizes for arbitrary structures using machine native types with non-default padding rules, you’ll need to go to the ctypes module, define your ctypes.Structure subclass with the desired _pack_ setting, then use ctypes.sizeof to check the size, e.g.:

from ctypes import Structure, c_long, c_short, sizeof

class HL(Structure):
    _pack_ = 1  # Disables padding for field alignment
    # Defines (unnamed) fields, a short followed by long
    _fields_ = [("", c_short),
               ("", c_long)]

print(sizeof(HL))

which outputs 10 as desired.

This could be factored out as a utility function if needed (this is a simplified example that doesn’t handle all struct format codes, but you can expand if needed):

from ctypes import *

FMT_TO_TYPE = dict(zip("cb?hHiIlLqQnNfd",
                       (c_char, c_byte, c_bool, c_short, c_ushort, c_int, c_uint,
                        c_long, c_ulong, c_longlong, c_ulonglong, 
                        c_ssize_t, c_size_t, c_float, c_double)))

def calcsize(fmt, pack=None):
    '''Compute size of a format string with arbitrary padding (defaults to native)'''
    class _(Structure):
        if packis not None:
            _pack_ = pack
        _fields_ = [("", FMT_TO_TYPE[c]) for c in fmt]
    return sizeof(_)

which, once defined, lets you compute sizes padded or unpadded like so:

>>> calcsize("hl")     # Defaults to native "natural" alignment padding
16
>>> calcsize("hl", 1)  # pack=1 means no alignment padding between members
10

Compute percentile rank relative to a given population

I have “reference population” (say, v=np.random.rand(100)) and I want to compute percentile ranks for a given set (say, np.array([0.3, 0.5, 0.7])).

It is easy to compute one by one:

def percentile_rank(x):
    return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4

(actually, there is an ootb scipy.stats.percentileofscore – but it does not work on vectors).

np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33  0.48  0.71]

This produces the expected results, but I have a feeling that there should be a built-in for this.

I can also cheat:

pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]

0    0.330097
1    0.485437
2    0.718447

This is bad on two counts:

  1. I don’t want the test data [0.3, 0.5, 0.7] to be a part of the ranking.
  2. I don’t want to waste time computing ranks for the reference population.

So, what is the idiomatic way to accomplish this?

Solution:

Setup:

In [62]: v=np.random.rand(100)

In [63]: x=np.array([0.3, 0.4, 0.7])

Using Numpy broadcasting:

In [64]: (v<x[:,None]).mean(axis=1)
Out[64]: array([ 0.18,  0.28,  0.6 ])

Check:

In [67]: percentile_rank(0.3)
Out[67]: 0.17999999999999999

In [68]: percentile_rank(0.4)
Out[68]: 0.28000000000000003

In [69]: percentile_rank(0.7)
Out[69]: 0.59999999999999998

Efficient numpy argsort with condition while maintaining original indices

I’m wondering what the most efficient way to do an argsort of an array given a condition, while preserving the original index

x = np.array([0.63, 0.5, 0.7, 0.65])

np.argsort(x)
Out[99]: array([0, 1, 3, 2])

I want to argsort this array with condition that x>0.6. Since 0.5 < 0.6, index 1 should not be included.

x = np.array([0.63, 0.5, 0.7, 0.65])
index = x.argsort()
list(filter(lambda i: x[i] > 0.6, index))

[0,3,2]

This is inefficient since its not vectorized.

EDIT:
The filter will eliminate most of elements. So ideally, it filter first, then sort, while preserving original index.

Solution:

Method 1 (@jp_data_analysis answer)

You should use this one unless you have reason not to.

def meth1(x, thresh):
    return np.argsort(x)[(x <= thresh).sum():]

Method 2

If the filter will greatly reduce the number of elements in the array and the array is large then following may help:

def meth2(x, thresh):
    m = x > thresh
    idxs = np.argsort(x[m])
    offsets = (~m).cumsum()
    return idxs + offsets[m][idxs]

Speed comparison

x = np.random.rand(10000000)

%timeit meth1(x, 0.99)
# 2.81 s ± 244 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit meth2(x, 0.99)
# 104 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)