NumPy broadcasting to improve dot-product performance

This is a rather simple operation, but it is repeated millions of times in my actual code and, if possible, I’d like to improve its performance.

import numpy as np

# Initial data array
xx = np.random.uniform(0., 1., (3, 14, 1))
# Coefficients used to modify 'xx'
a, b, c = np.random.uniform(0., 1., 3)

# Operation on 'xx' to obtain the final array 'yy'
yy = xx[0] * a * b + xx[1] * b + xx[2] * c

The last line is the one I’d like to improve. Basically, each term in xx is multiplied by a factor (given by the a, b, c coefficients) and then all terms are added to give a final yy array with the shape (14, 1) vs the shape of the initial xx array (3, 14, 1).

Is it possible to do this via numpy broadcasting?

Solution:

We could use broadcasted multiplication and then sum along the first axis for the first alternative.

As the second one, we could also bring in matrix-multiplication with np.dot. Thus, giving us two more approaches. Here’s the timings for the sample provided in the question –

# Original one
In [81]: %timeit xx[0] * a * b + xx[1] * b + xx[2] * c
100000 loops, best of 3: 5.04 µs per loop

# Proposed alternative #1
In [82]: %timeit (xx *np.array([a*b,b,c])[:,None,None]).sum(0)
100000 loops, best of 3: 4.44 µs per loop

# Proposed alternative #2
In [83]: %timeit np.array([a*b,b,c]).dot(xx[...,0])[:,None]
1000000 loops, best of 3: 1.51 µs per loop

How can I determine the reason for a Python Type Error

I’m currently using a try/except block to treat a particular variable as an iterable when I can, but handle it a different, though correct, manner when it isn’t iterable.

My problem is that a TypeException may be thrown for reasons other than trying to iterate with a non-iterable. My check was to use the message attached to the TypeException to ensure that this was the reason and not something like an unsupported operand.

But messages as a part of exceptions have been deprecated. So, how can I check on the reason for my TypeException?

For the sake of completeness, the code I’m using is fairly similar to this:

            try:
               deref = [orig[x].value.flatten() for x in y]
            except TypeError as ex:
                if "object is not iterable" in ex.message:
                    x = y
                    deref = [orig[x].value.flatten()]
                else:
                    raise

Solution:

Separate the part that throws the exception you’re interested in from the parts that throw unrelated exceptions:

try:
    iterator = iter(y)
except TypeError:
    handle_that()
else:
    do_whatever_with([orig[x].value.flatten() for x in iterator])

Python – How to pass a method as an argument to call a method from another library

I want to pass a method as an argument that will call such method from another python file as follows:

file2.py

def abc():
    return 'success.'

main.py

import file2
def call_method(method_name):
    #Here the method_name passed will be a method to be called from file2.py
    return file2.method_name()

print(call_method(abc))

What I expect is to return success.

If calling a method within the same file (main.py), I notice it is workable. However, for case like above where it involves passing an argument to be called from another file, how can I do that?

Solution:

You can use getattr to get the function from the module using a string like:

import file2
def call_method(method_name):
    return getattr(file2, method_name)()

print(call_method('abc'))

create module function alias by import or assigment

Say I have import a module by using

import m

and now I want an alias to its function, I can use

from m import f as n

or

n = m.f

I think there is no difference, is one preferred than another?

Solution:

There is no difference, as far as using the object n is concerned.

There is a slight logical difference: the first way will leave a name m bound in scope, and the second way will not. Though, the m module would still get loaded into sys.modules with either approach.

Using the import statement for this is more commonly seen.

Index a NumPy array row-wise

Say I have a NumPy array:

>>> X = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
>>> X
array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

and an array of indexes that I want to select for each row:

>>> ixs = np.array([[1, 3], [0, 1], [1, 2]])
>>> ixs
array([[1, 3],
       [0, 1],
       [1, 2]])

How do I index the array X so that for every row in X I select the two indices specified in ixs?

So for this case, I want to select element 1 and 3 for the first row, element 0 and 1 for the second row, and so on. The output should be:

array([[2, 4],
       [5, 6],
       [10, 11]])

A slow solution would be something like this:

output = np.array([row[ix] for row, ix in zip(X, ixs)])

however this can get kinda slow for extremely long arrays. Is there a faster way to do this without a loop using NumPy?

EDIT: Some very approximate speed tests on a 2.5K * 1M array (10GB):

np.array([row[ix] for row, ix in zip(X, ixs)]) 0.16s

X[np.arange(len(ixs)), ixs.T].T 0.175s

X.take(idx+np.arange(0, X.shape[0]*X.shape[1], X.shape[1])[:,None]) 33s

np.fromiter((X[i, j] for i, row in enumerate(ixs) for j in row), dtype=X.dtype).reshape(ixs.shape) 2.4s

Solution:

You can use this:

X[np.arange(len(ixs)), ixs.T].T

Here is the reference for complex indexing.

Python iterate through array while finding the mean of the top k elements

Suppose I have a Python array a=[3, 5, 2, 7, 5, 3, 6, 8, 4]. My goal is to iterate through this array 3 elements at a time returning the mean of the top 2 of the three elements.

Using the above above, during my iteration step, the first three elements are [3, 5, 2] and the mean of the top 2 elements is 4. The next three elements are [5, 2, 7] and the mean of the top 2 elements is 6. The next three elements are [2, 7, 5] and the mean of the top 2 elements is again 6. …

Hence, the result for the above array would be [4, 6, 6, 6, 5.5, 7, 7].

What is the nicest way to write such a function?

Solution:

Solution

You can use some fancy slicing of your list to manipulate subsets of elements. Simply grab each three element sublist, sort to find the top two elements, and then find the simple average (aka. mean) and add it to a result list.

Code

def get_means(input_list):
    means = []
    for i in xrange(len(input_list)-2):
        three_elements = input_list[i:i+3]
        top_two = sorted(three_elements, reverse=True)[:2]
        means.append(sum(top_two)/2.0)
    return means

Example

print(get_means([3, 5, 2, 7, 5, 3, 6, 8, 4]))
# [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

numpy broadcasting to all dimensions

I have a 3d numpy array build like this:

a = np.ones((3,3,3))

And I would like to broadcast values on all dimensions starting from a certain point with given coordinates, but the number of dimensions may vary.

For example if i’m given the coordinates (1,1,1) I can do these 3 functions:

a[1,1,:] = 0
a[1,:,1] = 0
a[:,1,1] = 0

And the result will be my desired output which is:

array([[[1., 1., 1.],
        [1., 0., 1.],
        [1., 1., 1.]],

       [[1., 0., 1.],
        [0., 0., 0.],
        [1., 0., 1.]],

       [[1., 1., 1.],
        [1., 0., 1.],
        [1., 1., 1.]]])

Or if i’m given the coordinates (0,1,0) the corresponding broadcast will be:

a[0,1,:] = 0
a[0,:,0] = 0
a[:,1,0] = 0

Is there any way to do this in a single action instead of 3? I’m asking because the actual arrays i’m working with have even more dimensions which makes the code seem long and redundant. Also if the number of dimensions change I would have to rewrite the code.

EDIT: It doesn’t have to be a single action, I just need to do it in all dimensions programatically such that if the number of dimensions change the code will stay the same.

EDIT 2: About the logic of this, i’m not sure if that’s relevant, but i’m being given the value of a point (by coordinates) on a map and based on that I know the values of the entire row, column and height on the same map (that’s why i’m updating all 3 with 0 as an example). In other cases the map is 2-dimensions and I still know the same thing about the row and column, but can’t figure out a function that works for a varied numbers of dimensions.

Solution:

Here’s a way to generate string of exactly the 3 lines of code you’re currently using, and then execute them:

import numpy as np

a = np.ones([3,3,3])
coord = [1, 1, 1]

for i in range(len(coord)):
   temp = coord[:]
   temp[i] = ':'
   slice_str = ','.join(map(str, temp))
   exec("a[%s] = 0"%slice_str)

print a

This may not be the best approach, but at least it’s amusing. Now that we know that it works, we can go out and find the appropriate syntax to do it without actually generating the string and execing it. For example, you could use slice:

import numpy as np

a = np.ones([3,3,3])
coord = [1, 1, 1]

for i, length in enumerate(a.shape):
   temp = coord[:]
   temp[i] = slice(length)
   a[temp] = 0
print a

Pandas: Filtering multiple conditions

I’m trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:

temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()

However, if I do this (which I think should be equivalent), I get no rows back:

temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()

Any idea what accounts for the difference?

Solution:

Use () because operator precedence:

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]

Alternatively, create conditions on separate rows:

cond1 = df["bin"] == 3    
cond2 = df["days since"] > 7
cond3 = ~df["Def"]

temp2 = df[cond1 & cond2 & cond3]

Sample:

df = pd.DataFrame({'Def':[True] *2 + [False]*4,
                   'days since':[7,8,9,14,2,13],
                   'bin':[1,3,5,3,3,3]})

print (df)
     Def  bin  days since
0   True    1           7
1   True    3           8
2  False    5           9
3  False    3          14
4  False    3           2
5  False    3          13


temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
     Def  bin  days since
3  False    3          14
5  False    3          13

How do I split a string into several columns in a dataframe with pandas Python?

I am aware of the following questions:

1.) How to split a column based on several string indices using pandas?
2.) How do I split text in a column into multiple rows?

I want to split these into several new columns though. Suppose I have a dataframe that looks like this:

id    | string
-----------------------------
1     | astring, isa, string
2     | another, string, la
3     | 123, 232, another

I know that using:

df['string'].str.split(',')

I can split a string. But as a next step, I want to efficiently put the split string into new columns like so:

id    | string_1 | string_2 | string_3
-----------------|---------------------
1     | astring  | isa      | string
2     | another  | string   | la
3     | 123      | 232      | another
---------------------------------------

I could for example do this:

for index, row in df.iterrows():
    i = 0
    for item in row['string'].split():
        df.set_values(index, 'string_{0}'.format(i), item)
        i = i + 1

But how could one achieve the same result more elegantly?a

Solution:

The str.split method has an expand argument:

>>> df['string'].str.split(',', expand=True)
         0        1         2
0  astring      isa    string
1  another   string        la
2      123      232   another
>>>

With column names:

>>> df['string'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
   string1  string2   string3
0  astring      isa    string
1  another   string        la
2      123      232   another

Much neater with Python >= 3.6 f-strings:

>>> (df['string'].str.split(',', expand=True)
...              .rename(columns=lambda x: f"string_{x+1}"))
  string_1 string_2  string_3
0  astring      isa    string
1  another   string        la
2      123      232   another

Python concatenating elements of one list that are between elements of another list

I have two lists: a and b. I want to concatenate all of the elements of the b that are between elements of a. All of the elements of a are in b, but b also has some extra elements that are extraneous. I would like to take the first instance of every element of a in b and concatenate it with the extraneous elements that follow it in b until we find another element of a in b. The following example should make it more clear.

a = [[11.0, 1.0], [11.0, 2.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [12.0, 11.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]  

b = [[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0], [11.0, 2.0], [11.0, 3.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 5.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0], [3.0, 1.0], [13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [11.0, 3.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 5.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]

c = [[[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0]], [[11.0, 2.0]], [[11.0, 3.0], [11.0, 3.0]], [[11.0, 4.0]], [[11.0, 5.0]], [[12.0, 1.0]], [[12.0, 2.0]], [[12.0, 3.0]], [[12.0, 4.0]], [[12.0, 5.0]], [[12.0, 6.0]], [[12.0, 7.0], [12.0, 5.0]], [[12.0, 8.0]], [[12.0, 9.0]], [[12.0, 10.0], [13.0, 5.0]], [[12.0, 11.0], [12.0, 8.0], [3.0, 1.0]], [[13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0]], [[13.0, 2.0]], [[11.0, 3.0], [13.0, 3.0]], [[13.0, 4.0]], [[13.0, 5.0], [13.0, 5.0], [13.0, 5.0]], [[13.0, 6.0]], [[13.0, 7.0], [13.0, 7.0]], [[13.0, 8.0]], [[13.0, 9.0]], [[13.0, 10.0]]]

What I have thought of is something like this:

slice_list = []
for i, elem in enumerate(a):
    if i < len(key_list)-1:
        b_first_index = b.index(a[i])
        b_second_index = b.index(a[i+1]) 
        slice_list.append([b_first_index, b_second_index])

c = [[b[slice_list[i][0]:b[slice_list[i][1]]]] for i in range(len(slice_list))]

This however will not catch the last item in the list (which I am not quite sure how to fit into my list comprehension anyways) and it seems quite ugly. My question is, is there a neater way of doing this (perhaps in itertools)?

Solution:

I think your example wrong_list_fixed is incorrect.

        [[12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0],
# There should be a new list here -^

Here’s a solution that walks the lists. It can be optimized further:

from contextlib import suppress

fixed = []
current = []
key_list_iter = iter(key_list)
next_key = next(key_list_iter)
for wrong in wrong_list:
    if wrong == next_key:
        if current:
            fixed.append(current)
            current = []
        next_key = None
        with suppress(StopIteration):
            next_key = next(key_list_iter)
    current.append(wrong)

if current:
    fixed.append(current)

Here are the correct lists (modified to be easier to visually parse):

key_list = ['_a0', '_b0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_m0', '_n0', '_o0', '_p0', '_q0', '_r0', '_s0', '_t0', '_u0', '_v0', '_w0', '_x0', '_y0', '_z0', '_A0', '_B0', '_C0'] 
wrong_list = ['_a0', '_a0', 'D0', '_b0', '_c0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_j0', '_m0', '_n0', '_o0', '_x0', '_p0', '_m0', 'E0', '_t0', 'F0', '_q0', '_r0', '_s0', '_t0', '_u0', '_c0', '_v0', '_w0', '_x0', '_x0', '_x0', '_y0', '_z0', '_z0', '_A0', '_B0', '_C0'] 
wrong_list_fixed = [['_a0', '_a0', 'D0'], ['_b0'], ['_c0', '_c0'], ['_d0'], ['_e0'], ['_f0'], ['_g0'], ['_h0'], ['_i0'], ['_j0'], ['_k0'], ['_l0', '_j0'], ['_m0'], ['_n0'], ['_o0', '_x0'], ['_p0', '_m0', 'E0', '_t0', 'F0'], ['_q0'], ['_r0'], ['_s0'], ['_t0'], ['_u0', '_c0'], ['_v0'], ['_w0'], ['_x0', '_x0', '_x0'], ['_y0'], ['_z0', '_z0'], ['_A0'], ['_B0'], ['_C0']]