numpy vectorized way to change multiple rows of array(rows can be repeated)

I run into this problem when implementing the vectorized svm gradient for cs231n assignment1.
here is an example:

ary = np.array([[1,-9,0],
                [1,2,3],
                [0,0,0]])
ary[[0,1]] += np.ones((2,2),dtype='int')

and it outputs:

array([[ 2, -8,  1],
      [ 2,  3,  4],
      [ 0,  0,  0]])

everything is fine until rows is not unique:

ary[[0,1,1]] += np.ones((3,3),dtype='int') 

although it didn’t throw an error,the output was really strange:

array([[ 2, -8,  1],
       [ 2,  3,  4],
       [ 0,  0,  0]])

and I expect the second row should be [3,4,5] rather than [2,3,4],
the naive way I used to solve this problem is using a for loop like this:

ary = np.array([[ 2, -8,  1],
                [ 2,  3,  4],
                [ 0,  0,  0]])
# the rows I want to change
rows = [0,1,2,1,0,1]
# the change matrix
change = np.random.randn((6,3))
for i,row in enumerate(rows):
  ary[row] += change[i]

so I really don’t know how to vectorize this for loop, is there a better way to do this in NumPy?
and why it’s wrong to do something like this?:

ary[rows] += change

In case anyone is curious why I want to do so, here is my implementation of svm_loss_vectorized function, I need to compute the gradients of weights based on labels y:

def svm_loss_vectorized(W, X, y, reg):
    """
    Structured SVM loss function, vectorized implementation.

    Inputs and outputs are the same as svm_loss_naive.
    """
    loss = 0.0
    dW = np.zeros(W.shape) # initialize the gradient as zero

    # transpose X and W
    # D means input dimensions, N means number of train example
    # C means number of classes
    # X.shape will be (D,N)
    # W.shape will be (C,D)
    X = X.T
    W = W.T
    dW = dW.T
    num_train = X.shape[1]
    # transpose W_y shape to (D,N) 
    W_y = W[y].T
    S_y = np.sum(W_y*X ,axis=0)
    margins =  np.dot(W,X) + 1 - S_y
    mask = np.array(margins>0)

    # get the impact of num_train examples made on W's gradient
    # that is,only when the mask is positive 
    # the train example has impact on W's gradient
    dW_j = np.dot(mask, X.T)
    dW +=  dW_j
    mul_mask = np.sum(mask, axis=0, keepdims=True).T

    # dW[y] -= mul_mask * X.T
    dW_y =  mul_mask * X.T
    for i,label in enumerate(y):
      dW[label] -= dW_y[i]

    loss = np.sum(margins*mask) - num_train
    loss /= num_train
    dW /= num_train
    # add regularization term
    loss += reg * np.sum(W*W)
    dW += reg * 2 * W
    dW = dW.T

    return loss, dW

Solution:

Using built-in np.add.at

The built-in is np.add.at for such tasks, i,e.

np.add.at(ary, rows, change)

But, since we are working with a 2D array, that might not be the most performant one.

Leveraging fast matrix-multiplication

As it turns out, we can leverage the very efficient matrix-multplication for such a case as well and given enough number of repeated rows for summation, could be really good. Here’s how we can use it –

mask = rows == np.arange(len(ary))[:,None]
ary += mask.dot(change)

Benchmarking

Let’s time np.add.at method against matrix-multiplication based one for bigger arrays –

In [681]: ary = np.random.rand(1000,1000)

In [682]: rows = np.random.randint(0,len(ary),(10000))

In [683]: change = np.random.rand(10000,1000)

In [684]: %timeit np.add.at(ary, rows, change)
1 loop, best of 3: 604 ms per loop

In [687]: def matmul_addat(ary, row, change):
     ...:     mask = rows == np.arange(len(ary))[:,None]
     ...:     ary += mask.dot(change)

In [688]: %timeit matmul_addat(ary, rows, change)
10 loops, best of 3: 158 ms per loop

Vectorized sum-reduction with outer product – NumPy

I’m relatively new to NumPy and often read that you should avoid to write loops. In many cases I understand how to deal with that, but at the moment I have the following code:

p = np.arange(15).reshape(5,3)
w = np.random.rand(5)
A = np.sum(w[i] * np.outer(p[i], p[i]) for i in range(len(p)))

Does anybody know if there is there a way to avoid the inner for loop?

Thanks in advance!

Solution:

Approach #1 : With np.einsum

np.einsum('ij,ik,i->jk',p,p,w)

Approach #2 : With broadcasting + np.tensordot

np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))

Approach #3 : With np.einsum + np.dot

np.einsum('ij,i->ji',p,w).dot(p)

Runtime test

Set #1 :

In [653]: p = np.random.rand(50,30)

In [654]: w = np.random.rand(50)

In [655]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10000 loops, best of 3: 101 µs per loop

In [656]: %timeit np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
10000 loops, best of 3: 124 µs per loop

In [657]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
100000 loops, best of 3: 9.07 µs per loop

Set #2 :

In [658]: p = np.random.rand(500,300)

In [659]: w = np.random.rand(500)

In [660]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10 loops, best of 3: 139 ms per loop

In [661]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
1000 loops, best of 3: 1.01 ms per loop

The third approach just blew everything else!

Why Approach #3 is 10x-130x faster than Approach #1?

np.einsum is implemented in C. In the first approach, with those three strings there i,j,k in its string-notation, we would have three nested loops (in C of course). That’s a lot of memory overhead there.

With the third approach, we are only getting into two strings i, j, hence two nested loops (in C again) and also leveraging BLAS based matrix-multiplication with that np.dot. These two factors are responsible for the amazing speedup with this one.