In Pandas, does .iloc method give a copy or view?

I find the result is a little bit random. Sometimes it’s a copy sometimes it’s a view. For example:

df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}],index=['student1','student2'])

df
              age   name
   student1   21  Marry
   student2   24   John

Now, Let me try to modify it a little bit.

df2= df.loc['student1']
df2 [0] = 23
df
              age   name
   student1   21  Marry
   student2   24   John

As you can see, nothing changed. df2 is a copy. However, if I add another student into the dataframe…

df.loc['student3'] = ['old','Tom']
df
               age   name
    student1   21  Marry
    student2   24   John
    student3  old    Tom

Try to change the age again..

df3=df.loc['student1']
df3[0]=33
df
               age   name
    student1   33  Marry
    student2   24   John
    student3  old    Tom

Now df3 suddenly became a view. What is going on? I guess the value ‘old’ is the key?

Solution:

In general, you can get a view if the data-frame has a single dtype, which is not the case with your original data-frame:

In [4]: df
Out[4]:
          age   name
student1   21  Marry
student2   24   John

In [5]: df.dtypes
Out[5]:
age      int64
name    object
dtype: object

However, when you do:

In [6]: df.loc['student3'] = ['old','Tom']
   ...:

The first column get’s coerced to object, since columns cannot have mixed dtypes:

In [7]: df.dtypes
Out[7]:
age     object
name    object
dtype: object

In this case, the underlying .values will always return an array with the same underlying buffer, and changes to that array will be reflected in the data-frame:

In [11]: vals = df.values

In [12]: vals
Out[12]:
array([[21, 'Marry'],
       [24, 'John'],
       ['old', 'Tom']], dtype=object)

In [13]: vals[0,0] = 'foo'

In [14]: vals
Out[14]:
array([['foo', 'Marry'],
       [24, 'John'],
       ['old', 'Tom']], dtype=object)

In [15]: df
Out[15]:
          age   name
student1  foo  Marry
student2   24   John
student3  old    Tom

On the other hand, with mixed types like with your original data-frame:

In [26]: df = pd.DataFrame([{'name':'Marry', 'age':21},{'name':'John','age':24}]
    ...: ,index=['student1','student2'])
    ...:

In [27]: vals = df.values

In [28]: vals
Out[28]:
array([[21, 'Marry'],
       [24, 'John']], dtype=object)

In [29]: vals[0,0] = 'foo'

In [30]: vals
Out[30]:
array([['foo', 'Marry'],
       [24, 'John']], dtype=object)

In [31]: df
Out[31]:
          age   name
student1   21  Marry
student2   24   John

Note, however, that a view will only be returned if it is possible to be a view, i.e. if it is a proper slice, otherwise, a copy will be made regardless of the dtypes:

In [39]: df.loc['student3'] = ['old','Tom']


In [40]: df2
Out[40]:
          name
student3   Tom
student2  John

In [41]: df2.loc[:] = 'foo'

In [42]: df2
Out[42]:
         name
student3  foo
student2  foo

In [43]: df
Out[43]:
          age   name
student1   21  Marry
student2   24   John
student3  old    Tom

Python Logging levels don't seem to be working

I would like to log info level information to a file and debug level info to the console. I am using StreamHandlers, but both logging.info and logging.debug both log to the console and file. I would like the console to just show test1 and the file to just show test.

import logging
import os

rootLogger_file = logging.getLogger()
rootLogger_file.setLevel(logging.INFO)

rootLogger_console = logging.getLogger()
rootLogger_console.setLevel(logging.DEBUG)

fileHandler = logging.FileHandler('info', "w")

rootLogger_file.addHandler(fileHandler)

consoleHandler = logging.StreamHandler()
rootLogger_console.addHandler(consoleHandler)

rootLogger_file.info('test')
rootLogger_console.debug('test1')

Solution:

You are only creating a single logger with level DEBUG and you are adding both handlers to it. From the docs:

Multiple calls to getLogger() with the same name will always return a reference to the same Logger object.

f = logging.getLogger()
f.setLevel(logging.INFO)

c = logging.getLogger()  # returns the same object as before!
c.setLevel(logging.DEBUG)

f is c
# True  # f and c are the same object!

f.level
# 10  # DEBUG
c.level
# 10  # DEBUG

Since the one logger you have has level DEBUG (which means it also logs INFO and all other levels) and is picked up by both handlers, both messages are shown on the console and in the file. You have to give them different names upon creation:

f = logging.getLogger('f')
f.setLevel(logging.INFO)
c = logging.getLogger('c')
c.setLevel(logging.DEBUG)
# ...
f.info('test')  # logs to file
c.debug('test1') # logs to console

Pandas – Only Pivot Select Rows

I have a table where two different types of columns have been stacked into the field column – attributes and questions.

+-------+------------+-------+
|  id   |   field    | value |
+-------+------------+-------+
| 52394 | gender     | M     |
| 52394 | age        | 24    |
| 52394 | question_1 | 2     |
| 52394 | question_2 | 1     |
+-------+------------+-------+

I want to reshape it so that gender and age become columns while question_1 and question_2 remain stacked.

+-------+--------+-----+------------+-------+
|  id   | gender | age |   field    | value |
+-------+--------+-----+------------+-------+
| 52394 | M      |  24 | question_1 |     2 |
| 52394 | M      |  24 | question_2 |     1 |
+-------+--------+-----+------------+-------+

Any ideas on how to do this?

Solution:

This would be my strategy:

Apply pivot to your df where field is gender or age, save as df1. Select the df where field is not gender or age, save as df2. Then merge the two (df1 and df2) on id. Here is my full code:

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
  id     field     value 
 52394  gender      M     
 52394  age         24    
 52394  question_1  2     
 52394  question_2  1     
""")
df = pd.read_csv(rawText, sep = "\s+")
df1 = df[df['field'].isin(['gender','age'])]
df1 = df1.pivot(index = 'id', columns = 'field', values = 'value').reset_index()
df2 = df[~df['field'].isin(['gender','age'])]
df1.merge(df2)

The result is:

      id age gender       field value
0  52394  24      M  question_1     2
1  52394  24      M  question_2     1

'utf8' codec can't decode byte 0xc3 while decode('utf-8') in python

Today I was hit with strange error in my script:

'utf8' codec can't decode byte 0xc3 in position 21: invalid continuation byte

I’m reading data from socket sock.recv and result is buff.decode('utf-8') where buff is the returned data.

But today I found pretty much “unicorn” where one of the characters returned “▒” <– this is what throw decode utf-8 into exception. Is there some pre process that would either remove or replace such a strange character?

Solution:

There is a second parameter for .decode() named errors. You can set it to 'ignore' to ignore all non-utf8 characters, or set it to 'replace' to replace them with the diamond question mark (�).

buff.decode('utf-8', 'ignore')

Cycling Slicing in Python

I’ve come up with this question while trying to apply a Cesar Cipher to a matrix with different shift values for each row, i.e. given a matrix X

array([[1, 0, 8],
   [5, 1, 4],
   [2, 1, 1]])

with shift values of S = array([0, 1, 1]), the output needs to be

array([[1, 0, 8],
   [1, 4, 5],
   [1, 1, 2]])

This is easy to implement by the following code:

Y = []
for i in range(X.shape[0]):
    if (S[i] > 0):
        Y.append( X[i,S[i]::].tolist() + X[i,:S[i]:].tolist() )
    else:
        Y.append(X[i,:].tolist())
Y = np.array(Y)

This is a left-cycle-shift. I wonder how to do this in a more efficient way using numpy arrays?

Update: This example applies the shift to the columns of a matrix. Suppose that we have a 3D array

array([[[8, 1, 8],
        [8, 6, 2],
        [5, 3, 7]],

       [[4, 1, 0],
        [5, 9, 5],
        [5, 1, 7]],

       [[9, 8, 6],
        [5, 1, 0],
        [5, 5, 4]]])

Then, the cyclic right shift of S = array([0, 0, 1]) over the columns leads to

array([[[8, 1, 7],
        [8, 6, 8],
        [5, 3, 2]],

       [[4, 1, 7],
        [5, 9, 0],
        [5, 1, 5]],

       [[9, 8, 4],
        [5, 1, 6],
        [5, 5, 0]]])

Solution:

Approach #1 : Use modulus to implement the cyclic pattern and get the new column indices and then simply use advanced-indexing to extract the elements, giving us a vectorized solution, like so –

def cyclic_slice(X, S):
    m,n = X.shape
    idx = np.mod(np.arange(n) + S[:,None],n)
    return X[np.arange(m)[:,None], idx]

Approach #2 : We can also leverage the power of strides for further speedup. The idea would be to concatenate the sliced off portion from the start and append it at the end, then create sliding windows of lengths same as the number of cols and finally index into the appropriate window numbers to get the same rolled over effect. The implementation would be like so –

def cyclic_slice_strided(X, S):
    X2 = np.column_stack((X,X[:,:-1]))
    s0,s1 = X2.strides
    strided = np.lib.stride_tricks.as_strided 

    m,n1 = X.shape
    n2 = X2.shape[1]
    X2_3D = strided(X2, shape=(m,n2-n1+1,n1), strides=(s0,s1,s1))
    return X2_3D[np.arange(len(S)),S]

Sample run –

In [34]: X
Out[34]: 
array([[1, 0, 8],
       [5, 1, 4],
       [2, 1, 1]])

In [35]: S
Out[35]: array([0, 1, 1])

In [36]: cyclic_slice(X, S)
Out[36]: 
array([[1, 0, 8],
       [1, 4, 5],
       [1, 1, 2]])

Runtime test –

In [75]: X = np.random.rand(10000,100)
    ...: S = np.random.randint(0,100,(10000))

# @Moses Koledoye's soln
In [76]: %%timeit
    ...: Y = []
    ...: for i, x in zip(S, X):
    ...:     Y.append(np.roll(x, -i))
10 loops, best of 3: 108 ms per loop

In [77]: %timeit cyclic_slice(X, S)
100 loops, best of 3: 14.1 ms per loop

In [78]: %timeit cyclic_slice_strided(X, S)
100 loops, best of 3: 4.3 ms per loop

Adaption for 3D case

Adapting approach #1 for the 3D case, we would have –

shift = 'left'
axis = 1 # axis along which S is to be used (axis=1 for rows)
n = X.shape[axis]
if shift == 'left':
    Sa = S
else:
    Sa = -S    

# For rows
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[:,idx, np.arange(len(S))]

# For columns
idx = np.mod(Sa[:,None] + np.arange(n),n)
out = X[:,np.arange(len(S))[:,None], idx]

# For axis=0
idx = np.mod(np.arange(n)[:,None] + Sa,n)
out = X[idx, np.arange(len(S))]

There could be a way to have a generic solution for a generic axis, but I will keep it to this point.

Dealing with large numbers in R [Inf] and Python

I am learning Python these days, and this is probably my first post on Python. I am relatively new to R as well, and have been using R for about a year. I am comparing both the languages while learning Python. I apologize if this question is too basic.

I am unsure why R outputs Inf for something python doesn’t. Let’s take 2^1500 as an example.

In R:

nchar(2^1500)
[1] 3
2^1500
[1] Inf

In Python:

len(str(2**1500))
Out[7]: 452
2**1500
Out[8]: 3507466211043403874...

I have two questions:

a) Why is it that R provides Inf when Python doesn’t.

b) I researched How to work with large numbers in R? thread. It seems that Brobdingnag could help us out with dealing with large numbers. However, even in such case, I am unable to compute nchar. How do I compute above expression i.e. 2^1500 in R

2^Brobdingnag::as.brob(500)
[1] +exp(346.57)
> nchar(2^Brobdingnag::as.brob(500))
Error in nchar(2^Brobdingnag::as.brob(500)) : 
  no method for coercing this S4 class to a vector

Solution:

In answer to your questions:

a) They use different representations for numbers. Most numbers in R are represented as double precision floating point values. These are all 64 bits long, and give about 15 digit precision throughout the range, which goes from -double.xmax to double.xmax, then switches to signed infinite values. R also uses 32 bit integer values sometimes. These cover the range of roughly +/- 2 billion. R chooses these types because it is geared towards statistical and numerical methods, and those rarely need more precision than double precision gives. (They often need a bigger range, but usually taking logs solves that problem.)

Python is more of a general purpose platform, and it has types discussed in MichaelChirico’s comment.

b) Besides Brobdingnag, the gmp package can handle arbitrarily large integers. For example,

> as.bigz(2)^1500
Big Integer ('bigz') :
[1] 35074662110434038747627587960280857993524015880330828824075798024790963850563322203657080886584969261653150406795437517399294548941469959754171038918004700847889956485329097264486802711583462946536682184340138629451355458264946342525383619389314960644665052551751442335509249173361130355796109709885580674313954210217657847432626760733004753275317192133674703563372783297041993227052663333668509952000175053355529058880434182538386715523683713208549376
> nchar(as.character(as.bigz(2)^1500))
[1] 452

I imagine the as.character() call would also be needed with Brobdingnag.

numpy vectorized way to change multiple rows of array(rows can be repeated)

I run into this problem when implementing the vectorized svm gradient for cs231n assignment1.
here is an example:

ary = np.array([[1,-9,0],
                [1,2,3],
                [0,0,0]])
ary[[0,1]] += np.ones((2,2),dtype='int')

and it outputs:

array([[ 2, -8,  1],
      [ 2,  3,  4],
      [ 0,  0,  0]])

everything is fine until rows is not unique:

ary[[0,1,1]] += np.ones((3,3),dtype='int') 

although it didn’t throw an error,the output was really strange:

array([[ 2, -8,  1],
       [ 2,  3,  4],
       [ 0,  0,  0]])

and I expect the second row should be [3,4,5] rather than [2,3,4],
the naive way I used to solve this problem is using a for loop like this:

ary = np.array([[ 2, -8,  1],
                [ 2,  3,  4],
                [ 0,  0,  0]])
# the rows I want to change
rows = [0,1,2,1,0,1]
# the change matrix
change = np.random.randn((6,3))
for i,row in enumerate(rows):
  ary[row] += change[i]

so I really don’t know how to vectorize this for loop, is there a better way to do this in NumPy?
and why it’s wrong to do something like this?:

ary[rows] += change

In case anyone is curious why I want to do so, here is my implementation of svm_loss_vectorized function, I need to compute the gradients of weights based on labels y:

def svm_loss_vectorized(W, X, y, reg):
    """
    Structured SVM loss function, vectorized implementation.

    Inputs and outputs are the same as svm_loss_naive.
    """
    loss = 0.0
    dW = np.zeros(W.shape) # initialize the gradient as zero

    # transpose X and W
    # D means input dimensions, N means number of train example
    # C means number of classes
    # X.shape will be (D,N)
    # W.shape will be (C,D)
    X = X.T
    W = W.T
    dW = dW.T
    num_train = X.shape[1]
    # transpose W_y shape to (D,N) 
    W_y = W[y].T
    S_y = np.sum(W_y*X ,axis=0)
    margins =  np.dot(W,X) + 1 - S_y
    mask = np.array(margins>0)

    # get the impact of num_train examples made on W's gradient
    # that is,only when the mask is positive 
    # the train example has impact on W's gradient
    dW_j = np.dot(mask, X.T)
    dW +=  dW_j
    mul_mask = np.sum(mask, axis=0, keepdims=True).T

    # dW[y] -= mul_mask * X.T
    dW_y =  mul_mask * X.T
    for i,label in enumerate(y):
      dW[label] -= dW_y[i]

    loss = np.sum(margins*mask) - num_train
    loss /= num_train
    dW /= num_train
    # add regularization term
    loss += reg * np.sum(W*W)
    dW += reg * 2 * W
    dW = dW.T

    return loss, dW

Solution:

Using built-in np.add.at

The built-in is np.add.at for such tasks, i,e.

np.add.at(ary, rows, change)

But, since we are working with a 2D array, that might not be the most performant one.

Leveraging fast matrix-multiplication

As it turns out, we can leverage the very efficient matrix-multplication for such a case as well and given enough number of repeated rows for summation, could be really good. Here’s how we can use it –

mask = rows == np.arange(len(ary))[:,None]
ary += mask.dot(change)

Benchmarking

Let’s time np.add.at method against matrix-multiplication based one for bigger arrays –

In [681]: ary = np.random.rand(1000,1000)

In [682]: rows = np.random.randint(0,len(ary),(10000))

In [683]: change = np.random.rand(10000,1000)

In [684]: %timeit np.add.at(ary, rows, change)
1 loop, best of 3: 604 ms per loop

In [687]: def matmul_addat(ary, row, change):
     ...:     mask = rows == np.arange(len(ary))[:,None]
     ...:     ary += mask.dot(change)

In [688]: %timeit matmul_addat(ary, rows, change)
10 loops, best of 3: 158 ms per loop

How to concatenate pandas DataFrame with built-in logic?

I have two pandas data frame and I would like to produce the output shown in the expected data frame.

import pandas as pd

df1 = pd.DataFrame({'a':['aaa', 'bbb', 'ccc', 'ddd'],
                    'b':['eee', 'fff', 'ggg', 'hhh']})
df2 = pd.DataFrame({'a':['aaa', 'bbb', 'ccc', 'ddd'],
                    'b':['eee', 'fff', 'ggg', 'hhh'],
                    'update': ['', 'X', '', 'Y']})
expected = pd.DataFrame({'a': ['aaa', 'bbb', 'ccc', 'ddd'],
                         'b': ['eee', 'X', 'ggg', 'Y']})

I tried to apply some concatenation logic but this is not producing the expected output.

df1.set_index('b')
df2.set_index('update')
out = pd.concat([df1[~df1.index.isin(df2.index)], df2])

print(out)
         a    b   update
0  aaa  eee
1  bbb  fff  X
2  ccc  ggg
3  ddd  hhh  Y

From this output I can produce the expected output but I was wondering if this logic can be built directly inside the concat call?

def fx(row):
    if row['update'] is not '':
        row['b'] = row['update']
    return row

result = out.apply(lambda x : fx(x),axis=1)
result.drop('update', axis=1, inplace=True)
print(result)
     a        b
0  aaa      eee
1  bbb      X
2  ccc      ggg
3  ddd      Y

Solution:

Use builtin update by replacing ” with nan i.e

df1['b'].update(df2['update'].replace('',np.nan))

    a    b
0  aaa  eee
1  bbb    X
2  ccc  ggg
3  ddd    Y

You can also use np.where i.e

out = df1.assign(b=np.where(df2['update'].eq(''), df2['b'], df2['update']))

Numpy RandomState method

Im new to python so just dont blast me.
I’m studying a Python code, i found this

rgen = np.random.RandomState(self.random_state)

where self.random_state is an int.
I looked at the documentation at
https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.RandomState.html#numpy.random.RandomState

and i discovere that RandomState(int) doesnt exist as a method, but it is just a “methods container”.
So, how it’s possible to call RandomState(self.random_state) ?

Solution:

RandomState is a class and RandomState(whatever_arguments) just creates a new instance of the class RandomState.

Instance creation normally goes through __init__ (and/or __new__) which is a special method and not always seperately documented. Normally, as in this case, it’s documented in the class’s docstring, you already linked to the relevant documentation page which lists the parameter for the instance creation:

class numpy.random.RandomState

Container for the Mersenne Twister pseudo-random number generator.

RandomState exposes a number of methods for generating random numbers drawn from a variety of probability distributions. In addition to the distribution-specific arguments, each method takes a keyword argument size that defaults to None. If size is None, then a single value is generated and returned. If size is an integer, then a 1-D array filled with generated values is returned. If size is a tuple, then an array with that shape is filled and returned.

Compatibility Guarantee A fixed seed and a fixed series of calls to ‘RandomState’ methods using the same parameters will always produce the same results up to roundoff error except when the values were incorrect. Incorrect values will be fixed and the NumPy version in which the fix was made will be noted in the relevant docstring. Extension of existing parameter ranges and the addition of new parameters is allowed as long the previous behavior remains unchanged.

Parameters:

seed : {None, int, array_like}, optional

Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 – 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.

Vectorized sum-reduction with outer product – NumPy

I’m relatively new to NumPy and often read that you should avoid to write loops. In many cases I understand how to deal with that, but at the moment I have the following code:

p = np.arange(15).reshape(5,3)
w = np.random.rand(5)
A = np.sum(w[i] * np.outer(p[i], p[i]) for i in range(len(p)))

Does anybody know if there is there a way to avoid the inner for loop?

Thanks in advance!

Solution:

Approach #1 : With np.einsum

np.einsum('ij,ik,i->jk',p,p,w)

Approach #2 : With broadcasting + np.tensordot

np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))

Approach #3 : With np.einsum + np.dot

np.einsum('ij,i->ji',p,w).dot(p)

Runtime test

Set #1 :

In [653]: p = np.random.rand(50,30)

In [654]: w = np.random.rand(50)

In [655]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10000 loops, best of 3: 101 µs per loop

In [656]: %timeit np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
10000 loops, best of 3: 124 µs per loop

In [657]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
100000 loops, best of 3: 9.07 µs per loop

Set #2 :

In [658]: p = np.random.rand(500,300)

In [659]: w = np.random.rand(500)

In [660]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10 loops, best of 3: 139 ms per loop

In [661]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
1000 loops, best of 3: 1.01 ms per loop

The third approach just blew everything else!

Why Approach #3 is 10x-130x faster than Approach #1?

np.einsum is implemented in C. In the first approach, with those three strings there i,j,k in its string-notation, we would have three nested loops (in C of course). That’s a lot of memory overhead there.

With the third approach, we are only getting into two strings i, j, hence two nested loops (in C again) and also leveraging BLAS based matrix-multiplication with that np.dot. These two factors are responsible for the amazing speedup with this one.