Is there a str.replace equivalent for sequence in general?

Is there a method similar to str.replace which can do the following:

>> replace(sequence=[0,1,3], old=[0,1], new=[1,2]) 
[1,2,3]

It should really act like str.replace : replacing a “piece” of a sequence by another sequence, not map elements of “old” with “new” ‘s ones.
Thanks 🙂

Solution:

No, I’m afraid there is no built-in function that does this, however you can create your own!

The steps are really easy, we just need to slide a window over the list where the width of the window is the len(old). At each position, we check if the window == to old and if it is, we slice before the window, insert new and concatenate the rest of the list on after – this can be done simply be assigning directly to the old slice as pointed out by @OmarEinea.

def replace(seq, old, new):
    seq = seq[:]
    w = len(old)
    i = 0
    while i < len(seq) - w + 1:
        if seq[i:i+w] == old:
            seq[i:i+w] = new
            i += len(new)
        else:
            i += 1
    return seq

and some tests show it works:

>>> replace([0, 1, 3], [0, 1], [1, 2])
[1, 2, 3]
>>> replace([0, 1, 3, 0], [0, 1], [1, 2])
[1, 2, 3, 0]
>>> replace([0, 1, 3, 0, 1], [0, 1], [7, 8])
[7, 8, 3, 7, 8]
>>> replace([1, 2, 3, 4, 5], [1, 2, 3], [1, 1, 2, 3])
[1, 1, 2, 3, 4, 5]
>>> replace([1, 2, 1, 2], [1, 2], [3])
[3, 3]

As pointed out by @user2357112, using a for-loop leads to re-evaluating replaced sections of the list, so I updated the answer to use a while instead.

numpy matrix algebra best practice

My question is regarding the last line below: mu@sigma@mu. Why does it work? Is a one-dimensional ndarray treated as a row vector or a column vector? Either way, shouldn’t it be mu.T@sigma@mu or mu@sigma@mu.T? I know mu.T still returns mu since mu only has one dimension, but still, the interpreter seems to be too smart.

>> import numpy as np
>> mu = np.array([1, 1])
>> print(mu)

[1 1]

>> sigma = np.eye(2) * 3
>> print(sigma)

[[ 3.  0.]
 [ 0.  3.]]

>> mu@sigma@mu

6.0

More generally, which is a better practice for matrix algebra in python: use ndarray and @ to do matrix multiplication as above (cleaner code), or use np.matrix and the overloaded * as below (mathematically less confusing)

>> import numpy as np
>> mu = np.matrix(np.array([1, 1]))
>> print(mu)

[[1 1]]

>> sigma = np.matrix(np.eye(2) * 3)
>> print(sigma)

[[ 3.  0.]
 [ 0.  3.]]

>> a = mu * sigma * mu.T
>> a.item((0, 0))

6.0

Solution:

Python performs chained operations left to right:

In [32]: mu=np.array([1,1])
In [33]: sigma= np.array([[3,0],[0,3]])
In [34]: mu@sigma@mu
Out[34]: 6

is the same as doing two expressions:

In [35]: temp=mu@sigma
In [36]: temp.shape
Out[36]: (2,)
In [37]: temp@mu
Out[37]: 6

In my comments (deleted) I claimed @ was just doing np.dot. That’s not quite right. The documentation describes the handling of 1d arrays differently. But the resulting shapes are the same:

In [38]: mu.dot(sigma).dot(mu)
Out[38]: 6
In [39]: mu.dot(sigma).shape
Out[39]: (2,)

For 1d and 2d arrays, np.dot and @ should produce the same result. They differ in handling higher dimensional arrays.

Historically numpy has used arrays, which can be 0d, 1d, and on up. np.dot was the original matrix multiplication method/function.

np.matrix was added, largely as a convenience for wayward MATLAB programmers. It only allows 2d arrays (just as the old, 1990s MATLAB). And it overloads __mat__ (*) with

def __mul__(self, other):
    if isinstance(other, (N.ndarray, list, tuple)) :
        # This promotes 1-D vectors to row vectors
        return N.dot(self, asmatrix(other))
    if isscalar(other) or not hasattr(other, '__rmul__') :
        return N.dot(self, other)
    return NotImplemented

Mu*sigma and Mu@sigma behave the same, though the calling tree is different

In [48]: Mu@sigma@Mu
...
ValueError: shapes (1,2) and (1,2) not aligned: 2 (dim 1) != 1 (dim 0)

Mu*sigma produces a (1,2) matrix, which cannot matrix multiply a (1,2), hence the need for a transpose:

In [49]: Mu@sigma@Mu.T
Out[49]: matrix([[6]])

Note that this is a (1,1) matrix. You have to use item if you want a scalar. (In MATLAB there isn’t such a thing as a scalar. Everything has a shape/size.)

@ is a relatively recent addition to Python and numpy. It was added to Python as an unimplemented operator. numpy (and possibly other packages) has implemented it.

It makes chained expressions possible, though I don’t have any problems with the chained dot in [38]. It is more useful when handling higher dimensional cases.

This addition means there is one less reason to use the old np.matrix class. (Matrix like behavior is more deeply ingrained in the scipy.sparse matrix classes.)

If you want ‘mathematical purity’ I’d suggest taking the mathematical physics approach, and use Einstein notation – as implemented in np.einsum.


With arrays this small, the timings reflect the calling structure more than the actually number of calculations:

In [57]: timeit mu.dot(sigma).dot(mu)
2.79 µs ± 7.75 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [58]: timeit mu@sigma@mu
6.29 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [59]: timeit Mu@sigma@Mu.T
17.1 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [60]: timeit Mu*sigma*Mu.T
17.7 µs ± 517 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Note that the ‘old-fashioned’ dot is fastest, while both matrix versions are slower.

Update model fields based on POST data before save with Django Rest Framework

I’m using django-rest-framework and want to augment the posted data before saving it to my model as is normally achieved using the model’s clean method as in this example from the django docs:

class Article(models.Model):
...
def clean(self):
    # Don't allow draft entries to have a pub_date.
    if self.status == 'draft' and self.pub_date is not None:
        raise ValidationError(_('Draft entries may not have a publication date.'))
    # Set the pub_date for published items if it hasn't been set already.
    if self.status == 'published' and self.pub_date is None:
        self.pub_date = datetime.date.today()

Unfortunately a django-rest-framework Serializer does not call a model’s clean method as with a standard django Form so how would I achieve this?

Solution:

From official docs:

The one difference that you do need to note is that the .clean() method will not be called as part of serializer validation, as it would be if using a ModelForm. Use the serializer .validate() method to perform a final validation step on incoming data where required.

There may be some cases where you really do need to keep validation logic in the model .clean() method, and cannot instead separate it into the serializer .validate(). You can do so by explicitly instantiating a model instance in the .validate() method.

def validate(self, attrs):
    instance = ExampleModel(**attrs)
    instance.clean()
    return attrs

How arguments in Python decorated functions work

I have troubles understanding how the argument is passed to a wrapper function inside a decorator.
Take a simple example:

def my_decorator(func):
    def wrapper(func_arg):
        print('Before')
        func(func_arg)
        print('After')
    return wrapper

@my_decorator
def my_function(arg):
    print(arg + 1)

my_function(1)

I have a function that takes 1 argument and it is decorated. I have troubles in understanding how func_arg works. When my_function(1) is called, how is the value 1 passed to the wrapper. From my little understanding of this, is that my_function is ‘replaced’ by
a new function like: my_function = my_decorator(my_function).

print(my_function)
<function my_decorator.<locals>.wrapper at 0x7f72fea9c620>

Solution:

Your understanding is entirely correct. Decorator syntax is just syntactic sugar, the lines:

@my_decorator
def my_function(arg):
    print(arg + 1)

are executed as

def my_function(arg):
    print(arg + 1)

my_function = my_decorator(my_function)

without my_function actually having been set before the decorator is called*.

So my_function is now bound to the wrapper() function created in your my_decorator() function. The original function object was passed into my_decorator() as the func argument, so is still available to the wrapper() function, as a closure. So calling func() calls the original function object.

So when you call the decorated my_function(1) object, you really call wrapper(1). This function receives the 1 via the name func_arg, and wrapper() then itself calls func(func_arg), which is the original function object. So in the end, the original function is passed 1 too.

You can see this result in the interpreter:

>>> def my_decorator(func):
...     def wrapper(func_arg):
...         print('Before')
...         func(func_arg)
...         print('After')
...     return wrapper
...
>>> @my_decorator
... def my_function(arg):
...     print(arg + 1)
...
>>> my_function
<function my_decorator.<locals>.wrapper at 0x10f278ea0>
>>> my_function.__closure__
(<cell at 0x10ecdf498: function object at 0x10ece9730>,)
>>> my_function.__closure__[0].cell_contents
<function my_function at 0x10ece9730>
>>> my_function.__closure__[0].cell_contents(1)
2

Closures are accessible via the __closure__ attribute, and you can access the current value for a closure via the cell_contents attribute. Here, that’s the original decorated function object.

It is important to note that each time you call my_decorator(), a new function object is created. They are all named wrapper() but they are separate objects, each with their own __closure__.


* Python produces bytecode that creates the function object without assigning it to a name; it lives on the stack instead. The next bytecode instruction then calls the decorator object:

>>> import dis
>>> dis.dis(compile('@my_decorator\ndef my_function(arg):\n    print(arg + 1)\n', '', 'exec'))
  1           0 LOAD_NAME                0 (my_decorator)
              2 LOAD_CONST               0 (<code object my_function at 0x10f25bb70, file "", line 1>)
              4 LOAD_CONST               1 ('my_function')
              6 MAKE_FUNCTION            0
              8 CALL_FUNCTION            1
             10 STORE_NAME               1 (my_function)
             12 LOAD_CONST               2 (None)
             14 RETURN_VALUE

So first LOAD_NAME looks up the my_decorator name. Next, the bytecode generated for the function object is loaded as well as the name for the function. MAKE_FUNCTION creates the function object from those two pieces of information (removing them from the stack) and puts the resulting function object back on. CALL_FUNCTION then takes the one argument on the stack (it’s operand 1 tells it how many positional arguments to take), and calls the next object on the stack (the decorator object loaded). The result of that call is then stored under the name my_function.

I have a for loop, can I create the same list of lists using a list comprehension?

Let’s have a list of values and an arbitrary integer number.

values = ['5', '3', '.', '.', '7', '.', '.', '.', '.', '6', '.', '.', '1', '9', '5', '.', '.', '.', '.', '9', '8', '.', '.', '.', '.', '6', '.', '8', '.', '.', '.', '6', '.', '.', '.', '3', '4', '.', '.', '8', '.', '3', '.', '.', '1', '7', '.', '.', '.', '2', '.', '.', '.', '6', '.', '6', '.', '.', '.', '.', '2', '8', '.', '.', '.', '.', '4', '1', '9', '.', '.', '5', '.', '.', '.', '.', '8', '.', '.', '7', '9']

n = 9

I’d like to group the values with n numbers in a row.

Let us suppose n=9, that is 9 numbers will be in a row.

The result should be like this:

grouped_values = [
     ['5', '3', '.', '.', '7', '.', '.', '.', '.'],
     ['6', '.', '.', '1', '9', '5', '.', '.', '.'],
     ['.', '9', '8', '.', '.', '.', '.', '6', '.'],
     ['8', '.', '.', '.', '6', '.', '.', '.', '3'],
     ['4', '.', '.', '8', '.', '3', '.', '.', '1'],
     ['7', '.', '.', '.', '2', '.', '.', '.', '6'],
     ['.', '6', '.', '.', '.', '.', '2', '8', '.'],
     ['.', '.', '.', '4', '1', '9', '.', '.', '5'],
     ['.', '.', '.', '.', '8', '.', '.', '7', '9']]

I can do it like this:

def group(values, n):
   rows_number = int(len(values)/n) # Simplified. Exceptions will be caught.
   grouped_values = []

   for i in range(0, rows_number):
      grouped_values.append(values[i:i+9])

But there is a suspicion that list comprehension can be used here.
Could you help me understand how can it be done?

Solution:

Just move the expression in the list.append() call to the front, and add the for loop:

grouped_values = [values[i:i + 9] for i in range(rows_number)]

Note that this does not slice up your input list into chunks of consecutive elements. It produces a sliding window; you slice values[0:9] then values[1:10], etc. It produces windows onto the input data, each of length 9, with 8 elements overlapping with the previous window. To produce consecutive chunks of length 9, use range(0, len(values), n) as the range, no need to calculate rows_number:

grouped_values = [values[i:i + n] for i in range(0, len(values), n)]

Whenever you see a pattern like this:

<list_name> = []

for <targets> in <iterable>:
    <list_name>.append(<expression>)

where <expression> does not reference <list_name>, you can trivially turn that into

<list_name> = [<expression> for <targets> in <iterable>]

The only difference here is that list_name is not set until after the whole list comprehension has been executed. You can’t reference the list being built from inside the list comprehension.

Sometimes you need to move additional code in the loop body that produces that final <expression> value into a single expression before you arrive at the above pattern.

Note that it doesn’t matter here that <expression> itself produces list objects; they can be entirely new list comprehensions or any other valid Python expression.

When there are more for loops or if statements with added nested levels, then list those added for loops and if statements from left-to-right in the resulting list comprehension; for example, the pattern

<list_name> = []

for <targets1> in <iterable1>:
    if <test_expression>:
        for <targets2> in <iterable2>:        
            <list_name>.append(<expression>)

becomes

<list_name> = [
    <expression>
    for <targets> in <iterable>
    if <test_expression>
    for <targets2> in <iterable2>
]

groupby() and index values in pandas

I have a pandas.DataFrame with a Multiindex, thus:

a         val
   dog    1
   cat    2
b         
   fox    3
   rat    4

And I want a series whose entries are the lists of the index values at level 1,

so:

a    [dog, cat]
b    [fox, rat]

the following does work, but is quite slow and inelegant:

fff = df.groupby(level=0)['val'].agg(lambda x:[i[1] for i in list(x.index.values)])

So I am hoping there is a better way.

Solution:

reset_index and groupby

df.reset_index(level=1).groupby(level=0)['level_1'].apply(list)


Out[21]: 
a    [dog, cat]
b    [fox, rat]
Name: level_1, dtype: object

Pandas: Difficulty Filling in Null Values

I’m using the Kaggle Titanic dataset and trying to fill in null values. Running this:

combined_df.isnull().sum()

Get me this:

Age            263
Embarked         2
Fare             1
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
fam_size         0
Title            0
dtype: int64

So I do the following to fill in null values:

combined_df.Age.fillna(combined_df.Age.mean(), inplace=True)
combined_df.Embarked.fillna(combined_df.Embarked.mode(), inplace=True)
combined_df.Fare.fillna(combined_df.Fare.mean(), inplace=True)

So when I run this now:

combined_df.isnull().sum()

I get:

Age              0
Embarked         2
Fare             0
Parch            0
PassengerId      0
Pclass           0
Sex              0
SibSp            0
Survived       418
fam_size         0
Title            0
dtype: int64

So it handles the Age and Fare columns correctly but Embarked still has two null values as before.

Interestingly, when I run:

combined_df.Embarked.value_counts()

I get back:

S    914
C    270
Q    123
Name: Embarked, dtype: int64

So that makes it seem like there aren’t any null values in Embarked?

Very confused; any suggestions?

Thanks!

Solution:

You cannot use the value returned by mode to fill as it is a Series object (well you can, but that signifies which indices to fill). Instead use the first entry (it is possible there is a tie).

df = pd.DataFrame({'Emb': ['S', 'Q', 'C',  np.nan, 'Q', None]})
df
    Emb
0     S
1     Q
2     C
3   NaN
4     Q
5  None
df.fillna(df.Emb.mode())
    Emb
0     S
1     Q
2     C
3   NaN
4     Q
5  None
df.fillna(df.Emb.mode()[0])
  Emb
0   S
1   Q
2   C
3   Q
4   Q
5   Q

For more clarification:

mode = df.Emb.mode()
mode
0    Q
dtype: object
0      S
1      Q
2      C
3    NaN
4      Q
5    NaN
Name: Emb, dtype: object
mode.index = [5]
5    Q
dtype: object
df.Emb.fillna(mode)
0      S
1      Q
2      C
3    NaN
4      Q
5      Q
Name: Emb, dtype: object

How to fit a polynomial with some of the coefficients constrained?

Using NumPy’s polyfit (or something similar) is there an easy way to get a solution where one or more of the coefficients are constrained to a specific value?

For example, we could find the ordinary polynomial fitting using:

x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])
z = np.polyfit(x, y, 3)

yielding

array([ 0.08703704, -0.81349206,  1.69312169, -0.03968254])

But what if I wanted the best fit polynomial where the third coefficient (in the above case z[2]) was required to be 1? Or will I need to write the fitting from scratch?

Solution:

In this case, I would use curve_fit or lmfit; I quickly show it for the first one.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

def func(x, a, b, c, d):
  return a + b * x + c * x ** 2 + d * x ** 3

x = np.array([0.0, 1.0, 2.0, 3.0,  4.0,  5.0])
y = np.array([0.0, 0.8, 0.9, 0.1, -0.8, -1.0])

print(np.polyfit(x, y, 3))

popt, _ = curve_fit(func, x, y)
print(popt)

popt_cons, _ = curve_fit(func, x, y, bounds=([-np.inf, 2, -np.inf, -np.inf], [np.inf, 2.001, np.inf, np.inf]))
print(popt_cons)

xnew = np.linspace(x[0], x[-1], 1000)

plt.plot(x, y, 'bo')
plt.plot(xnew, func(xnew, *popt), 'k-')
plt.plot(xnew, func(xnew, *popt_cons), 'r-')
plt.show()

This will print:

[ 0.08703704 -0.81349206  1.69312169 -0.03968254]
[-0.03968254  1.69312169 -0.81349206  0.08703704]
[-0.14331349  2.         -0.95913556  0.10494372]

So in the unconstrained case, polyfit and curve_fit give identical results (just the order is different), in the constrained case, the fixed parameter is 2, as desired.

The plot looks then as follows:

enter image description here

In lmfit you can also choose whether a parameter should be fitted or not, so you can then also just set it to a desired value.

find common elements from sublist in list

I have two list, and I have to extract the items from the first list, from which the first element is present in the second. The code I have pasted bellow works perfectly but as I am operating with several million records, it is painfully slow. Does any one have any idea how it can be optimized?

a = [[1,0],[2,0],[3,0],[4,0]]
b = [2,4,7,8]

same_nums = list(set([x[0] for x in a]).intersection(set(b)))

result = []

for i in a:
    if i[0] in same_nums:
        result.append(i)

print(result)

Solution:

You are overcomplicating things. Just turn b into a set to speed up the contains check. Then one iteration of a in the comprehension will suffice:

set_b = set(b)  # makes   vvvvvvvvvvvvv  O(1)
result = [x for x in a if x[0] in set_b]

Particular turning same_nums back into a list is a real performance killer as it makes the whole thing O(m*n) again. With a single set from b it is O(m+n). But same_nums is entirely unnecessary to begin with, since you know all the i[0] are in a as you are iterating a.

NumPy ufuncs are 2x faster in one axis over the other

I was doing some computation, and measured the performance of ufuncs like np.cumsum over different axes, to make the code more performant.

In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cumsum over axis 1 is almost 2x faster than cumsum over axis 0. Why is it so and what is going on behind the scenes? It’d be nice to have a clear understanding of the reason behind it. Thanks!

Solution:

You have a square array. It looks like this:

1 2 3
4 5 6
7 8 9

But computer memory is linearly addressed, so to the computer it looks like this:

1 2 3 4 5 6 7 8 9

Or, if you think about it, it might look like this:

1 4 7 2 5 8 3 6 9

If you are trying to sum [1 2 3] or [4 5 6] (one row), the first layout is faster. If you are trying to sum [1 4 7] or [2 5 8], the second layout is faster.

This happens because loading data from memory happens one “cache line” at a time, which is typically 64 bytes (8 values with NumPy’s default dtype of 8-byte float).

You can control which layout NumPy uses when you construct an array, using the order parameter.

For more on this, see: https://en.wikipedia.org/wiki/Row-_and_column-major_order