Accessing static fields from the decorated class

Full code example:

def decorator(class_):
    class Wrapper:
        def __init__(self, *args, **kwargs):
            self.instance = class_(*args, **kwargs)

        @classmethod
        def __getattr__(cls, attr):
            return getattr(class_, attr)
    return Wrapper


@decorator
class ClassTest:

    static_var = "some value"


class TestSomething:

    def test_decorator(self):
        print(ClassTest.static_var)
        assert True

When trying to execute test, getting error:

test/test_Framework.py F
test/test_Framework.py:37 (TestSomething.test_decorator)
self = <test_Framework.TestSomething object at 0x10ce3ceb8>

    def test_decorator(self):
>       print(ClassTest.static_var)
E       AttributeError: type object 'Wrapper' has no attribute 'static_var'

Is it possible to access static fields from the decorated class?

Solution:

You can get it to work by making the decorator create a class derived from the one being decorated.

Here’s what I mean:

def decorator(class_):
    class Wrapper(class_):
        def __init__(self, *args, **kwargs):
            self.instance = super().__init__(*args, **kwargs)

    return Wrapper

@decorator
class ClassTest:
    static_var = "some value"

print(ClassTest.static_var)  # -> some value

LogisticRegression scikit learn covariate (column) order matters on training

For some reason the order of the covariates seems to matter with a LogisticRegression classifier in scikit-learn, which seems odd to me. I have 9 covariates and a binary output, and when I change the order of the columns and call fit() and then call predict_proba() the output is different. Toy example below

logit_model = LogisticRegression(C=1e9, tol=1e-15)

The following

logit_model.fit(df['column_2','column_1'],df['target'])
logit_model.predict_proba(df['column_2','column_1'])

array([[ 0.27387109,  0.72612891] ..])

Gives a different result to:

logit_model.fit(df['column_1','column_2'],df['target'])
logit_model.predict_proba(df['column_1','column_2'])

array([[ 0.26117794,  0.73882206], ..])

This seems surprising to me but maybe thats just my lack of knowledge about the internals of the algorithm and the fit method.

What am I missing?

EDIT: Here is the full code and data

data: https://s3-us-west-2.amazonaws.com/gjt-personal/test_model.csv

import pandas as pd
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('test_model.csv',index_col=False)

columns1 =['col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9']
columns2 =['col_2','col_1','col_3','col_4','col_5','col_6','col_7','col_8','col_9']

logit_model = LogisticRegression(C=1e9, tol=1e-15)

logit_model.fit(df[columns1],df['target'])
logit_model.predict_proba(df[columns1])

logit_model.fit(df[columns2],df['target'])
logit_model.predict_proba(df[columns2])

Turns out its something to do with tol=1e-15 because this gives a different result.

LogisticRegression(C=1e9, tol=1e-15)

But this gives the same result.

LogisticRegression(C=1e9)

Solution:

Thanks for adding sample data.

Taking a deeper look at your data it is clearly not standardized. If you were to apply a StandardScaler to the dataset and try fitting again you will find that the prediction discrepancy disappears.

While this result is at least consistent it is still troubling that it raises a LineSearchWarning and ConvergenceWarning. To that I would say you do have an exceedingly low tolerance here at 1e-15. Given the very high regularization penalty ratio (1e9) you have applied, lowering tol to the default 1e-4 will really have no impact whatsoever. This allows the model to properly converge and still produces the same outcome (in a much faster run time).

My full process looks like this:

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

ss = StandardScaler()
cols1 = np.arange(9)
cols2 = np.array([1,0,2,3,4,5,6,7,8])
X = ss.fit_transform(df.drop('target', axis=1))

lr = LogisticRegression(solver='newton-cg', tol=1e-4, C=1e9)
lr.fit(X[:, cols1], df['target'])
preds_1 = lr.predict_proba(X[:, cols1])

lr.fit(X[:, cols2], df['target'])
preds_2 = lr.predict_proba(X[:, cols2])

preds_1 
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

preds_2
array([[  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   1.00000000e+00],
       ...,
       [  1.00000000e+00,   9.09277801e-31],
       [  1.00000000e+00,   3.52079327e-35],
       [  1.00000000e+00,   5.99607407e-30]])

The assertion preds_1 == preds_2 will fail, but the difference is on the order of 1e-40 + for each value, which I would say is well beyond any plausible level of significance.

How to read an input file of integers separated by a space using readlines in Python 3?

I need to read an input file (input.txt) which contains one line of integers (13 34 14 53 56 76) and then compute the sum of the squares of each number.

This is my code:

# define main program function
def main():
    print("\nThis is the last function: sum_of_squares")
    print("Please include the path if the input file is not in the root directory")
    fname = input("Please enter a filename : ")
    sum_of_squares(fname)

def sum_of_squares(fname):
    infile = open(fname, 'r')
    sum2 = 0
    for items in infile.readlines():
        items = int(items)
        sum2 += items**2
    print("The sum of the squares is:", sum2)
    infile.close()

# execute main program function
main()

If each number is on its own line, it works fine.

But, I can’t figure out how to do it when all the numbers are on one line separated by a space. In that case, I receive the error: ValueError: invalid literal for int() with base 10: '13 34 14 53 56 76'

Solution:

You can use file.read() to get a string and then use str.split to split by whitespace.

You’ll need to convert each number from a string to an int first and then use the built in sum function to calculate the sum.

As an aside, you should use the with statement to open and close your file for you:

def sum_of_squares(fname):

    with open(fname, 'r') as myFile: # This closes the file for you when you are done
        contents = myFile.read()

    sumOfSquares = sum(int(i)**2 for i in contents.split())
    print("The sum of the squares is: ", sumOfSquares)

Output:

The sum of the squares is: 13242

Does the base for logarithmic calculations in Python influence the speed?

I have to use a lot of logarithmic calculations in one program. In terms of the logarithmic base, the procedure is not specific. I was wondering, if any base n (2? 10? e?) is faster in the Python 3.5 math module than others, because maybe under the hood all other bases a are transformed into log_a(x) = log_n(x)/log_n(a). Or does the choice of the base not influence the speed of the calculation, because all bases are implemented in the same way using a C library?

Solution:

In CPython, math.log is base independent, but platform dependent. From the C source for the math module, on lines 1940-1961, the code for math.log is shown.

math_log_impl(PyObject *module, PyObject *x, int group_right_1,
          PyObject *base)
/*[clinic end generated code: output=7b5a39e526b73fc9 input=0f62d5726cbfebbd]*/

{
    PyObject *num, *den;
    PyObject *ans;

    num = loghelper(x, m_log, "log"); // uses stdlib log
    if (num == NULL || base == NULL)
        return num;

    den = loghelper(base, m_log, "log"); // uses stdlib log
    if (den == NULL) {
        Py_DECREF(num);
        return NULL;
    }

    ans = PyNumber_TrueDivide(num, den);
    Py_DECREF(num);
    Py_DECREF(den);
    return ans;
}

This, no matter what, calculates the natural log of the number and base, so unless the C log function has a special check for e, it will run at the same speed.

This source also explains the other answer’s log2 and log10 being faster than log. They are implemented using the standard library log2 and log10 functions respectively, which will be faster. These functions, however, are defined differently depending on the platform.

Note: I am not very familiar with C so I may be wrong here.

Trouble getting few items from a webpage

I’ve written a script in python in combination with selenium to parse some items from a webpage. I can’t get it working in anyway. The items I’m after are (perhaps) within iframe. I tried to switch it but that doesn’t have any effect. I’m still getting nothing except for TimeoutException when it hits the line where I tried to switch the iframe. How can i get it working. Thanks in advance:

Here goes the webpage link: URL

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url = "replace_with_above_url"

driver = webdriver.Chrome()
driver.get(url)
wait = WebDriverWait(driver, 10)

wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "tradingview_fe623")))

for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".quick .apply-common-tooltip"))):
    print(item.text)

driver.quit()

Elements within which the items I’m after:

5 1h 1D 1M 1D

This is the output I’m expecting to have (it works locally when i try to get them using css selectors):

5
1h
1D
1M
1D

This is how it looks in the web:

enter image description here

Solution:

Required nodes located inside 2 nested iframes, so you need to switch to them one by one. Note that id/name of second one generated dynamically. Just try to replace

wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, "tradingview_fe623")))

with

wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, ".abs")))
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[id^='tradingview_']")))

How to optimize a nested for loop in Python

So I am trying to write a python function to return a metric called the Mielke-Berry R value. The metric is calculated like so:
enter image description here

The current code I have written works, but because of the sum of sums in the equation, the only thing I could think of to solve it was to use a nested for loop in Python, which is very slow…

Below is my code:

def mb_r(forecasted_array, observed_array):
    """Returns the Mielke-Berry R value."""
    assert len(observed_array) == len(forecasted_array)
    y = forecasted_array.tolist()
    x = observed_array.tolist()
    total = 0
    for i in range(len(y)):
        for j in range(len(y)):
            total = total + abs(y[j] - x[i])
    total = np.array([total])
    return 1 - (mae(forecasted_array, observed_array) * forecasted_array.size ** 2 / total[0])

The reason I converted the input arrays to lists is because I have heard (haven’t yet tested) that indexing a numpy array using a python for loop is very slow.

I feel like there may be some sort of numpy function to solve this much faster, anyone know of anything?

Solution:

Here’s one vectorized way to leverage broadcasting to get total

np.abs(forecasted_array[:,None] - observed_array).sum()

To accept both lists and arrays alike, we can use NumPy builtin for the outer subtraction, like so –

np.abs(np.subtract.outer(forecasted_array, observed_array)).sum()

We can also make use of numexpr module for faster absolute computations and perform summation-reductions in one single numexpr evaluate call and as such would be much more memory efficient, like so –

import numexpr as ne

forecasted_array2D = forecasted_array[:,None]
total = ne.evaluate('sum(abs(forecasted_array2D - observed_array))')

How to get the multiple max key values in a dictionary?

Let’s say I have a dictionary:

data = {'a':1, 'b':2, 'c': 3, 'd': 3}

I want to get the maximum value(s) in the dictionary. So far, I have been just doing:

max(zip(data.values(), data.keys()))[1]

but I’m aware that I could be missing another max value. What would be the most efficient way to approach this?

Solution:

Based on your example, it seems like you’re looking for the key(s) which map to the maximum value. You could use a list comprehension:

[k for k, v in data.items() if v == max(data.values())]
# ['c', 'd']

If you have a large dictionary, break this into two lines to avoid calculating max for as many items as you have:

mx = max(data.values())
[k for k, v in data.items() if v == mx]

In Python 2.x you will need .iteritems().

How to use list comprehension with matrices in python?

How would I write the following using list comprehension?

def mv(A,X,n):
    Y = [0]*n
    for i in range(n):
        for j in range(n):
            Y[i] += A[i][j] * X[j]
    return Y

I believe that A is a matrix and that X is a vector. This is what I have tried so far, but it does not output the same thing:

def mv2(A,X,n):
    res = [sum((A[i][j] * X[i]) for i in range(n) for j in range(n))]
    return res

Solution:

You are very close to the right answer, as you should apply sum on the right target

return [sum([A[i][j] * X[j] for j in range(n)]) for i in range(n)]

Notes: if you want to do the math with a library, numpy is a good option

import numpy as np
def mv2(A, X):
    A = np.array(A)
    X = np.array(X)
    return np.dot(A, X)

Python: ax.text not displaying in saved PDF

I am creating a figure with some text (example here: a sin curve with some text on the side) in an ipython notebook. The plot and text show up inline in my notebook, but when I save the figure I only see the plot and not the text. I’ve reproduced the problem with this example code:

import numpy as np
import matplotlib.pyplot as plt

fig,ax = plt.subplots(1)
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
ax.plot(x, y)
ax.text(8,0.9,'Some Text Here',multialignment='left', linespacing=2.)
plt.savefig('sin.pdf')

How can I see the text in the saved pdf?

Solution:

Figures shown in jupyter notebook are saved png images. They are saved with the option bbox_inches="tight".

In order to produce a pdf which looks exactly the same as the png in the notebook, you also need to use this option.

plt.savefig('sin.pdf', bbox_inches="tight")

The reason is that the coordinates (8,0.9) are outside the figure. So the text won’t appear in the saved version of it (It wouldn’t appear in an interactive figure either). The option bbox_inches="tight" expands or shrinks the saved range to include all elements of the canvas. Using this option is indeed useful for easily including elements which are outside the plot without having to care about figure size, margins and coordinates at all.

A final note: You are specifying the text’s position in data coordinates. This is usually undesired, because it makes the text’s position dependent on what data is shown in the axes. Instead it would make sense to specify it in axes coordiantes,

ax.text(1.1, .9, 'Some Text Here', va="top", transform=ax.transAxes)

such that it always sits at position (1.1,.9) with respect to the axes.

Time-intensive collection processing in Python

The code has been vastly simplified, but should serve to illustrate my question.

S = ('A1RT', 'BDF7', 'CP09')
for s in S:
    if is_valid(s): # very slow!
        process(s)

I have a collection of strings obtained from a site-scrape. (Strings will be retrieved from site-scrapes periodically.) Each of these strings need to be validated, over the network, against a third party. The validation process can be slow at times, which is problematic. Due to the iterative nature of the above code, it may take some time before the last string is validated and processed.

Is there a proper way to parallelize the above logic in Python? To be frank, I’m not very familiar with concurrency / parallel-processing concepts, but it would seem as though they may be useful in this circumstance. Thoughts?

Solution:

The concurrent.futures module is a great way to start work on “embarrassingly parallel” problems, and can very easily be switched between using either multiple processes or multiple threads within a single process.

In your case, it sounds like the “hard work” is being done on other machines over the network, and your main program will spend most of its time waiting for them to deliver results. If so, threads should work fine. Here’s a complete, executable toy example:

import concurrent.futures as cf

def is_valid(s):
    import random
    import time
    time.sleep(random.random() * 10)
    return random.choice([False, True])

NUM_WORKERS = 10  # number of threads you want to run

strings = list("abcdefghijklmnopqrstuvwxyz")

with cf.ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
    # map a future object to the string passed to is_valid
    futures = {executor.submit(is_valid, s): s for s in strings}
    # `as_complete()` returns results in the order threads
    # complete work, _not_ necessarily in the order the work
    # was passed out
    for future in cf.as_completed(futures):
        result = future.result()
        print(futures[future], result)

And here’s sample output from one run:

g False
i True
j True
b True
f True
e True
k False
h True
c True
l False
m False
a False
s False
v True
q True
p True
d True
n False
t False
z True
o True
y False
r False
w False
u True
x False

concurrent.futures handles all the headaches of starting threads, parceling out work for them to do, and noticing when threads deliver results.

As written above, up through 10 (NUM_WORKERS) is_valid() invocations can be active simultaneously. as_completed() returns a future object as soon as its result is ready to retrieve, and the executor automatically hands the thread that computed the result another string for is_valid() to chew on.