Python iterate through array while finding the mean of the top k elements

Suppose I have a Python array a=[3, 5, 2, 7, 5, 3, 6, 8, 4]. My goal is to iterate through this array 3 elements at a time returning the mean of the top 2 of the three elements.

Using the above above, during my iteration step, the first three elements are [3, 5, 2] and the mean of the top 2 elements is 4. The next three elements are [5, 2, 7] and the mean of the top 2 elements is 6. The next three elements are [2, 7, 5] and the mean of the top 2 elements is again 6. …

Hence, the result for the above array would be [4, 6, 6, 6, 5.5, 7, 7].

What is the nicest way to write such a function?

Solution:

Solution

You can use some fancy slicing of your list to manipulate subsets of elements. Simply grab each three element sublist, sort to find the top two elements, and then find the simple average (aka. mean) and add it to a result list.

Code

def get_means(input_list):
    means = []
    for i in xrange(len(input_list)-2):
        three_elements = input_list[i:i+3]
        top_two = sorted(three_elements, reverse=True)[:2]
        means.append(sum(top_two)/2.0)
    return means

Example

print(get_means([3, 5, 2, 7, 5, 3, 6, 8, 4]))
# [4.0, 6.0, 6.0, 6.0, 5.5, 7.0, 7.0]

Can't stream files from Amazon s3 using requests

I’m trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example:

resp = requests.get(url, stream=True)
print(resp.raw.read())

When I run this on a Common Crawl s3 http url, I get the response:

b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message><Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>
<RequestId>3652F4DCFAE0F641</RequestId><HostId>Do0NlzMr6
/wWKclt2G6qrGCmD5gZzdj5/GNTSGpHrAAu5+SIQeY15WC3VC6p/7/1g2q+t+7vllw=
</HostId></Error>'

I am using warcio, and need a streaming file object as input to the archive iterator, and a can’t download the file all at once because of limited memory. What should I do?

PS. The url I request in the example is https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

Solution:

There is an error in your url. Compare the key in the response you are getting:

<Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>

to the one in the intended url:

https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz

For some reason you are adding unnecessary whitespace, probably picked up during file reading (readline() will give you trailing ‘\n’ characters on every line). Maybe try calling .strip() to remove the trailing newline.

Python: input to os.system getting split

I wish to pass a list of list as argument to my python program. When I am doing the same on normal shell it works absolutely fine however when I do the same from within os.system, it just splits my list of list

import sys
import json
import os
#path=~/Desktop/smc/fuzzy/
os.system("test -d results || mkdir results")
C1=[-5,-2.5,0,2.5,5];spr1=2.5;LR1=[10,20,30,30,30]
C2=[-4,-3,-2,-1,0,1,2,3,4];spr2=1;LR2=[30,40,50,50,50]
C3=[-4,-2,0,2,4];spr3=2;LR3=[40,50,60,60,60]
arg=[[spr1,LR1,C1],[spr2,LR2,C2],[spr3,LR3,C3]]
for i in range(len(arg)):
    print ('this is from the main automate file:',arg[i])
    print('this is stringized version of the input:',str(arg[i]))
    inp=str(arg[i])
    os.system("python "+"~/Desktop/smc/fuzzy/"+"name_of_my_python_file.py "+ inp)   
    os.system("mv "+"*_"+str(arg[i])+" results")

This is the error that it is throwing-

('this is from the main automate file:', [2.5, [10, 20, 30, 30, 30], [-5, -2.5, 0, 2.5, 5]])
('this is stringized version of the input:', '[2.5, [10, 20, 30, 30, 30], [-5, -2.5, 0, 2.5, 5]]')
('from the main executable file:', ['/home/amardeep/Desktop/smc/fuzzy/name_of_my_python_file.py', '[2.5,', '[10,', '20,', '30,', '30,', '30],', '[-5,', '-2.5,', '0,', '2.5,', '5]]'])

In the third line it is just splitting the list by commas and hence messing the list. Is there a way I can by pass this?
Instead of passing a neat list of list like:

[2.5, [10, 20, 30, 30, 30], [-5, -2.5, 0, 2.5, 5]]

it is passing something like

[2.5,', '[10,', '20,', '30,', '30,', '30],', '[-5,', '-2.5,', '0,', '2.5,', '5]]']

I need to be able to pass a list of list as argument to my python program.

Solution:

  1. don’t use os.system, it’s deprecated and not able to compose a proper command line with quoted args, etc… (Since inp contains spaces, you need quoting, and it can become a mess quickly enough)
  2. don’t use mv when you have shutil.move

my proposal: use subprocess.check_call (python <3.5), using os.path.expanduser allows to evaluate ~ without needing shell=True:

import subprocess
subprocess.check_call(["python",
           os.path.expanduser("~/Desktop/smc/fuzzy/name_of_my_python_file.py"),inp])

Passing arguments as a list of arguments allows to let check_call handle the quoting when needed.

Now, to move the files use a loop on globbed files and shutil:

import glob,shutil
for file in glob.glob("*_"+str(arg[i])):
   shutil.move(file,"results")

However, in the long run, since you’re calling a python program from a python program and you’re passing python lists, you’d be better off with simple module imports and function calls, passing the lists directly, not converted as string, where you have to parse them back in the subprocess.

This present answer is better suited for non-python subprocesses.

As an aside, don’t use system calls to create directories:

os.system("test -d results || mkdir results")

can be replaced by full-python code, OS independent:

if not os.path.isdir("results"):
   os.mkdir("results")

numpy broadcasting to all dimensions

I have a 3d numpy array build like this:

a = np.ones((3,3,3))

And I would like to broadcast values on all dimensions starting from a certain point with given coordinates, but the number of dimensions may vary.

For example if i’m given the coordinates (1,1,1) I can do these 3 functions:

a[1,1,:] = 0
a[1,:,1] = 0
a[:,1,1] = 0

And the result will be my desired output which is:

array([[[1., 1., 1.],
        [1., 0., 1.],
        [1., 1., 1.]],

       [[1., 0., 1.],
        [0., 0., 0.],
        [1., 0., 1.]],

       [[1., 1., 1.],
        [1., 0., 1.],
        [1., 1., 1.]]])

Or if i’m given the coordinates (0,1,0) the corresponding broadcast will be:

a[0,1,:] = 0
a[0,:,0] = 0
a[:,1,0] = 0

Is there any way to do this in a single action instead of 3? I’m asking because the actual arrays i’m working with have even more dimensions which makes the code seem long and redundant. Also if the number of dimensions change I would have to rewrite the code.

EDIT: It doesn’t have to be a single action, I just need to do it in all dimensions programatically such that if the number of dimensions change the code will stay the same.

EDIT 2: About the logic of this, i’m not sure if that’s relevant, but i’m being given the value of a point (by coordinates) on a map and based on that I know the values of the entire row, column and height on the same map (that’s why i’m updating all 3 with 0 as an example). In other cases the map is 2-dimensions and I still know the same thing about the row and column, but can’t figure out a function that works for a varied numbers of dimensions.

Solution:

Here’s a way to generate string of exactly the 3 lines of code you’re currently using, and then execute them:

import numpy as np

a = np.ones([3,3,3])
coord = [1, 1, 1]

for i in range(len(coord)):
   temp = coord[:]
   temp[i] = ':'
   slice_str = ','.join(map(str, temp))
   exec("a[%s] = 0"%slice_str)

print a

This may not be the best approach, but at least it’s amusing. Now that we know that it works, we can go out and find the appropriate syntax to do it without actually generating the string and execing it. For example, you could use slice:

import numpy as np

a = np.ones([3,3,3])
coord = [1, 1, 1]

for i, length in enumerate(a.shape):
   temp = coord[:]
   temp[i] = slice(length)
   a[temp] = 0
print a

Pandas: Filtering multiple conditions

I’m trying to do boolean indexing with a couple conditions using Pandas. My original DataFrame is called df. If I perform the below, I get the expected result:

temp = df[df["bin"] == 3]
temp = temp[(~temp["Def"])]
temp = temp[temp["days since"] > 7]
temp.head()

However, if I do this (which I think should be equivalent), I get no rows back:

temp2 = df[df["bin"] == 3]
temp2 = temp2[~temp2["Def"] & temp2["days since"] > 7]
temp2.head()

Any idea what accounts for the difference?

Solution:

Use () because operator precedence:

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]

Alternatively, create conditions on separate rows:

cond1 = df["bin"] == 3    
cond2 = df["days since"] > 7
cond3 = ~df["Def"]

temp2 = df[cond1 & cond2 & cond3]

Sample:

df = pd.DataFrame({'Def':[True] *2 + [False]*4,
                   'days since':[7,8,9,14,2,13],
                   'bin':[1,3,5,3,3,3]})

print (df)
     Def  bin  days since
0   True    1           7
1   True    3           8
2  False    5           9
3  False    3          14
4  False    3           2
5  False    3          13


temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
     Def  bin  days since
3  False    3          14
5  False    3          13

Edit base.css file from S3 Bucket

I’m using AWS S3 to serve my static files – however I’ve just found out you can’t edit them directly from S3, which kind of makes it pointless as I will be continuously changing things on my website. So – is the conventional way to make the changes then re-upload the file? Or do most developers store their base.css file in their repository so it’s easier to change?

Because I’m using Django for my project so there is only supposed to be one static path (for me that’s my S3 bucket) – or is there another content delivery network where I can directly edit the contents of the file on the go which would be better?

Solution:

Yes, imo it would be unusual to edit the files of your production website directly from where they are served.

Edit them locally, check them into your repo and then deploy them to s3 from your repo, perhaps using a tool like Jenkins. If you make a mistake, you have something to roll back to.

I can’t think of any circumstances where editing your files directly in production is a good idea.

How do I split a string into several columns in a dataframe with pandas Python?

I am aware of the following questions:

1.) How to split a column based on several string indices using pandas?
2.) How do I split text in a column into multiple rows?

I want to split these into several new columns though. Suppose I have a dataframe that looks like this:

id    | string
-----------------------------
1     | astring, isa, string
2     | another, string, la
3     | 123, 232, another

I know that using:

df['string'].str.split(',')

I can split a string. But as a next step, I want to efficiently put the split string into new columns like so:

id    | string_1 | string_2 | string_3
-----------------|---------------------
1     | astring  | isa      | string
2     | another  | string   | la
3     | 123      | 232      | another
---------------------------------------

I could for example do this:

for index, row in df.iterrows():
    i = 0
    for item in row['string'].split():
        df.set_values(index, 'string_{0}'.format(i), item)
        i = i + 1

But how could one achieve the same result more elegantly?a

Solution:

The str.split method has an expand argument:

>>> df['string'].str.split(',', expand=True)
         0        1         2
0  astring      isa    string
1  another   string        la
2      123      232   another
>>>

With column names:

>>> df['string'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
   string1  string2   string3
0  astring      isa    string
1  another   string        la
2      123      232   another

Much neater with Python >= 3.6 f-strings:

>>> (df['string'].str.split(',', expand=True)
...              .rename(columns=lambda x: f"string_{x+1}"))
  string_1 string_2  string_3
0  astring      isa    string
1  another   string        la
2      123      232   another

Python concatenating elements of one list that are between elements of another list

I have two lists: a and b. I want to concatenate all of the elements of the b that are between elements of a. All of the elements of a are in b, but b also has some extra elements that are extraneous. I would like to take the first instance of every element of a in b and concatenate it with the extraneous elements that follow it in b until we find another element of a in b. The following example should make it more clear.

a = [[11.0, 1.0], [11.0, 2.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [12.0, 11.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]  

b = [[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0], [11.0, 2.0], [11.0, 3.0], [11.0, 3.0], [11.0, 4.0], [11.0, 5.0], [12.0, 1.0], [12.0, 2.0], [12.0, 3.0], [12.0, 4.0], [12.0, 5.0], [12.0, 6.0], [12.0, 7.0], [12.0, 5.0], [12.0, 8.0], [12.0, 9.0], [12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0], [3.0, 1.0], [13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0], [13.0, 2.0], [11.0, 3.0], [13.0, 3.0], [13.0, 4.0], [13.0, 5.0], [13.0, 5.0], [13.0, 5.0], [13.0, 6.0], [13.0, 7.0], [13.0, 7.0], [13.0, 8.0], [13.0, 9.0], [13.0, 10.0]]

c = [[[11.0, 1.0], [11.0, 1.0], [1281.0, 8.0]], [[11.0, 2.0]], [[11.0, 3.0], [11.0, 3.0]], [[11.0, 4.0]], [[11.0, 5.0]], [[12.0, 1.0]], [[12.0, 2.0]], [[12.0, 3.0]], [[12.0, 4.0]], [[12.0, 5.0]], [[12.0, 6.0]], [[12.0, 7.0], [12.0, 5.0]], [[12.0, 8.0]], [[12.0, 9.0]], [[12.0, 10.0], [13.0, 5.0]], [[12.0, 11.0], [12.0, 8.0], [3.0, 1.0]], [[13.0, 1.0], [9.0, 7.0], [12.0, 12.0], [12.0, 13.0], [12.0, 14.0], [13.0, 1.0]], [[13.0, 2.0]], [[11.0, 3.0], [13.0, 3.0]], [[13.0, 4.0]], [[13.0, 5.0], [13.0, 5.0], [13.0, 5.0]], [[13.0, 6.0]], [[13.0, 7.0], [13.0, 7.0]], [[13.0, 8.0]], [[13.0, 9.0]], [[13.0, 10.0]]]

What I have thought of is something like this:

slice_list = []
for i, elem in enumerate(a):
    if i < len(key_list)-1:
        b_first_index = b.index(a[i])
        b_second_index = b.index(a[i+1]) 
        slice_list.append([b_first_index, b_second_index])

c = [[b[slice_list[i][0]:b[slice_list[i][1]]]] for i in range(len(slice_list))]

This however will not catch the last item in the list (which I am not quite sure how to fit into my list comprehension anyways) and it seems quite ugly. My question is, is there a neater way of doing this (perhaps in itertools)?

Solution:

I think your example wrong_list_fixed is incorrect.

        [[12.0, 10.0], [13.0, 5.0], [12.0, 11.0], [12.0, 8.0],
# There should be a new list here -^

Here’s a solution that walks the lists. It can be optimized further:

from contextlib import suppress

fixed = []
current = []
key_list_iter = iter(key_list)
next_key = next(key_list_iter)
for wrong in wrong_list:
    if wrong == next_key:
        if current:
            fixed.append(current)
            current = []
        next_key = None
        with suppress(StopIteration):
            next_key = next(key_list_iter)
    current.append(wrong)

if current:
    fixed.append(current)

Here are the correct lists (modified to be easier to visually parse):

key_list = ['_a0', '_b0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_m0', '_n0', '_o0', '_p0', '_q0', '_r0', '_s0', '_t0', '_u0', '_v0', '_w0', '_x0', '_y0', '_z0', '_A0', '_B0', '_C0'] 
wrong_list = ['_a0', '_a0', 'D0', '_b0', '_c0', '_c0', '_d0', '_e0', '_f0', '_g0', '_h0', '_i0', '_j0', '_k0', '_l0', '_j0', '_m0', '_n0', '_o0', '_x0', '_p0', '_m0', 'E0', '_t0', 'F0', '_q0', '_r0', '_s0', '_t0', '_u0', '_c0', '_v0', '_w0', '_x0', '_x0', '_x0', '_y0', '_z0', '_z0', '_A0', '_B0', '_C0'] 
wrong_list_fixed = [['_a0', '_a0', 'D0'], ['_b0'], ['_c0', '_c0'], ['_d0'], ['_e0'], ['_f0'], ['_g0'], ['_h0'], ['_i0'], ['_j0'], ['_k0'], ['_l0', '_j0'], ['_m0'], ['_n0'], ['_o0', '_x0'], ['_p0', '_m0', 'E0', '_t0', 'F0'], ['_q0'], ['_r0'], ['_s0'], ['_t0'], ['_u0', '_c0'], ['_v0'], ['_w0'], ['_x0', '_x0', '_x0'], ['_y0'], ['_z0', '_z0'], ['_A0'], ['_B0'], ['_C0']] 

How to maximize window in Jenkins using Python, Selenium Webdriver and Chromedriver?

How to maximize a window in Jenkins?
I tried to maximize window using:

chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--start-maximized")
self.driver = webdriver.Chrome(chrome_options=chromeOptions)


driver.maximize_window()

driver.set_window_size(1920,1080)

driver.fullscreen_window()

All works when I run the test in PyCharm, but in Jenkins, the window does not change size.

Solution:

The problem is with Jenkins running as a windows service.

In Windows command prompt, go to folder where jenkins-cli.jar file is located.

And stop the service.

java -jar jenkins-cli.jar -s http://localhost:8080 safe-shutdown --username "YourUsername" 
--password "YourPassword"

Have Jenkins run from the command prompt using a script.

parse string for key, value pairs with a known key delimiter

How can I convert a string to a dict, if key strings are known substrings with definite delimiters? Example:

s = 'k1:text k2: more text k3:andk4: more yet'
key_list = ['k1','k2','k3']
(missing code)
# s_dict = {'k1':'text', 'k2':'more text', 'k3':'andk4: more yet'}  

In this case, keys must be preceded by a space, newline, or be the first character of the string and must be followed (immediately) by a colon, else they are not parsed as keys. Thus in the example, k1,k2, and k3 are read as keys, while k4 is part of k3‘s value. I’ve also stripped trailing white space but consider this is optional.

Solution:

You can use re.findall to do this:

>>> import re
>>> dict(re.findall(r'(?:(?<=\s)|(?<=^))(\S+?):(.*?)(?=\s[^\s:]+:|$)', s))
{'k1': 'text', 'k2': ' more text', 'k3': 'andk4: more yet'}

The regular expression requires a little trial-and-error. Stare at it long enough, and you’ll understand what it’s doing.

Details

(?:          
   (?<=\s)   # lookbehind for a space 
   |         # regex OR
   (?<=^)    # lookbehind for start-of-line
)     
(\S+?)       # non-greedy match for anything that isn't a space
:            # literal colon
(.*?)        # non-greedy match
(?=          # lookahead (this handles the third key's case)
   \s        # space  
   [^\s:]+   # anything that is not a space or colon
   :         # colon
   |
   $         # end-of-line
)