Replace the top 10 values in numpy

Is there any easy way to replace the top 10 values with 1 and the rest of them with zeros? I have found that numpy argpartition can give me a new array with the index but I haven’t been able to easily use it in the original array?
Can anyone help?
Thanks in Advance

Solution:

You could do it using np.sort to find the 10th largest value, and then use np.where to flag the array.

import numpy as np

a = np.random.rand(30)

a_10 = np.sort(a)[-10]

a_new = np.where(a >= a_10, 1, 0)

print(a)     # Print the original
print(a_new) # Print the boolean array

EDIT: A single-line, in-place operation is thus

a = np.where(a >= np.sort(a)[-10], 1, 0)

EDIT2: The answer can be extended to 2D. I made a 6×6 matrix, where I flag per row the 3 largest values with a 1.

# 2D example, save top3 per 
a = np.random.rand(6, 6)

a_3 = np.sort(a, axis=1)[:,-3]
a_new = np.where(a >= a_3[:,None], 1, 0)

print(a)
print(a_new)

How to correctly handle duplicates in computing the complementary list?

While this question may seem to be related to previous ones (like this one: Python, compute list difference), it is not exactly the same, and even the best rated answer containing two suggestions will not exactly answer the following one.

I have a main (unordered) list L containing values with duplicates; take for instance a list of integers:

L = [3, 1, 4, 1, 5, 9, 2, 6, 5]

I have a smaller list contaning a choice of values from L, for instance:

x = [4, 1, 3]

The order of the elements in x is not related in any way to the order of the elements in L.

Now, I would like to compute the difference L-x in such a way that concatenating x and this difference would give the same list than L (except for the order); to be more precise:

list(sorted(x + D(L,x))) == list(sorted(L))

The first bad idea is obviously to use sets, since duplicated will not be handled correctly.

The second bad idea is to use some list comprehension with a filter like:

[ e for e in L if e not in x ]

since the value 1 in my example will be discarded though one instance of this value should occur in the expected difference.

As far as I can see, the most efficient way of doing it would be to sort both lists then iterate on both lists (an iterator could be helpful) and carefully take duplicates into account; this would be a O(n log n) solution.

I am not looking for speed; I rather wonder if some concise pythonic syntax could do it; even O(n²) or worse could be acceptable if it could do the expected task in one or two lines.

Solution:

You want the multiset operations provided by collections.Counter:

>>> L = [3, 1, 4, 1, 5, 9, 2, 6, 5]
>>> x = [4, 1, 3]
>>> list((Counter(L) - Counter(x)).elements())
[1, 5, 5, 9, 2, 6]

This is O(n). You can also preserve order and maintain O(n) using an OrderedCounter if necessary.

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict): 
    pass

Is an instance of object (but not a subclass) guaranteed to compare unequal to any other object?

Apologies if this is a dupe. (I couldn’t find it but I’m not very good with google.)

I just stumbled over some code where they use

x = object()

in a place where they probably want x to compare not equal to anyhing that’s already there. Is that guaranteed by the language?

Solution:

Nothing is guaranteed. You can make anything equals to anything else by implementing __eq__.

Unless you know what x is, nothing is guaranteed and nothing can be assumed.

For example:

class A:
    def __eq__(self, other):
        return True

print(A() == object())
# True

And the contrary:

class A:
    def __eq__(self, other):
        return False

print(A() == object())
# False

Make border of Label, bbox or axes.text flush with spines of Graph in python matplotlib

for a certain manuscript i need to position my label of the Graph exactly in the right or left top corner. The label needs a border with the same thickness as the spines of the graph. Currently i do it like this:

import matplotlib.pyplot as plt
import numpy as np
my_dpi=96
xposr_box=0.975 
ypos_box=0.94
nrows=3
Mytext="label"
GLOBAL_LINEWIDTH=2
fig, axes = plt.subplots(nrows=nrows, sharex=True, sharey=True, figsize=
               (380/my_dpi, 400/my_dpi), dpi=my_dpi)
fig.subplots_adjust(hspace=0.0001)
colors = ('k', 'r', 'b')
for ax, color in zip(axes, colors):
    data = np.random.random(1) * np.random.random(10)
    ax.plot(data, marker='o', linestyle='none', color=color)

for ax in ['top','bottom','left','right']:
    for idata in range(0,nrows):
        axes[idata].spines[ax].set_linewidth(GLOBAL_LINEWIDTH)


axes[0].text(xposr_box, ypos_box , Mytext, color='black',fontsize=8,
             horizontalalignment='right',verticalalignment='top', transform=axes[0].transAxes,
             bbox=dict(facecolor='white', edgecolor='black',linewidth=GLOBAL_LINEWIDTH)) 

plt.savefig("Label_test.png",format='png', dpi=600,transparent=True)

Image1

So i control the position of the box with the parameters:

xposr_box=0.975 
ypos_box=0.94

If i change the width of my plot, the position of my box also changes, but it should always have the top and right ( or left) spine directly on top of the graphs spines:

import matplotlib.pyplot as plt
import numpy as np
my_dpi=96
xposr_box=0.975 
ypos_box=0.94
nrows=3
Mytext="label"
GLOBAL_LINEWIDTH=2
fig, axes = plt.subplots(nrows=nrows, sharex=True, sharey=True, figsize=
               (500/my_dpi, 400/my_dpi), dpi=my_dpi)
fig.subplots_adjust(hspace=0.0001)
colors = ('k', 'r', 'b')
for ax, color in zip(axes, colors):
    data = np.random.random(1) * np.random.random(10)
    ax.plot(data, marker='o', linestyle='none', color=color)

for ax in ['top','bottom','left','right']:
    for idata in range(0,nrows):
        axes[idata].spines[ax].set_linewidth(GLOBAL_LINEWIDTH)


axes[0].text(xposr_box, ypos_box , Mytext, color='black',fontsize=8,
             horizontalalignment='right',verticalalignment='top',transform=axes[0].transAxes,
             bbox=dict(facecolor='white', edgecolor='black',linewidth=GLOBAL_LINEWIDTH)) 

plt.savefig("Label_test.png",format='png', dpi=600,transparent=True)

Image2

This should also be the case if the image is narrower not wider as in this example.I would like to avoid doing this manually. Is there a way to always position it where it should? Independent on the width and height of the plot
and the amount of stacked Graphs?

Solution:

The problem is that the position of a text element is relative to the text’s extent, not to its surrounding box. While it would in principle be possible to calculate the border padding and position the text such that it sits at coordinates (1,1)-borderpadding, this is rather cumbersome since (1,1) is in axes coordinates and borderpadding in figure points.

There is however a nice alternative, using matplotlib.offsetbox.AnchoredText. This creates a textbox which can be placed easily relative the the axes, using the location parameters like a legend, e.g. loc="upper right". Using a zero padding around that text box directly places it on top of the axes spines.

from matplotlib.offsetbox import AnchoredText
txt = AnchoredText("text", loc="upper right", pad=0.4, borderpad=0, )
ax.add_artist(txt)

A complete example:

import matplotlib.pyplot as plt
from matplotlib.offsetbox import AnchoredText
import numpy as np

my_dpi=96
nrows=3
Mytext="label"
plt.rcParams["axes.linewidth"] = 2
plt.rcParams["patch.linewidth"] = 2

fig, axes = plt.subplots(nrows=nrows, sharex=True, sharey=True, figsize=
               (500./my_dpi, 400./my_dpi), dpi=my_dpi)
fig.subplots_adjust(hspace=0.0001)
colors = ('k', 'r', 'b')
for ax, color in zip(axes, colors):
    data = np.random.random(1) * np.random.random(10)
    ax.plot(data, marker='o', linestyle='none', color=color)

txt = AnchoredText(Mytext, loc="upper right", 
                   pad=0.4, borderpad=0, prop={"fontsize":8})
axes[0].add_artist(txt)

plt.show()

enter image description here

Finding the odd number out in an array

I am trying to solve a problem where I’m given an array, such as [0, 0, 1, 1, 2, 2, 6, 6, 9, 10, 10] where all numbers are duplicated twice, excluding one number, and I need to return the number that is not duplicated.

I am trying to do it like this:

def findNumber(self, nums):

    if (len(nums) == 1):
        return nums[0]

    nums_copy = nums[:]

    for i in nums:
        nums_copy.remove(i)

        if i not in nums:
            return i
        else:
            nums_copy.remove(i)

However when it reaches the else statement, there is the following error:

ValueError: list.remove(x): x not in list

This is occurring when i is in nums_copy, so I do not understand why this error occurs in this situation?

Solution:

You already nums_copy.remove(i) so you can’t nums_copy.remove(i) again

You could do:

a = [0, 0, 1, 1, 2, 2, 6, 6, 9, 10, 10]

def get_single_instance(array):
  d = {}

  for item in a:
    if item not in d:
      d[item] = 1
    else:
      d[item] += 1

  print d

  for k, v in d.iteritems():
    if v == 1:
      return k

print get_single_instance(a)

Result: 9

Python matplotlib clockwise pie charts

I am playing a bit with Python and its matplotlib library, how can I create the following chart so that the first slice starts from the top and goes to the right (clockwise) instead of going to the left (counter clockwise):

enter image description here

Code:

import matplotlib.pyplot as plt
import re
import math

# The slices will be ordered and plotted counter-clockwise if startangle=90.
sizes = [175, 50, 25, 50]
total = sum(sizes)
print('TOTAL:')
print(total)
print('')
percentages = list(map(lambda x: str((x/(total * 1.00)) * 100) + '%', sizes))

print('PERCENTAGES:')
print(percentages)
backToFloat = list(map(lambda x: float(re.sub("%$", "", x)), percentages))
print('')

print('PERCENTAGES BACK TO FLOAT:')
print(backToFloat)
print('')

print('SUM OF PERCENTAGES')
print(str(sum(backToFloat)))
print('')
labels = percentages
colors = ['blue', 'red', 'green', 'orange']
patches, texts = plt.pie(sizes, colors=colors, startangle=-270)


plt.legend(patches, labels, loc="best")
# Set aspect ratio to be equal so that pie is drawn as a circle.
plt.axis('equal')
plt.tight_layout()
plt.show()

Solution:

To specify fractions direction of the pie chart, you must set the counterclock parameter to True or False (value is True by default). For your need, you must replace:

patches, texts = plt.pie(sizes, colors=colors, startangle=-270)

with:

patches, texts = plt.pie(sizes, counterclock=False, colors=colors, startangle=-270)

Replacing value in text file column with string

I’m having a pretty simple issue. I have a dataset (small sample shown below)

22 85 203 174 9 0 362 40 0
21 87 186 165 5 0 379 32 0
30 107 405 306 25 0 756 99 0
6 5 19 6 2 0 160 9 0
21 47 168 148 7 0 352 29 0
28 38 161 114 10 3 375 40 0
27 218 1522 1328 114 0 1026 310 0
21 78 156 135 5 0 300 27 0

The first issue I needed to cover was replacing each space with a comma I did that with the following code

import fileinput

with open('Data_Sorted.txt', 'w') as f:
    for line in fileinput.input('DATA.dat'):
        line = line.split(None,8)
        f.write(','.join(line))

The result was the following

22,85,203,174,9,0,362,40,0
21,87,186,165,5,0,379,32,0
30,107,405,306,25,0,756,99,0
6,5,19,6,2,0,160,9,0
21,47,168,148,7,0,352,29,0
28,38,161,114,10,3,375,40,0
27,218,1522,1328,114,0,1026,310,0
21,78,156,135,5,0,300,27,0

My next step is to grab the values from the last column, check if they are less than 2 and replace it with the string ‘nfp’.

I’m able to seperate the last column with the following

for line in open("Data_Sorted.txt"):
    columns = line.split(',')

    print columns[8]

My issue is implementing the conditional to replace the value with the string and then I’m not sure how to put the modified column back into the original dataset.

Solution:

There’s no need to do this in two loops through the file. Also, you can use -1 to index the last element in the line.

import fileinput

with open('Data_Sorted.txt', 'w') as f:
    for line in fileinput.input('DATA.dat'):
        # strip newline character and split on whitespace
        line = line.strip().split()

        # check condition for last element (assuming you're using ints)
        if int(line[-1]) < 2:
            line[-1] = 'nfp'

        # write out the line, but you have to add the newline back in
        f.write(','.join(line) + "\n")

Further Reading:

Excel is not opening csv file when index=False option is selected in to_csv command

Hi I can export and open the csv file in windows if I do:

y.to_csv('sample.csv').

where y is a pandas dataframe.

However, this output file has an index column. I am able to export the output file to csv by doing:

y.to_csv('sample.csv',index=False)

But when I try to open the file is showing an error message:

“The file format and extension of ‘sample.csv’ don’t match. The file could be corrupted or unsafe. Unless you trust it’s source, don’t open it. Do you want to open it anyway?”

Sample of y:

enter image description here

Solution:

Change the name of the ID column. That’s a special name that Excel recognizes. If the first cell of the first column of a CSV is ID, Excel will try to interpret the file as another file type. Since when you don’t exclude the index, the ID column appears in the second column, it’s fine. But when you exclude the index column, ID appears in the first cell of the first column, and Excel gets confused. You can either change the name of the column, keep the index column, or change the order of the columns in the data frame so that the ID column doesn’t appear first.

shorten a list of integers by sums of contiguous positive or negative numbers

I would like to write a function to process a list of integers, best way is to show as an example:

input [0,1,2,3, -1,-2,-3, 0,1,2,3, -1,-2,-3] will return [6,-6,6,-6]

I have a draft here that will actually work:

def group_pos_neg_list(nums):
    p_nums = []

    # to determine if the first element >=0 or <0
    # create pos_combined and neg_combined as a list to check the length in the future
    if nums[0] >= 0:
        pos_combined, neg_combined = [nums[0]], []
    elif nums[0] < 0:
        pos_combined, neg_combined = [], [nums[0]]

    # loop over each element from position 1 to the end
    # accumulate pos num and neg nums and set back to 0 if next element is different
    index = 1
    while index < len(nums):
        if nums[index] >= 0 and nums[index-1] >= 0: # both posivite
            pos_combined.append(nums[index])
            index += 1
        elif nums[index] < 0 and nums[index-1] < 0: # both negative
            neg_combined.append(nums[index])
            index += 1
        else:
            if len(pos_combined) > 0:
                p_nums.append(sum(pos_combined))
                pos_combined, neg_combined = [], [nums[index]]
            elif len(neg_combined) > 0:
                p_nums.append(sum(neg_combined))
                pos_combined, neg_combined = [nums[index]], []
            index += 1

    # finish the last combined group
    if len(pos_combined) > 0:
        p_nums.append(sum(pos_combined))
    elif len(neg_combined) > 0:
        p_nums.append(sum(neg_combined))

    return p_nums

But I am not quite happy with it, because it looks a bit complicate.
Especially that there is a repeating part of code:

if len(pos_combined) > 0:
    p_nums.append(sum(pos_combined))
    pos_combined, neg_combined = [], [nums[index]]
elif len(neg_combined) > 0:
    p_nums.append(sum(neg_combined))
    pos_combined, neg_combined = [nums[index]], []

I have to write this twice as the final group of integers will not be counted in the loop, so an extra step is needed.

Is there anyway to simplify this?

Solution:

Using groupby

No need to make it that complex: we can first groupby the signum, and then we can calculate the sum, so:

from itertools import groupby

[sum(g) for _, g in groupby(data, lambda x: x >= 0)]

This then produces:

>>> from itertools import groupby
>>> data = [0,1,2,3, -1,-2,-3, 0,1,2,3, -1,-2,-3]
>>> [sum(g) for _, g in groupby(data, lambda x: x >= 0)]
[6, -6, 6, -6]

So groupby produces tuples with the “key” (the part we calculate with the lambda), and an iterable of the “burst” (a continuous subsequence of elements with the same key). We are only interested in the latter g, and then calculate sum(g) and add that to the list.

Custom made algorithm

We can also write our own version, by using:

swap_idx = [0]
swap_idx += [i+1 for i, (v1, v2) in enumerate(zip(data, data[1:]))
             if (v1 >= 0) != (v2 >= 0)]
swap_idx.append(None)

our_sums = [sum(data[i:j]) for i, j in zip(swap_idx, swap_idx[1:])]

Here we first construct a list swap_idx that stores the indices where of the element where the signum changes. So for your sample code that is:

>>> swap_idx
[0, 4, 7, 11, None]

The 0 and None are added by the code explicitly. So now that we identified the points where the sign has changed, we can sum these subsequences together, with sum(data[i:j]). We thus use zip(swap_idx, swap_idx[1:]) to obtain two consecutive indices, and thus we can then sum that slice together.

More verbose version

The above is not very readable: yes it works, but it requires some reasoning. We can also produce a more verbose version, and make it even more generic, for example:

def groupby_aggregate(iterable, key=lambda x: x, aggregate=list):
    itr = iter(iterable)
    nx = next(itr)
    kx = kxcur = key(nx)
    current = [nx]
    try:
        while True:
            nx = next(itr)
            kx = key(nx)
            if kx != kxcur:
                yield aggregate(current)
                current = [nx]
                kxcur = kx
            else:
                current.append(nx)
    except StopIteration:
         yield aggregate(current)

We can then use it like:

list(groupby_aggregate(data, lambda x: x >= 0, sum))

How to iterate through two lists with iter() and yield?

A simple example:

 popen = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True, shell=True)
 for stdout_line in iter(popen.stdout.readline, ""): # how to add popen.stderr.readline check?
     yield stdout_line

We read from popen.stdout, yet we also want to read from stderr at the same time! We do not know when process will end.

So How to iterate through two lists with iter() and yield?

Solution:

These aren’t lists, and the right way to work with them isn’t how you would work with lists. If you want to stuff a process’s stdout and stderr into one combined stream, do that with output redirection:

from subprocess import PIPE, STDOUT

process = subprocess.Popen(cmd, stdout=PIPE, stderr=STDOUT, ...)
#                                                   ^^^^^^