Summing array values by repeating index for an array

I want to sum the values in vals into elements of a smaller array a specified in an index list idx.

import numpy as np

a = np.zeros((1,3))
vals = np.array([1,2,3,4])
idx = np.array([0,1,2,2])

a[0,idx] += vals

This produces the result [[ 1. 2. 4.]] but I want the result [[ 1. 2. 7.]], because it should add the 3 from vals and 4 from vals into the 2nd element of a.

I can achieve what I want with:

import numpy as np

a = np.zeros((1,3))
vals = np.array([1,2,3,4])
idx = np.array([0,1,2,2])

for i in np.unique(idx):
    fidx = (idx==i).astype(int)
    psum = (vals * fidx).sum()
    a[0,i] = psum 

print(a)

Is there a way to do this with numpy without using a for loop?

Solution:

Possible with np.add.at as long as the shapes align, i.e., a will need to be 1D here.

a = a.squeeze()
np.add.at(a, idx, vals)

a
array([1., 2., 7.])

Find maximum value of time in list containing tuples of time in format ('hour', 'min', 'AM/PM')

I have a list of of tuples that represent different times

timeList = [('4', '12', 'PM'), ('8', '23', 'PM'), ('4', '03', 'AM'), ('1', '34', 'AM'), 
('12', '48', 'PM'), ('4', '13', 'AM'), ('11', '09', 'AM'), ('3', '12', 'PM'), 
('4', '10', 'PM')]

I want to return the max from the list, after some searching I realized I could use the key in max to search by the AM or PM first.
print(max(timeList, key = operator.itemgetter(2)))

When I run this however, I’m getting the wrong max ('4', '12', 'PM')

I thought about it, and not only does it not make sense, given that 8:23 should be max, but I also realized that 12:48 would probably return max since it’s a PM and also technically greater than 8 in my search.

That being said, how might I get this max to find the latest possible time, given formatting of the list can not be changed.

Solution:

Just define an appropriate key-function. You want int(hour), int(minute) and 'PM' already sorts lexicographically higher than "AM", but it should be considered first, so. Also, you need to take the hours modulus 12, so that 12 sorts less than other numbers, within a pm/am:

In [39]: timeList = [('4', '12', 'PM'), ('8', '23', 'PM'), ('4', '03', 'AM'), ('1', '34', 'AM'),
    ...: ('12', '48', 'PM'), ('4', '13', 'AM'), ('11', '09', 'AM'), ('3', '12', 'PM'),
    ...: ('4', '10', 'PM')]

In [40]: def key(t):
...:     h, m, z = t
...:     return z, int(h)%12, int(m)
...:

In [41]: max(timeList,key=key)
Out[41]: ('8', '23', 'PM')

But what would make the most sense is to actually use datetime.time objects, instead of pretending a tuple of strings is a good way to store time.

So something like:

In [49]: def to_time(t):
    ...:     h, m, z = t
    ...:     h, m = int(h)%12, int(m)
    ...:     if z  == "PM":
    ...:         h += 12
    ...:     return datetime.time(h, m)
    ...:

In [50]: real_time_list = list(map(to_time, timeList))

In [51]: real_time_list
Out[51]:
[datetime.time(16, 12),
 datetime.time(20, 23),
 datetime.time(4, 3),
 datetime.time(1, 34),
 datetime.time(12, 48),
 datetime.time(4, 13),
 datetime.time(11, 9),
 datetime.time(15, 12),
 datetime.time(16, 10)]

In [52]: list(map(str, real_time_list))
Out[52]:
['16:12:00',
 '20:23:00',
 '04:03:00',
 '01:34:00',
 '12:48:00',
 '04:13:00',
 '11:09:00',
 '15:12:00',
 '16:10:00']

Note, now max “just works”:

In [54]: t = max(real_time_list)

In [55]: print(t)
20:23:00

And if you need a pretty string to print, just do the formatting at that point:

In [56]: print(t.strftime("%I:%M %p"))
08:23 PM

Converting a list of tuples to an array or other structure that allows easy slicing

Using list comprehension I have created a list of tuples which looks like

temp = [(1, 0, 1, 0, 2), (1, 0, 1, 0, 5), (1, 0, 2, 0, 2), (1, 0, 2, 0, 5)]

I could also create a list of lists if that works easier.

Either way, I would now like to get an array, or a 2D list, from the data. Something where I can easily access the value of the first element in each tuple in the above using slicing, something like

first_elements = temp[:,0]

Solution:

Use numpy for that type of indexing:

import numpy as np
temp = [(1, 0, 1, 0, 2), (1, 0, 1, 0, 5), (1, 0, 2, 0, 2), (1, 0, 2, 0, 5)]

a = np.array(temp)
a[:, 0]

returns

array([1, 1, 1, 1])

Note: all of your inner lists must be of the same size at for this to work. Otherwise the array constructor will return an array of Python lists.

Copying sublist structure to another list of equal length in Python

Say I have one list,

list1 = ['Dog', 'Cat', 'Monkey', 'Parakeet', 'Zebra']

And another,

list2 = [[True, True], [False], [True], [False]]

(You can imagine that the second list was created with an itertools.groupby on an animal being a house pet.)

Now say I want to give the first list the same sublist structure as the second.

list3 = [['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']]

I might do something like this:

list3 = []
lengths = [len(x) for x in list2]
count = 0
for leng in lengths:
    templist = []
    for i in range(leng):   
        templist.append(list1[count])
        count += 1
    list3.append(templist)

Which I haven’t totally debugged but I think should work. My question is if there is a more pythonic or simple way to do this? This seems kind of convoluted and I have to imagine there is a more graceful way to do it (maybe in itertools which never ceases to impress me).

Solution:

If you’re okay with modifying the original list (if not, you can always copy first), you can use pop() in a list comprehension:

list1 = ['Dog', 'Cat', 'Monkey', 'Parakeet', 'Zebra']
list2 = [[True, True], [False], [True], [False]]

list3 = [[list1.pop(0) for j in range(len(x))] for x in list2]
print(list3)
#[['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']]

Paradoxical behaviour of math.nan when combined with the 'in' operator

I have the following lines of code:

import math as mt

...
...
...

        if mt.isnan (coord0):
            print (111111, coord0, type (coord0), coord0 in (None, mt.nan))
            print (222222, mt.nan, type (mt.nan), mt.nan in (None, mt.nan))

It prints:

111111 nan <class 'float'> False
222222 nan <class 'float'> True

I am baffled…
Any explanation?

Python 3.6.0, Windows 10

I have a rock solid confidence in the quality of the Python interpreter…
And I know, whenever it seems the computer makes a mistake, it’s actually me being mistaken…
So what am I missing?

[EDIT]

(In reaction to @COLDSPEED)

Indeed the ID’s are different:

print (111111, coord0, type (coord0), id (coord0), coord0 in (None, mt.nan))
print (222222, mt.nan, type (mt.nan), id (mt.nan), mt.nan in (None, mt.nan))

Prints:

111111 nan <class 'float'> 2149940586968 False
222222 nan <class 'float'> 2151724423496 True

Maybe there’s a good reason whey nan isn’t a true singleton. But I do not yet get it. This behavior is rather error prone in my view.

[EDIT2]

(In reaction to @Sven Marnach)

Carefully reading the answer of @Sven Marnach makes it understandable to me. It is indeed a compromise of the kind one encounters when designing things.

Still the ice is thin:

Having a in (b,) return True if id (a) == id (b) seems to be at odds with the IEEE-754 standard that nan should be unequal to nan.

The conclusion would have to be that while a is in an aggregate, at the same time it isn’t, because the thing in the aggregate, namely b has to be considered unequal to a by IEEE standards.

Think I’ll use isnan from now on…

Solution:

The behaviour you see is an artefact of an optimization for the in operator in Python and the fact that nan compares unequal to itself, as required by the IEEE-754 standard.

The in operator in Python returns whether any element in the iterator is equal to the element you are looking for. The expression x in it essentially evaluates to any(x == y for y in it), except that an additional optimization is applied by CPython: to avoid having to call __eq__ on each element, the interpreter first checks whether x and y are the same objects, in which case it immediately returns True.

This optimization is fine for almost all objects. After all, it’s one of the basic properties of equality that every object compares equal to itself. However, the IEEE-754 standard for floating point numbers requires that nan != nan, so NaN breaks this assumption. This results in the odd behaviour you see: if one nan happens to be the same object as a nan in the iterator, the above-mentioned optimization results in the in operator returning True. However, if the nan in the iterator isn’t the same object, Python falls back to __eq__(), and you get False.

python "binom" with fewer dependencies?

My Python 3.6 script works great with

from scipy.special import binom

Running the code in AWS lambda, however, does not work. Attempting to load the zipped deployment package from S3 gives the error:

Unzipped size must be smaller than 262144000 bytes

Surely somewhere there is a Python package which can do what “binom” does without needing all of “scipy” which seems to require “numpy” ?

Solution:

The binomial coefficient is a known and fairly trivial calculation; binom(n, k) is just n! / (k! * (n - k)!). Python built-ins can do this in a perfectly straightforward (if theoretically sub-optimal, since it can produce excessively large intermediates that more tuned approaches could avoid, but it hardly matters most of the time) way:

from math import factorial

def binom(n, k):
    return factorial(n) // (factorial(k) * factorial(n - k))

If you need it somewhat faster, gmpy2 offers a bincoef function, and gmpy2 is a bit more standalone than scipy/numpy (it needs GMP/MPFR/MPC, but it’s on the order of a few MB of binaries all told, not a few hundred MB). It returns a gmpy2.mpz type, which is largely interoperable with int, or you can just force conversion back to int by wrapping:

from gmpy2 import bincoef

def binom(n, k):
    return int(bincoef(n, k))

">" not working to direct output of python command to file

I have decided to try snakefood to help with a refactoring to check the imports. It keeps dumping output on the screen and “>” does not send it to a file, it just creates an empty file.

I had to unfortunately create a virtualenv with Python 2.7 as it probably does not work properly in Python 3. Still, it can probably check a Python 2 project, even though it is written in Python 2. Am using a Mac, but it seems to use similar commands to Linux on the command line.

I did

pip install six
pip install graphviz
pip install snakefood

once the Python 2 environment was activated.

Then if I type

$ sfood-checker path/to/folder

..it dumps a huge amount of text on the screen, but

$ sfood-checker path/to/folder > check.txt

..only creates an empty file. Any idea what is wrong, how to fix it? Would like to carefully go through the file in sublime.

Solution:

You are redirecting stdout, but your program is writing to stderr. The fix is to redirect stderr:

$ sfood-checker path/to/folder 2> check.txt

Or redirect both stdout and stderr:

$ sfood-checker path/to/folder &> check.txt

Background: when processes are initially created, they generally always have three initial file descriptors already opened for them:

  • 0, stdin, “Standard Input”, a read-only stream
  • 1, stdout, “Standard Output”, a write-only stream
  • 2, stderr, “Standard Error”, a write-only stream

There is precisely zero difference between stdout and stderr, other than convention and the file descriptor number. By convention, then, status messages and other “informational” content is output to stderr (some version of fwrite(stderr, informational_data);, and the data required for normal operations of the program are written to stdout.

Convert indexes in str to indexes in bytearray

I have some text, process it and find offset for some words in text. These offsets will be used by another application and that application operates with text as with sequence of bytes, so str indexes will be wrong for it.

Example:

>>> text = "“Hello there!” He said"
>>> text[7:12]
'there'
>>> text.encode('utf-8')[7:12]
>>> b'o the'

So how can I convert indexes in string to indexes in encoded bytearray?

Solution:

Encode the substrings and get their lengths in bytes:

text = "“Hello there!” He said"
start = len(text[:7].encode('utf-8'))
count = len(text[7:12].encode('utf-8'))
text.encode('utf-8')[start:start+count]

This gives b'there'.

Concatenate strings and integers in a list based on conditions

I’m working with a list that contains both strings and integers, and I want to create a function that concatenates new elements to these strings and integers based on different conditions. For instance if the element in the list is an integer I want to add 100 to it; if the element is a string I want to add “is the name”. I tried working with a list comprehension but couldn’t figure out how to account for strings and integers both being present in the list (so not sure if this is possible here). Here’s a basic example of what I’m working with:

sample_list = ['buford', 1, 'henley', 2, 'emi', 3]

the output would look like this:

sample_list = ['buford is the name', 101, 'henley is the name', 102, 'emi is the name', 103]

I tried using something like this:

def concat_func():
    sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
    [element + 100 for element in sample_list if type(element) == int]

I also tried using basic for loops and wasn’t sure if this was the right way to go about it instead:

def concat_func():
    sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
    for element in sample_list:
        if type(element) == str:
            element + " is the name"
        elif type(element) == int:
            element + 100
    return sample_list

Solution:

You were close. instead of checking for equality with type, use ‘is’. You can also do isinstance() as pointed out in the comments to check for inheritance and subclasses of str/int.

sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
newlist = []

for s in sample_list:
    if type(s) is int:
        newlist.append(s + 100)
    elif type(s) is str:
        newlist.append(s + ' is the name')
    else:
        newlist.append(s)

newlist2 = []

for s in sample_list:
    if isinstance(s, int):
        newlist2.append(s + 100)
    elif isinstance(s, str):
        newlist2.append(s + ' is the name')
    else:
        newlist2.append(s)

print(newlist)
print(newlist2)

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I’d like to transform:

   id  values  days  time  value_per_day
0   1      15    15     1         1
1   1      20     5     2         4
2   1      12    12     3         1

I’d like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.

Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:

   id  values  days  value_per_day
0   1      10    10         1.0
1   1      25    10         2.5
2   1      10    10         1.0
3   1       2     2         1.0

I’ve tried pd.Grouper:

df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})

Out[146]:
            values
days    id        
5 days  1       16
15 days 1       10

But I’m clearly using it incorrectly.

csv for convenience:

id,values,days,time  
1,10,15,1  
1,20,5,2  
1,12,12,3  

Solution:

Notice: this is a time cost solution

newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]: 
   days  value_per_day
0    10            1.0
1    10            2.5
2    10            1.0
3     2            1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]: 
   days  value_per_day  value
0    10            1.0   10.0
1    10            2.5   25.0
2    10            1.0   10.0
3     2            1.0    2.0

I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop