Summing array values by repeating index for an array

I want to sum the values in vals into elements of a smaller array a specified in an index list idx.

import numpy as np

a = np.zeros((1,3))
vals = np.array([1,2,3,4])
idx = np.array([0,1,2,2])

a[0,idx] += vals

This produces the result [[ 1. 2. 4.]] but I want the result [[ 1. 2. 7.]], because it should add the 3 from vals and 4 from vals into the 2nd element of a.

I can achieve what I want with:

import numpy as np

a = np.zeros((1,3))
vals = np.array([1,2,3,4])
idx = np.array([0,1,2,2])

for i in np.unique(idx):
    fidx = (idx==i).astype(int)
    psum = (vals * fidx).sum()
    a[0,i] = psum 

print(a)

Is there a way to do this with numpy without using a for loop?

Solution:

Possible with np.add.at as long as the shapes align, i.e., a will need to be 1D here.

a = a.squeeze()
np.add.at(a, idx, vals)

a
array([1., 2., 7.])

Find maximum value of time in list containing tuples of time in format ('hour', 'min', 'AM/PM')

I have a list of of tuples that represent different times

timeList = [('4', '12', 'PM'), ('8', '23', 'PM'), ('4', '03', 'AM'), ('1', '34', 'AM'), 
('12', '48', 'PM'), ('4', '13', 'AM'), ('11', '09', 'AM'), ('3', '12', 'PM'), 
('4', '10', 'PM')]

I want to return the max from the list, after some searching I realized I could use the key in max to search by the AM or PM first.
print(max(timeList, key = operator.itemgetter(2)))

When I run this however, I’m getting the wrong max ('4', '12', 'PM')

I thought about it, and not only does it not make sense, given that 8:23 should be max, but I also realized that 12:48 would probably return max since it’s a PM and also technically greater than 8 in my search.

That being said, how might I get this max to find the latest possible time, given formatting of the list can not be changed.

Solution:

Just define an appropriate key-function. You want int(hour), int(minute) and 'PM' already sorts lexicographically higher than "AM", but it should be considered first, so. Also, you need to take the hours modulus 12, so that 12 sorts less than other numbers, within a pm/am:

In [39]: timeList = [('4', '12', 'PM'), ('8', '23', 'PM'), ('4', '03', 'AM'), ('1', '34', 'AM'),
    ...: ('12', '48', 'PM'), ('4', '13', 'AM'), ('11', '09', 'AM'), ('3', '12', 'PM'),
    ...: ('4', '10', 'PM')]

In [40]: def key(t):
...:     h, m, z = t
...:     return z, int(h)%12, int(m)
...:

In [41]: max(timeList,key=key)
Out[41]: ('8', '23', 'PM')

But what would make the most sense is to actually use datetime.time objects, instead of pretending a tuple of strings is a good way to store time.

So something like:

In [49]: def to_time(t):
    ...:     h, m, z = t
    ...:     h, m = int(h)%12, int(m)
    ...:     if z  == "PM":
    ...:         h += 12
    ...:     return datetime.time(h, m)
    ...:

In [50]: real_time_list = list(map(to_time, timeList))

In [51]: real_time_list
Out[51]:
[datetime.time(16, 12),
 datetime.time(20, 23),
 datetime.time(4, 3),
 datetime.time(1, 34),
 datetime.time(12, 48),
 datetime.time(4, 13),
 datetime.time(11, 9),
 datetime.time(15, 12),
 datetime.time(16, 10)]

In [52]: list(map(str, real_time_list))
Out[52]:
['16:12:00',
 '20:23:00',
 '04:03:00',
 '01:34:00',
 '12:48:00',
 '04:13:00',
 '11:09:00',
 '15:12:00',
 '16:10:00']

Note, now max “just works”:

In [54]: t = max(real_time_list)

In [55]: print(t)
20:23:00

And if you need a pretty string to print, just do the formatting at that point:

In [56]: print(t.strftime("%I:%M %p"))
08:23 PM

Converting a list of tuples to an array or other structure that allows easy slicing

Using list comprehension I have created a list of tuples which looks like

temp = [(1, 0, 1, 0, 2), (1, 0, 1, 0, 5), (1, 0, 2, 0, 2), (1, 0, 2, 0, 5)]

I could also create a list of lists if that works easier.

Either way, I would now like to get an array, or a 2D list, from the data. Something where I can easily access the value of the first element in each tuple in the above using slicing, something like

first_elements = temp[:,0]

Solution:

Use numpy for that type of indexing:

import numpy as np
temp = [(1, 0, 1, 0, 2), (1, 0, 1, 0, 5), (1, 0, 2, 0, 2), (1, 0, 2, 0, 5)]

a = np.array(temp)
a[:, 0]

returns

array([1, 1, 1, 1])

Note: all of your inner lists must be of the same size at for this to work. Otherwise the array constructor will return an array of Python lists.

Copying sublist structure to another list of equal length in Python

Say I have one list,

list1 = ['Dog', 'Cat', 'Monkey', 'Parakeet', 'Zebra']

And another,

list2 = [[True, True], [False], [True], [False]]

(You can imagine that the second list was created with an itertools.groupby on an animal being a house pet.)

Now say I want to give the first list the same sublist structure as the second.

list3 = [['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']]

I might do something like this:

list3 = []
lengths = [len(x) for x in list2]
count = 0
for leng in lengths:
    templist = []
    for i in range(leng):   
        templist.append(list1[count])
        count += 1
    list3.append(templist)

Which I haven’t totally debugged but I think should work. My question is if there is a more pythonic or simple way to do this? This seems kind of convoluted and I have to imagine there is a more graceful way to do it (maybe in itertools which never ceases to impress me).

Solution:

If you’re okay with modifying the original list (if not, you can always copy first), you can use pop() in a list comprehension:

list1 = ['Dog', 'Cat', 'Monkey', 'Parakeet', 'Zebra']
list2 = [[True, True], [False], [True], [False]]

list3 = [[list1.pop(0) for j in range(len(x))] for x in list2]
print(list3)
#[['Dog', 'Cat'], ['Monkey'], ['Parakeet'], ['Zebra']]

Paradoxical behaviour of math.nan when combined with the 'in' operator

I have the following lines of code:

import math as mt

...
...
...

        if mt.isnan (coord0):
            print (111111, coord0, type (coord0), coord0 in (None, mt.nan))
            print (222222, mt.nan, type (mt.nan), mt.nan in (None, mt.nan))

It prints:

111111 nan <class 'float'> False
222222 nan <class 'float'> True

I am baffled…
Any explanation?

Python 3.6.0, Windows 10

I have a rock solid confidence in the quality of the Python interpreter…
And I know, whenever it seems the computer makes a mistake, it’s actually me being mistaken…
So what am I missing?

[EDIT]

(In reaction to @COLDSPEED)

Indeed the ID’s are different:

print (111111, coord0, type (coord0), id (coord0), coord0 in (None, mt.nan))
print (222222, mt.nan, type (mt.nan), id (mt.nan), mt.nan in (None, mt.nan))

Prints:

111111 nan <class 'float'> 2149940586968 False
222222 nan <class 'float'> 2151724423496 True

Maybe there’s a good reason whey nan isn’t a true singleton. But I do not yet get it. This behavior is rather error prone in my view.

[EDIT2]

(In reaction to @Sven Marnach)

Carefully reading the answer of @Sven Marnach makes it understandable to me. It is indeed a compromise of the kind one encounters when designing things.

Still the ice is thin:

Having a in (b,) return True if id (a) == id (b) seems to be at odds with the IEEE-754 standard that nan should be unequal to nan.

The conclusion would have to be that while a is in an aggregate, at the same time it isn’t, because the thing in the aggregate, namely b has to be considered unequal to a by IEEE standards.

Think I’ll use isnan from now on…

Solution:

The behaviour you see is an artefact of an optimization for the in operator in Python and the fact that nan compares unequal to itself, as required by the IEEE-754 standard.

The in operator in Python returns whether any element in the iterator is equal to the element you are looking for. The expression x in it essentially evaluates to any(x == y for y in it), except that an additional optimization is applied by CPython: to avoid having to call __eq__ on each element, the interpreter first checks whether x and y are the same objects, in which case it immediately returns True.

This optimization is fine for almost all objects. After all, it’s one of the basic properties of equality that every object compares equal to itself. However, the IEEE-754 standard for floating point numbers requires that nan != nan, so NaN breaks this assumption. This results in the odd behaviour you see: if one nan happens to be the same object as a nan in the iterator, the above-mentioned optimization results in the in operator returning True. However, if the nan in the iterator isn’t the same object, Python falls back to __eq__(), and you get False.

Convert indexes in str to indexes in bytearray

I have some text, process it and find offset for some words in text. These offsets will be used by another application and that application operates with text as with sequence of bytes, so str indexes will be wrong for it.

Example:

>>> text = "“Hello there!” He said"
>>> text[7:12]
'there'
>>> text.encode('utf-8')[7:12]
>>> b'o the'

So how can I convert indexes in string to indexes in encoded bytearray?

Solution:

Encode the substrings and get their lengths in bytes:

text = "“Hello there!” He said"
start = len(text[:7].encode('utf-8'))
count = len(text[7:12].encode('utf-8'))
text.encode('utf-8')[start:start+count]

This gives b'there'.

Concatenate strings and integers in a list based on conditions

I’m working with a list that contains both strings and integers, and I want to create a function that concatenates new elements to these strings and integers based on different conditions. For instance if the element in the list is an integer I want to add 100 to it; if the element is a string I want to add “is the name”. I tried working with a list comprehension but couldn’t figure out how to account for strings and integers both being present in the list (so not sure if this is possible here). Here’s a basic example of what I’m working with:

sample_list = ['buford', 1, 'henley', 2, 'emi', 3]

the output would look like this:

sample_list = ['buford is the name', 101, 'henley is the name', 102, 'emi is the name', 103]

I tried using something like this:

def concat_func():
    sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
    [element + 100 for element in sample_list if type(element) == int]

I also tried using basic for loops and wasn’t sure if this was the right way to go about it instead:

def concat_func():
    sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
    for element in sample_list:
        if type(element) == str:
            element + " is the name"
        elif type(element) == int:
            element + 100
    return sample_list

Solution:

You were close. instead of checking for equality with type, use ‘is’. You can also do isinstance() as pointed out in the comments to check for inheritance and subclasses of str/int.

sample_list = ['buford', 1, 'henley', 2, 'emi', 3]
newlist = []

for s in sample_list:
    if type(s) is int:
        newlist.append(s + 100)
    elif type(s) is str:
        newlist.append(s + ' is the name')
    else:
        newlist.append(s)

newlist2 = []

for s in sample_list:
    if isinstance(s, int):
        newlist2.append(s + 100)
    elif isinstance(s, str):
        newlist2.append(s + ' is the name')
    else:
        newlist2.append(s)

print(newlist)
print(newlist2)

How do I aggregate rows with an upper bound on column value?

I have a pd.DataFrame I’d like to transform:

   id  values  days  time  value_per_day
0   1      15    15     1         1
1   1      20     5     2         4
2   1      12    12     3         1

I’d like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.

Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:

   id  values  days  value_per_day
0   1      10    10         1.0
1   1      25    10         2.5
2   1      10    10         1.0
3   1       2     2         1.0

I’ve tried pd.Grouper:

df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})

Out[146]:
            values
days    id        
5 days  1       16
15 days 1       10

But I’m clearly using it incorrectly.

csv for convenience:

id,values,days,time  
1,10,15,1  
1,20,5,2  
1,12,12,3  

Solution:

Notice: this is a time cost solution

newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]: 
   days  value_per_day
0    10            1.0
1    10            2.5
2    10            1.0
3     2            1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]: 
   days  value_per_day  value
0    10            1.0   10.0
1    10            2.5   25.0
2    10            1.0   10.0
3     2            1.0    2.0

I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop

On what parameter does python differentiate between a formatted string and a normal string?

x = f"There are {n} types of people"

print(type(x)==type("HELLO")) #returns True

If the formatted string and a normal string are of same type. How does a function differentiate when to format it or when not to?

My guess is whenever I specify f before a string, the interpreter picks up the value of the variables and formats it then and there and function recieves a formatted string.

Is it a shorthand notation just like lambdas in Java 8?

Solution:

In your example:

x = f"There are {n} types of people"

x is never an f-string, it is simply a regular string, already having had the {n} replaced by the value of the variable n.

An f-string is evaluated syntactically and the resulting object type is str.

Numpy: Fastest way to insert value into array such that array's in order

Suppose I have an array my_array and a singular value my_val. (Note that my_array is always sorted).

my_array = np.array([1, 2, 3, 4, 5])
my_val = 1.5

Because my_val is 1.5, I want to put it in between 1 and 2, giving me the array [1, 1.5, 2, 3, 4, 5].

My question is: What’s the fastest way (i.e. in microseconds) of producing the ordered output array as my_array grows arbitrarily large?

The original way I though of was concatenating the value to the original array and then sorting:

arr_out = np.sort(np.concatenate((my_array, np.array([my_val]))))
[ 1.   1.5  2.   3.   4.   5. ]

I know that np.concatenate is fast but I’m unsure how np.sort would scale as my_array grows, even given that my_array will always be sorted.

Edit:

I’ve compiled the times for the various methods listed at the time an answer was accepted:

Input:

import timeit

timeit_setup = 'import numpy as np\n' \
               'my_array = np.array([i for i in range(1000)], dtype=np.float64)\n' \
               'my_val = 1.5'
num_trials = 1000

my_time = timeit.timeit(
    'np.sort(np.concatenate((my_array, np.array([my_val]))))',
    setup=timeit_setup, number=num_trials
)

pauls_time = timeit.timeit(
    'idx = my_array.searchsorted(my_val)\n'
    'np.concatenate((my_array[:idx], [my_val], my_array[idx:]))',
    setup=timeit_setup, number=num_trials
)

sanchit_time = timeit.timeit(
    'np.insert(my_array, my_array.searchsorted(my_val), my_val)',
    setup=timeit_setup, number=num_trials
)

print('Times for 1000 repetitions for array of length 1000:')
print("My method took {}s".format(my_time))
print("Paul Panzer's method took {}s".format(pauls_time))
print("Sanchit Anand's method took {}s".format(sanchit_time))

Output:

Times for 1000 repetitions for array of length 1000:
My method took 0.017865657746239747s
Paul Panzer's method took 0.005813951002013821s
Sanchit Anand's method took 0.014003945532323987s

And the same for 100 repetitions for an array of length 1,000,000:

Times for 100 repetitions for array of length 1000000:
My method took 3.1770704101754195s
Paul Panzer's method took 0.3931240139911161s
Sanchit Anand's method took 0.40981490723551417s

Solution:

Use np.searchsorted to find the insertion point in logarithmic time:

>>> idx = my_array.searchsorted(my_val)
>>> np.concatenate((my_array[:idx], [my_val], my_array[idx:]))
array([1. , 1.5, 2. , 3. , 4. , 5. ])

Note 1: I recommend looking at @Willem Van Onselm’s and @hpaulj’s insightful comments.

Note 2: Using np.insert as suggested by @Sanchit Anand may be slightly more convenient if all datatypes are matching from the beginning. It is, however, worth mentioning that this convenience comes at the cost of significant overhead:

>>> def f_pp(my_array, my_val):
...      idx = my_array.searchsorted(my_val)
...      return np.concatenate((my_array[:idx], [my_val], my_array[idx:]))
... 
>>> def f_sa(my_array, my_val):
...      return np.insert(my_array, my_array.searchsorted(my_val), my_val)
...
>>> my_farray = my_array.astype(float)
>>> from timeit import repeat
>>> kwds = dict(globals=globals(), number=100000)
>>> repeat('f_sa(my_farray, my_val)', **kwds)
[1.2453778409981169, 1.2268288589984877, 1.2298014000116382]
>>> repeat('f_pp(my_array, my_val)', **kwds)
[0.2728819379990455, 0.2697303680033656, 0.2688361559994519]