Support arithmetic with row-series#2085
Conversation
Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.
This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.
This enables computations like the following (which used to err):
df = df - df.mean()
However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)
Fixes dask#1759
|
It's not really "1 row", rather it's a series with the index equal to the columns in the original dataframe. In this case, I'd rather check if the divisions are equal to the (first,last) elements in columns of the frame it's broadcasting against, and set them accordingly in the same place you have here. This also allows keeps the meaning of divisions consistent. |
|
Good point. I think that I have resolved this in a recent commit. |
1342f7b to
2987070
Compare
|
Any further comments @jcrist ? Merging this afternoon if not |
d899e68 to
a131265
Compare
| return (isinstance(s, Series) and | ||
| s.npartitions == 1 and | ||
| s.known_divisions and | ||
| any(s.divisions == (min(df.columns), max(df.columns)) |
There was a problem hiding this comment.
This seems fine to me. Just for posterity, I was initially worried about three things:
- What happens when columns is an unordered
CategoricalIndex - What happens when columns is a
MultiIndex - Performance of using
min(columns)instead ofcolumns.min()
The first one actually works fine, since the iterator goes over the values in the index not the categorical codes (makes sense). For multiindex, the iterator is just tuples, so that also works. The third is interesting - the performance of min is waaaay faster than the min method (not that it matters here, columns is likely to be small). This is true for Index and MultiIndex.
In [62]: ind = pd.Index(map(str, range(100000)))
In [63]: %timeit ind.min()
100 loops, best of 3: 10.8 ms per loop
In [64]: %timeit min(ind)
100 loops, best of 3: 3.16 ms per loop
In [65]: ind = pd.Index(['a', 'b', 'c', 'd', 'e']) # Something smaller
In [66]: %timeit ind.min()
The slowest run took 4.64 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 30.2 µs per loop
In [67]: %timeit min(ind)
The slowest run took 43.95 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.62 µs per loop
Sometimes a Series is just one row and is intended to be broadcast
across a Dataframe, rather than in an elementwise fashion.
This creates a convention, that a Series with divisions (0, 1) signals
that it is just a single row, and is thus appropriate for broadcasting.
This enables computations like the following (which used to err):
However it also introduces silent failures if elementwise operation
against this series is intended, despite its divisions of (0, 1)
Fixes #1759
cc @jcrist for review