BUG: Get common dtype in apply_along_axis by gfyoung · Pull Request #8363 · numpy/numpy

gfyoung · 2016-12-09T22:55:21Z

Currently, np.apply_along_axis formats the output array according to the first element or array that it processes. However, this can cause truncation of subsequent elements whose dtype may be larger than that of the first element.

This commit patches this behaviour by collecting all of the dtypes first and then casting at the end according to the dtype most amenable to all of them.

Closes #8352.

mhvk · 2016-12-13T14:47:15Z

I worry this patch, by first creating an object array, and then converting to scalars, will make the common case where the output does have the correct type quite a bit slower. Also, do you have an example where this is an issue? It looks as if the function itself has to give different types of outputs depending on its inputs, which seems strange. If this is really an issue, I would suggest that for these rare cases, one should advise to wrap the function into another one that ensures the output is of the right type.

gfyoung · 2016-12-13T16:12:45Z

@mhvk : Did you read my PR description? All of your questions are answered there. Not to mention, your suggestion would not be feasible because the damage would already have been done in terms of element coercion when calling apply_along_axis. Wrapping this function would be moot.

Nevertheless, your point about performance is well taken. There is a noticeable performance hit:

In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr)

master : 5.93 ms / loop
PR : 82.8 ms / loop

In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)

master : 1.88 ms / loop
PR : 2.28 ms / loop

mhvk · 2016-12-13T16:29:34Z

@gfyoung - yes, I did read your description, but as I said I do not understand how one could easily trigger the sitation you are trying to solve, i.e., how one can have a situation where the returned dtype will depend on which part of the array one is applying the function to. Could you give an example of a function for which the issue arises?

gfyoung · 2016-12-13T16:31:48Z

@mhvk : But then your question about the issue arising should have been answered. Just read the issue I pointed to in the description. Not to mention, my first sentence in the description clearly explains the dtype casting behavior in np.apply_along_axis IMO.

mhvk · 2016-12-13T16:38:27Z

Duh... somehow I missed that link completely. But it does confirm what I felt already: that one is trying to solve corner cases that would be better dealt with in an update to the documentation rather than in a change that causes a significant performance hit to what I would think is the main use case: numerical computations.

gfyoung · 2016-12-13T16:56:29Z

This is not a corner case though, nor do I understand why we should preclude non-numerical computations for this functions. And if you want a numerical example, here is one:

>>> import numpy as np
>>> arr = [1, np.iinfo(np.int64).max]
>>> np.apply_along_axis(lambda x : x + 1, 0, arr)
array([2, -9223372036854775808])  # Forced casting to int64, which is wrong

mhvk · 2016-12-13T17:29:42Z

To me, that is just as much a corner case. Anyway, I guess we'll have to agree to disagree, and it is up to the numpy maintainers to decide whether the performance hit is worth it. (Unsubscribing).

eric-wieser · 2016-12-14T03:13:17Z

Is there any situation this would reasonably arise other than for strings/unicode? Perhaps only those should be special-cased?

gfyoung · 2016-12-14T03:36:04Z

That is a plausible idea. I did some testing with different numerical types, and everything seems to be okay. It just seems to the case with strictly strings/unicode.

shoyer · 2016-12-15T17:57:35Z

I agree with @mhvk that such performance degradation for the typical case is not acceptable. A sane way to handle this would be to explicitly cast to the desired dtype in func1d

gfyoung · 2016-12-15T18:09:20Z

@shoyer : That's not the issue though. The problem is that we create the output array based on the dtype of the result of the first element. So in the case of strings, numpy creates an array of dtype "<U19" for example. Then, if a string of length-21 is added, numpy simply truncates this element before adding it.

This is why I am looking into creating a special case for strings/unicode as @eric-wieser suggested so that the common cases won't take a significant hit.

shoyer · 2016-12-15T18:14:45Z

The problem is that we create the output array based on the dtype of the result of the first element.

If the first element of the result has the correct dtype, this isn't a problem, e.g.,

>>> arr = np.array([1, np.iinfo(np.int64).max], dtype=object)
>>> np.apply_along_axis(lambda x: x + 1, 0, arr)
array([2, 9223372036854775808L], dtype=object)

eric-wieser · 2016-12-15T18:20:18Z

If the first element of the result has the correct dtype, this isn't a problem

The issue is that if each element of the result is a string, the length of the string is part of the dtype, and it's pretty unlikely that each string element will be the same length.

Another option would be to addng a dtype argument:

b = np.array([[111,111,0,0,0], [111,111,111,111,111]])
np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, b, dtype=object).astype(str)

I'm surprised this doesn't work either:

b = np.array([[111,111,0,0,0], [111,111,111,111,111]])
def make_object_scalar(x):
    val = np.empty((), dtype=object)
    val[()] = x
    return val
np.apply_along_axis(lambda x: make_object_scalar(" ".join(map(str, x))), 1, b).astype(str)
# TypeError: len() of unsized object

gfyoung · 2016-12-15T18:23:56Z

@shoyer : Agreed. What do you think about @eric-wieser proposal?

shoyer · 2016-12-15T18:42:40Z

I would prefer to fix handling of functions that return 0d strings arrays, rather than adding a separate dtype argument.

eric-wieser · 2016-12-15T19:20:21Z

numpy/lib/tests/test_shape_base.py

There should be another test here for b" ".join(map(lambda xi: str(xi).encode('utf8'), x))

Makes sense. Done.

eric-wieser · 2016-12-15T19:28:52Z

That TypeError has already been fixed here, so the make_object_scalar version should work on master

gfyoung · 2016-12-17T08:28:35Z

Rewrote patch to handle special case for str type as first element, which dramatically improved performance as seen below:

In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr) 
master : 5.93 ms / loop
PR : 5.97 ms / loop

In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)
master : 1.88 ms / loop
PR : 1.90 ms / loop

gfyoung · 2016-12-30T22:21:55Z

Could someone take a look at this? The changes seem to be both robust and performant AFAICT.

shoyer

I'm not super happy with this approach, which adds a lot of logic particularly for string dtypes. If there's a way we can accomplish this without special casing strings (e.g., by handling scalar dtype=object arrays), that would make me a lot happier.

shoyer · 2016-12-30T22:43:08Z

numpy/lib/shape_base.py

What if instead of checking isscalar (which looks for numpy scalars), we cast res to a numpy array using np.asarray, and then check for res.ndim == 0 on this line?

That would let us avoid making all these special cases for str dtypes -- instead, you could simply return an scalar array with dtype=object if you want the appropriate handling for strings.

@shoyer : You're going to have to explain a little more. I can't follow the logic in your statements here. How exactly does this change the special-casing so-to-speak?

I should also add that I special-cased because initially, I did try to do something more generic (i.e. handled a dtype=object array and then casted at the end with astype), but it pummeled performance terribly as you might have seen earlier.

@shoyer: I assumed it already behaved that way! Definitely think that change should be made, but constructing 0d object arrays is painful

@eric-wieser : Indeed, handling object arrays is not fun (can't put a view on it for example), nor very performance friendly, at least based off my first stab at this.

@eric-wieser : No, I perfectly understand. I'm just asking what the rationale is for it (as well as what the dtype of the output should be).

Ok, here's my proposed fix (on top of #8441, which makes things a lot tidier ):

diff --git a/numpy/lib/shape_base.py b/numpy/lib/shape_base.py index 342f47d..804ece0 100644 --- a/numpy/lib/shape_base.py +++ b/numpy/lib/shape_base.py @@ -10,7 +10,7 @@ from numpy.core.fromnumeric import product, reshape, transpose from numpy.core import vstack, atleast_3d from numpy.lib.index_tricks import ndindex from numpy.matrixlib.defmatrix import matrix # this raises all the right alarm bells - +from numpy.compat.py3k import bytes, unicode __all__ = [ 'column_stack', 'row_stack', 'dstack', 'array_split', 'split', @@ -96,11 +96,14 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs): # invoke the function on the first item ind0 = next(inds) res = func1d(inarr_view[ind0], *args, **kwargs) + res_type = type(res) res = asanyarray(res) + is_py_str = issubclass(res_type, (bytes, unicode)) + # insert as many axes as necessary to create the output outshape = arr.shape[:axis] + res.shape + arr.shape[axis+1:] - outarr = zeros(outshape, res.dtype) + outarr = zeros(outshape, res.dtype if not is_py_str else object) # matrices call reshape in __array_prepare__, and this is apparently ok! if not isinstance(res, matrix): outarr = res.__array_prepare__(outarr) @@ -119,6 +122,9 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs): outarr = res.__array_wrap__(outarr) + if is_py_str: + outarr = outarr.astype(res_type) + return outarr

Special case str and bytes, and not dtypes, because if the user really wants fixed width strings, then they probably returned an array scalar anyway

@eric-wieser : Does your patch hold against the test cases I added in this PR? IIUC, your diff is only for the scalar case?

@gfyoung: you're correct, it would fail in the array case. I haven't run your tests against it, because I don't really want to mix separate features requests into one PR.

@eric-wieser : That's perfectly understandable. I'll keep this PR open for the time being, but it seems like you would have a better patch for this. Do you mind opening another PR to pick where this one left off once #8441 is merged?

homu · 2016-12-31T16:56:47Z

☔ The latest upstream changes (presumably b1b67d6) made this pull request unmergeable. Please resolve the merge conflicts.

eric-wieser · 2017-01-02T22:30:36Z

So, right now, this fails:

>>> np.apply_along_axis(lambda r: np.array(np.sum(r)), 0, np.eye(5))
TypeError: len() of unsized object

Where I would expect an output of np.ones(5).

But this also fails:

np.apply_along_axis(np.diag, 0, np.eye(5))

Both of these are the case when the result of f is an array with ndim != 1. I propose we should do at least one of:

extend this to work for all ndims
add a keepdims argument for the 0D case to preserve a unit dimension, so that np.apply_along_axis(np.sum, 0, arr, keepdims=True) is the same as np.sum(arr, axis=0, keepdims=True)

(fixed in #8441, applying the first bullet point )

Currently, np.apply_along_axis formats the output array according to the first element or array that it processes. However, this can cause truncation of subsequent elements whose dtype may be larger than that of the first element. This commit patches this behaviour by collecting all of the dtypes first and then casting at the end according to the dtype most amenable to all of them. Closes gh-8352.

eric-wieser · 2017-01-20T13:10:19Z

Fun fact: np.ma.apply_along_axis() already does what this PR is proposing:

>>> a = [[111, 111, 0, 0, 0], [111, 111, 111, 111, 111]]
>>> np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a)
array(['111 111 0 0 0', '111 111 111 1'], 
      dtype='<U13')
>>> np.ma.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a).data
array(['111 111 0 0 0', '111 111 111 111 111'], 
      dtype='<U19')

Less fun fact - the implementation of np.ma.apply_along_axis is totally independant of apply_along_axis, and probably misses most bugfixes

gfyoung · 2017-01-21T08:39:15Z

@eric-wieser : I'm going to close this. You seem to be moving in a much better direction with this patch than I am given all of these other issues you have uncovered with your patches.

Thanks for all of your feedback!

eric-wieser · 2017-02-13T15:41:19Z

@gfyoung: That PR re-implementing apply_along_axis (#8441) is now merged, so you might want to take another look at this

charris added 00 - Bug component: numpy.lib labels Dec 10, 2016

mhvk mentioned this pull request Dec 13, 2016

apply_along_axis cuts strings #8352

Open

eric-wieser reviewed Dec 15, 2016

View reviewed changes

shoyer reviewed Dec 30, 2016

View reviewed changes

eric-wieser mentioned this pull request Jan 2, 2017

BUG: Fix crash on 0d return value in apply_along_axis #8441

Merged

gfyoung closed this Jan 15, 2017

gfyoung deleted the string-cut-axis-apply branch January 15, 2017 23:05

gfyoung restored the string-cut-axis-apply branch January 15, 2017 23:05

gfyoung reopened this Jan 15, 2017

gfyoung closed this Jan 21, 2017

gfyoung deleted the string-cut-axis-apply branch January 21, 2017 08:40

Uh oh!

Conversation

gfyoung commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Dec 13, 2016

Uh oh!

gfyoung commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Dec 13, 2016

Uh oh!

gfyoung commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Dec 13, 2016

Uh oh!

gfyoung commented Dec 13, 2016

Uh oh!

mhvk commented Dec 13, 2016

Uh oh!

eric-wieser commented Dec 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Dec 14, 2016

Uh oh!

shoyer commented Dec 15, 2016

Uh oh!

gfyoung commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Dec 15, 2016

Uh oh!

eric-wieser commented Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Dec 15, 2016

Uh oh!

shoyer commented Dec 15, 2016

Uh oh!

eric-wieser Dec 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 17, 2016

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Dec 15, 2016

Uh oh!

gfyoung commented Dec 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gfyoung commented Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

shoyer Dec 30, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 30, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser Dec 31, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Dec 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung Jan 2, 2017

Choose a reason for hiding this comment

Uh oh!

eric-wieser Jan 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

gfyoung commented Dec 9, 2016 •

edited

Loading

gfyoung commented Dec 13, 2016 •

edited

Loading

gfyoung commented Dec 13, 2016 •

edited

Loading

eric-wieser commented Dec 14, 2016 •

edited

Loading

gfyoung commented Dec 15, 2016 •

edited

Loading

eric-wieser commented Dec 15, 2016 •

edited

Loading

eric-wieser Dec 15, 2016 •

edited

Loading

gfyoung commented Dec 17, 2016 •

edited

Loading

gfyoung commented Dec 30, 2016 •

edited

Loading

gfyoung Dec 30, 2016 •

edited

Loading

gfyoung Dec 31, 2016 •

edited

Loading

eric-wieser Jan 3, 2017 •

edited

Loading

gfyoung Jan 3, 2017 •

edited

Loading

gfyoung Jan 3, 2017 •

edited

Loading

eric-wieser commented Jan 2, 2017 •

edited

Loading

eric-wieser commented Jan 20, 2017 •

edited

Loading

gfyoung commented Jan 21, 2017 •

edited

Loading

eric-wieser commented Feb 13, 2017 •

edited

Loading