Skip to content

BUG: Get common dtype in apply_along_axis#8363

Closed
gfyoung wants to merge 1 commit intonumpy:masterfrom
gfyoung:string-cut-axis-apply
Closed

BUG: Get common dtype in apply_along_axis#8363
gfyoung wants to merge 1 commit intonumpy:masterfrom
gfyoung:string-cut-axis-apply

Conversation

@gfyoung
Copy link
Copy Markdown
Contributor

@gfyoung gfyoung commented Dec 9, 2016

Currently, np.apply_along_axis formats the output array according to the first element or array that it processes. However, this can cause truncation of subsequent elements whose dtype may be larger than that of the first element.

This commit patches this behaviour by collecting all of the dtypes first and then casting at the end according to the dtype most amenable to all of them.

Closes #8352.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Dec 13, 2016

I worry this patch, by first creating an object array, and then converting to scalars, will make the common case where the output does have the correct type quite a bit slower. Also, do you have an example where this is an issue? It looks as if the function itself has to give different types of outputs depending on its inputs, which seems strange. If this is really an issue, I would suggest that for these rare cases, one should advise to wrap the function into another one that ensures the output is of the right type.

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 13, 2016

@mhvk : Did you read my PR description? All of your questions are answered there. Not to mention, your suggestion would not be feasible because the damage would already have been done in terms of element coercion when calling apply_along_axis. Wrapping this function would be moot.

Nevertheless, your point about performance is well taken. There is a noticeable performance hit:

In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr) 

master : 5.93 ms / loop
PR : 82.8 ms / loop

In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)

master : 1.88 ms / loop
PR : 2.28 ms / loop

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Dec 13, 2016

@gfyoung - yes, I did read your description, but as I said I do not understand how one could easily trigger the sitation you are trying to solve, i.e., how one can have a situation where the returned dtype will depend on which part of the array one is applying the function to. Could you give an example of a function for which the issue arises?

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 13, 2016

@mhvk : But then your question about the issue arising should have been answered. Just read the issue I pointed to in the description. Not to mention, my first sentence in the description clearly explains the dtype casting behavior in np.apply_along_axis IMO.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Dec 13, 2016

Duh... somehow I missed that link completely. But it does confirm what I felt already: that one is trying to solve corner cases that would be better dealt with in an update to the documentation rather than in a change that causes a significant performance hit to what I would think is the main use case: numerical computations.

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 13, 2016

This is not a corner case though, nor do I understand why we should preclude non-numerical computations for this functions. And if you want a numerical example, here is one:

>>> import numpy as np
>>> arr = [1, np.iinfo(np.int64).max]
>>> np.apply_along_axis(lambda x : x + 1, 0, arr)
array([2, -9223372036854775808])  # Forced casting to int64, which is wrong

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Dec 13, 2016

To me, that is just as much a corner case. Anyway, I guess we'll have to agree to disagree, and it is up to the numpy maintainers to decide whether the performance hit is worth it. (Unsubscribing).

@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Dec 14, 2016

Is there any situation this would reasonably arise other than for strings/unicode? Perhaps only those should be special-cased?

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 14, 2016

That is a plausible idea. I did some testing with different numerical types, and everything seems to be okay. It just seems to the case with strictly strings/unicode.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Dec 15, 2016

I agree with @mhvk that such performance degradation for the typical case is not acceptable. A sane way to handle this would be to explicitly cast to the desired dtype in func1d

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 15, 2016

@shoyer : That's not the issue though. The problem is that we create the output array based on the dtype of the result of the first element. So in the case of strings, numpy creates an array of dtype "<U19" for example. Then, if a string of length-21 is added, numpy simply truncates this element before adding it.

This is why I am looking into creating a special case for strings/unicode as @eric-wieser suggested so that the common cases won't take a significant hit.

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Dec 15, 2016

The problem is that we create the output array based on the dtype of the result of the first element.

If the first element of the result has the correct dtype, this isn't a problem, e.g.,

>>> arr = np.array([1, np.iinfo(np.int64).max], dtype=object)
>>> np.apply_along_axis(lambda x: x + 1, 0, arr)
array([2, 9223372036854775808L], dtype=object)

@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Dec 15, 2016

If the first element of the result has the correct dtype, this isn't a problem

The issue is that if each element of the result is a string, the length of the string is part of the dtype, and it's pretty unlikely that each string element will be the same length.

Another option would be to addng a dtype argument:

b = np.array([[111,111,0,0,0], [111,111,111,111,111]])
np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, b, dtype=object).astype(str)

I'm surprised this doesn't work either:

b = np.array([[111,111,0,0,0], [111,111,111,111,111]])
def make_object_scalar(x):
    val = np.empty((), dtype=object)
    val[()] = x
    return val
np.apply_along_axis(lambda x: make_object_scalar(" ".join(map(str, x))), 1, b).astype(str)
# TypeError: len() of unsized object

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 15, 2016

@shoyer : Agreed. What do you think about @eric-wieser proposal?

@shoyer
Copy link
Copy Markdown
Member

shoyer commented Dec 15, 2016

I would prefer to fix handling of functions that return 0d strings arrays, rather than adding a separate dtype argument.

Copy link
Copy Markdown
Member

@eric-wieser eric-wieser Dec 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be another test here for b" ".join(map(lambda xi: str(xi).encode('utf8'), x))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.

@eric-wieser
Copy link
Copy Markdown
Member

That TypeError has already been fixed here, so the make_object_scalar version should work on master

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 17, 2016

Rewrote patch to handle special case for str type as first element, which dramatically improved performance as seen below:

In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr) 
master : 5.93 ms / loop
PR : 5.97 ms / loop

In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)
master : 1.88 ms / loop
PR : 1.90 ms / loop

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Dec 30, 2016

Could someone take a look at this? The changes seem to be both robust and performant AFAICT.

Copy link
Copy Markdown
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super happy with this approach, which adds a lot of logic particularly for string dtypes. If there's a way we can accomplish this without special casing strings (e.g., by handling scalar dtype=object arrays), that would make me a lot happier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of checking isscalar (which looks for numpy scalars), we cast res to a numpy array using np.asarray, and then check for res.ndim == 0 on this line?

That would let us avoid making all these special cases for str dtypes -- instead, you could simply return an scalar array with dtype=object if you want the appropriate handling for strings.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer : You're going to have to explain a little more. I can't follow the logic in your statements here. How exactly does this change the special-casing so-to-speak?

Copy link
Copy Markdown
Contributor Author

@gfyoung gfyoung Dec 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should also add that I special-cased because initially, I did try to do something more generic (i.e. handled a dtype=object array and then casted at the end with astype), but it pummeled performance terribly as you might have seen earlier.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer: I assumed it already behaved that way! Definitely think that change should be made, but constructing 0d object arrays is painful

Copy link
Copy Markdown
Contributor Author

@gfyoung gfyoung Dec 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser : Indeed, handling object arrays is not fun (can't put a view on it for example), nor very performance friendly, at least based off my first stab at this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser : No, I perfectly understand. I'm just asking what the rationale is for it (as well as what the dtype of the output should be).

Copy link
Copy Markdown
Member

@eric-wieser eric-wieser Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, here's my proposed fix (on top of #8441, which makes things a lot tidier ):

diff --git a/numpy/lib/shape_base.py b/numpy/lib/shape_base.py
index 342f47d..804ece0 100644
--- a/numpy/lib/shape_base.py
+++ b/numpy/lib/shape_base.py
@@ -10,7 +10,7 @@ from numpy.core.fromnumeric import product, reshape, transpose
 from numpy.core import vstack, atleast_3d
 from numpy.lib.index_tricks import ndindex
 from numpy.matrixlib.defmatrix import matrix  # this raises all the right alarm bells
-
+from numpy.compat.py3k import bytes, unicode
 
 __all__ = [
     'column_stack', 'row_stack', 'dstack', 'array_split', 'split',
@@ -96,11 +96,14 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
     # invoke the function on the first item
     ind0 = next(inds)
     res = func1d(inarr_view[ind0], *args, **kwargs)
+    res_type = type(res)
     res = asanyarray(res)
 
+    is_py_str = issubclass(res_type, (bytes, unicode))
+
     # insert as many axes as necessary to create the output
     outshape = arr.shape[:axis] + res.shape + arr.shape[axis+1:]
-    outarr = zeros(outshape, res.dtype)
+    outarr = zeros(outshape, res.dtype if not is_py_str else object)
     # matrices call reshape in __array_prepare__, and this is apparently ok!
     if not isinstance(res, matrix):
         outarr = res.__array_prepare__(outarr)
@@ -119,6 +122,9 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
 
     outarr = res.__array_wrap__(outarr)
 
+    if is_py_str:
+        outarr = outarr.astype(res_type)
+
     return outarr

Special case str and bytes, and not dtypes, because if the user really wants fixed width strings, then they probably returned an array scalar anyway

Copy link
Copy Markdown
Contributor Author

@gfyoung gfyoung Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser : Does your patch hold against the test cases I added in this PR? IIUC, your diff is only for the scalar case?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung: you're correct, it would fail in the array case. I haven't run your tests against it, because I don't really want to mix separate features requests into one PR.

Copy link
Copy Markdown
Contributor Author

@gfyoung gfyoung Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser : That's perfectly understandable. I'll keep this PR open for the time being, but it seems like you would have a better patch for this. Do you mind opening another PR to pick where this one left off once #8441 is merged?

@homu
Copy link
Copy Markdown
Contributor

homu commented Dec 31, 2016

☔ The latest upstream changes (presumably b1b67d6) made this pull request unmergeable. Please resolve the merge conflicts.

@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Jan 2, 2017

So, right now, this fails:

>>> np.apply_along_axis(lambda r: np.array(np.sum(r)), 0, np.eye(5))
TypeError: len() of unsized object

Where I would expect an output of np.ones(5).

But this also fails:

np.apply_along_axis(np.diag, 0, np.eye(5))

Both of these are the case when the result of f is an array with ndim != 1. I propose we should do at least one of:

  • extend this to work for all ndims
  • add a keepdims argument for the 0D case to preserve a unit dimension, so that np.apply_along_axis(np.sum, 0, arr, keepdims=True) is the same as np.sum(arr, axis=0, keepdims=True)

(fixed in #8441, applying the first bullet point )

@gfyoung gfyoung closed this Jan 15, 2017
@gfyoung gfyoung deleted the string-cut-axis-apply branch January 15, 2017 23:05
@gfyoung gfyoung restored the string-cut-axis-apply branch January 15, 2017 23:05
@gfyoung gfyoung reopened this Jan 15, 2017
Currently, np.apply_along_axis formats the
output array according to the first element
or array that it processes. However, this
can cause truncation of subsequent elements
whose dtype may be larger than that of the
first element.

This commit patches this behaviour by
collecting all of the dtypes first and
then casting at the end according to the
dtype most amenable to all of them.

Closes gh-8352.
@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Jan 20, 2017

Fun fact: np.ma.apply_along_axis() already does what this PR is proposing:

>>> a = [[111, 111, 0, 0, 0], [111, 111, 111, 111, 111]]
>>> np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a)
array(['111 111 0 0 0', '111 111 111 1'], 
      dtype='<U13')
>>> np.ma.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a).data
array(['111 111 0 0 0', '111 111 111 111 111'], 
      dtype='<U19')

Less fun fact - the implementation of np.ma.apply_along_axis is totally independant of apply_along_axis, and probably misses most bugfixes

@gfyoung
Copy link
Copy Markdown
Contributor Author

gfyoung commented Jan 21, 2017

@eric-wieser : I'm going to close this. You seem to be moving in a much better direction with this patch than I am given all of these other issues you have uncovered with your patches.

Thanks for all of your feedback!

@gfyoung gfyoung closed this Jan 21, 2017
@gfyoung gfyoung deleted the string-cut-axis-apply branch January 21, 2017 08:40
@eric-wieser
Copy link
Copy Markdown
Member

eric-wieser commented Feb 13, 2017

@gfyoung: That PR re-implementing apply_along_axis (#8441) is now merged, so you might want to take another look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants