BUG: Get common dtype in apply_along_axis#8363
BUG: Get common dtype in apply_along_axis#8363gfyoung wants to merge 1 commit intonumpy:masterfrom gfyoung:string-cut-axis-apply
Conversation
|
I worry this patch, by first creating an object array, and then converting to scalars, will make the common case where the output does have the correct type quite a bit slower. Also, do you have an example where this is an issue? It looks as if the function itself has to give different types of outputs depending on its inputs, which seems strange. If this is really an issue, I would suggest that for these rare cases, one should advise to wrap the function into another one that ensures the output is of the right type. |
|
@mhvk : Did you read my PR description? All of your questions are answered there. Not to mention, your suggestion would not be feasible because the damage would already have been done in terms of element coercion when calling Nevertheless, your point about performance is well taken. There is a noticeable performance hit: In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr)
In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)
|
|
@gfyoung - yes, I did read your description, but as I said I do not understand how one could easily trigger the sitation you are trying to solve, i.e., how one can have a situation where the returned |
|
@mhvk : But then your question about the issue arising should have been answered. Just read the issue I pointed to in the description. Not to mention, my first sentence in the description clearly explains the |
|
Duh... somehow I missed that link completely. But it does confirm what I felt already: that one is trying to solve corner cases that would be better dealt with in an update to the documentation rather than in a change that causes a significant performance hit to what I would think is the main use case: numerical computations. |
|
This is not a corner case though, nor do I understand why we should preclude non-numerical computations for this functions. And if you want a numerical example, here is one: >>> import numpy as np
>>> arr = [1, np.iinfo(np.int64).max]
>>> np.apply_along_axis(lambda x : x + 1, 0, arr)
array([2, -9223372036854775808]) # Forced casting to int64, which is wrong |
|
To me, that is just as much a corner case. Anyway, I guess we'll have to agree to disagree, and it is up to the numpy maintainers to decide whether the performance hit is worth it. (Unsubscribing). |
|
Is there any situation this would reasonably arise other than for strings/unicode? Perhaps only those should be special-cased? |
|
That is a plausible idea. I did some testing with different numerical types, and everything seems to be okay. It just seems to the case with strictly strings/unicode. |
|
I agree with @mhvk that such performance degradation for the typical case is not acceptable. A sane way to handle this would be to explicitly cast to the desired dtype in |
|
@shoyer : That's not the issue though. The problem is that we create the output array based on the This is why I am looking into creating a special case for strings/unicode as @eric-wieser suggested so that the common cases won't take a significant hit. |
If the first element of the result has the correct dtype, this isn't a problem, e.g., |
The issue is that if each element of the result is a string, the length of the string is part of the dtype, and it's pretty unlikely that each string element will be the same length. Another option would be to addng a I'm surprised this doesn't work either: |
|
@shoyer : Agreed. What do you think about @eric-wieser proposal? |
|
I would prefer to fix handling of functions that return 0d strings arrays, rather than adding a separate dtype argument. |
numpy/lib/tests/test_shape_base.py
Outdated
There was a problem hiding this comment.
There should be another test here for b" ".join(map(lambda xi: str(xi).encode('utf8'), x))
|
That |
|
Rewrote patch to handle special case for In [1]: import numpy as np
In [2]: arr = np.arange(1000000)
In [3]: %timeit np.apply_along_axis(lambda x: x + 1, 0, arr)
master : 5.93 ms / loop
PR : 5.97 ms / loop
In [4]: %timeit np.apply_along_axis(lambda x: [x + 1, 5], 0, arr)
master : 1.88 ms / loop
PR : 1.90 ms / loop |
|
Could someone take a look at this? The changes seem to be both robust and performant AFAICT. |
shoyer
left a comment
There was a problem hiding this comment.
I'm not super happy with this approach, which adds a lot of logic particularly for string dtypes. If there's a way we can accomplish this without special casing strings (e.g., by handling scalar dtype=object arrays), that would make me a lot happier.
numpy/lib/shape_base.py
Outdated
There was a problem hiding this comment.
What if instead of checking isscalar (which looks for numpy scalars), we cast res to a numpy array using np.asarray, and then check for res.ndim == 0 on this line?
That would let us avoid making all these special cases for str dtypes -- instead, you could simply return an scalar array with dtype=object if you want the appropriate handling for strings.
There was a problem hiding this comment.
@shoyer : You're going to have to explain a little more. I can't follow the logic in your statements here. How exactly does this change the special-casing so-to-speak?
There was a problem hiding this comment.
I should also add that I special-cased because initially, I did try to do something more generic (i.e. handled a dtype=object array and then casted at the end with astype), but it pummeled performance terribly as you might have seen earlier.
There was a problem hiding this comment.
@shoyer: I assumed it already behaved that way! Definitely think that change should be made, but constructing 0d object arrays is painful
There was a problem hiding this comment.
@eric-wieser : Indeed, handling object arrays is not fun (can't put a view on it for example), nor very performance friendly, at least based off my first stab at this.
There was a problem hiding this comment.
@eric-wieser : No, I perfectly understand. I'm just asking what the rationale is for it (as well as what the dtype of the output should be).
There was a problem hiding this comment.
Ok, here's my proposed fix (on top of #8441, which makes things a lot tidier ):
diff --git a/numpy/lib/shape_base.py b/numpy/lib/shape_base.py
index 342f47d..804ece0 100644
--- a/numpy/lib/shape_base.py
+++ b/numpy/lib/shape_base.py
@@ -10,7 +10,7 @@ from numpy.core.fromnumeric import product, reshape, transpose
from numpy.core import vstack, atleast_3d
from numpy.lib.index_tricks import ndindex
from numpy.matrixlib.defmatrix import matrix # this raises all the right alarm bells
-
+from numpy.compat.py3k import bytes, unicode
__all__ = [
'column_stack', 'row_stack', 'dstack', 'array_split', 'split',
@@ -96,11 +96,14 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
# invoke the function on the first item
ind0 = next(inds)
res = func1d(inarr_view[ind0], *args, **kwargs)
+ res_type = type(res)
res = asanyarray(res)
+ is_py_str = issubclass(res_type, (bytes, unicode))
+
# insert as many axes as necessary to create the output
outshape = arr.shape[:axis] + res.shape + arr.shape[axis+1:]
- outarr = zeros(outshape, res.dtype)
+ outarr = zeros(outshape, res.dtype if not is_py_str else object)
# matrices call reshape in __array_prepare__, and this is apparently ok!
if not isinstance(res, matrix):
outarr = res.__array_prepare__(outarr)
@@ -119,6 +122,9 @@ def apply_along_axis(func1d, axis, arr, *args, **kwargs):
outarr = res.__array_wrap__(outarr)
+ if is_py_str:
+ outarr = outarr.astype(res_type)
+
return outarrSpecial case str and bytes, and not dtypes, because if the user really wants fixed width strings, then they probably returned an array scalar anyway
There was a problem hiding this comment.
@eric-wieser : Does your patch hold against the test cases I added in this PR? IIUC, your diff is only for the scalar case?
There was a problem hiding this comment.
@gfyoung: you're correct, it would fail in the array case. I haven't run your tests against it, because I don't really want to mix separate features requests into one PR.
There was a problem hiding this comment.
@eric-wieser : That's perfectly understandable. I'll keep this PR open for the time being, but it seems like you would have a better patch for this. Do you mind opening another PR to pick where this one left off once #8441 is merged?
|
☔ The latest upstream changes (presumably b1b67d6) made this pull request unmergeable. Please resolve the merge conflicts. |
|
So, right now, this fails: Where I would expect an output of But this also fails: Both of these are the case when the result of
(fixed in #8441, applying the first bullet point ) |
Currently, np.apply_along_axis formats the output array according to the first element or array that it processes. However, this can cause truncation of subsequent elements whose dtype may be larger than that of the first element. This commit patches this behaviour by collecting all of the dtypes first and then casting at the end according to the dtype most amenable to all of them. Closes gh-8352.
|
Fun fact: >>> a = [[111, 111, 0, 0, 0], [111, 111, 111, 111, 111]]
>>> np.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a)
array(['111 111 0 0 0', '111 111 111 1'],
dtype='<U13')
>>> np.ma.apply_along_axis(lambda x: " ".join(map(str, x)), 1, a).data
array(['111 111 0 0 0', '111 111 111 111 111'],
dtype='<U19')Less fun fact - the implementation of |
|
@eric-wieser : I'm going to close this. You seem to be moving in a much better direction with this patch than I am given all of these other issues you have uncovered with your patches. Thanks for all of your feedback! |
Currently,
np.apply_along_axisformats the output array according to the first element or array that it processes. However, this can cause truncation of subsequent elements whosedtypemay be larger than that of the first element.This commit patches this behaviour by collecting all of the
dtypesfirst and then casting at the end according to thedtypemost amenable to all of them.Closes #8352.