BUG MaskedArray __eq__ wrong for masked scalar, multi-d recarray#8590
BUG MaskedArray __eq__ wrong for masked scalar, multi-d recarray#8590eric-wieser merged 2 commits intonumpy:masterfrom
Conversation
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
I'm still wondering whether it would be better to use .view(type(self)) as long as that is not mvoid while the shape is that of an array.
There was a problem hiding this comment.
Oh, you mean in the following line?
There was a problem hiding this comment.
Yes, indeed. In the old code, the result is unconditionally turned into type(self), which is a problem if one compares a MaskedArray with an mvoid (where the latter is a single element, which will always have precedence since it is a subclass of MaskedArray.
Possibly, it is better solved in mvoid...
Anyway, as written the code works, but let me know if there are other suggestions.
There was a problem hiding this comment.
Yes, probably it can also be solved by defining _compare in mvoid, but I agree this is fine.
(also, not sure if the test above for check.dtype.names should be something like check.dtype == np.void to account for dtype('V4')... but maybe that is overthinking it since I suspect truly void types don't work well with masks anyway)
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
Could use a bit more explanation of the role of the mask.
There was a problem hiding this comment.
Could use a bit more explanation of the role of the mask.
Wasn't there before, but happy to do so (but will wait for possible other comments first).
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
I know this is the old behavior, but just want to note that any makes more sense to me than all.
With all, I think it causes the result of the comparison to depend on the data at a masked value, eg:
>>> m = np.ma.mvoid((1,2), mask=(0,1), dtype='i4,f4')
>>> n = np.ma.mvoid((1,0), dtype='i4,f4')
>>> o = np.ma.mvoid((1,1), dtype='i4,f4')
>>> (n == m), (m == n), (o == m)
(True, masked, False)There was a problem hiding this comment.
@ahaldane To be clear, is that the behaviour you want, or the behaviour that all implies?
There was a problem hiding this comment.
That's the behavior I get with this PR as it is right now (I fetchd + tested it).
I actually didn't work out why n == m is different from m ==n yet.
There was a problem hiding this comment.
Should m == n return masked or True? Either seem somewhat reasonable (but obviously both as they are right now is bad!)
There was a problem hiding this comment.
Now that you ask, I have been going back and forth on it...
After some thought, the only consistent result I see is to return masked for all three cases, except that we need some extra code so that m == m, ie
>>> p = np.ma.mvoid((1,7), mask=(0,1), dtype='i4,f4')
>>> (m == p), (m == m)
True, Truewhich I think requires replacement of mask_or by mask_xor, and then use any instead of all. But possibly we should have m == m return masked, I'm still a little undecided.
Returning True seems worse to me because intuitively I think n and m don't represent equal values, and secondly because the code would have to be much more complicated: We would have to reimplements numpy's internal structure comparison (somewhere in the C code) inside of MaskedArray to ignore fields with masked values.
There was a problem hiding this comment.
@ahaldane: I'm suggesting all semantics on the flattened view of the masked data (when such a thing can exist), not on the mask alone. I'm suggesting this because that's what the semantics of normal structured arrays are.
Flattening the data in your example, we get consistent behaviour
>>> o = np.ma.array([[1]], mask=[[True]])
>>> p = np.ma.array([[0]], mask=[[True]])
>>> (o == p).all(axis=-1)
masked_array(data = [--],
mask = [ True],
fill_value = True)I'd expect your example to return a 1D array, not a scalar, but I assume that's just a mistake.
There was a problem hiding this comment.
I think I'm with @eric-wieser here, but maybe best to postpone further discussion until I have the next version up; @ahaldane's example shows that this one is more broken than I had thought (though an improvement on what was there!). In the meantime, I also found that other parts of np.ma are relying on comparison of two masked items being returned as masked True...
There was a problem hiding this comment.
By the way, I just noticed there are unused, half-implemented methods _get_recordmask and _set_recordmask defined for MaskedArray...
There was a problem hiding this comment.
A lot of work an exploration was put into it, which hopefully we can learn from in a future tidying/reimplementation. (incidentally, if more numpy members who know about maskedarrays are joining recently, maybe we can make a reimplementation happen soon)
With respect to this PR: I recognize that, as often happens, this PR might be snowballing now that the unit tests are failing upon further changes. I might be in favor of getting this PR through now as a partial fix (at least m == m[0] will work). What do you think?
|
Besides the Also, congrats on becoming a numpy member @mhvk ! I'm very happy to have someone else who often works on MaskedArrays here. |
|
@ahaldane - hmm, that's a nasty problem you point out: ideally, we would ignore any masked values in the comparison, but that is not how it was done (or done in my PR). Needs some thought... |
|
@ahaldane - on the "snowballing" -- I think it is not too bad, but I agree that if I don't get to something better by the end of the week, we probably should take this as at least an initial step. |
|
@mhvk all right, sounds good |
1be0dc1 to
6b84bdb
Compare
|
@ahaldane, @eric-wieser - I pushed an updated version of |
numpy/ma/tests/test_core.py
Outdated
There was a problem hiding this comment.
I personally think that every single one of these tests should return np.masked
There was a problem hiding this comment.
In fact, can any two mvoid objects return np.masked on comparison with this patch?
There was a problem hiding this comment.
My argument would be that there are 3 cases here:
- No matter what values the masked fields take, two objects are always equal
- No matter what values the masked fields take, two objects are always not equal
- The equality depends on the values of the masked fields
Since the first two of these cases are represented by True and False, surely we should use masked to represent the third case?
There was a problem hiding this comment.
Yes, mvoid comparisons can yield masked, but only if all elements are masked (in at least one of the two):
m = mvoid((1, 2.), mask=(0, 1), dtype='i4,f4')
p = mvoid((1, 2.), mask=(1, 0), dtype=m.dtype)
m == p
# masked
I probably should add that to the test cases...
|
@eric-wieser - replying in the main thread so things do not get lost: I was actually working by an analogy (I think) you made earlier, that the structured arrays should work the same way as would be the case if one tested equality over an axis. Now for that, masked elements are simply ignored: I think this is actually reasonably logical given the interpretation that masking elements means one should ignore them: if both arrays mask a given element, just don't care, if one does not mask it, it cannot possibly be equal to a masked one. |
|
Actually, @eric-wieser, your real example was: while with my PR, so I'm not consistent. Darn. |
|
@mhvk : Perhaps the test suite should try every permutation of mask / value, and compare structured vs unstructured comparison? |
|
This is a little of a further-out idea, but I'd like to try it out on you. Maybe we want masked structure comparison to be fundamentally different from plain structure comparison. What I mean is, maybe instead of returning a single True/False value per element, comparison should return a new structure similar to the Incidentally, plain structure comparison is itself somewhat poorly defined and gives you lots of deprecation warnings in various cases. Its conceivable we could also change plain structure comparisons to do a similar thing (just without the masks). However I haven't thought it through, and that might be too big a change. |
7c1847a to
ed86a0b
Compare
|
@eric-wieser, @ahaldane - a new version, which now explicitly tests that the structured comparison gives the same result as doing |
numpy/ma/tests/test_core.py
Outdated
There was a problem hiding this comment.
This is the test I had to change (as well as its equivalent for ne).
|
@ahaldane - now that |
ed86a0b to
7f191b5
Compare
|
Ok, I like the behavior now and I am happy that the code seems quite elegant now compared to what was there before! I'll merge in an in an hour or two, if there are no further comments. (We can keep the "more logical" possible way for another time, if ever. It's such a niche case I don't think its a priority) |
7f191b5 to
400e0a6
Compare
|
@ahaldane - yes, I liked as well that the code ended up becoming reasonably elegant! Note that I realised I had not updated the docstring for |
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
This seems inconsistent with the duck-typing we seem to use elsewhere
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
I think this might be clearer as just sbroadcast, odatabroadcast = np.broadcast_arrays(self, odata, subok=True)
There was a problem hiding this comment.
Yes, on second thought, I agree clarity beats speed here (I essentially wrote out what np.broadcast_arrays does, but skipping broadcasting odata).
There was a problem hiding this comment.
@mhvk: I'm also suggesting you pass odatabroadcast rather than odata into filled, for clarity
There was a problem hiding this comment.
In fact, I'm starting to think that this should be the default behaviour of filled, allowing filling with a larger fill-value than the dimensions
There was a problem hiding this comment.
actually, that breaks tests, it turns out: broadcast_arrays is not guaranteed to return a new instance, and since I set the mask right after, I effectively change self. Not sure that is desired behaviour, but outside of the scope of this PR...
There was a problem hiding this comment.
And looking into this, I realise mask is not properly broadcast either (and never was): this fails
np.ma.array([0, 1], mask=[0, 1]) == np.ma.array([[0, 2]])
(it fails in __repr__ but would otherwise be bad too)
There was a problem hiding this comment.
@eric-wieser - as you'll see, I had to leave the broadcasting alone, since otherwise I had to make a copy. I agree with you that, ideally, filled would just do the right thing itself. Even better if the filled function could take a mask rather than just calling the method. But I think that's best left for another PR!
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
This is probably better as if np.ndim(check) == 0
There was a problem hiding this comment.
Oh, unless this is deliberately trying to return 0d arrays and not scalars
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
There's a function somewhere within this file for picking the appropriate subclass of MaskedArray
c540303 to
04f83e5
Compare
|
colons inserted... |
|
Great. I'll leave this a day in case @ahaldane has any more comments or the tests fail, then looks good to merge. Thanks for tolerating my hole-picking! |
|
Nice catches @eric-wieser ! I just read through it again, and nothing sticks out to me. Feel free to merge, if you don't I'll do it tomorrow. |
numpy/ma/core.py
Outdated
There was a problem hiding this comment.
Are there any tests of this underlying bool data?
There was a problem hiding this comment.
Yes, note how the tests all do something like test = (?? != ??) and then test both the values and the mask.
|
I think this needs a release note, since we actually did change behaviour - it was a narrow and buggy enough case that probably no one is affected, but if they are, then it would be nice to have a release note to point at |
|
Yes, fine to add a quick note to the release: @charris - this is a bug fix mainly still, which I think should go in 1.12.1. But do I add a note to 1.12.1-notes.rst or to |
numpy/ma/tests/test_core.py
Outdated
There was a problem hiding this comment.
If this is trying to test the underlying values, I think it needs a .data
There was a problem hiding this comment.
You're presuming that numpy functions correctly account for subclasses.... ;-)
In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True])
In [4]: test = np.ma.MaskedArray([True, False], mask=[False, True])
In [5]: assert_equal(test, [True, False])
(they should, of course, but that's for another time....)
There was a problem hiding this comment.
Wanna bet they don't?
In [2]: test = np.ma.MaskedArray([True, True], mask=[False, True])
In [3]: assert_equal(test, [True, True]) # correct, as before
In [4]: assert_equal(test, [True, False]) # uh oh...
In [5]: assert_equal(test, [True, np.array("oh dear", dtype=object)]) # oh geezThere was a problem hiding this comment.
Grrrrrr.... OK, will change.
04f83e5 to
7e8bfd6
Compare
|
@mhvk When this showed up I was tempted to make it a backport, but because of the changes in tested behavior decided not to. I'd like to branch 1.13 by the end of next month, so the delay should not be excessive. |
|
As to where to make release note additions for maintenance releases, it varies ;) Generally it should start in current master, but I did it the other way round in the 1.12.0 release because of the numerous updates. As a more practical matter, the 1.12.1 notes don't yet exist in the 1.12.x branch. |
7e8bfd6 to
3e6a69f
Compare
|
OK, added a changelog entry. |
doc/release/1.13.0-notes.rst
Outdated
There was a problem hiding this comment.
regular would be better as un-structured here, since regular falsely implies "not masked"
3e6a69f to
f49708b
Compare
In the process of trying to fix the "questionable behaviour in `MaskedArray.__eq__`" (numpygh-8589), it became clear that the code was buggy. E.g., `ma == ma[0]` failed if `ma` held a structured dtype; multi-d structured dtypes failed generally; and, more worryingly, a masked scalar comparison could be wrong: `np.ma.MaskedArray(1, mask=True) == 0` yields True. This commit solves these problems, adding tests to prevent regression. In the process, it also ensures that the results for structured arrays always equals what one would get by logically combining the results over individual parts of the structure.
f49708b to
3435dd9
Compare
|
OK, done. I cancelled the builds, since this passed already. |
|
Not sure I'm comfortable merging this without the builds... Problem with rewriting history is I have no way of knowing if you accidentally messed something up, because I can't just see the one-word diff you intended to make. I'm sure you didn't, but it'd be nice to have a passed test to prove it. |
|
Thanks for your patience, @mhvk! |
|
Thanks for all the comments, @eric-wieser -- the result is a much better new method than I had initially, one that does more at the same time as being smaller. |
In the process of trying to fix the "questionable behaviour in
MaskedArray.__eq__" (#8589), it became clear that the code was more than a little buggy. E.g.,ma == ma[0]failed ifmaheld a structured dtype; multi-d structured dtypes failed generally; and, more worryingly, a masked scalar comparison could be wrong:It doesn't help to do tests on data filled with 0 if one doesn't consistently check the mask after... (for the rest, see the new test cases)