Skip to content

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Dec 3, 2020

Split off from #10423, to include only regular arrays and Quantity and its subclasses. From #10423, the general comments still hold:

It may be worth pointing out two specific things:

  • I very much on purpose decided to treat reductions different from MaskedArray - any time there is any items in the reduction masked, the result is masked as well. So, e.g., for .sum(), masked elements are not implicitly replaced by zero. The logic is that one cannot interpret the sum meaningfully without knowing how many elements actually contributed. Conversely, following that same logic, for methods where skipping a masked element does make sense (like .mean(), where one divides by the number of unmasked elements), I do use it.
  • Currently, the implementation does not support structured arrays. The question for those, which I haven't really found a "best" answer for, is whether the mask should be per full item, or whether the mask itself should also be a structured array, i.e., allow masking individual pieces. MaskedArray does the latter and it is quite a hassle to keep track, but perhaps better. EDIT: followed MaskedArray and have a structured mask.

This is definitely at a point where some further design choices have to be made, and it would be good to step back a bit and decide what exactly we want masking to mean.

Part of the difficulty is that it cannot immediately used for MaskedColumn, since some design choices which are different from MaskedArray, so it would need a kind of compatibility layer if we want to keep exact one-to-one matches, and/or some way to decide which type one wants.

fixes #1852, #1857, #7367, #10792

@mhvk
Copy link
Contributor Author

mhvk commented Dec 7, 2020

@taldcroft - I added brief documentation as well as more comments and docstrings to the code. I still need to think a bit about the auto-generation with Masked(UnmaskedClass)...

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk - thanks, the docs are a good place to start some API discussions, and a gentle introduction to reviewing this. :-)

@taldcroft
Copy link
Member

Question - will this be able to support setting the mask via __setitem__ in a way like mq[5] = np.ma.masked? I like this idiom better than mq.mask[5] = True.

@taldcroft
Copy link
Member

Question 2 - what's your sense of how much it would take to apply this to a non-ndarray class like SkyCoord (and even Time if Masked really works out well). I'm not especially attached to the Nan in jd2 implementation, except that it basically works.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 7, 2020

Thanks, very useful!

On your questions:

will this be able to support setting the mask via __setitem__ in a way like mq[5] = np.ma.masked? I like this idiom better than mq.mask[5] = True.

Yes, it does (I like it too!). It is why __setitem__ looks slightly odd (_data_mask returns None, True for np.ma.masked):

    def __setitem__(self, item, value):
        value, mask = self._data_mask(value)
        if value is not None:
            self.unmasked[item] = value
        self.mask[item] = mask

what's your sense of how much it would take to apply this to a non-ndarray class like SkyCoord (and even Time if Masked really works out well). I'm not especially attached to the Nan in jd2 implementation, except that it basically works.

I think it should not be too hard, mostly because in the end it is quite similar to using nan - just need something that collects and sets the masks. I am hopeful that it will need just a single MaskedShapedLikeNDArray class, with everything that depends on ShapedLikeNDArray then automatically able to use it.

But it may still take some effort as one does need to ensure that masks are shared or otherwise kept consistent between attributes, and verifying that there are no functions or ufuncs required for the transformations which the current code doesn't deal properly with.

Aside: something I have spent too much time on (but is fun) is to get this to work reliably:
image

But it is different between an ANSI-emulating console and, say, a jupyter notebook in a browser:
image

@mhvk
Copy link
Contributor Author

mhvk commented Dec 8, 2020

@taldcroft - I updated the documentation page with your suggestions, and also made two changes that you had suggested elsewhere:

  • Ensure that, say, MaskedQuantity can be initialized just like a Quantity, but with a mask argument (keyword only);
  • Allow Masked(Quantity), etc., as a short-cut to create a "default" masked class.

Both were fairly easy and will extend without problem to container classes like SkyCoord (once those work in the first place, of course...). So, that gives some more confidence that this route is actually a good one!

I'm trying to think what point is best to stop for this PR. One option would be about where it is, i.e., with a PR that does not yet touch anything else.

Another option might be to see to what extent the new MaskedQuantity can be used in a QTable wherever now MaskedColumn is used in a Table.

I'm also interested in experimenting with Time, which might well be easier that representations and coordinates because of the infrastructure you made there already, but that is almost certainly much better done in follow-up....

@mhvk
Copy link
Contributor Author

mhvk commented Dec 9, 2020

Just for amusement, some things already work better than I had thought possible:

In [1]: from astropy.utils.masked import Masked

In [2]: from astropy.coordinates import SkyCoord

In [3]: sc = SkyCoord([0., 10.], [20., 30.], unit='deg')

In [4]: msc = sc._apply(Masked, mask=[True, False])

In [5]: msc.ra, msc.dec
Out[5]: (<MaskedLongitude [———, 10.0] deg>, <MaskedLatitude [ ———, 30.0] deg>)

In [6]: msc
Out[6]: 
<SkyCoord (ICRS): (ra, dec) in deg
    [( 0., 20.), (10., 30.)]>

Latter is because display somehow loses the mask...

Also, propagation will need better support of gufuncs:

In [16]: msc.cartesian
TypeError: operand type(s) all returned NotImplemented from __array_ufunc__(<ufunc 's2c'>, '__call__', <MaskedLongitude [———, 10.0] deg>, <MaskedLatitude [ ———, 30.0] deg>, out=(<Quantity [[ 0., 20.,  1.],
           [10., 30.,  1.]]>,)): 'MaskedLongitude', 'MaskedLatitude', 'Quantity'

@taldcroft
Copy link
Member

@mhvk - amazing progress. I need to find a block of time to give this a deeper review, but to be honest I expect that to be more functional than deep line-by-line of all the new stuff. Overall I agree this may be a decent waypoint to get this into master in order to facilitate subsequent smaller, more manageable, PRs.

Just one comment which may disappoint you, but I'm not a huge fan of the strikethrough. There is this unresolvable philosophical schism of whether the data are "masked" or "invalid/missing". From the database perspective it is typically the latter. If you do a table join and there are missing elements, the data behind the mask is junk and should not be displayed. If you want a configuration option to use strikethrough that would be OK, but overall I'd prefer to not rock the boat quite so much and just stick with -- for missing / masked data.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 9, 2020

OK, sounds good! Right now, in a separate branch where I've been trying, it is definitely only an option to use strikethrough - indeed, beyond your good point that usually you don't actually want to see what is behind the mask, it seems sphinx/.rst does not support strikethrough so it will be difficult to document it!

I don't think it needs to hold up the PR, but what I still do hope to ensure is that all erfa ufunc work (right now some of the gufunc among them do not - one needs to infer what axis is being acted on so one act on the mask similarly; not super-difficult, but a bit tricky and tedious). I'm really hopeful, though, that the follow-up PR for containers like SkyCoord will turn out to be quite easy!

@mhvk
Copy link
Contributor Author

mhvk commented Dec 10, 2020

Just made a few further updates to ensure that representations, coordinates and times can propagate masks if they are present (for Time, needs a bit of a hack to initialize).

@taldcroft
Copy link
Member

@mhvk - just to let you know that I've been a bit slammed by day-job stuff but this is still on my radar. Holidays are coming!

@mhvk
Copy link
Contributor Author

mhvk commented Dec 27, 2020

@taldcroft - no worries if this takes some time... it is a very big PR.

In the meantime, I added support for structured dtype, since that is needed for Time.

I also have branch that uses this internally in Time, which works well (except I still have to get saving to files to work). One question: In principle, I could make it such that if you input np.ma.MaskedArray, you also get that class out for all .to_value(), etc. Is this worth doing? I worry it will break more than it helps, since np.ma.MaskedArray does not work for quite a few cases (e.g., all erfa ufuncs...). My tendency would be to accept it on input, but always use Masked inside.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 4, 2021

Rebased and updated with substantially increased coverage of numpy functions. Somewhat better documentation of functions where behaviour is non-obvious.

@taldcroft
Copy link
Member

My tendency would be to accept it (np.ma.MaskedArray) on input, but always use Masked inside.

Agreed. I think it would be confusing / fragile to do otherwise.

@mhvk mhvk force-pushed the utils-masked-class branch from 2801218 to 0389951 Compare April 13, 2021 20:02
utils:
- cextern/expat/**/*
- any: ['**/utils/**/*', '!astropy/utils/iers/**/*', '!docs/utils/iers.rst']
- any: ['**/utils/**/*',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pllim - do these additional changes to the labeller make sense? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's correct.

p.s. 100 commits and +5,427 −112 😱

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was quite a bit of work. Though most of the lines is somewhat boilerplate in function_helpers and lots of tests...

@taldcroft
Copy link
Member

taldcroft commented Apr 13, 2021

I vote for 4.3 based on precedent from modeling. Essentially all of this is in a new sub-subpackage and it has no impact on existing code. At least that is my understanding. Getting more testing on this from a broad audience will be crucial to flushing out issues and potentially API details.

@taldcroft
Copy link
Member

For 5.0 we should have Masked fully integrated with the rest of astropy! No more mixin column table operations exceptions for SkyCoord and Quantity.

@mhvk
Copy link
Contributor Author

mhvk commented Apr 13, 2021

OK, sounds good! Since #11210 is in, and it seems the rebase was OK too, I'll merge, even though I am sure that if I went over the code again, I'd find some things to improve. But let that be follow-up that is more reviewable!

Thanks for the reviews, @taldcroft and @nstarman.

@mhvk mhvk merged commit dc3927e into astropy:main Apr 13, 2021
@mhvk mhvk deleted the utils-masked-class branch April 13, 2021 20:51
@pllim
Copy link
Member

pllim commented Apr 13, 2021

I think this definitely needs a What's New entry. 😉

@mhvk mhvk mentioned this pull request Apr 13, 2021
@taldcroft
Copy link
Member

🎉 💥

@pllim
Copy link
Member

pllim commented Apr 14, 2021

I think this broke pyinstaller job. See https://github.com/astropy/astropy/runs/2339389527?check_suite_focus=true

@mhvk
Copy link
Contributor Author

mhvk commented Apr 14, 2021

Darn. But that's an easy fix at least (tells you how long ago this PR started, that I still had relative imports...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement MaskedQuantity

5 participants