Skip to content

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Jun 1, 2020

EDIT: the base implementation is in #11127 - this one tries to use Masked for columns and gives a sense how it would work for representations.

@astrofrog - as per your request in #10119, here's what I have. I got things to working quite well with Quantity and all its subclasses; the first thing to look at may be the quite detailed tests (which should be useful for whatever we end up implementing).

I had started on MaskedColumn and had also tried to move the information up in representations (to go for coordinates) but that is not finished. Part of the difficulty with MaskedColumn is that I made some design choices which are different from MaskedArray, so it would need a kind of compatibility layer if we want to keep exact one-to-one matches, and/or some way to decide which type one wants.

It may be worth pointing out two specific things:

  • I very much on purpose decided to treat reductions different from MaskedArray - any time there is any items in the reduction masked, the result is masked as well. So, e.g., for .sum(), masked elements are not implicitly replaced by zero. The logic is that one cannot interpret the sum meaningfully without knowing how many elements actually contributed. Conversely, following that same logic, for methods where skipping a masked element does make sense (like .mean(), where one divides by the number of unmasked elements), I do use it.
  • Currently, the implementation does not support structured arrays. The question for those, which I haven't really found a "best" answer for, is whether the mask should be per full item, or whether the mask itself should also be a structured array, i.e., allow masking individual pieces. MaskedArray does the latter and it is quite a hassle to keep track, but perhaps better.

Anyway, suggestions very welcome. Indeed, this is definitely at a point where some further design choices have to be made, and it would be good to step back a bit and decide what exactly we want masking to mean.

cc @taldcroft

EDIT: Masked column was very much a work in progress. Tests still passed without the 3 last commits (i.e., at 379a881)

@mhvk mhvk marked this pull request as draft June 1, 2020 15:00
@mhvk mhvk mentioned this pull request Jun 1, 2020
@mhvk mhvk force-pushed the utils-masked-class-applications branch from b2e2212 to 1f7b14f Compare June 1, 2020 19:36
@taldcroft
Copy link
Member

@mhvk - without actually looking at this, can you talk to the rationale for trying to apply this to MaskedColumn in a single PR? Offhand this seems like a huge PR and maybe it would be best to try to break it into smaller bits starting with only Quantity.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 3, 2020

Yes, I think the reason I pushed this here was just to show that it could be done, but would involve some API changes. But I kept separate branches even locally. I am in the process of rebasing and will push the quantity-only one separately. There's a bit of a chicken-and-egg problem in that ideally there be more documentation than the tests, but to find the energy to write that, I'd like to get a sense that this is actually going to get used...

@mhvk
Copy link
Contributor Author

mhvk commented Dec 3, 2020

@taldcroft - see #11127. Most important would be to decide on the points noted there...

@mhvk
Copy link
Contributor Author

mhvk commented Dec 3, 2020

@taldcroft - I'm wondering if going the route of an APE would be good to get our thoughts properly organized and checked. Would you be willing to think along? We could perhaps start with a wiki page or a google docs or so.

E.g., another design decision I made was to keep masking close to ndarray, always below other classes like Quantity. Like with using nan for masked, this means a lot of things go automatic, and it means the masking code does not have to know about details of Angle, Time, SkyCoord, etc. (but unlike for nan, where one would need np.nanmean, etc., the performance cost for masking is always born just by the masked classes).

mhvk added 3 commits December 3, 2020 16:47
Makes almost all of test_column.py pass. The exception is a structured
dtype, which MaskedNDArray cannot yet handle.
@mhvk mhvk force-pushed the utils-masked-class-applications branch from 1f7b14f to a84aeb9 Compare December 3, 2020 21:52
@taldcroft
Copy link
Member

taldcroft commented Dec 5, 2020

@mhvk - looking again at this and #11127, I realize there is a lot of value in this PR that shows both the base implementation and how it gets used in practice. That (with the tests) is the effective first version of documentation for developers, and it proves that the infrastructure supports the intended use cases within astropy.

Some points / questions:

At this point it seems the only way to make a MaskedQuantity is with something like Masked([1,2]*u.m). With this the MaskedQuantity class is generated on the fly. While I see the nice generalizability of that idiom, it seems like for general usability there need to be fixed class initializers like MaskedQuantity (and indeed Masked<mixin_class> for all the mixin classes). This will allow isinstance tests and initializing with a mask kwarg.

If I look at the MRO of Masked([1,2]*u.m) it shows astropy.utils.masked.core.MaskedQuantity but that doesn't actually exist. This is confusing.

About MaskedColumn and the API changes from MaskedArray. I think that given the long heritage of MaskedColumn, we are going to need to leave MaskedColumn untouched for now. Even if you can get a Masked(Column) version passing our tests, I'm not convinced those tests are covering all the use in the wild. People are pretty creative...

My suggestion would be to get to the point where the new infrastructure is passing our tests (already there, I guess?), and then back out changes to MaskedColumn before merging. After a release then introduce an opt-in version of MaskedColumn based on Masked which people can use via a configuration option. The main reason for users to do this would be better consistency for reduction operations. (Others drivers?)

For me the next step here would be to modify table/operations.py to support all operations for all mixins! If you are somewhat careful about commits and keep the "applications" separate from the infrastructure (basically the new astropy.utils.masked subpackage, then I think it would make sense to merge the infrastructure commits without the applications. (I guess you can keep on cherry-picking infrastructure commits into #11127.) Since this is a new subpackage we should definitely try to break into manageable chunks and merge as soon as we are convinced it will be accepted (even if incomplete and there are follow-on actions like docs). I'm OK with an new/experimental package in astropy not being perfect.

There is of course the elephant in the room of whether this functionality is so widely useful that it should be released as a separate package. 😄 Anyway, thanks for the amazing work.

@taldcroft
Copy link
Member

Can we have this work:

MaskedQuantity = Masked(Quantity)

where this accepts all the Quantity initializers and adds mask?

@taldcroft
Copy link
Member

One thing that isn't quite working now:

In [21]: sc                                                                     
Out[21]: 
<SkyCoord (ICRS): (ra, dec) in deg
    [(1., 3.), (2., 4.)]>

In [22]: Masked(sc)                                                             
Out[22]: 
MaskedNDArray([<SkyCoord (ICRS): (ra, dec) in deg
    (1., 3.)>,
               <SkyCoord (ICRS): (ra, dec) in deg
    (2., 4.)>], dtype=object)

@mhvk
Copy link
Contributor Author

mhvk commented Dec 5, 2020

@taldcroft - thanks for the broad review and thoughts! I agree with the plan to start with just the infrastructure, so that initially nothing truly depends on it and we can experiment without feeling we've locked in the API (including e.g. structured dtype). With that, I think the first goal would then be to get #11127 ready for merger, without including any of the extra commits here.

We probably should discuss in #11127 what is still missing - looking at it after nearly half a year myself it would be good to have a basic design description for developers (can be just in the module docstring) as well as perhaps a very brief overview in the docs (which may include a short list of differences for people used to MaskedArray).

Explicit MaskedQuantity, etc., would definitely be good but probably for a follow-up PR, so that review can focus on how that actually works in practice (e.g., info might want to gain a mask attribute to generalize between options). An explicit goal would then be for all the tests that now work with unit-full MaskedColumn to work with the new MaskedQuantity....

I very much like the idea of Masked(Quantity) creating/returning the MaskedQuantity class, and have that have an initializer which just adds a mask keyword argument to the standard one! One might also consider creating some kind of @maskable class decorator which would check for a mask argument and, if present, return the Masked version while just passing through to the regular class if not. (This of course does not preclude the explicit version.) Note that it is already possible to get a MaskedQuantity by passing in masked array input, as in u.Quantity(Masked(data, mask), u.m) or ma * u.m.

I also agree with the suggestion to drop the idea of making a direct replacement for MaskedColumn, and instead aim for an configuration option to switch it out for a new Masked(Column) which behaves just like the other Masked instances.

On more complicated mixins: Masked as written only works on ndarray, since it inserts itself just above the ndarray. In a way, it is very much like your nan-for-masked version, which I realized had the enormous benefit that Masked only had to know how to deal with arrays, and that this was also all that was needed, since the behaviour in Quantity and other subclasses basically does not depend on whether items are masked. But it does mean that higher-order classes like SkyCoord will need some general machinery that collects masks from the underlying arrays, similar to what you implemented for Time. I think it may well be possible to generalize that, just like reshaping is generalized via ShapedLikeNDArray, e.g., by inheriting from a MaskableNDArrayContainer. Indeed, it may well be possible to make Masked(SkyCoord) work by recognizing not just subclasses of ndarray but also of ShapedLikeNDArray!

One nice thing is that since the erfa functions are all ufuncs, mask propagation for those is already working!

@mhvk
Copy link
Contributor Author

mhvk commented Apr 13, 2021

Going to close this one, since now that the base implementation is merged, it is better to just slowly add pieces.

@mhvk mhvk closed this Apr 13, 2021
@mhvk mhvk deleted the utils-masked-class-applications branch April 13, 2021 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants