Skip to content

Allow longdouble values through input validation#9368

Merged
mhvk merged 2 commits intoastropy:masterfrom
aarchiba:flexible-time-input
Oct 23, 2019
Merged

Allow longdouble values through input validation#9368
mhvk merged 2 commits intoastropy:masterfrom
aarchiba:flexible-time-input

Conversation

@aarchiba
Copy link
Contributor

Description

This pull request is to make it possible for third-party code to take higher-precision floating-point values as input, in order to more easily work with the extended precision available from astropy Time objects. This would be relevant to #9361 for example.

Fixes #9341

@astropy-bot astropy-bot bot added the time label Oct 15, 2019
@pllim pllim requested review from mhvk and taldcroft October 15, 2019 15:19
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some smaller comments, but also a general one: ideally, one could just put in higher precision floats and it would "just work". I think it may be possible to do so by changing day_frac to special-case higher-precision - it may be possible to do so by adjusting two_sum with some calls that would be no-op for regular float (NOTE: NOT TESTED!)

def two_sum(a, b):
    x = np.asanyarray(a + b, float)
    eb = x - a  # bvirtual in Shewchuk
    ea = x - eb  # avirtual in Shewchuk
    eb = b - eb  # broundoff in Shewchuk
    ea = a - ea  # aroundoff in Shewchuk
    return x, (ea + eb).astype(float, copy=False)

It may be safer to just pre-condition the arguments, a bit like done in quantity_day_frac, doing a careful divmod(val[12], 1) before adding up the values and residuals.

@pytest.mark.parametrize("f", ["mjd", "unix", "cxcsec"])
def test_existing_types_refuse_longdoubles(f):
with pytest.raises(ValueError):
Time(np.longdouble(58000), format=f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we'd want these tests to fail!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Well. Hmm. I want precision loss to require explicit conversion, and none of these types preserve long double precision (yet?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of them start with day_frac (in TimeFromEpoch.set_jds) so they might all support it if that routine is adjusted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests now all permit either outright rejection of long doubles or acceptance provided that the precision is preserved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about precision loss requiring explicit conversion (else raising an unavoidable exception). This will outright break existing code that happens to be using long doubles, and I'm almost always against breaking code without a strong reason.

In general I prefer to be a bit pragmatic about things like this, letting users be responsible for making decisions about data types, and when it matters then requiring that they have some cognizance of the implementation details. The second paragraph of the Time documentation starts with "All time manipulations and arithmetic operations are done internally using two 64-bit floats to represent time." I think that is a pretty good starting point for someone to infer that providing a higher-precision input might not be helping.

But as a compromise it might be fine to issue a warning in the case of potential precision loss.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our code currently emits a blizzard of warnings even for perfectly normal operation, so in fact people grow used to ignoring warnings. This is seen with LaTeX as well - too many warnings and they are ignored.

What is the use case where users pass long doubles in to these functions and would prefer lost precision to exceptions? Recall that manual downconversion is as easy as applying float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not fixing the problem here, then raising an exception seems reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think of exceptions for issues that are definitely wrong or an error, and warnings for things that might lead to unexpected results. I don't think that saying that some code or package issues a blizzard of warnings is a good argument for ignoring the entire Python warning system.

If I do Time(np.longdouble(1), format='cxcsec'), there is no reason that should fail to execute. It is perfectly legal and gives the result that I expect. But I do think it is reasonable to issue a warning that about precision which an informed user can take into consideration.

It may be that somebody is already using long-double for some reason that we don't know about, and they understand what is happening, and they don't want their code to suddenly break for no good reason with astropy 4.0. I sometimes do things that might seem irrational or downright stupid to someone else, but we all have our reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, convinced the other way again. Let's just agree to fix it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on this - let's get something like #9361 merged and then there won't be much question left what to do with longdoubles because they will work.

@aarchiba
Copy link
Contributor Author

Thanks! Some smaller comments, but also a general one: ideally, one could just put in higher precision floats and it would "just work". I think it may be possible to do so by changing day_frac to special-case higher-precision - it may be possible to do so by adjusting two_sum with some calls that would be no-op for regular float (NOTE: NOT TESTED!)

def two_sum(a, b):
    x = np.asanyarray(a + b, float)
    eb = x - a  # bvirtual in Shewchuk
    ea = x - eb  # avirtual in Shewchuk
    eb = b - eb  # broundoff in Shewchuk
    ea = a - ea  # aroundoff in Shewchuk
    return x, (ea + eb).astype(float, copy=False)

It may be safer to just pre-condition the arguments, a bit like done in quantity_day_frac, doing a careful divmod(val[12], 1) before adding up the values and residuals.

This would absolutely call for testing with hypothesis, once we figured out exactly what the criterion was for successful computation.

Well, okay, I think two_sum should do exactly what it does regardless of type it is passed: compute the sum to maximum precision available with the types. But making day_frac operate appropriately would take some doing, especially if we want it to behave well when also passed Quantities. And definitely testing.

@mhvk
Copy link
Contributor

mhvk commented Oct 15, 2019

Yes, makes sense to have two_sum be type-agnostic. There is already a quantity_day_frac so there is precedent... But, yes, substantial testing needed...

@aarchiba
Copy link
Contributor Author

Yes, makes sense to have two_sum be type-agnostic. There is already a quantity_day_frac so there is precedent... But, yes, substantial testing needed...

Sorry for the repetition, but my experience with hypothesis in getting the kinks out of PINT's numerical conversion code was extremely positive. It nailed down some difficult-to-spot bugs - including finding leap-second-day problems before I had even told it about leap second days - and programming it was easy (and had that delightful pieces-clicking-together feeling I get with Haskell).

@mhvk
Copy link
Contributor

mhvk commented Oct 15, 2019

I'm all for trying hypothesis on Time - I think it is a much easier target than units.

@aarchiba
Copy link
Contributor Author

Maybe they're off-topic for this PR but I also found myself poking around to confirm that there is no data validation on the jd1 and jd2 values set in Time objects, and there is no validation of types returned by .value - but there are strict and hard-to-explain requirements on them. This means that implementing custom time formats involves a lot of impenetrable error messages. I'd like to add some validation on both ends; ensure that a Time object's jd1 and jd2 are always arrays of float64s of the same shape no matter what a custom format tries to set them to, and ensure that .value always returns an array of that same shape of whatevers, or a singleton whatever, as appropriate; also that implementors of custom time formats get helpful error messages if their .value methods don't do this correctly. (It'd sort of be nice to be able to provide element-by-element conversion funcions that would automatically be broadcast; these would be too slow for actual numeric arrays but for object arrays they're not any slower and handling arrays of arbitrary shape is awkward; see https://github.com/nanograv/PINT/blob/a4b33eae5fce21a6010e8665bfeecf9caec8cf96/src/pint/pulsar_mjd.py#L501 for the mess.)

@aarchiba
Copy link
Contributor Author

So uh.

Right now we quietly ignore it if someone passes a val2 when creating a Time object for a string format. That is, the user provides some information in val2, and astropy throws it away and doesn't tell them. I do not approve.

I added some ValueErrors in this case, only to find that astropy has a test for explicitly this behaviour! Astropy's tests demand that we silently throw away information explicitly provided by the user. What? Can I change that? Please?

@mhvk
Copy link
Contributor

mhvk commented Oct 15, 2019

@aarchiba - I agree ignoring val2 rather than checking it is None as appropriate is bad, and have meant to address that myself. So, happy to change, but let's do that in a separate PR!

@taldcroft and I also discussed that somewhat of a design mistake is to have jd1 and jd2 on the TimeFormat class - it would make more sense to just let it do the transformations to and from input.

p.s. Your mjds_to_str would seem rather like Time._shaped_like_input...

@aarchiba
Copy link
Contributor Author

@aarchiba - I agree ignoring val2 rather than checking it is None as appropriate is bad, and have meant to address that myself. So, happy to change, but let's do that in a separate PR!

I'll pull it out. I have a bad habit of fixing minor problems as they fly past, which gets them fixed but makes my PRs messy.

@taldcroft and I also discussed that somewhat of a design mistake is to have jd1 and jd2 on the TimeFormat class - it would make more sense to just let it do the transformations to and from input.

This does seem awkward. We are sort of stuck with it, though properties allow for some deviousness.

p.s. Your mjds_to_str would seem rather like Time._shaped_like_input...

Ah, yes, probably. I've had to write similar code too many times now, to get functions to behave like ufuncs in terms of correctly maintaining array-ness (scalar->scalar, array->array including zero-dimensional) and shape. It's worse needing to have multiple input arrays that are broadcast together. I thought numpy.vectorize might get me this for free but at least for strings it didn't quite. Modern numpy.vectorize can let a vector-capable function act on an entire one-dimensional array, at least, which broadens its applicability.

@mhvk
Copy link
Contributor

mhvk commented Oct 15, 2019

On ufuncs, see numpy/numpy#13105 - would definitely be good to have a solution...

@aarchiba
Copy link
Contributor Author

On ufuncs, see numpy/numpy#13105 - would definitely be good to have a solution...

I'm not too hopeful, I remember frustration and argument about numpy zero-dimensional arrays versus scalars on the numpy mailing lists more than a decade ago. But it sure would be nice.

@aarchiba
Copy link
Contributor Author

Validation moved to #9373

@aarchiba
Copy link
Contributor Author

On ufuncs, see numpy/numpy#13105 - would definitely be good to have a solution...

I'm not too hopeful, I remember frustration and argument about numpy zero-dimensional arrays versus scalars on the numpy mailing lists more than a decade ago. But it sure would be nice.

One of my ancient-times fixes was to record the input array shape (or that it wasn't an array), reshape it into a 1d array, do the vectorized calculation, then reshape it into the same shape (and unpack if it wasn't an array) on output. We could do that, though as we allow Time objects to be indexed and reshaped it would be more complicated. We could at least record whether the input was an array or not so as to know whether to strip off the array boxing on output.

@mhvk
Copy link
Contributor

mhvk commented Oct 15, 2019

I think the shaping, etc., and possibly returning to np.float64 instead of python float, could be done as part of _shaped_like_input. Though I'm not sure in the Time case it matters all that much whether it is a np.float64 or a float - I can only see arguments for never returning a python type, just to ensure output always has a shape, etc.

@mhvk
Copy link
Contributor

mhvk commented Oct 16, 2019

@aarchiba - on day_frac: like two_sum, it currently works fine on any type of float - it may make sense to do the conversion at the end of it - in particular, just before where the second rounding is done:

day = day.astype(np.float64, copy=False)
frac = frac.astype(np.float64, copy=False)
excess = np.round(frac)
...

@aarchiba
Copy link
Contributor Author

@aarchiba - on day_frac: like two_sum, it currently works fine on any type of float - it may make sense to do the conversion at the end of it - in particular, just before where the second rounding is done:

The day_frac with multiplication/division is not type-agnostic, because two_prod is not - it has the "split" function with a magic constant that would need to be adapted for each type (which I think is not too hard).

More locally, I discovered in writing #9375 that you can't run jd1,jd2 through day_frac upon storage because some tests check for (and presumably some users rely upon?) their not being copied.

@mhvk
Copy link
Contributor

mhvk commented Oct 16, 2019

Ah, good point about two_prod

assert getattr(t, custom_format_name) != getattr(t2, custom_format_name)


def test_mjd_unit_validation():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two unit conversion tests are unrelated to the code change, but presuming they are just to improve coverage then fine to leave them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not at all obvious to me that the code did the right thing in these cases so I wrote test cases, and I concluded that it did but that the added test cases were harmless. I'm not actually sure how they impact coverage numbers.

@aarchiba aarchiba force-pushed the flexible-time-input branch from b300306 to a044ee4 Compare October 18, 2019 17:46
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting longdouble to work generically in #9361 needs this, so since the only remaining comment is about a possibly superfluous test, I'm going to merge this and rebase #9361 on top.

Thanks, @aarchiba!

@mhvk mhvk merged commit 99f4f59 into astropy:master Oct 23, 2019
@aarchiba aarchiba deleted the flexible-time-input branch October 23, 2019 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Impossible to input np.longdouble into custom time formats

5 participants