Skip to content

Allow subformats for Numeric Time formats like MJD, JD#9361

Merged
taldcroft merged 22 commits intoastropy:masterfrom
mhvk:time-output-subformats
Oct 27, 2019
Merged

Allow subformats for Numeric Time formats like MJD, JD#9361
taldcroft merged 22 commits intoastropy:masterfrom
mhvk:time-output-subformats

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Oct 13, 2019

Inspired by #9358, a trial to allow Decimal and str input for format='mjd', and also allow one to select such output. This could be easily extended to jd, etc.

With this PR (EDIT: updated to str, bytes)

In [1]: from astropy.time import Time

In [2]: from decimal import Decimal

In [3]: Time('54321.01234567890123456789', format='mjd')
Out[3]: <Time object: scale='utc' format='mjd' value=54321.012>

In [4]: Time(b'54321.01234567890123456789', format='mjd', precision=19)
Out[4]: <Time object: scale='utc' format='mjd' value=b'54321.0123456789012345580'>

In [5]: Time(Decimal('54321.01234567890123456789'), format='mjd')
Out[5]: <Time object: scale='utc' format='mjd' value=54321.01234567890123455802254>

In [6]: t = Time(54321., [0., 1e-9, 1e-12], format='mjd', precision=13)

In [7]: t.value
Out[7]: array([54321., 54321., 54321.])

In [8]: t.out_subfmt = 'str'

In [9]: t.value
Out[9]: 
array(['54321.0000000000000', '54321.0000000010000',
       '54321.0000000000010'], dtype='<U19')

In [10]: t.out_subfmt = 'object'

In [11]: t.value
Out[11]: 
array([Decimal('54321.0'), Decimal('54321.00000000100000002722922'),
       Decimal('54321.00000000000099997787828')], dtype=object)

In [12]: t.out_subfmt = 'float128'

In [13]: t.value
Out[13]: array([54321., 54321., 54321.], dtype=float128)

In [14]: t.value - t.value[0]
Out[14]: array([0.00000000e+00, 1.00000008e-09, 9.98312544e-13], dtype=float128)

cc @aarchiba, @taldcroft

@aarchiba
Copy link
Contributor

I do like that this is fairly compact but is there a way to convert a Time to string or long without either modifying or copying it?

@mhvk
Copy link
Contributor Author

mhvk commented Oct 13, 2019

Right now, the best one could do is something like Time(<time object>, format='mjd', out_subfmt='U').value

@aarchiba
Copy link
Contributor

Right now, the best one could do is something like Time(<time object>, format='mjd', out_subfmt='U').value

That requires a copy of the entire array, right? I mean, it's about to turn into a string array, so the cost isn't necessarily prohibitive (though it's more unpleasant for longdoubles) but it's unpleasant and the syntax is cumbersome.

Can I also add my disapproval of "U" as a way to indicate "string"? The letter doesn't even appear!

@mhvk
Copy link
Contributor Author

mhvk commented Oct 13, 2019

I don't mean this to be final in any way - certainly, one can pick better names of the subfmt - right now they are just numpy dtype strings - and one could imagine extending the t.<format> nomenclature to allow t.<format>_<subfmt> (since that is interpreted within __getattr__ anyway).

The main advantage I see for this technique is that it allows general flexibility on input, yet also the ability to insist on a given format (using in_subfmt).

@mhvk
Copy link
Contributor Author

mhvk commented Oct 13, 2019

p.s. By default, instantiating a Time from another one does not copy the jd1 and jd2 arrays - a leftover from when Time was immutable.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 14, 2019

OK, change subfmt to bytes and string; Decimal still is O, which is hardly logical, but can obviously be changed as well.

@aarchiba
Copy link
Contributor

I don't mean this to be final in any way - certainly, one can pick better names of the subfmt - right now they are just numpy dtype strings - and one could imagine extending the t.<format> nomenclature to allow t.<format>_<subfmt> (since that is interpreted within __getattr__ anyway).

I would feel a lot better about this. One wouldn't want it to conflict with format names containing underscores ("pulsar_mjd" springs to mind) but this seems like quite a manageable issue.

The main advantage I see for this technique is that it allows general flexibility on input, yet also the ability to insist on a given format (using in_subfmt).

I can see it in that light - we can't really allow guessing across formats, but the MJD format has a quite restricted range of sensible inputs, over which guessing is quite reasonable. I'm not totally comfortable with the "stickiness" of subformats in this context - if you just ask a Time object t for t.mjd, you get a string or long or decimal or float depending on how it was constructed. (I'm not sure changing the subformat on read is a good answer here, in terms of consistency with the rest of astropy.)

(Now I'm wondering whether we should be talking about pulsar_utc as a scale rather than pulsar_mjd as a format? Not really relevant here, except in that it would be nice if the multi-subformat code could be re-used for other formats, presumably including JD.)

What about mjd_pair? One could use autodetection to choose mjd_pair as the subformat for Times that were built with two floats instead of one (presumably indicating that the user cares about extra precision), which would result in a default output subformat that also retained the pair-of-floats representation. And on output it would provide a less exotic way to retain the precision as well.

I realize these formats appear to be proliferating, but we are starting from a position of not really having a good way to import/export extended time precision at all. And there's no really canonical way to handle extended precision in python, especially as we provide more than londouble can hande.

Also: what do we do on platforms like MSVC where longdouble == float? Do we still implement the _long versions? It would also be nice to test the longdouble stuff somewhere longdouble == binary128 (e.g. power8 or arm64, yes my telephone has quad precision), and this might actually be possible on Travis:

@mhvk
Copy link
Contributor Author

mhvk commented Oct 14, 2019

I also wasn't sure about keeping the input format for the output - though perhaps it is most consistent with what Time does when it auto-guesses the string format.

And while I think auto-guessing may be OK for .value, I don't think it is very good for .mjd.

Quite possibly this is still better done with subclasses, i.e.,

class TimeMJDString(TImeMJD, TimeDecimalString)
    pass

class TImeMJDLong(TimeMJD, TimeLong):
    pass

But it still seems easier to just share code directly.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 15, 2019

OK, next commit allows

In [2]: t = Time(54321., [0., 1e-9, 1e-12], format='mjd', precision=13)

In [3]: t.mjd_str
Out[3]: 
array(['54321.0000000000000', '54321.0000000010000',
       '54321.0000000000010'], dtype='<U19')

Also removed the part where out_subfmt was set to whatever the input was. Bad idea.
(Note: not really supporting Decimal yet, and also need to think what exactly to do with, e.g., mjd_float, mjd_float64, etc.)

@aarchiba
Copy link
Contributor

aarchiba commented Oct 15, 2019

This may be a side issue, but does setting precision allow the user to generate non-exact string representations? How can the user ensure exact string representations? That is, I want Time->string->Time to be exact, or at least within a couple times np.finfo(float).eps*u.day, and the strings reasonably short.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 15, 2019

It does, like for iso and isot. It might be an idea to try to define a precision -1 or so that uses the number of digits needed to represent the time uniquely. But perhaps for another PR...

@mhvk mhvk changed the title Allow subformats for MJD format. Allow subformats for Numeric Time formats like MJD, JD Oct 16, 2019
@mhvk
Copy link
Contributor Author

mhvk commented Oct 16, 2019

OK, extended to all numerical time formats. Still not quite sure this is the right way to go about it... @taldcroft? Also, some other things are more important to try to wrap up before the 4.0 feature freeze (masked quantities and quantities with uncertainties...).

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk - this looks impressive!

I guess I have one big picture question. For the most part it seems the real point of supporting strings or Decimal for input/output is to get around the 64-bit precision for val. Would it be sufficient if 128-bit inputs were correctly handled (no precision loss) and if there was a convenient way (TBD, maybe your attr modifier or maybe something more object-y) to get 128-bit outputs for all the numerical formats?

Somehow it seems simpler and if users need Decimal or string on either end they can deal with that themselves as long as precision out to 128-bits is maintained (again, on input and with .value or .mjd).

def precision(self, val):
del self.cache
if not isinstance(val, int) or val < 0 or val > 9:
if not isinstance(val, int) or val < 0 or val > 19:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ERFA (or something along the way) doesn't like precision > 9:

In [2]: t = Time.now()

In [3]: t.iso
Out[3]: '2019-10-18 15:52:50.713'

In [4]: t.precision = 18

In [5]: t.iso
Out[5]: '2019-10-18 15:52:50.-00000001060580486'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be ERFA, I ran into this. Unfortunately, precision=9 only gets us nanoseconds in this format, and we need greater precision for MJDs (but ERFA is generally not needed there).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm sorry this is just a horrible hack. Probably the best solution is to let the format deal with validation of precision (and arguably the *_sub* attributes as well).

@aarchiba
Copy link
Contributor

I guess I have one big picture question. For the most part it seems the real point of supporting strings or Decimal for input/output is to get around the 64-bit precision for val. Would it be sufficient if 128-bit inputs were correctly handled (no precision loss) and if there was a convenient way (TBD, maybe your attr modifier or maybe something more object-y) to get 128-bit outputs for all the numerical formats?

Somehow it seems simpler and if users need Decimal or string on either end they can deal with that themselves as long as precision out to 128-bits is maintained (again, on input and with .value or .mjd).

I don't speak for Marten but I do use extended precision for times, and:

The problem is that almost no users actually have a 128-bit precision type. Intel machines do not provide such; np.float128 only has 80 bits of precision, which is not enough to capture all the precision in a Time object when working with MJDs (and definitely not for JDs). MSVC users do not have an extended-precision floating-point type at all. Users on arm64 machines do actually have long doubles that are binary128, quad precision, but astropy has not as far as I know been tested on such a platform at all.

A generic double-double type would also solve the problem, providing exactly the precision of a Time object, but although there are software libraries, none has as far as I know been made to work as a general-purpose numeric dtype, let alone is available in an existing dependency of astropy.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 18, 2019

@taldcroft - the main argument is @aarchiba's: right now in python the only platform-independent ways to support high precision are Decimal, str, and int. Of those, int doesn't really make sense for most formats. I did wonder, though¸ whether we needed to support Decimal explicitly, or whether just str is good enough.

Copy link
Contributor

@aarchiba aarchiba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. It seems efficient and effective. My suggestions are almost all minor fixes. Is there a way I can help supply the code for those?

return cache[attr]

elif attr in TIME_SCALES: # allowed ones done above (self.SCALES)
elif '_' in attr:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this correctly handle pulsar_mjd? That is, custom time formats that contain underscores but not as used to specify subformats? What about pulsar_mjd_string? (Custom format with an underscore, and a subformat.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a dunder so mjd__float128 or whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's meant to be user-visible, I think one underscore is plenty.

Copy link
Contributor

@aarchiba aarchiba Oct 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct way to do that is to first check if the format as-is is known, then pull off the rightmost _thing and see if what's left is known, then see if the after-_ part is a known subformat. "is known" is just a "s in dict" check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dunder conveys a (much-more) unambiguous separation between the attribute name and the subformat, both programmatically and to the human reader. I think that mjd_str is a pretty good example of where this gets ambiguous on both counts since somebody might reasonably make their own format where MJD is represented as a string and call it mjd_str. There is precedence, e.g. Django uses dunder in search queries to separate the name of the search field from the operation (https://docs.djangoproject.com/en/2.2/topics/db/search/).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed @aarchiba's logic, first check whether it is known.

But I like @taldcroft's suggestion of a general Time.to_value(fmt, subfmt) even better - indeed if we default to Quantity for TimeDelta, as a side-effect that solves a long-standing annoyance of being able to do .to(unit) but not to_value(unit).

bytes_to_twoval = np.vectorize(bytes_to_twoval1)


def twoval_to_float128(val1, val2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not a good idea to use float128: on 32-bit platforms it is less efficient than float96 and therefore is not the default. np.longdouble generally accomplishes the right thing, with the exception that on Windows/MSVC np.longdouble == np.float while np.float128 doesn't exist.

Also float128 is misleading.

twoval_to_string = np.vectorize(twoval_to_string1, excluded='fmt')


def twoval_to_bytes1(val1, val2, fmt):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are bytes versions really needed? Wouldn't it be better to let the user generate a string and then encode/decode as appropriate for the rest of their byte-generation process?

return val1.astype(np.float128) + val2.astype(np.float128)


def twoval_to_decimal1(val1, val2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is important to beware of the interaction between the precision specified in the object and the precision with which Decimal objects are represented - the user could have an unpleasant surprise if the Decimal precision were not sufficient. There is a context manager that can set Decimal precision.

This is particularly an issue with the string conversions, where users may not realize that they are going through Decimal.

Some tests with deviant Decimal contexts would be a good idea.

@taldcroft
Copy link
Member

right now in python the only platform-independent ways to support high precision are Decimal, str, and int.

I have some naive / stupid questions or comments:

  1. Does the astropy Time class itself have sufficient precision for astronomy (pulsar) research? Right now Time is indeed a platform-independent way to support high precision, and I had been under the impression the double-double internal repr in Time is good enough.

  2. It makes sense that str is a way to represent an MJD time (starting from a Time object) with sufficient precision to be lossless. But I don't quite get what you do with that representation. You can't do math on it. Of course you can (with this PR) turn it back into a Time object, or store it in a file to round-trip, but you can also do those with a native Time. I understand a little more the use case for a Decimal MJD since you can actually do math on it.

  3. From what @aarchiba says (which all sounds reasonable) the whole business of representing a Time to high precision in any single floating point value is a minefield of problems. So the stuff in here about representing floating point value as some other type like np.longdouble or float128 is going to be quite platform dependent, and anyway maybe not even good enough to be worth the trouble (i.e. the 80 bits of precision comment).

@bsipocz
Copy link
Member

bsipocz commented Oct 18, 2019

My suggestions are almost all minor fixes. Is there a way I can help supply the code for those?

@aarchiba - If the suggested changes are only affecting a single line at a time, then doing the github suggestion (~step no6 here: https://help.github.com/en/articles/reviewing-proposed-changes-in-a-pull-request) would work. My understanding is that the ability to suggest for multiple lines is coming in a few months time.
For more complex suggestions, you could add Marten's fork as a remote to your local clone, work the changes on this branch and then open a PR against the branch in his fork.

@aarchiba
Copy link
Contributor

aarchiba commented Oct 19, 2019

  1. Does the astropy Time class itself have sufficient precision for astronomy (pulsar) research? Right now Time is indeed a platform-independent way to support high precision, and I had been under the impression the double-double internal repr in Time is good enough.

Yes. A (normal) Time object has 0.02 ns resolution, while an MJD represented in long double (on Intel/GCC, so 80 bits) has 0.6 ns, which is uncomfortably close to the 50 ns level of uncertainty we get; a long double JD has a resolution of 23 ns, which is not acceptable.

  1. It makes sense that str is a way to represent an MJD time (starting from a Time object) with sufficient precision to be lossless. But I don't quite get what you do with that representation. You can't do math on it. Of course you can (with this PR) turn it back into a Time object, or store it in a file to round-trip, but you can also do those with a native Time. I understand a little more the use case for a Decimal MJD since you can actually do math on it.

We need to be able to read and write MJD strings while losing as little accuracy as possible because many of our file formats use these. We can also do subtraction of Time objects to get TimeDeltas with the same resolution as Time, and our time differences tend not to be quite as large as 60_000 days, so conversion of these deltas to long double is a little more accurate than converting straight MJDs. This is good, because we do fairly elaborate calculations with these quantities; for example, we combine the time difference with spin frequency and some number of its time derivatives to compute pulsar rotational phase, which is our primary observable. So we need extended precision for the time, the spin frequency, and the rotational phase. I would love to have more precision, but software quad precision is about 25 times slower, and arbitrary precision (like Decimal) is usually substantially slower than that. So Decimal can be more accurate than extended precision but its performance is almost certainly insufficient. (We find ourselves with tens of pulsars times hundreds or thousands of measurements, evaluated at millions of parameter values as we run MCMC fitting procedures.)

  1. From what @aarchiba says (which all sounds reasonable) the whole business of representing a Time to high precision in any single floating point value is a minefield of problems. So the stuff in here about representing floating point value as some other type like np.longdouble or float128 is going to be quite platform dependent, and anyway maybe not even good enough to be worth the trouble (i.e. the 80 bits of precision comment).

The fact is that 80 bits is the state of the art, and it is by far the best available balance of precision and performance. The existing program TEMPO2 uses 80 bits internally and is the standard tool for precision pulsar timing. I recently carried out a test of gravity using precision timing of a single pulsar (in a triple system) that involved direct integration of the equations of motion, and I compared 64-bit hardware floats, 80-bit hardware floats, and 128-bit software quad precision; 64-bit floats were absolutely not adequate, and I measured a factor of 25 slowdown for the quad precision, which made them infeasible for the primary work of the project, given we were using a hundred cores on a modern cluster and were at the limits of parallelizability of the problem.

The double-double representation inside Time objects was chosen to meet a real need, and our desire to have it interoperate with hardware extended precision as well as to have reasonable IO and a higher-precision check format are all based on real, practical needs of existing software.

@aarchiba
Copy link
Contributor

This specific PR and issue #9358 arise from work I did on PINT, which takes a different approach to the problem: https://github.com/nanograv/PINT/blob/master/src/pint/pulsar_mjd.py

You can also look at the (extremely messy) gravity testing code: https://github.com/aarchiba/triplesystem ; I draw your attention to the lengths I went to to have quad precision available as a check, for example: https://github.com/aarchiba/triplesystem/blob/master/code-v1/quad_integrate.pyx not to mention the use of Boost's templated C++ ODE integrators. I did not much use astropy.time.Time because without a way to read and write the full accuracy as strings or convert to long doubles it didn't help me much. So I wrote my own code to read in strings, subtract off an MJD at the beginning of the observation campaign, and convert what was left to long double. I am pushing for astropy to have these tools so I won't have to do that any more times.

@taldcroft
Copy link
Member

@aarchiba - thanks for the excellent answers, this helps a lot and clarifies some of the recent work and discussions! I wonder if any of this would be helpful to others, maybe in a section on Time performance or Time and real-world high-precision applications. Just a thought...

@taldcroft
Copy link
Member

taldcroft commented Oct 19, 2019

I agree the attribute munging to quickly get subformat seems handy. It should apply to all formats and subformats, so one can do tm.iso__date_hm. This is also an example where the single-underscore logic proposed earlier won't work (tm.iso_date_hm would assume format='iso_date').

@mhvk mhvk force-pushed the time-output-subformats branch from b23745c to 2823c18 Compare October 27, 2019 16:55
@mhvk
Copy link
Contributor Author

mhvk commented Oct 27, 2019

@taldcroft - OK, I made format a required argument and used '*' as the default for subfmt. I still needed to keep the option of passing in None to signal to use self.out_subfmt since otherwise getting the format it __getattr__ became tricky: at that point, I cannot just do self.to_value(self.fmt, self.out_subfmt), since the format change may also change out_subfmt.

I think it is fine to have it as an option though. If you think it is better, I could remote its description from the docstring.

@taldcroft
Copy link
Member

Looks good now, merging!!

@taldcroft taldcroft merged commit e75d34e into astropy:master Oct 27, 2019
@mhvk mhvk deleted the time-output-subformats branch October 27, 2019 23:00
taldcroft added a commit to taldcroft/astropy that referenced this pull request Jan 17, 2020
taldcroft added a commit to taldcroft/astropy that referenced this pull request Jan 17, 2020
taldcroft added a commit to taldcroft/astropy that referenced this pull request Feb 20, 2020
taldcroft added a commit to taldcroft/astropy that referenced this pull request Feb 20, 2020
taldcroft added a commit to taldcroft/astropy that referenced this pull request Feb 21, 2020
- Fix some issues in astropy#9361
- Change construct_from_dict and related to allow out_subfmt validation
- Improve exception handling for bad subfmt
- Validate subfmt at Format class level
- Update changelog
@taldcroft taldcroft mentioned this pull request Oct 22, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants