Allow subformats for Numeric Time formats like MJD, JD#9361
Allow subformats for Numeric Time formats like MJD, JD#9361taldcroft merged 22 commits intoastropy:masterfrom
Conversation
|
I do like that this is fairly compact but is there a way to convert a Time to string or long without either modifying or copying it? |
|
Right now, the best one could do is something like |
That requires a copy of the entire array, right? I mean, it's about to turn into a string array, so the cost isn't necessarily prohibitive (though it's more unpleasant for longdoubles) but it's unpleasant and the syntax is cumbersome. Can I also add my disapproval of "U" as a way to indicate "string"? The letter doesn't even appear! |
|
I don't mean this to be final in any way - certainly, one can pick better names of the The main advantage I see for this technique is that it allows general flexibility on input, yet also the ability to insist on a given format (using |
|
p.s. By default, instantiating a |
|
OK, change |
I would feel a lot better about this. One wouldn't want it to conflict with format names containing underscores ("pulsar_mjd" springs to mind) but this seems like quite a manageable issue.
I can see it in that light - we can't really allow guessing across formats, but the MJD format has a quite restricted range of sensible inputs, over which guessing is quite reasonable. I'm not totally comfortable with the "stickiness" of subformats in this context - if you just ask a Time object t for t.mjd, you get a string or long or decimal or float depending on how it was constructed. (I'm not sure changing the subformat on read is a good answer here, in terms of consistency with the rest of astropy.) (Now I'm wondering whether we should be talking about pulsar_utc as a scale rather than pulsar_mjd as a format? Not really relevant here, except in that it would be nice if the multi-subformat code could be re-used for other formats, presumably including JD.) What about mjd_pair? One could use autodetection to choose mjd_pair as the subformat for Times that were built with two floats instead of one (presumably indicating that the user cares about extra precision), which would result in a default output subformat that also retained the pair-of-floats representation. And on output it would provide a less exotic way to retain the precision as well. I realize these formats appear to be proliferating, but we are starting from a position of not really having a good way to import/export extended time precision at all. And there's no really canonical way to handle extended precision in python, especially as we provide more than londouble can hande. Also: what do we do on platforms like MSVC where longdouble == float? Do we still implement the _long versions? It would also be nice to test the longdouble stuff somewhere longdouble == binary128 (e.g. power8 or arm64, yes my telephone has quad precision), and this might actually be possible on Travis: |
|
I also wasn't sure about keeping the input format for the output - though perhaps it is most consistent with what And while I think auto-guessing may be OK for Quite possibly this is still better done with subclasses, i.e., But it still seems easier to just share code directly. |
|
OK, next commit allows Also removed the part where |
|
This may be a side issue, but does setting |
|
It does, like for |
|
OK, extended to all numerical time formats. Still not quite sure this is the right way to go about it... @taldcroft? Also, some other things are more important to try to wrap up before the 4.0 feature freeze (masked quantities and quantities with uncertainties...). |
taldcroft
left a comment
There was a problem hiding this comment.
@mhvk - this looks impressive!
I guess I have one big picture question. For the most part it seems the real point of supporting strings or Decimal for input/output is to get around the 64-bit precision for val. Would it be sufficient if 128-bit inputs were correctly handled (no precision loss) and if there was a convenient way (TBD, maybe your attr modifier or maybe something more object-y) to get 128-bit outputs for all the numerical formats?
Somehow it seems simpler and if users need Decimal or string on either end they can deal with that themselves as long as precision out to 128-bits is maintained (again, on input and with .value or .mjd).
astropy/time/core.py
Outdated
| def precision(self, val): | ||
| del self.cache | ||
| if not isinstance(val, int) or val < 0 or val > 9: | ||
| if not isinstance(val, int) or val < 0 or val > 19: |
There was a problem hiding this comment.
ERFA (or something along the way) doesn't like precision > 9:
In [2]: t = Time.now()
In [3]: t.iso
Out[3]: '2019-10-18 15:52:50.713'
In [4]: t.precision = 18
In [5]: t.iso
Out[5]: '2019-10-18 15:52:50.-00000001060580486'
There was a problem hiding this comment.
It'll be ERFA, I ran into this. Unfortunately, precision=9 only gets us nanoseconds in this format, and we need greater precision for MJDs (but ERFA is generally not needed there).
There was a problem hiding this comment.
Yes, I'm sorry this is just a horrible hack. Probably the best solution is to let the format deal with validation of precision (and arguably the *_sub* attributes as well).
I don't speak for Marten but I do use extended precision for times, and: The problem is that almost no users actually have a 128-bit precision type. Intel machines do not provide such; np.float128 only has 80 bits of precision, which is not enough to capture all the precision in a Time object when working with MJDs (and definitely not for JDs). MSVC users do not have an extended-precision floating-point type at all. Users on arm64 machines do actually have long doubles that are binary128, quad precision, but astropy has not as far as I know been tested on such a platform at all. A generic double-double type would also solve the problem, providing exactly the precision of a Time object, but although there are software libraries, none has as far as I know been made to work as a general-purpose numeric dtype, let alone is available in an existing dependency of astropy. |
|
@taldcroft - the main argument is @aarchiba's: right now in python the only platform-independent ways to support high precision are |
aarchiba
left a comment
There was a problem hiding this comment.
I like this approach. It seems efficient and effective. My suggestions are almost all minor fixes. Is there a way I can help supply the code for those?
astropy/time/core.py
Outdated
| return cache[attr] | ||
|
|
||
| elif attr in TIME_SCALES: # allowed ones done above (self.SCALES) | ||
| elif '_' in attr: |
There was a problem hiding this comment.
Does this correctly handle pulsar_mjd? That is, custom time formats that contain underscores but not as used to specify subformats? What about pulsar_mjd_string? (Custom format with an underscore, and a subformat.)
There was a problem hiding this comment.
Maybe this should be a dunder so mjd__float128 or whatever.
There was a problem hiding this comment.
It's meant to be user-visible, I think one underscore is plenty.
There was a problem hiding this comment.
I think the correct way to do that is to first check if the format as-is is known, then pull off the rightmost _thing and see if what's left is known, then see if the after-_ part is a known subformat. "is known" is just a "s in dict" check.
There was a problem hiding this comment.
The dunder conveys a (much-more) unambiguous separation between the attribute name and the subformat, both programmatically and to the human reader. I think that mjd_str is a pretty good example of where this gets ambiguous on both counts since somebody might reasonably make their own format where MJD is represented as a string and call it mjd_str. There is precedence, e.g. Django uses dunder in search queries to separate the name of the search field from the operation (https://docs.djangoproject.com/en/2.2/topics/db/search/).
There was a problem hiding this comment.
I followed @aarchiba's logic, first check whether it is known.
But I like @taldcroft's suggestion of a general Time.to_value(fmt, subfmt) even better - indeed if we default to Quantity for TimeDelta, as a side-effect that solves a long-standing annoyance of being able to do .to(unit) but not to_value(unit).
astropy/time/formats.py
Outdated
| bytes_to_twoval = np.vectorize(bytes_to_twoval1) | ||
|
|
||
|
|
||
| def twoval_to_float128(val1, val2): |
There was a problem hiding this comment.
It's probably not a good idea to use float128: on 32-bit platforms it is less efficient than float96 and therefore is not the default. np.longdouble generally accomplishes the right thing, with the exception that on Windows/MSVC np.longdouble == np.float while np.float128 doesn't exist.
Also float128 is misleading.
astropy/time/formats.py
Outdated
| twoval_to_string = np.vectorize(twoval_to_string1, excluded='fmt') | ||
|
|
||
|
|
||
| def twoval_to_bytes1(val1, val2, fmt): |
There was a problem hiding this comment.
Are bytes versions really needed? Wouldn't it be better to let the user generate a string and then encode/decode as appropriate for the rest of their byte-generation process?
astropy/time/formats.py
Outdated
| return val1.astype(np.float128) + val2.astype(np.float128) | ||
|
|
||
|
|
||
| def twoval_to_decimal1(val1, val2): |
There was a problem hiding this comment.
Perhaps it is important to beware of the interaction between the precision specified in the object and the precision with which Decimal objects are represented - the user could have an unpleasant surprise if the Decimal precision were not sufficient. There is a context manager that can set Decimal precision.
This is particularly an issue with the string conversions, where users may not realize that they are going through Decimal.
Some tests with deviant Decimal contexts would be a good idea.
I have some naive / stupid questions or comments:
|
@aarchiba - If the suggested changes are only affecting a single line at a time, then doing the github suggestion (~step no6 here: https://help.github.com/en/articles/reviewing-proposed-changes-in-a-pull-request) would work. My understanding is that the ability to suggest for multiple lines is coming in a few months time. |
Yes. A (normal) Time object has 0.02 ns resolution, while an MJD represented in long double (on Intel/GCC, so 80 bits) has 0.6 ns, which is uncomfortably close to the 50 ns level of uncertainty we get; a long double JD has a resolution of 23 ns, which is not acceptable.
We need to be able to read and write MJD strings while losing as little accuracy as possible because many of our file formats use these. We can also do subtraction of Time objects to get TimeDeltas with the same resolution as Time, and our time differences tend not to be quite as large as 60_000 days, so conversion of these deltas to long double is a little more accurate than converting straight MJDs. This is good, because we do fairly elaborate calculations with these quantities; for example, we combine the time difference with spin frequency and some number of its time derivatives to compute pulsar rotational phase, which is our primary observable. So we need extended precision for the time, the spin frequency, and the rotational phase. I would love to have more precision, but software quad precision is about 25 times slower, and arbitrary precision (like Decimal) is usually substantially slower than that. So Decimal can be more accurate than extended precision but its performance is almost certainly insufficient. (We find ourselves with tens of pulsars times hundreds or thousands of measurements, evaluated at millions of parameter values as we run MCMC fitting procedures.)
The fact is that 80 bits is the state of the art, and it is by far the best available balance of precision and performance. The existing program TEMPO2 uses 80 bits internally and is the standard tool for precision pulsar timing. I recently carried out a test of gravity using precision timing of a single pulsar (in a triple system) that involved direct integration of the equations of motion, and I compared 64-bit hardware floats, 80-bit hardware floats, and 128-bit software quad precision; 64-bit floats were absolutely not adequate, and I measured a factor of 25 slowdown for the quad precision, which made them infeasible for the primary work of the project, given we were using a hundred cores on a modern cluster and were at the limits of parallelizability of the problem. The double-double representation inside Time objects was chosen to meet a real need, and our desire to have it interoperate with hardware extended precision as well as to have reasonable IO and a higher-precision check format are all based on real, practical needs of existing software. |
|
This specific PR and issue #9358 arise from work I did on PINT, which takes a different approach to the problem: https://github.com/nanograv/PINT/blob/master/src/pint/pulsar_mjd.py You can also look at the (extremely messy) gravity testing code: https://github.com/aarchiba/triplesystem ; I draw your attention to the lengths I went to to have quad precision available as a check, for example: https://github.com/aarchiba/triplesystem/blob/master/code-v1/quad_integrate.pyx not to mention the use of Boost's templated C++ ODE integrators. I did not much use astropy.time.Time because without a way to read and write the full accuracy as strings or convert to long doubles it didn't help me much. So I wrote my own code to read in strings, subtract off an MJD at the beginning of the observation campaign, and convert what was left to long double. I am pushing for astropy to have these tools so I won't have to do that any more times. |
|
@aarchiba - thanks for the excellent answers, this helps a lot and clarifies some of the recent work and discussions! I wonder if any of this would be helpful to others, maybe in a section on Time performance or Time and real-world high-precision applications. Just a thought... |
|
I agree the attribute munging to quickly get subformat seems handy. It should apply to all formats and subformats, so one can do |
For TimeDelta, adjust the existing method to take either a unit or a format.
Just need to use .to_value()
Plus quite a few more tests.
Plus a bit of clarification.
Also set the default for subfmt to be '*', but allow passing in None to get the instance's out_subfmt (needed for backward consistency).
b23745c to
2823c18
Compare
|
@taldcroft - OK, I made I think it is fine to have it as an option though. If you think it is better, I could remote its description from the docstring. |
|
Looks good now, merging!! |
- Fix some issues in astropy#9361 - Change construct_from_dict and related to allow out_subfmt validation - Improve exception handling for bad subfmt - Validate subfmt at Format class level - Update changelog
Inspired by #9358, a trial to allow
Decimalandstrinput forformat='mjd', and also allow one to select such output. This could be easily extended tojd, etc.With this PR (EDIT: updated to
str, bytes)cc @aarchiba, @taldcroft