Implement mechanism for more flexible unit parsing#16892
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
|
At first glance, the idea seems very good - but I need to look a bit better! One slight worry is the performance hit from setting up the warnings filter -- I'll run the benchmarks just in case. |
Agreed.
Just wondering how ExceptionGroups might fit into this discussion. Could they be useful here? |
|
Benchmark "very simple unit parse" is substantially worse: |
A necessary (but far from sufficient) condition for using exception groups is that
If the performance of this code path is considered to be very important then we can avoid the performance decrease by moving the warning filtering to the except block at astropy/astropy/units/format/generic.py Lines 578 to 584 in 70e3b3a But then we would have to change Lines 2096 to 2097 in 70e3b3a to try:
return f.parse(s, parse_strict=parse_strict)and the additional argument can break any downstream unit formatters that implement parsing (unless we implement the equivalent of except TypeError:
try:
return f.parse(s)somehow). try:
return f.parse(s, parse_strict=parse_strict) if f.supports_parse_strict else f.parse(s)(EDIT: that wouldn't work because downstream formatters could inherit the attribute while still implementing their own non-compliant But I think the added functionality is worth it, even with the slowdown, and such complications are not necessary. The referred case is still a factor of few faster than parsing even the simplest of composite units. |
|
I agree that your solution is far more elegant than the alternatives, and it is nice that anyone subclassing the formatters does not have to worry about it (though not so likely those exist, or that we haven't broken those at least for output...). Separately, I have been wondering whether there should be a simple cache for |
astropy/units/core.py
Outdated
| "enable it with 'u.add_enabled_units'. " | ||
| "For details, see " | ||
| "https://docs.astropy.org/en/latest/units/combining_and_defining.html" | ||
| with warnings.catch_warnings(): |
There was a problem hiding this comment.
Since our filter is simple, I think we can speed things up a bit by passing in arguments here rather than setting up a filterwarnings (saves almost a factor 2 in my tests). And maybe even better if we do something like:
WARNING_FILTER = {
"warn": "default",
"raise": "error",
"silent": "ignore",
}
try:
with warnings.catchwarnings(action=WARNING_FILTER[parse_strict], category=UnitParserWarning):
return f.parse(s)
except KeyError:
if parse_strict not in WARNING_FILTER:
raise ValueError(
"'parse_strict' must be 'warn', 'raise' or 'silent'"
) from None
raise
except NotImplementedError:
raise
...
EDIT: WARNING_FILTER to be defined outside of this scope for speed, obviously. And ideally with a better name!
There was a problem hiding this comment.
What you are suggesting does not work with Python 3.10: https://docs.python.org/3.11/library/warnings.html#available-context-managers
There was a problem hiding this comment.
Maybe we should bite the bullet and set the minimum version to python 3.11 - we're dropping support for it in the next release anyway... (#16886 was also annoying because of this...)
There was a problem hiding this comment.
See #16900 for an issue about dropping python 3.10...
There was a problem hiding this comment.
The suggestion we are discussing here is about an implementation detail that can affect the performance of the code, but not the API. Is the proposed API good enough that I should start working on the documentation or is there anything that still needs to be discussed about the proposed API?
There was a problem hiding this comment.
Personally I'm significantly more in favor of the "Alternative API" you discuss. Avoiding / removing custom meta classes is almost always an improvement. Adding the kwarg will be a bit backwards incompatible, but we are coming up on a major version update, so that seems fine to me.
|
Taking this to the main thread: I think I actually like the approach implemented here somewhat better than passing something down to But then I realized I maybe misunderstand the alternative: would Anyway, more directly to @eerovaher's question: if we go this route, then, yes, I think it is fine not to wait on #16900 but to start adding documentation. |
|
I think having the parsers emit warnings in case of non-critical errors is a good idea for two reasons:
If we do choose to use warnings then where should the warning handling take place? I've opened this pull request with the code placed in
How detailed the warnings hierarchy should be in |
|
I agree with keeping the machinery in |
dfc838a to
cab2e81
Compare
mhvk
left a comment
There was a problem hiding this comment.
This is very nice! Only small comments.
astropy/units/core.py
Outdated
| except NotImplementedError: | ||
| raise | ||
| except UnitParserWarning as err: | ||
| raise ValueError(str(err)) |
There was a problem hiding this comment.
should this be from err? Or from None?
Not perhaps really needed, but if we chain, we can also have some text here, perhaps noting that this error can be dealt with.
There was a problem hiding this comment.
We don't end up in this except block with critical errors that the parser couldn't handle, so adding a remark about parse_strict to the error message is a good idea.
astropy/units/core.py
Outdated
| @@ -2164,12 +2169,15 @@ class Unit(NamedUnit, metaclass=_UnitMetaClass): | |||
| The optional ``parse_strict`` keyword controls what happens when an | |||
| unrecognized unit string is passed in. It may be one of the following: | |||
There was a problem hiding this comment.
We need to change this bit too: "... what happens when a unit string that does not conform to the standard or is unrecognized passed in"
docs/units/format.rst
Outdated
| In case of more serious standard violations an | ||
| :class:`~astropy.units.UnrecognizedUnit` instance is returned instead. | ||
| In such cases it can sometimes be helpful to register additional unit aliases | ||
| (for misspelt units) with :func:`~astropy.units.set_enabled_aliases` (e.g., |
astropy/units/core.py
Outdated
| "'parse_strict' must be 'warn', 'raise' or 'silent'" | ||
| ) from None | ||
| try: | ||
| with warnings.catch_warnings( |
There was a problem hiding this comment.
Looking at the docs I see that catch_warnings isn't thread-safe. Was the previous implementation?
There was a problem hiding this comment.
Currently astropy.units does not use catch_warnings() (outside tests).
Do you have any ideas how to implement this functionality without using catch_warnings() that would still be compatible with ply?
There was a problem hiding this comment.
We've had quite a few PRs dealing with thread-safety adding locks, etc. So, my guess is that, as is, we were thread-safe (or at least "safe enough"). It also seems this is a real problem: the risk is that a thread gets our changed state, i.e., if we use parse_strict="raise" (the default), it would suddenly start failing on all warnings after unit parsing...
From https://stackoverflow.com/questions/56043644/python-locally-suppress-warnings-in-thread-safe-way, it looks like the only solution is to add a lock. That's not very complicated and adds negligible overhead compared to the cost we have anyway in setting up the warning filter (0.2 us vs 1.9 us). So, that seems the simplest solution.
However, seeing the time to set up catch_warnings makes me wonder again if we cannot do better another way, since it is a significant overhead, more than doubling the parsing time for simple units (on the same laptop, u.Unit("m") takes 1.5 us currently).
I guess that gets us back to whether we should just bite the bullet and pass on parse_strict after all. Though in a way that just pushes the problem to the parser: the warnings are raised in the middle of the action, inside bits that only have access to bits of the string, so one still needs a process of letting any warnings bubble up anyway, i.e., something like a lock (but at least it would not increase the overhead of the generic parser!). The same would be the case if we implement something that, say, attaches an attribute to the returning unit of whether there was a warning during parsing.
The other way this perhaps can be approached is to have a utility function or method for the warnings, i.e., we call the parser with, effectively, our own catch_warnings (with a lock), and the parser has to use the units utility function (say, via a new cls.warn(...) method), which would get the locked units warning item, and add something to it. I think this may be possible to implement this relatively easily with a context variable. (Don't really know details, but became aware of it via numpy/numpy#26846).
What do you both think?
There was a problem hiding this comment.
I've opened #17004 to speed up parsing simple unit strings with Unit. If that is merged then this pull request wouldn't slow down parsing simple strings at all. It would slightly slow down parsing composite units, but those are much slower to parse anyways, so the relative slowdown would be small.
I still have to think about thread safety.
There was a problem hiding this comment.
Sounds good! It keeps things simple from the parser perspective, which was always the big incentive for the approach here.
There was a problem hiding this comment.
context variables are cool! I didn't know about them. Thanks 👍
There was a problem hiding this comment.
Just saw this because of the backlink from the NumPy PR. I think these days you can think of context variables as a "modern" async-aware (but still thread-safe) version of threading.local. I also just learned today that gevent treats context variables as locals to a green thread, so even async frameworks outside of the built-in ones use them now. You get lots of parallel/concurrent safety "for free" if you put state that could be stored in a global in a context variable.
cab2e81 to
071c592
Compare
071c592 to
ae9a625
Compare
|
The first commit of the updated pull request implements the new mechanism, adds tests and updates documentation. The main concern @mhvk had was about the impact of the new mechanism on parsing non-composite units. That was solved by #17004, which allows simple units to be parsed before the new mechanism is invoked at all. There is still a slight slow-down in parsing composite units, but that is much slower anyways, so the relative slow-down is smaller, and the new functionality is worth it. However, I will point out that this slow-down might cause some performance related change log entries to be misleading, so we might have to review them before 7.0 is released. @nstarman brought up the question of thread-safety, which is addressed in the second commit by introducing a |
nstarman
left a comment
There was a problem hiding this comment.
This all LGTM!
AFAIK this solution does not preclude the more intensive refactor discussed above, which would have an API impact. Given the performance concerns re catch_warnings we should definitely discuss more performant solutions, but IMO this can be merged first. Thanks @eerovaher
mhvk
left a comment
There was a problem hiding this comment.
This looks good - I still really like the idea and agree the performance issues I worried most about are now gone, so am happy to proceed. Just two suggestions for the implementation, though I can go with what you have now too.
astropy/units/core.py
Outdated
| pass | ||
|
|
||
| warning_actions = {"silent": "ignore", "warn": "default", "raise": "error"} | ||
| if parse_strict not in warning_actions: |
There was a problem hiding this comment.
How about adding this inside the try/except:
except KeyError as err:
if parse_strict not in warning_actions:
raise ValueError(...) from None
else:
raise
...
Saves a bit of time.
Also, is python smart enough not to re-evaluate the dict each time? Or is it better to define it outside?
There was a problem hiding this comment.
I believe it re-evaluates unless the function is compiled in some way, e.g. numba / jax.
It would be nice if it could bind Final variables into the function definition itself.
There was a problem hiding this comment.
I had placed the if parse_strict not in warning_actions: check outside the try-except code because I wanted to exclude the possibility of accidentally catching a wrong KeyError, but putting the check inside the except KeyError block achieves that too.
`Unit()` accepts the `parse_strict` argument, which is supposed to let users choose if unit parsing problems cause errors, warnings or are suppressed silently. This has so far been done by letting unit parsers raise errors, which are handled based on the `parse_strict` value. However, raising an error forces parsers to stop early, meaning that every problem becomes a critical error and `Unit` has no choice but to return an `UnrecognizedUnit`. Therefore `parse_strict="raise"` cannot be very strict because otherwise `parse_strict="warn"` and `"silent"` can cause problems for users who might be e.g. reading in files they have little control over. An improvement is to let the unit parsers emit warnings for non-critical problems and to let the machinery in `Unit` handle those warnings based on the `parse_strict` value. Warnings do not force parsers to stop early, so it will be possible to make the `parse_strict="raise"` option stricter without forcing `UnrecognizedUnit` instances with `"warn"` or `"silent"`. Unit parsers still have the option of immediately raising an error in case of serious problems. The new implementation places the warning handling in `_UnitMetaClass` so it is inherited by `Unit`. The other possible implementation without code duplication would be to place it in `Generic._do_parse()`. However, that would require passing the `parse_strict` value through the unit formatter `parse()` methods, which would be a backwards-incompatible change. A subtle (and fortunately minor) change is that if a string can be parsed without problems, but the `parse_strict` value is invalid, then the updated code raises an error. This is because the new code uses the `parse_strict` value to set up appropriate warning filters, which must be done before calling the unit parsers.
Parsing units from strings with `Unit()` can involve `warnings.catch_warnings()`, which is not thread-safe. This could cause problems because the machinery can be invoked when reading files and speeding up I/O with threading is common enough. The solution is to use a `threading.RLock`. `RLock` is preferable over `Lock` because e.g. the `Generic` parser calls `Unit()`, so there might be more than one attempt per thread to obtain the lock.
ae9a625 to
f39e2ef
Compare
|
With my last mistaken comment resolved, let's get this in. Thanks, @eerovaher! |
Description
Unit()accepts theparse_strictargument, which is supposed to let users choose if unit parsing problems cause errors, warnings or are suppressed silently. This has so far been done by letting the unit parsers raise errors, which are caught byUnitand handled based on theparse_strictvalue. However, raising an error forces the parser to stop early, meaning that every problem becomes a critical error andUnithas no choice but to return anUnrecognizedUnitinstance. This in turn means thatparse_strict="raise"cannot be very strict because otherwiseparse_strict="warn"andparse_strict="silent"could start causing problems for users who might be e.g. reading in files they have little control over.An improvement is to let the unit parsers emit warnings for non-critical problems and to let the machinery in
Unithandle those warnings based on theparse_strictvalue. Emitting warnings does not force parsers to stop early, so it will be possible to make theparse_strict="raise"option stricter without being forced to return anUnrecognizedUnitwith"warn"or"silent". Unit parsers still have the option of raising an error instead of a warning in case of serious problems. Raising warnings is also a good approach because we useplyfor parsing, so we are not fully in control of the parser API, but the new warnings do not rely on the API of an external package.Backwards compatibility
A subtle (and fortunately minor) change is that if a string can be parsed without problems, but the
parse_strictvalue is invalid, then the previous code did not raise errors but the new code does. This is because the new code uses theparse_strictvalue to set up appropriate warning filters, which must be done before calling the unit parsers, whereas the old code only looked at theparse_strictvalue in an exception catching block if an exception was actually raised.Alternative
Another option would be to place the new mechanism in
Generic._do_parse()(which is called by all theastropyunit parsers) instead of_UnitMetaClass. However, that would require passing theparse_strictvalue through the unit formatterparse()methods, which would be a backwards-incompatible API change for anyone downstream who has implemented a unit formatter with parsing capabilities (unless their implementation usesparse(..., **kwargs)).Further remarks
This pull request should also update documentation and the change log, but I'm opening this as a draft to allow the implementation to be reviewed before spending time and effort on documenting it. Reviewers can look at the updated tests to see what the changes here mean in practice.
It should be pointed out that if the unit parser raises a warning then the
Unitmachinery will not add the... If this is meant to be a custom unit, ...text to the message. But if the unit parser can recognize the problem then the message it generates is likely to be more relevant anyways.Future possibilities
The only the parser rule updated here is the rule regarding invalid multiplication in the OGIP format, so that it could be used to test the new mechanism. It is the best parser rule for this purpose because it was implemented just recently in 3f562e5, meaning there are no
astropyreleases with this rule so we do not have to worry too much about backwards compatibility. Other parser rules can be updated in follow-up pull requests. Once the mechanism is in place then the remaining parser rules can be updated to make use of it independently from each other.Perhaps the default
parse_strictvalue should be changed from"raise"to "warn". This might be in scope here, or it could be done in a separate follow-up.It might be a good idea to introduce
UnitParserWarningsubclasses for each of the unit parsers so that the unit format name could be included in the warning class name (e.g.FITSParserWarning) instead of having to shoehorn the format name to the warning message. This can be done in a follow-up pull request in a backwards-compatible manner.If the previous idea is adopted then it might be a good idea to introduce warning classes for the individual parser rules (e.g.
OGIPMultiplicationWarning), so that constructing the messages could be implemented in the warning classes rather than in the parsers.