Skip to content

Implement mechanism for more flexible unit parsing#16892

Merged
mhvk merged 2 commits intoastropy:mainfrom
eerovaher:units-parse_strict
Oct 7, 2024
Merged

Implement mechanism for more flexible unit parsing#16892
mhvk merged 2 commits intoastropy:mainfrom
eerovaher:units-parse_strict

Conversation

@eerovaher
Copy link
Copy Markdown
Member

Description

Unit() accepts the parse_strict argument, which is supposed to let users choose if unit parsing problems cause errors, warnings or are suppressed silently. This has so far been done by letting the unit parsers raise errors, which are caught by Unit and handled based on the parse_strict value. However, raising an error forces the parser to stop early, meaning that every problem becomes a critical error and Unit has no choice but to return an UnrecognizedUnit instance. This in turn means that parse_strict="raise" cannot be very strict because otherwise parse_strict="warn" and parse_strict="silent" could start causing problems for users who might be e.g. reading in files they have little control over.

An improvement is to let the unit parsers emit warnings for non-critical problems and to let the machinery in Unit handle those warnings based on the parse_strict value. Emitting warnings does not force parsers to stop early, so it will be possible to make the parse_strict="raise" option stricter without being forced to return an UnrecognizedUnit with "warn" or "silent". Unit parsers still have the option of raising an error instead of a warning in case of serious problems. Raising warnings is also a good approach because we use ply for parsing, so we are not fully in control of the parser API, but the new warnings do not rely on the API of an external package.

Backwards compatibility

A subtle (and fortunately minor) change is that if a string can be parsed without problems, but the parse_strict value is invalid, then the previous code did not raise errors but the new code does. This is because the new code uses the parse_strict value to set up appropriate warning filters, which must be done before calling the unit parsers, whereas the old code only looked at the parse_strict value in an exception catching block if an exception was actually raised.

Alternative

Another option would be to place the new mechanism in Generic._do_parse() (which is called by all the astropy unit parsers) instead of _UnitMetaClass. However, that would require passing the parse_strict value through the unit formatter parse() methods, which would be a backwards-incompatible API change for anyone downstream who has implemented a unit formatter with parsing capabilities (unless their implementation uses parse(..., **kwargs)).

Further remarks

This pull request should also update documentation and the change log, but I'm opening this as a draft to allow the implementation to be reviewed before spending time and effort on documenting it. Reviewers can look at the updated tests to see what the changes here mean in practice.

It should be pointed out that if the unit parser raises a warning then the Unit machinery will not add the ... If this is meant to be a custom unit, ... text to the message. But if the unit parser can recognize the problem then the message it generates is likely to be more relevant anyways.

Future possibilities

The only the parser rule updated here is the rule regarding invalid multiplication in the OGIP format, so that it could be used to test the new mechanism. It is the best parser rule for this purpose because it was implemented just recently in 3f562e5, meaning there are no astropy releases with this rule so we do not have to worry too much about backwards compatibility. Other parser rules can be updated in follow-up pull requests. Once the mechanism is in place then the remaining parser rules can be updated to make use of it independently from each other.

Perhaps the default parse_strict value should be changed from "raise" to "warn". This might be in scope here, or it could be done in a separate follow-up.

It might be a good idea to introduce UnitParserWarning subclasses for each of the unit parsers so that the unit format name could be included in the warning class name (e.g. FITSParserWarning) instead of having to shoehorn the format name to the warning message. This can be done in a follow-up pull request in a backwards-compatible manner.

If the previous idea is adopted then it might be a good idea to introduce warning classes for the individual parser rules (e.g. OGIPMultiplicationWarning), so that constructing the messages could be implemented in the warning classes rather than in the parsers.

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@eerovaher eerovaher added this to the v7.0.0 milestone Aug 27, 2024
@eerovaher eerovaher requested review from mhvk and nstarman August 27, 2024 18:08
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@eerovaher eerovaher marked this pull request as draft August 27, 2024 18:09
@mhvk mhvk added the benchmark Run benchmarks for a PR label Aug 27, 2024
@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Aug 27, 2024

At first glance, the idea seems very good - but I need to look a bit better! One slight worry is the performance hit from setting up the warnings filter -- I'll run the benchmarks just in case.

@nstarman
Copy link
Copy Markdown
Member

At first glance, the idea seems very good

Agreed.

However, raising an error forces the parser to stop early, meaning that every problem becomes a critical error and Unit has no choice but to return an UnrecognizedUnit instance

Just wondering how ExceptionGroups might fit into this discussion. Could they be useful here?

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Aug 27, 2024

Benchmark "very simple unit parse" is substantially worse:

| Change   | Before [70e3b3a9]    | After [38031947]    |   Ratio | Benchmark (Parameter)             |
|----------|----------------------|---------------------|---------|-----------------------------------|
| +        | 2.50±0.01μs          | 4.71±0.01μs         |    1.88 | units.time_very_simple_unit_parse |

@eerovaher
Copy link
Copy Markdown
Member Author

eerovaher commented Aug 27, 2024

@nstarman asked

Just wondering how ExceptionGroups might fit into this discussion. Could they be useful here?

A necessary (but far from sufficient) condition for using exception groups is that ply supports them. The version of ply we bundle is old enough that I feel confident in saying that it does not, even without looking up its documentation.

@mhvk commented

Benchmark "very simple unit parse" is substantially worse: ...

If the performance of this code path is considered to be very important then we can avoid the performance decrease by moving the warning filtering to the except block at

@classmethod
def _do_parse(cls, s: str, debug: bool = False) -> UnitBase:
try:
# This is a short circuit for the case where the string
# is just a single unit name
return cls._parse_unit(s, detailed_exception=False)
except ValueError as e:

But then we would have to change

astropy/astropy/units/core.py

Lines 2096 to 2097 in 70e3b3a

try:
return f.parse(s)

to

try:
    return f.parse(s, parse_strict=parse_strict)

and the additional argument can break any downstream unit formatters that implement parsing (unless we implement the equivalent of

except TypeError:
    try:
        return f.parse(s)

somehow). One more possibility would be to add a Boolean class attribute (False by default) to unit formatters so that we could call something like

try:
    return f.parse(s, parse_strict=parse_strict) if f.supports_parse_strict else f.parse(s)

(EDIT: that wouldn't work because downstream formatters could inherit the attribute while still implementing their own non-compliant parse()).

But I think the added functionality is worth it, even with the slowdown, and such complications are not necessary. The referred case is still a factor of few faster than parsing even the simplest of composite units.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Aug 27, 2024

I agree that your solution is far more elegant than the alternatives, and it is nice that anyone subclassing the formatters does not have to worry about it (though not so likely those exist, or that we haven't broken those at least for output...).

Separately, I have been wondering whether there should be a simple cache for Unit - most codes I can think of just keep on parsing the same units...

"enable it with 'u.add_enabled_units'. "
"For details, see "
"https://docs.astropy.org/en/latest/units/combining_and_defining.html"
with warnings.catch_warnings():
Copy link
Copy Markdown
Contributor

@mhvk mhvk Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since our filter is simple, I think we can speed things up a bit by passing in arguments here rather than setting up a filterwarnings (saves almost a factor 2 in my tests). And maybe even better if we do something like:

WARNING_FILTER = {
    "warn": "default",
    "raise": "error",
    "silent": "ignore",
}
try:
    with warnings.catchwarnings(action=WARNING_FILTER[parse_strict], category=UnitParserWarning):
        return f.parse(s)
    except KeyError:
        if parse_strict not in WARNING_FILTER:
            raise ValueError(
                        "'parse_strict' must be 'warn', 'raise' or 'silent'"
                    ) from None
        raise
    except NotImplementedError:
        raise
...

EDIT: WARNING_FILTER to be defined outside of this scope for speed, obviously. And ideally with a better name!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are suggesting does not work with Python 3.10: https://docs.python.org/3.11/library/warnings.html#available-context-managers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should bite the bullet and set the minimum version to python 3.11 - we're dropping support for it in the next release anyway... (#16886 was also annoying because of this...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #16900 for an issue about dropping python 3.10...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion we are discussing here is about an implementation detail that can affect the performance of the code, but not the API. Is the proposed API good enough that I should start working on the documentation or is there anything that still needs to be discussed about the proposed API?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm significantly more in favor of the "Alternative API" you discuss. Avoiding / removing custom meta classes is almost always an improvement. Adding the kwarg will be a bit backwards incompatible, but we are coming up on a major version update, so that seems fine to me.

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Sep 2, 2024

Taking this to the main thread: I think I actually like the approach implemented here somewhat better than passing something down to _do_parse(), because we need the try/except anyway to produce an UnrecognizedUnit -- this seems something parsers should not have to know about that. It seems quite clean for the API of _do_parse() to raise on anything impossible to parse, and warn on things not quite up to the standard.

But then I realized I maybe misunderstand the alternative: would _do_parse() get the full parse_strict (i.e., also parse_strict="silent", which means they do have to know about UnrecognizedUnit), or another keyword argument?

Anyway, more directly to @eerovaher's question: if we go this route, then, yes, I think it is fine not to wait on #16900 but to start adding documentation.

@eerovaher
Copy link
Copy Markdown
Member Author

I think having the parsers emit warnings in case of non-critical errors is a good idea for two reasons:

  1. we can use Python's warning handling to pass the warnings along, suppress them silently, or turn them into errors, which is what we want to do with them in the end anyways,
  2. it does not rely on implementation details of ply, which is an external package we do not have full control over.

If we do choose to use warnings then where should the warning handling take place? I've opened this pull request with the code placed in _UnitMetaClass. I've mentioned Generic._do_parse() as an alternative because that's the other location where the warning handling could be implemented without having to duplicate it, but the more I think about it, the less I like it. Moving the warning handling to _do_parse() is a big backwards-incompatible API change for both the astropy unit formatters and also any downstream formatters that implement parsing. We would still need to retain the error handling in _UnitMetaClass because the parsers should not now about UnrecognizedUnit (relevant in case of critical errors).

@nstarman said

Personally I'm significantly more in favor of the "Alternative API" you discuss. Avoiding / removing custom meta classes is almost always an improvement.

How detailed the warnings hierarchy should be in units is completely orthogonal to whether the warning handling should take place in _UnitMetaClass or Generic._do_parse().

@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Sep 5, 2024

I agree with keeping the machinery in Unit - but note that it is something we could move fairly easily to _do_parse() if we were to decide that it is better there. Given that python 3.10 is now going to be dropped (#16903), I think we can go ahead and use the fast version of warnings enabled by 3.11, and indeed document things.

@eerovaher eerovaher marked this pull request as ready for review September 11, 2024 17:52
Copy link
Copy Markdown
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice! Only small comments.

except NotImplementedError:
raise
except UnitParserWarning as err:
raise ValueError(str(err))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be from err? Or from None?

Not perhaps really needed, but if we chain, we can also have some text here, perhaps noting that this error can be dealt with.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't end up in this except block with critical errors that the parser couldn't handle, so adding a remark about parse_strict to the error message is a good idea.

@@ -2164,12 +2169,15 @@ class Unit(NamedUnit, metaclass=_UnitMetaClass):
The optional ``parse_strict`` keyword controls what happens when an
unrecognized unit string is passed in. It may be one of the following:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to change this bit too: "... what happens when a unit string that does not conform to the standard or is unrecognized passed in"

In case of more serious standard violations an
:class:`~astropy.units.UnrecognizedUnit` instance is returned instead.
In such cases it can sometimes be helpful to register additional unit aliases
(for misspelt units) with :func:`~astropy.units.set_enabled_aliases` (e.g.,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, reads well!

"'parse_strict' must be 'warn', 'raise' or 'silent'"
) from None
try:
with warnings.catch_warnings(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs I see that catch_warnings isn't thread-safe. Was the previous implementation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently astropy.units does not use catch_warnings() (outside tests).

Do you have any ideas how to implement this functionality without using catch_warnings() that would still be compatible with ply?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had quite a few PRs dealing with thread-safety adding locks, etc. So, my guess is that, as is, we were thread-safe (or at least "safe enough"). It also seems this is a real problem: the risk is that a thread gets our changed state, i.e., if we use parse_strict="raise" (the default), it would suddenly start failing on all warnings after unit parsing...

From https://stackoverflow.com/questions/56043644/python-locally-suppress-warnings-in-thread-safe-way, it looks like the only solution is to add a lock. That's not very complicated and adds negligible overhead compared to the cost we have anyway in setting up the warning filter (0.2 us vs 1.9 us). So, that seems the simplest solution.

However, seeing the time to set up catch_warnings makes me wonder again if we cannot do better another way, since it is a significant overhead, more than doubling the parsing time for simple units (on the same laptop, u.Unit("m") takes 1.5 us currently).

I guess that gets us back to whether we should just bite the bullet and pass on parse_strict after all. Though in a way that just pushes the problem to the parser: the warnings are raised in the middle of the action, inside bits that only have access to bits of the string, so one still needs a process of letting any warnings bubble up anyway, i.e., something like a lock (but at least it would not increase the overhead of the generic parser!). The same would be the case if we implement something that, say, attaches an attribute to the returning unit of whether there was a warning during parsing.

The other way this perhaps can be approached is to have a utility function or method for the warnings, i.e., we call the parser with, effectively, our own catch_warnings (with a lock), and the parser has to use the units utility function (say, via a new cls.warn(...) method), which would get the locked units warning item, and add something to it. I think this may be possible to implement this relatively easily with a context variable. (Don't really know details, but became aware of it via numpy/numpy#26846).

What do you both think?

Copy link
Copy Markdown
Member Author

@eerovaher eerovaher Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #17004 to speed up parsing simple unit strings with Unit. If that is merged then this pull request wouldn't slow down parsing simple strings at all. It would slightly slow down parsing composite units, but those are much slower to parse anyways, so the relative slowdown would be small.

I still have to think about thread safety.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! It keeps things simple from the parser perspective, which was always the big incentive for the approach here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

context variables are cool! I didn't know about them. Thanks 👍

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this because of the backlink from the NumPy PR. I think these days you can think of context variables as a "modern" async-aware (but still thread-safe) version of threading.local. I also just learned today that gevent treats context variables as locals to a green thread, so even async frameworks outside of the built-in ones use them now. You get lots of parallel/concurrent safety "for free" if you put state that could be stored in a global in a context variable.

@astropy astropy deleted a comment from mhvk Sep 12, 2024
@astropy astropy deleted a comment from pllim Sep 12, 2024
@eerovaher
Copy link
Copy Markdown
Member Author

The first commit of the updated pull request implements the new mechanism, adds tests and updates documentation. The main concern @mhvk had was about the impact of the new mechanism on parsing non-composite units. That was solved by #17004, which allows simple units to be parsed before the new mechanism is invoked at all. There is still a slight slow-down in parsing composite units, but that is much slower anyways, so the relative slow-down is smaller, and the new functionality is worth it. However, I will point out that this slow-down might cause some performance related change log entries to be misleading, so we might have to review them before 7.0 is released.

@nstarman brought up the question of thread-safety, which is addressed in the second commit by introducing a threading.RLock. Locking is an implementation detail that can be changed in the future without breaking our public API.

Copy link
Copy Markdown
Member

@nstarman nstarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all LGTM!
AFAIK this solution does not preclude the more intensive refactor discussed above, which would have an API impact. Given the performance concerns re catch_warnings we should definitely discuss more performant solutions, but IMO this can be merged first. Thanks @eerovaher

Copy link
Copy Markdown
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - I still really like the idea and agree the performance issues I worried most about are now gone, so am happy to proceed. Just two suggestions for the implementation, though I can go with what you have now too.

pass

warning_actions = {"silent": "ignore", "warn": "default", "raise": "error"}
if parse_strict not in warning_actions:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding this inside the try/except:

except KeyError as err:
    if parse_strict not in warning_actions:
        raise ValueError(...) from None
    else:
        raise
...

Saves a bit of time.

Also, is python smart enough not to re-evaluate the dict each time? Or is it better to define it outside?

Copy link
Copy Markdown
Member

@nstarman nstarman Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it re-evaluates unless the function is compiled in some way, e.g. numba / jax.
It would be nice if it could bind Final variables into the function definition itself.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had placed the if parse_strict not in warning_actions: check outside the try-except code because I wanted to exclude the possibility of accidentally catching a wrong KeyError, but putting the check inside the except KeyError block achieves that too.

@eerovaher eerovaher removed the benchmark Run benchmarks for a PR label Oct 7, 2024
`Unit()` accepts the `parse_strict` argument, which is supposed to let
users choose if unit parsing problems cause errors, warnings or are
suppressed silently. This has so far been done by letting unit parsers
raise errors, which are handled based on the `parse_strict` value.
However, raising an error forces parsers to stop early, meaning that
every problem becomes a critical error and `Unit` has no choice but to
return an `UnrecognizedUnit`. Therefore `parse_strict="raise"` cannot be
very strict because otherwise `parse_strict="warn"` and `"silent"` can
cause problems for users who might be e.g. reading in files they have
little control over.

An improvement is to let the unit parsers emit warnings for non-critical
problems and to let the machinery in `Unit` handle those warnings based
on the `parse_strict` value. Warnings do not force parsers to stop
early, so it will be possible to make the `parse_strict="raise"` option
stricter without forcing `UnrecognizedUnit` instances with `"warn"` or
`"silent"`. Unit parsers still have the option of immediately raising an
error in case of serious problems.

The new implementation places the warning handling in `_UnitMetaClass`
so it is inherited by `Unit`. The other possible implementation without
code duplication would be to place it in `Generic._do_parse()`. However,
that would require passing the `parse_strict` value through the unit
formatter `parse()` methods, which would be a backwards-incompatible
change.

A subtle (and fortunately minor) change is that if a string can be
parsed without problems, but the `parse_strict` value is invalid, then
the updated code raises an error. This is because the new code uses the
`parse_strict` value to set up appropriate warning filters, which must
be done before calling the unit parsers.
Parsing units from strings with `Unit()` can involve
`warnings.catch_warnings()`, which is not thread-safe. This could cause
problems because the machinery can be invoked when reading files and
speeding up I/O with threading is common enough. The solution is to use
a `threading.RLock`. `RLock` is preferable over `Lock` because e.g. the
`Generic` parser calls `Unit()`, so there might be more than one attempt
per thread to obtain the lock.
@mhvk
Copy link
Copy Markdown
Contributor

mhvk commented Oct 7, 2024

With my last mistaken comment resolved, let's get this in. Thanks, @eerovaher!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants