Implement mechanism for more flexible unit parsing by eerovaher · Pull Request #16892 · astropy/astropy

eerovaher · 2024-08-27T18:08:17Z

Description

Unit() accepts the parse_strict argument, which is supposed to let users choose if unit parsing problems cause errors, warnings or are suppressed silently. This has so far been done by letting the unit parsers raise errors, which are caught by Unit and handled based on the parse_strict value. However, raising an error forces the parser to stop early, meaning that every problem becomes a critical error and Unit has no choice but to return an UnrecognizedUnit instance. This in turn means that parse_strict="raise" cannot be very strict because otherwise parse_strict="warn" and parse_strict="silent" could start causing problems for users who might be e.g. reading in files they have little control over.

An improvement is to let the unit parsers emit warnings for non-critical problems and to let the machinery in Unit handle those warnings based on the parse_strict value. Emitting warnings does not force parsers to stop early, so it will be possible to make the parse_strict="raise" option stricter without being forced to return an UnrecognizedUnit with "warn" or "silent". Unit parsers still have the option of raising an error instead of a warning in case of serious problems. Raising warnings is also a good approach because we use ply for parsing, so we are not fully in control of the parser API, but the new warnings do not rely on the API of an external package.

Backwards compatibility

A subtle (and fortunately minor) change is that if a string can be parsed without problems, but the parse_strict value is invalid, then the previous code did not raise errors but the new code does. This is because the new code uses the parse_strict value to set up appropriate warning filters, which must be done before calling the unit parsers, whereas the old code only looked at the parse_strict value in an exception catching block if an exception was actually raised.

Alternative

Another option would be to place the new mechanism in Generic._do_parse() (which is called by all the astropy unit parsers) instead of _UnitMetaClass. However, that would require passing the parse_strict value through the unit formatter parse() methods, which would be a backwards-incompatible API change for anyone downstream who has implemented a unit formatter with parsing capabilities (unless their implementation uses parse(..., **kwargs)).

Further remarks

This pull request should also update documentation and the change log, but I'm opening this as a draft to allow the implementation to be reviewed before spending time and effort on documenting it. Reviewers can look at the updated tests to see what the changes here mean in practice.

It should be pointed out that if the unit parser raises a warning then the Unit machinery will not add the ... If this is meant to be a custom unit, ... text to the message. But if the unit parser can recognize the problem then the message it generates is likely to be more relevant anyways.

Future possibilities

The only the parser rule updated here is the rule regarding invalid multiplication in the OGIP format, so that it could be used to test the new mechanism. It is the best parser rule for this purpose because it was implemented just recently in 3f562e5, meaning there are no astropy releases with this rule so we do not have to worry too much about backwards compatibility. Other parser rules can be updated in follow-up pull requests. Once the mechanism is in place then the remaining parser rules can be updated to make use of it independently from each other.

Perhaps the default parse_strict value should be changed from "raise" to "warn". This might be in scope here, or it could be done in a separate follow-up.

It might be a good idea to introduce UnitParserWarning subclasses for each of the unit parsers so that the unit format name could be included in the warning class name (e.g. FITSParserWarning) instead of having to shoehorn the format name to the warning message. This can be done in a follow-up pull request in a backwards-compatible manner.

If the previous idea is adopted then it might be a good idea to introduce warning classes for the individual parser rules (e.g. OGIPMultiplicationWarning), so that constructing the messages could be implemented in the warning classes rather than in the parsers.

By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

github-actions · 2024-08-27T18:08:43Z

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

Do the proposed changes actually accomplish desired goals?
Do the proposed changes follow the Astropy coding guidelines?
Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

mhvk · 2024-08-27T18:19:22Z

At first glance, the idea seems very good - but I need to look a bit better! One slight worry is the performance hit from setting up the warnings filter -- I'll run the benchmarks just in case.

nstarman · 2024-08-27T18:49:52Z

At first glance, the idea seems very good

Agreed.

However, raising an error forces the parser to stop early, meaning that every problem becomes a critical error and Unit has no choice but to return an UnrecognizedUnit instance

Just wondering how ExceptionGroups might fit into this discussion. Could they be useful here?

mhvk · 2024-08-27T18:58:52Z

Benchmark "very simple unit parse" is substantially worse:

| Change   | Before [70e3b3a9]    | After [38031947]    |   Ratio | Benchmark (Parameter)             |
|----------|----------------------|---------------------|---------|-----------------------------------|
| +        | 2.50±0.01μs          | 4.71±0.01μs         |    1.88 | units.time_very_simple_unit_parse |

eerovaher · 2024-08-27T20:33:23Z

@nstarman asked

Just wondering how ExceptionGroups might fit into this discussion. Could they be useful here?

A necessary (but far from sufficient) condition for using exception groups is that ply supports them. The version of ply we bundle is old enough that I feel confident in saying that it does not, even without looking up its documentation.

@mhvk commented

Benchmark "very simple unit parse" is substantially worse: ...

If the performance of this code path is considered to be very important then we can avoid the performance decrease by moving the warning filtering to the except block at

astropy/astropy/units/format/generic.py

Lines 578 to 584 in 70e3b3a

    
           @classmethod 
        
           def _do_parse(cls, s: str, debug: bool = False) -> UnitBase: 
        
               try: 
        
                   # This is a short circuit for the case where the string 
        
                   # is just a single unit name 
        
                   return cls._parse_unit(s, detailed_exception=False) 
        
               except ValueError as e:

But then we would have to change

astropy/astropy/units/core.py

Lines 2096 to 2097 in 70e3b3a

    
           try: 
        
               return f.parse(s)

to

try:
    return f.parse(s, parse_strict=parse_strict)

and the additional argument can break any downstream unit formatters that implement parsing (unless we implement the equivalent of

except TypeError:
    try:
        return f.parse(s)

somehow). ~~One more possibility would be to add a Boolean class attribute (False by default) to unit formatters so that we could call something like~~

try:
    return f.parse(s, parse_strict=parse_strict) if f.supports_parse_strict else f.parse(s)

(EDIT: that wouldn't work because downstream formatters could inherit the attribute while still implementing their own non-compliant parse()).

But I think the added functionality is worth it, even with the slowdown, and such complications are not necessary. The referred case is still a factor of few faster than parsing even the simplest of composite units.

mhvk · 2024-08-27T20:53:51Z

I agree that your solution is far more elegant than the alternatives, and it is nice that anyone subclassing the formatters does not have to worry about it (though not so likely those exist, or that we haven't broken those at least for output...).

Separately, I have been wondering whether there should be a simple cache for Unit - most codes I can think of just keep on parsing the same units...

mhvk · 2024-08-28T18:05:41Z

astropy/units/core.py

-                        "enable it with 'u.add_enabled_units'. "
-                        "For details, see "
-                        "https://docs.astropy.org/en/latest/units/combining_and_defining.html"
+            with warnings.catch_warnings():


Since our filter is simple, I think we can speed things up a bit by passing in arguments here rather than setting up a filterwarnings (saves almost a factor 2 in my tests). And maybe even better if we do something like:

WARNING_FILTER = { "warn": "default", "raise": "error", "silent": "ignore", } try: with warnings.catchwarnings(action=WARNING_FILTER[parse_strict], category=UnitParserWarning): return f.parse(s) except KeyError: if parse_strict not in WARNING_FILTER: raise ValueError( "'parse_strict' must be 'warn', 'raise' or 'silent'" ) from None raise except NotImplementedError: raise ...

EDIT: WARNING_FILTER to be defined outside of this scope for speed, obviously. And ideally with a better name!

What you are suggesting does not work with Python 3.10: https://docs.python.org/3.11/library/warnings.html#available-context-managers

Maybe we should bite the bullet and set the minimum version to python 3.11 - we're dropping support for it in the next release anyway... (#16886 was also annoying because of this...)

See #16900 for an issue about dropping python 3.10...

The suggestion we are discussing here is about an implementation detail that can affect the performance of the code, but not the API. Is the proposed API good enough that I should start working on the documentation or is there anything that still needs to be discussed about the proposed API?

Personally I'm significantly more in favor of the "Alternative API" you discuss. Avoiding / removing custom meta classes is almost always an improvement. Adding the kwarg will be a bit backwards incompatible, but we are coming up on a major version update, so that seems fine to me.

mhvk · 2024-09-02T21:14:43Z

Taking this to the main thread: I think I actually like the approach implemented here somewhat better than passing something down to _do_parse(), because we need the try/except anyway to produce an UnrecognizedUnit -- this seems something parsers should not have to know about that. It seems quite clean for the API of _do_parse() to raise on anything impossible to parse, and warn on things not quite up to the standard.

But then I realized I maybe misunderstand the alternative: would _do_parse() get the full parse_strict (i.e., also parse_strict="silent", which means they do have to know about UnrecognizedUnit), or another keyword argument?

Anyway, more directly to @eerovaher's question: if we go this route, then, yes, I think it is fine not to wait on #16900 but to start adding documentation.

eerovaher · 2024-09-03T21:31:53Z

I think having the parsers emit warnings in case of non-critical errors is a good idea for two reasons:

we can use Python's warning handling to pass the warnings along, suppress them silently, or turn them into errors, which is what we want to do with them in the end anyways,
it does not rely on implementation details of ply, which is an external package we do not have full control over.

If we do choose to use warnings then where should the warning handling take place? I've opened this pull request with the code placed in _UnitMetaClass. I've mentioned Generic._do_parse() as an alternative because that's the other location where the warning handling could be implemented without having to duplicate it, but the more I think about it, the less I like it. Moving the warning handling to _do_parse() is a big backwards-incompatible API change for both the astropy unit formatters and also any downstream formatters that implement parsing. We would still need to retain the error handling in _UnitMetaClass because the parsers should not now about UnrecognizedUnit (relevant in case of critical errors).

@nstarman said

Personally I'm significantly more in favor of the "Alternative API" you discuss. Avoiding / removing custom meta classes is almost always an improvement.

How detailed the warnings hierarchy should be in units is completely orthogonal to whether the warning handling should take place in _UnitMetaClass or Generic._do_parse().

mhvk · 2024-09-05T14:19:41Z

I agree with keeping the machinery in Unit - but note that it is something we could move fairly easily to _do_parse() if we were to decide that it is better there. Given that python 3.10 is now going to be dropped (#16903), I think we can go ahead and use the fast version of warnings enabled by 3.11, and indeed document things.

mhvk

This is very nice! Only small comments.

mhvk · 2024-09-11T20:30:50Z

astropy/units/core.py

            except NotImplementedError:
                raise
+            except UnitParserWarning as err:
+                raise ValueError(str(err))


should this be from err? Or from None?

Not perhaps really needed, but if we chain, we can also have some text here, perhaps noting that this error can be dealt with.

We don't end up in this except block with critical errors that the parser couldn't handle, so adding a remark about parse_strict to the error message is a good idea.

mhvk · 2024-09-11T20:37:12Z

astropy/units/core.py

@@ -2164,12 +2169,15 @@ class Unit(NamedUnit, metaclass=_UnitMetaClass):
      The optional ``parse_strict`` keyword controls what happens when an
      unrecognized unit string is passed in.  It may be one of the following:


We need to change this bit too: "... what happens when a unit string that does not conform to the standard or is unrecognized passed in"

mhvk · 2024-09-11T20:40:25Z

docs/units/format.rst

+In case of more serious standard violations an
+:class:`~astropy.units.UnrecognizedUnit` instance is returned instead.
+In such cases it can sometimes be helpful to register additional unit aliases
+(for misspelt units) with :func:`~astropy.units.set_enabled_aliases` (e.g.,


Thanks, reads well!

nstarman · 2024-09-12T14:37:01Z

astropy/units/core.py

+                    "'parse_strict' must be 'warn', 'raise' or 'silent'"
+                ) from None
+            try:
+                with warnings.catch_warnings(


Looking at the docs I see that catch_warnings isn't thread-safe. Was the previous implementation?

Currently astropy.units does not use catch_warnings() (outside tests).

Do you have any ideas how to implement this functionality without using catch_warnings() that would still be compatible with ply?

We've had quite a few PRs dealing with thread-safety adding locks, etc. So, my guess is that, as is, we were thread-safe (or at least "safe enough"). It also seems this is a real problem: the risk is that a thread gets our changed state, i.e., if we use parse_strict="raise" (the default), it would suddenly start failing on all warnings after unit parsing...

From https://stackoverflow.com/questions/56043644/python-locally-suppress-warnings-in-thread-safe-way, it looks like the only solution is to add a lock. That's not very complicated and adds negligible overhead compared to the cost we have anyway in setting up the warning filter (0.2 us vs 1.9 us). So, that seems the simplest solution.

However, seeing the time to set up catch_warnings makes me wonder again if we cannot do better another way, since it is a significant overhead, more than doubling the parsing time for simple units (on the same laptop, u.Unit("m") takes 1.5 us currently).

I guess that gets us back to whether we should just bite the bullet and pass on parse_strict after all. Though in a way that just pushes the problem to the parser: the warnings are raised in the middle of the action, inside bits that only have access to bits of the string, so one still needs a process of letting any warnings bubble up anyway, i.e., something like a lock (but at least it would not increase the overhead of the generic parser!). The same would be the case if we implement something that, say, attaches an attribute to the returning unit of whether there was a warning during parsing.

The other way this perhaps can be approached is to have a utility function or method for the warnings, i.e., we call the parser with, effectively, our own catch_warnings (with a lock), and the parser has to use the units utility function (say, via a new cls.warn(...) method), which would get the locked units warning item, and add something to it. I think this may be possible to implement this relatively easily with a context variable. (Don't really know details, but became aware of it via numpy/numpy#26846).

What do you both think?

I've opened #17004 to speed up parsing simple unit strings with Unit. If that is merged then this pull request wouldn't slow down parsing simple strings at all. It would slightly slow down parsing composite units, but those are much slower to parse anyways, so the relative slowdown would be small.

I still have to think about thread safety.

Sounds good! It keeps things simple from the parser perspective, which was always the big incentive for the approach here.

context variables are cool! I didn't know about them. Thanks 👍

Just saw this because of the backlink from the NumPy PR. I think these days you can think of context variables as a "modern" async-aware (but still thread-safe) version of threading.local. I also just learned today that gevent treats context variables as locals to a green thread, so even async frameworks outside of the built-in ones use them now. You get lots of parallel/concurrent safety "for free" if you put state that could be stored in a global in a context variable.

eerovaher · 2024-10-07T14:38:49Z

The first commit of the updated pull request implements the new mechanism, adds tests and updates documentation. The main concern @mhvk had was about the impact of the new mechanism on parsing non-composite units. That was solved by #17004, which allows simple units to be parsed before the new mechanism is invoked at all. There is still a slight slow-down in parsing composite units, but that is much slower anyways, so the relative slow-down is smaller, and the new functionality is worth it. However, I will point out that this slow-down might cause some performance related change log entries to be misleading, so we might have to review them before 7.0 is released.

@nstarman brought up the question of thread-safety, which is addressed in the second commit by introducing a threading.RLock. Locking is an implementation detail that can be changed in the future without breaking our public API.

nstarman

This all LGTM!
AFAIK this solution does not preclude the more intensive refactor discussed above, which would have an API impact. Given the performance concerns re catch_warnings we should definitely discuss more performant solutions, but IMO this can be merged first. Thanks @eerovaher

mhvk

This looks good - I still really like the idea and agree the performance issues I worried most about are now gone, so am happy to proceed. Just two suggestions for the implementation, though I can go with what you have now too.

mhvk · 2024-10-07T15:48:38Z

astropy/units/core.py

                pass

+            warning_actions = {"silent": "ignore", "warn": "default", "raise": "error"}
+            if parse_strict not in warning_actions:


How about adding this inside the try/except:

except KeyError as err: if parse_strict not in warning_actions: raise ValueError(...) from None else: raise ...

Saves a bit of time.

Also, is python smart enough not to re-evaluate the dict each time? Or is it better to define it outside?

I believe it re-evaluates unless the function is compiled in some way, e.g. numba / jax.
It would be nice if it could bind Final variables into the function definition itself.

I had placed the if parse_strict not in warning_actions: check outside the try-except code because I wanted to exclude the possibility of accidentally catching a wrong KeyError, but putting the check inside the except KeyError block achieves that too.

astropy/units/core.py

`Unit()` accepts the `parse_strict` argument, which is supposed to let users choose if unit parsing problems cause errors, warnings or are suppressed silently. This has so far been done by letting unit parsers raise errors, which are handled based on the `parse_strict` value. However, raising an error forces parsers to stop early, meaning that every problem becomes a critical error and `Unit` has no choice but to return an `UnrecognizedUnit`. Therefore `parse_strict="raise"` cannot be very strict because otherwise `parse_strict="warn"` and `"silent"` can cause problems for users who might be e.g. reading in files they have little control over. An improvement is to let the unit parsers emit warnings for non-critical problems and to let the machinery in `Unit` handle those warnings based on the `parse_strict` value. Warnings do not force parsers to stop early, so it will be possible to make the `parse_strict="raise"` option stricter without forcing `UnrecognizedUnit` instances with `"warn"` or `"silent"`. Unit parsers still have the option of immediately raising an error in case of serious problems. The new implementation places the warning handling in `_UnitMetaClass` so it is inherited by `Unit`. The other possible implementation without code duplication would be to place it in `Generic._do_parse()`. However, that would require passing the `parse_strict` value through the unit formatter `parse()` methods, which would be a backwards-incompatible change. A subtle (and fortunately minor) change is that if a string can be parsed without problems, but the `parse_strict` value is invalid, then the updated code raises an error. This is because the new code uses the `parse_strict` value to set up appropriate warning filters, which must be done before calling the unit parsers.

Parsing units from strings with `Unit()` can involve `warnings.catch_warnings()`, which is not thread-safe. This could cause problems because the machinery can be invoked when reading files and speeding up I/O with threading is common enough. The solution is to use a `threading.RLock`. `RLock` is preferable over `Lock` because e.g. the `Generic` parser calls `Unit()`, so there might be more than one attempt per thread to obtain the lock.

astropy/units/core.py

mhvk · 2024-10-07T17:48:33Z

With my last mistaken comment resolved, let's get this in. Thanks, @eerovaher!

eerovaher added units Enhancement labels Aug 27, 2024

eerovaher added this to the v7.0.0 milestone Aug 27, 2024

eerovaher requested review from mhvk and nstarman August 27, 2024 18:08

eerovaher marked this pull request as draft August 27, 2024 18:09

mhvk added the benchmark Run benchmarks for a PR label Aug 27, 2024

mhvk reviewed Aug 28, 2024

View reviewed changes

mhvk mentioned this pull request Aug 28, 2024

Request: drop python 3.10 support now (since it will not be needed for 7.0) #16900

Closed

eerovaher force-pushed the units-parse_strict branch from dfc838a to cab2e81 Compare September 11, 2024 17:52

eerovaher marked this pull request as ready for review September 11, 2024 17:52

mhvk reviewed Sep 11, 2024

View reviewed changes

nstarman reviewed Sep 12, 2024

View reviewed changes

eerovaher force-pushed the units-parse_strict branch from cab2e81 to 071c592 Compare September 12, 2024 15:22

astropy deleted a comment from mhvk Sep 12, 2024

astropy deleted a comment from pllim Sep 12, 2024

mhvk mentioned this pull request Sep 12, 2024

Speed up parsing simple unit strings with Unit #17004

Merged

1 task

eerovaher force-pushed the units-parse_strict branch from 071c592 to ae9a625 Compare October 7, 2024 14:15

nstarman approved these changes Oct 7, 2024

View reviewed changes

mhvk approved these changes Oct 7, 2024

View reviewed changes

eerovaher removed the benchmark Run benchmarks for a PR label Oct 7, 2024

eerovaher added 2 commits October 7, 2024 18:50

eerovaher force-pushed the units-parse_strict branch from ae9a625 to f39e2ef Compare October 7, 2024 16:52

mhvk reviewed Oct 7, 2024

View reviewed changes

astropy/units/core.py Show resolved Hide resolved

mhvk merged commit b447aed into astropy:main Oct 7, 2024

eerovaher deleted the units-parse_strict branch October 7, 2024 17:55

eerovaher mentioned this pull request Oct 8, 2024

Add a warning class for invalid OGIP multiplications #17138

Closed

1 task

neutrinoceros mentioned this pull request Oct 9, 2024

TST: adapt a test to pytest<8 #17143

Merged

1 task

mhvk mentioned this pull request Oct 22, 2024

Make parsing implicit custom units with "vounit" format obey parse_strict #17232

Merged

1 task

		@@ -2164,12 +2169,15 @@ class Unit(NamedUnit, metaclass=_UnitMetaClass):
		The optional ``parse_strict`` keyword controls what happens when an
		unrecognized unit string is passed in. It may be one of the following:

Uh oh!

Conversation

eerovaher commented Aug 27, 2024

Description

Backwards compatibility

Alternative

Further remarks

Future possibilities

Uh oh!

github-actions bot commented Aug 27, 2024

Uh oh!

mhvk commented Aug 27, 2024

Uh oh!

nstarman commented Aug 27, 2024

Uh oh!

mhvk commented Aug 27, 2024

Uh oh!

eerovaher commented Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk commented Aug 27, 2024

Uh oh!

mhvk Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Sep 2, 2024

Uh oh!

eerovaher commented Sep 3, 2024

Uh oh!

mhvk commented Sep 5, 2024

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerovaher Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eerovaher commented Oct 7, 2024

Uh oh!

nstarman left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nstarman Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

eerovaher commented Aug 27, 2024 •

edited

Loading

mhvk Aug 28, 2024 •

edited

Loading

eerovaher Sep 12, 2024 •

edited

Loading

nstarman Oct 7, 2024 •

edited

Loading