Recognize units written with unicode symbols also in composites#17011
Recognize units written with unicode symbols also in composites#17011nstarman merged 5 commits intoastropy:mainfrom
Conversation
|
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
mhvk
left a comment
There was a problem hiding this comment.
Some small clarifications to help review.
| elif not s.isascii(): | ||
| if s[0] == "\N{MICRO SIGN}": | ||
| s = "u" + s[1:] | ||
| elif s[0] == "°": |
There was a problem hiding this comment.
With the change in what is recognized as units, conversion of degree sign can now be done here.
|
|
||
| @classmethod | ||
| def _convert_deg(cls, m: Match[str]) -> str: | ||
| if len(m.string) == 1: |
There was a problem hiding this comment.
This was actually wrong, it should have been m.group(). Hence, for "°C/s", "°C" would be substituted with "deg_C/s" giving a full string of "deg_C/s/s".
| ("\N{GREEK SMALL LETTER MU}g", u.microgram), | ||
| ("g\N{MINUS SIGN}1", u.g ** (-1)), | ||
| ("m\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}", 1 / u.m), | ||
| ("m\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}", u.m**-1), |
There was a problem hiding this comment.
Just ensuring we create a unit to compare with, not a Quantity.
3b3a693 to
5974f89
Compare
|
Now also includes a fix for odd CDS units (which end in 0). None very likely to be encountered, but it really brings out the risk of using multiple paths to determine what unit is intended. |
eerovaher
left a comment
There was a problem hiding this comment.
All the other parsing formatters inherit _regex_deg and _convert_deg() from Generic, so removing them simplifies the other classes too.
| @classmethod | ||
| def _convert_superscript(cls, m: Match[str]) -> str: | ||
| return f"({m.group().translate(cls._superscript_translations)})" |
There was a problem hiding this comment.
Do you think it would be a good idea to refactor the code in a follow-up pull request to get rid of this method too?
There was a problem hiding this comment.
Not sure it can be captured as easily in an if statement, but in principle, yes. It would also be good to ensure mixes of superscript and regular digits do not work...
docs/changes/units/17011.bugfix.rst
Outdated
| Units involving special unicode symbols can now be parsed also if they are | ||
| part of a composite unit, like "L☉/pc²". | ||
|
|
||
| Similarly, CDS units ending in a 0 can now be part of composite units. | ||
|
|
||
| Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2". |
There was a problem hiding this comment.
I think users would appreciate if the common part of these three cases is summarized in a single sentence and the details are in an explicit list:
| Units involving special unicode symbols can now be parsed also if they are | |
| part of a composite unit, like "L☉/pc²". | |
| Similarly, CDS units ending in a 0 can now be part of composite units. | |
| Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2". | |
| The unit parsers are now much better at recognizing several unusual composite units: | |
| - units involving special unicode symbols, like "L☉/pc²", | |
| - units that include CDS units ending in a 0, like "eps0/s", | |
| - units including "°C". For example, "°C/s" is no longer incorrectly interpreted as "°C/s^2". |
There was a problem hiding this comment.
Thanks, much better; used with slight modifications.
5974f89 to
467fa58
Compare
|
OK, pushed the new version. |
| (["eps0/mu0"], cds.eps0 / cds.mu0), | ||
| (["a0.s"], cds.a0 * u.s), |
There was a problem hiding this comment.
No harm in adding a comment here saying that these two are regression tests for #17010.
There was a problem hiding this comment.
Not strictly true, just found it while investigating that. But more relevantly, this either means running all of CI again or an extra commit with ci skip - neither seems quite worth it to me, but happy to be told otherwise.
There was a problem hiding this comment.
I'm not a units maintainer, @nstarman should do the final review.
There was a problem hiding this comment.
I think in this case the git blame is good.
It would be good to find a way to move towards using Hypothesis, e.g. with these as @example
There was a problem hiding this comment.
We're using hypothesis on Time, where I found it rather painful and noisy (though obviously it does help correctness, but just felt that we ended up having to increase the tolerance slightly many times...).
nstarman
left a comment
There was a problem hiding this comment.
This all looks good.
Also looking forward to continued simplifications by @eerovaher to make units/forma a little easier to grok 😆.
| # are a few special ones (\h is Planch constant) and three that end in 0. | ||
| def t_UNIT(t): | ||
| r"\%|°|\\h|((?!\d)\w)+" | ||
| r"%|°|\\h|(a|eps|mu)0|((?!\d)\w)+" |
There was a problem hiding this comment.
These are just documentation, right?
There was a problem hiding this comment.
No! This docstring is used by the lexer as a regex to capture things that are to be considered units. Since !\d excludes digits, I had to put in the special case for a0, etc.
There was a problem hiding this comment.
Ok! I recall this now from when I last looked through the formatting code in detail. I thought then as I think now that it's surprising to have docstrings with real code effects. It's not unreasonable in this circumstance, but perhaps a good refactor is to reformat this to be an instance of a class with a __call__ method.
class LexerFunc:
regex: ...
func: Callable[[T], R]
__slots__ = ("regex", "func")
def __call__(self, t): return self.func(t)
def _t_Unit(t):
t.value = cls._get_unit(t)
return t
t_Unit = LexerFunc(r"%|°|\\h|(a|eps|mu)0|((?!\d)\w)+", _t_Unit)(units/format/ appears to be a good place to apply mypyc and significantly reduce python function overhead)
This is not to be done in this PR, of course!
There was a problem hiding this comment.
(though we should probably move this to a separate Issue) @eerovaher, you've been upgrading units.format recently. Thoughts?
There was a problem hiding this comment.
A separate issue may be a good idea - at the same time, we should also really update to the latest PLY - #13399
There was a problem hiding this comment.
The requirement that the regexes must be in docstrings comes from PLY. Although we bundle it, it's still an external package, so we don't have full control over the API here.
| (["eps0/mu0"], cds.eps0 / cds.mu0), | ||
| (["a0.s"], cds.a0 * u.s), |
There was a problem hiding this comment.
I think in this case the git blame is good.
It would be good to find a way to move towards using Hypothesis, e.g. with these as @example
|
@nstarman - is this one good to go? |
|
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Recognize units written with unicode symbols also in composites (cherry picked from commit 2611b0b)
Backport PR #17011 on branch v6.1.x (Recognize units written with unicode symbols also in composites)
Units involving special unicode symbols can now be parsed also if they are
part of a composite unit, like "L☉/pc²".
Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2".
Fixes #17010