Skip to content

Recognize units written with unicode symbols also in composites#17011

Merged
nstarman merged 5 commits intoastropy:mainfrom
mhvk:units-composite-unicode
Sep 29, 2024
Merged

Recognize units written with unicode symbols also in composites#17011
nstarman merged 5 commits intoastropy:mainfrom
mhvk:units-composite-unicode

Conversation

@mhvk
Copy link
Copy Markdown
Contributor

@mhvk mhvk commented Sep 14, 2024

Units involving special unicode symbols can now be parsed also if they are
part of a composite unit, like "L☉/pc²".

Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2".

Fixes #17010

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link
Copy Markdown
Contributor Author

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small clarifications to help review.

elif not s.isascii():
if s[0] == "\N{MICRO SIGN}":
s = "u" + s[1:]
elif s[0] == "°":
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the change in what is recognized as units, conversion of degree sign can now be done here.


@classmethod
def _convert_deg(cls, m: Match[str]) -> str:
if len(m.string) == 1:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually wrong, it should have been m.group(). Hence, for "°C/s", "°C" would be substituted with "deg_C/s" giving a full string of "deg_C/s/s".

("\N{GREEK SMALL LETTER MU}g", u.microgram),
("g\N{MINUS SIGN}1", u.g ** (-1)),
("m\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}", 1 / u.m),
("m\N{SUPERSCRIPT MINUS}\N{SUPERSCRIPT ONE}", u.m**-1),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ensuring we create a unit to compare with, not a Quantity.

@mhvk mhvk requested a review from eerovaher September 14, 2024 14:06
@mhvk mhvk modified the milestones: v6.1.4, v7.0.0 Sep 14, 2024
@mhvk mhvk force-pushed the units-composite-unicode branch from 3b3a693 to 5974f89 Compare September 14, 2024 21:11
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Sep 14, 2024

Now also includes a fix for odd CDS units (which end in 0). None very likely to be encountered, but it really brings out the risk of using multiple paths to determine what unit is intended.

Copy link
Copy Markdown
Member

@eerovaher eerovaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other parsing formatters inherit _regex_deg and _convert_deg() from Generic, so removing them simplifies the other classes too.

Comment on lines 538 to 540
@classmethod
def _convert_superscript(cls, m: Match[str]) -> str:
return f"({m.group().translate(cls._superscript_translations)})"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be a good idea to refactor the code in a follow-up pull request to get rid of this method too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it can be captured as easily in an if statement, but in principle, yes. It would also be good to ensure mixes of superscript and regular digits do not work...

Comment on lines +1 to +6
Units involving special unicode symbols can now be parsed also if they are
part of a composite unit, like "L☉/pc²".

Similarly, CDS units ending in a 0 can now be part of composite units.

Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2".
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think users would appreciate if the common part of these three cases is summarized in a single sentence and the details are in an explicit list:

Suggested change
Units involving special unicode symbols can now be parsed also if they are
part of a composite unit, like "L☉/pc²".
Similarly, CDS units ending in a 0 can now be part of composite units.
Furthermore, a bug was fixed where "°C/s" was incorrectly interpreted as "°C/s^2".
The unit parsers are now much better at recognizing several unusual composite units:
- units involving special unicode symbols, like "L☉/pc²",
- units that include CDS units ending in a 0, like "eps0/s",
- units including "°C". For example, "°C/s" is no longer incorrectly interpreted as "°C/s^2".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, much better; used with slight modifications.

@mhvk mhvk force-pushed the units-composite-unicode branch from 5974f89 to 467fa58 Compare September 16, 2024 18:14
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Sep 16, 2024

OK, pushed the new version.

Comment on lines +127 to +128
(["eps0/mu0"], cds.eps0 / cds.mu0),
(["a0.s"], cds.a0 * u.s),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No harm in adding a comment here saying that these two are regression tests for #17010.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly true, just found it while investigating that. But more relevantly, this either means running all of CI again or an extra commit with ci skip - neither seems quite worth it to me, but happy to be told otherwise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a units maintainer, @nstarman should do the final review.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the git blame is good.
It would be good to find a way to move towards using Hypothesis, e.g. with these as @example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using hypothesis on Time, where I found it rather painful and noisy (though obviously it does help correctness, but just felt that we ended up having to increase the tolerance slightly many times...).

@eerovaher eerovaher requested a review from nstarman September 17, 2024 18:11
Copy link
Copy Markdown
Member

@nstarman nstarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good.
Also looking forward to continued simplifications by @eerovaher to make units/forma a little easier to grok 😆.

# are a few special ones (\h is Planch constant) and three that end in 0.
def t_UNIT(t):
r"\%|°|\\h|((?!\d)\w)+"
r"%|°|\\h|(a|eps|mu)0|((?!\d)\w)+"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just documentation, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! This docstring is used by the lexer as a regex to capture things that are to be considered units. Since !\d excludes digits, I had to put in the special case for a0, etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I recall this now from when I last looked through the formatting code in detail. I thought then as I think now that it's surprising to have docstrings with real code effects. It's not unreasonable in this circumstance, but perhaps a good refactor is to reformat this to be an instance of a class with a __call__ method.

class LexerFunc:
    regex: ...
    func: Callable[[T], R]
    __slots__ = ("regex", "func")
    def __call__(self, t): return self.func(t)

def _t_Unit(t):
    t.value = cls._get_unit(t)
    return t

t_Unit = LexerFunc(r"%|°|\\h|(a|eps|mu)0|((?!\d)\w)+", _t_Unit)

(units/format/ appears to be a good place to apply mypyc and significantly reduce python function overhead)

This is not to be done in this PR, of course!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(though we should probably move this to a separate Issue) @eerovaher, you've been upgrading units.format recently. Thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate issue may be a good idea - at the same time, we should also really update to the latest PLY - #13399

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement that the regexes must be in docstrings comes from PLY. Although we bundle it, it's still an external package, so we don't have full control over the API here.

Comment on lines +127 to +128
(["eps0/mu0"], cds.eps0 / cds.mu0),
(["a0.s"], cds.a0 * u.s),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case the git blame is good.
It would be good to find a way to move towards using Hypothesis, e.g. with these as @example

@pllim pllim modified the milestones: v6.1.4, v6.1.5 Sep 27, 2024
@mhvk
Copy link
Copy Markdown
Contributor Author

mhvk commented Sep 28, 2024

@nstarman - is this one good to go?

Copy link
Copy Markdown
Member

@nstarman nstarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@nstarman nstarman modified the milestones: v6.1.5, v7.0.0 Sep 29, 2024
@nstarman nstarman merged commit 2611b0b into astropy:main Sep 29, 2024
@lumberbot-app
Copy link
Copy Markdown

lumberbot-app bot commented Sep 29, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout v6.1.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 2611b0be0535b5fce583ac7c15823964e504a185
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #17011: Recognize units written with unicode symbols also in composites'
  1. Push to a named branch:
git push YOURFORK v6.1.x:auto-backport-of-pr-17011-on-v6.1.x
  1. Create a PR against branch v6.1.x, I would have named this PR:

"Backport PR #17011 on branch v6.1.x (Recognize units written with unicode symbols also in composites)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@mhvk mhvk deleted the units-composite-unicode branch September 29, 2024 21:32
mhvk pushed a commit to mhvk/astropy that referenced this pull request Sep 29, 2024
Recognize units written with unicode symbols also in composites

(cherry picked from commit 2611b0b)
pllim added a commit that referenced this pull request Sep 30, 2024
Backport PR #17011 on branch v6.1.x (Recognize units written with unicode symbols also  in composites)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Composite units with strange unicode symbols do not work

4 participants