Skip to content

Editorial: Remove special-casing of U+200C and U+200D#3074

Merged
ljharb merged 1 commit intotc39:mainfrom
mathiasbynens:unicode-15.1.0
Feb 21, 2024
Merged

Editorial: Remove special-casing of U+200C and U+200D#3074
ljharb merged 1 commit intotc39:mainfrom
mathiasbynens:unicode-15.1.0

Conversation

@mathiasbynens
Copy link
Copy Markdown
Member

@mathiasbynens mathiasbynens commented May 25, 2023

Unicode v15.1.0 makes both U+200C and U+200D ID_Continue characters, meaning we no longer need to explicitly special-case them for them to match IdentifierPart.

Fixes #3073

Copy link
Copy Markdown
Member

@gibson042 gibson042 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent, and I have confirmed it's consistency with the Unicode Standard, Version 15.1.0 draft. But we probably shouldn't merge until it is actually released.

@mathiasbynens mathiasbynens added the needs data This PR needs more information; such as web compatibility data, “web reality” (what all engines do)… label May 25, 2023
@mathiasbynens
Copy link
Copy Markdown
Member Author

This is excellent, and I have confirmed it's consistency with the Unicode Standard, Version 15.1.0 draft. But we probably shouldn't merge until it is actually released.

Indeed.

There is no “do not merge (yet)” issue label (should we add that?), so I’ve added “needs data” for now.

@bakkot
Copy link
Copy Markdown
Member

bakkot commented May 25, 2023

There is no “do not merge (yet)” issue label (should we add that?)

You could mark it as a draft? But the comment will do fine also.

<p>|IdentifierName| and |ReservedWord| are tokens that are interpreted according to the Default Identifier Syntax given in Unicode Standard Annex #31, Identifier and Pattern Syntax, with some small modifications. |ReservedWord| is an enumerated subset of |IdentifierName|. The syntactic grammar defines |Identifier| as an |IdentifierName| that is not a |ReservedWord|. The Unicode identifier grammar is based on character properties specified by the Unicode Standard. The Unicode code points in the specified categories in the latest version of the Unicode Standard must be treated as in those categories by all conforming ECMAScript implementations. ECMAScript implementations may recognize identifier code points defined in later editions of the Unicode Standard.</p>
<emu-note>
<p>This standard specifies specific code point additions: U+0024 (DOLLAR SIGN) and U+005F (LOW LINE) are permitted anywhere in an |IdentifierName|, and the code points U+200C (ZERO WIDTH NON-JOINER) and U+200D (ZERO WIDTH JOINER) are permitted anywhere after the first code point of an |IdentifierName|.</p>
<p>This standard specifies specific code point additions: U+0024 (DOLLAR SIGN) and U+005F (LOW LINE) are permitted anywhere in an |IdentifierName|.</p>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underscores are already a part of ID_Continue. Let's call out just our divergences here.

Suggested change
<p>This standard specifies specific code point additions: U+0024 (DOLLAR SIGN) and U+005F (LOW LINE) are permitted anywhere in an |IdentifierName|.</p>
<p>This standard specifies specific code point additions: U+005F (LOW LINE) is permitted as the first code point of an |IdentifierName| and U+0024 (DOLLAR SIGN) is permitted anywhere in an |IdentifierName|.</p>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not related to this PR, and shouldn't be part of it.


Separately, I think that's more confusing than the current wording (and also it is unrelated to this PR).

With the current wording, you might get misled about what's in Unicode, but you won't get misled about what's in this specification. Whereas with your suggested wording it's easy to read it as suggesting that LOW LINE is only permitted as the first code point in an IdentifierName, as opposed to being permitted anywhere.

And it's more important that this clearly convey what's in this specification than what's in Unicode.

We could probably find some other wording which is unambiguous about both but maybe not without adding more complexity than is warranted.

@ljharb ljharb marked this pull request as draft May 25, 2023 18:31
@mathiasbynens mathiasbynens marked this pull request as ready for review September 15, 2023 06:27
@mathiasbynens
Copy link
Copy Markdown
Member Author

Unicode 15.1.0 was released earlier this week. Marking this PR as officially ready for review.

@bakkot bakkot added the editor call to be discussed in the next editor call label Sep 15, 2023
@michaelficarra michaelficarra removed the editor call to be discussed in the next editor call label Sep 19, 2023
@michaelficarra michaelficarra removed the needs data This PR needs more information; such as web compatibility data, “web reality” (what all engines do)… label Sep 19, 2023
@michaelficarra michaelficarra added the ready to merge Editors believe this PR needs no further reviews, and is ready to land. label Feb 21, 2024
Unicode v15.1.0 makes both U+200C and U+200D `ID_Continue` characters, meaning we no longer need to explicitly special-case them for them to match `IdentifierPart`.

Issue: tc39#3073
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge Editors believe this PR needs no further reviews, and is ready to land.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Informative] Observable changes because of Unicode 15.1

5 participants