Skip to content

Recognize supplementary (non-BMP) punctuations & symbols#1072

Merged
puzrin merged 9 commits into
markdown-it:masterfrom
tats-u:non-bmp
May 20, 2026
Merged

Recognize supplementary (non-BMP) punctuations & symbols#1072
puzrin merged 9 commits into
markdown-it:masterfrom
tats-u:non-bmp

Conversation

@tats-u

@tats-u tats-u commented Dec 15, 2024

Copy link
Copy Markdown
Contributor

Fixes #1071

@tats-u tats-u changed the title Recognize non-BMP punctuations & symbols Recognize supplementary (non-BMP) punctuations & symbols Mar 17, 2025
@tats-u tats-u marked this pull request as draft May 20, 2026 14:59
@tats-u

tats-u commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

If you merge #1074, I'm going to have to resolve a conflict once again.

@puzrin

puzrin commented May 20, 2026

Copy link
Copy Markdown
Member

done

@tats-u tats-u marked this pull request as ready for review May 20, 2026 15:14
@tats-u

tats-u commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

I believe I have fixed all "PR notes" in #1071 (comment) other than bench/profiling.

@tats-u

tats-u commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Can we use String.prototype.codePointAt?

@puzrin

puzrin commented May 20, 2026

Copy link
Copy Markdown
Member

Can we use String.prototype.codePointAt?

Everything with the word point is for "fresh" browsers. Don't worry about that, I will polish if needed. In current days, with AI, that's not a problem at all. The right design is much more significant.

Need to go for 2-4 hours, then will return and continue.

Comment on lines +92 to +94
const codePoint = str.codePointAt(pos - 2)
// undefined > 0xffff = false, so we don't need extra check here
return codePoint > 0xffff ? codePoint : charCode

@tats-u tats-u May 20, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, version without codePointAt:

Suggested change
const codePoint = str.codePointAt(pos - 2)
// undefined > 0xffff = false, so we don't need extra check here
return codePoint > 0xffff ? codePoint : charCode
const probablyHighSurrogate = str.charCodeAt(pos - 2)
// NaN >> 10 = 0, so we don't need extra check here.
// Both 0xD800 >> 10 and 0xDBFF >> 10 are 54. 0xD7FF >> 10 and 0xDC00 >> 10 are 53 and 55 respectively.
return probablyHighSurrogate >> 10 === 54 ? (((probablyHighSurrogate & 0x3FF) << 10) | ((charCode & 0x3FF)) + 0x10000) : charCode

Please press "Commit suggestion" if you prefer this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, seems I forgot too much https://github.com/markdown-it/markdown-it/blob/master/CHANGELOG.md#changed-1

v14 officially dropped support for ancient browsers. Now codePointAt and other functions of this kind are ok. I'm awfully sorry for the buzz.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to know that. Should I continue to use the homemade fromCodePoint?

I forgot to update the conditional expression. I have just updated it.

@tats-u

tats-u commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Need to go for 2-4 hours, then will return and continue.

Sure. I'm going to bed now. Good night.

@tats-u

tats-u commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

If want me to revert 5803c7a, please tell me.

@puzrin puzrin merged commit 2c1e305 into markdown-it:master May 20, 2026
1 check passed
@puzrin

puzrin commented May 21, 2026

Copy link
Copy Markdown
Member

Merged to move forward. I don't think it's rational to waste your time on code polishing instead of on really important things. I will tweak the rest.

What are the next steps? How can I help you?

I've seen your efforts in promoting global CJK support. That's perfect. And I certainly wish this package could provide good CJK support somehow (via plugin, for example, with direct reference in the readme).

puzrin added a commit that referenced this pull request May 21, 2026
@tats-u tats-u deleted the non-bmp branch May 21, 2026 03:36
@tats-u

tats-u commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Thank you for the merge. Regarding CJK support, I don't currently see any immediate need to make further changes to the core markdown-it code unless we integrate CJK-Friendly Emphasis directly into markdown-it.

Even when markdown-it has no plans to natively support only CommonMark-compliant extensions, I believe this PR's merge marks full preparation for CJK-Friendly Emphasis. Unlike other Markdown parsers, markdown-it's implementation makes it easier to retrieve characters two positions beyond a delimiter run, which shouldn't pose significant obstacles during future updates following CommonMark specification revisions.

I think introducing markdown-it-cjk-friendly in the README would be a good idea.

@puzrin

puzrin commented May 21, 2026

Copy link
Copy Markdown
Member

FYI: 59955f2. Update to guarantee stable types and no exceptions with broken surrogate pairs.

How I see the situation in the scope of this package

  • In an "ideal world," it would be nice to have integrated CJK support.
    • Right now - Impossible directly until CM spec lands it.
    • Possible via plugin(s) somehow + instruction at top of readme.
  • AFAIK, you promote CJK in multiple directions (packages). So, it will be best for all to route traffic to your repos, instead of trying to keep everything in this org.
  • I'm ok to accept PR with short block "how to add CJK support". Anytime.
    • Note, it's better to focus not on "strange words" like emphasis, but on action like "For CJS support install [this] and [this] plugins". Or "For CJK support read the [additional instruction]". With any link(s) you wish. AFAIK, strikethough and breaks should be patched too.

Technical points:

  • Punctuation helper still stays without Astral CJK support. It shouldn't be a problem for the punctuation character group, but please keep in mind.
    // Currently without astral characters support.
    function isPunctChar (ch) {
    return ucmicro.P.test(ch) || ucmicro.S.test(ch)
    }
    function isPunctCharCode (code) {
    return isPunctChar(fromCodePoint(code))
    }

A side problem.

Another pain with CJK is proper auto-linking markdown-it/linkify-it#15 (comment)

Currently, I plan to extend punctuation with appropriate CJK chars, which should help, but not in all cases. If you can help to create a list of resolvable real-world patterns about links in CJK, that will help significantly.

tats-u referenced this pull request May 22, 2026
@puzrin

puzrin commented May 22, 2026

Copy link
Copy Markdown
Member

Hey, I got emails with changelog comments, but I don't see those when opening links. For clarity:

  1. Astral characters === supplementary characters. Unofficial spoken form.
  2. I understand this PR is not directly about CJK. Just wished to show my respect to the "same author".

Anyway, if you feel something is not good enough, feel free to PR update; I will accept it.

@tats-u

tats-u commented May 22, 2026

Copy link
Copy Markdown
Contributor Author
  • I'm ok to accept PR with short block "how to add CJK support". Anytime.
    • Note, it's better to focus not on "strange words" like emphasis, but on action like "For CJS support install [this] and [this] plugins". Or "For CJK support read the [additional instruction]". With any link(s) you wish. AFAIK, strikethough and breaks should be patched too.

At most, we could add something like: "Note that this package and its associated standards alone may not provide adequate CJK language support. For better CJK support, you might want to consider using extensions like markdown-it-cjk-breaks or markdown-it-cjk-friendly." in the very last section of the Syntax extension in README.

Another pain with CJK is proper auto-linking markdown-it/linkify-it#15 (comment)

Currently, I plan to extend punctuation with appropriate CJK chars, which should help, but not in all cases. If you can help to create a list of resolvable real-world patterns about links in CJK, that will help significantly.

The ongoing nature of this issue is shamefully attributable to GitHub's failure to properly address it and ultimately abandon the task. The GitHub-specific auto-linking specification appears haphazardly written compared to standard CommonMark feature specifications. We outside contributors should rigorously restandardize this in a far more satisfactory and consistent manner.

Actually, my first encounter with this problem came when a Streamdown maintainer asked me if I had anything to address it: vercel/streamdown#327

Hey, I got emails with changelog comments, but I don't see those when opening links. For clarity:

  1. Astral characters === supplementary characters. Unofficial spoken form.
  2. I understand this PR is not directly about CJK. Just wished to show my respect to the "same author".

https://www.quora.com/Why-are-unicode-characters-outside-the-BMP-called-astral

I didn't know that. You can proceed with the release as it is now. My apologies for the confusion.

@tats-u

tats-u commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

I marked my comments there Resolved.

@puzrin

puzrin commented May 22, 2026

Copy link
Copy Markdown
Member

At most, we could add something like: "Note that this package and its associated standards alone may not provide adequate CJK language support. For better CJK support, you might want to consider using extensions like markdown-it-cjk-breaks or markdown-it-cjk-friendly." in the very last section of the Syntax extension in README.

I'm ok to place this on top, if that will be a short, direct instruction (not "highly-likely", and not a recommendation to use another package instead)

Since the CommonMark specification lacks CJK support, this package lacks it too.
But you can fix the most of issues by installing plugin `<link_to_a_plugin_from_your_repo>`.

Why:

  • Users will get a direct path on how to solve a problem, without the need to think about details (no need to care separately about breaks, emphasis, and so on).
  • The maintenance/feedback will be split properly.
  • You will get the maximum possible traffic as an independent developer and CJK promoter.

Let's be honest - chances to land CJK to CM spec quickly are so-so. But this should not stop us :). That's exactly why this package was created - because in the real world, just specification is not enough for practical needs.


The ongoing nature of this issue is shamefully attributable to GitHub's failure to properly address it and ultimately abandon the task...

Probably my explanation was not good, and there is some confusion. linkify-it is not related to GitHub/GFM in any way. It's an independent companion package, shipped with markdown-it, and collects all rules from scratch. But since markdown-it users report CJK problems, I'm interested in solving those as much as possible.

The common problem with all heuristic approaches is that they should avoid false positives. That's not about coding; it's about understanding the frequency of cases and the probability of errors. Since I don't read CKJ websites, I don't know how to estimate both metrics, and all issues can stay stale for a long time. Not because I don't like fixing, but because I don't like risking breaking something else.

If you decide to participate, create a separate issue with the desired patterns to catch. Then I'll focus only on your [trusted] proposals, with high priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StateInline.prototype.scanDelims should recognize non-BMP punctuations & symbols

2 participants