Recognize supplementary (non-BMP) punctuations & symbols by tats-u · Pull Request #1072 · markdown-it/markdown-it

tats-u · 2024-12-15T07:25:50Z

tats-u · 2026-01-15T14:06:53Z

marked has already fixed this issue:

https://marked.js.org/demo/?text=a**a%E2%88%87**a%0A%0Aa**%E2%88%87a**a%0A%0Aa**a%F0%9D%9C%B5**a%0A%0Aa**%F0%9D%9C%B5a**a&options=%7B%0A%20%22async%22%3A%20false%2C%0A%20%22breaks%22%3A%20false%2C%0A%20%22extensions%22%3A%20null%2C%0A%20%22gfm%22%3A%20true%2C%0A%20%22hooks%22%3A%20null%2C%0A%20%22pedantic%22%3A%20false%2C%0A%20%22silent%22%3A%20false%2C%0A%20%22tokenizer%22%3A%20null%2C%0A%20%22walkTokens%22%3A%20null%0A%7D&version=17.0.1

It's now your turn, markdown-it.

tats-u · 2026-05-20T15:01:52Z

If you merge #1074, I'm going to have to resolve a conflict once again.

puzrin · 2026-05-20T15:09:10Z

done

tats-u · 2026-05-20T15:20:31Z

I believe I have fixed all "PR notes" in #1071 (comment) other than bench/profiling.

tats-u · 2026-05-20T15:22:08Z

Can we use String.prototype.codePointAt?

puzrin · 2026-05-20T15:28:57Z

Can we use String.prototype.codePointAt?

Everything with the word point is for "fresh" browsers. Don't worry about that, I will polish if needed. In current days, with AI, that's not a problem at all. The right design is much more significant.

Need to go for 2-4 hours, then will return and continue.

tats-u · 2026-05-20T15:40:30Z

+  const codePoint = str.codePointAt(pos - 2)
+  // undefined > 0xffff = false, so we don't need extra check here
+  return codePoint > 0xffff ? codePoint : charCode


FYI, version without codePointAt:

Suggested change

const codePoint = str.codePointAt(pos - 2)

// undefined > 0xffff = false, so we don't need extra check here

return codePoint > 0xffff ? codePoint : charCode

const probablyHighSurrogate = str.charCodeAt(pos - 2)

// NaN >> 10 = 0, so we don't need extra check here.

// Both 0xD800 >> 10 and 0xDBFF >> 10 are 54. 0xD7FF >> 10 and 0xDC00 >> 10 are 53 and 55 respectively.

return probablyHighSurrogate >> 10 === 54 ? (((probablyHighSurrogate & 0x3FF) << 10) | ((charCode & 0x3FF)) + 0x10000) : charCode

Please press "Commit suggestion" if you prefer this.

Oh, seems I forgot too much https://github.com/markdown-it/markdown-it/blob/master/CHANGELOG.md#changed-1

v14 officially dropped support for ancient browsers. Now codePointAt and other functions of this kind are ok. I'm awfully sorry for the buzz.

I'm glad to know that. Should I continue to use the homemade fromCodePoint?

I forgot to update the conditional expression. I have just updated it.

tats-u · 2026-05-20T15:42:28Z

Need to go for 2-4 hours, then will return and continue.

Sure. I'm going to bed now. Good night.

tats-u · 2026-05-20T23:12:14Z

If want me to revert 5803c7a, please tell me.

puzrin · 2026-05-21T00:04:50Z

Merged to move forward. I don't think it's rational to waste your time on code polishing instead of on really important things. I will tweak the rest.

What are the next steps? How can I help you?

I've seen your efforts in promoting global CJK support. That's perfect. And I certainly wish this package could provide good CJK support somehow (via plugin, for example, with direct reference in the readme).

tats-u · 2026-05-21T03:44:31Z

Thank you for the merge. Regarding CJK support, I don't currently see any immediate need to make further changes to the core markdown-it code unless we integrate CJK-Friendly Emphasis directly into markdown-it.

Even when markdown-it has no plans to natively support only CommonMark-compliant extensions, I believe this PR's merge marks full preparation for CJK-Friendly Emphasis. Unlike other Markdown parsers, markdown-it's implementation makes it easier to retrieve characters two positions beyond a delimiter run, which shouldn't pose significant obstacles during future updates following CommonMark specification revisions.

I think introducing markdown-it-cjk-friendly in the README would be a good idea.

puzrin · 2026-05-21T05:15:19Z

FYI: 59955f2. Update to guarantee stable types and no exceptions with broken surrogate pairs.

How I see the situation in the scope of this package

In an "ideal world," it would be nice to have integrated CJK support.
- Right now - Impossible directly until CM spec lands it.
- Possible via plugin(s) somehow + instruction at top of readme.
AFAIK, you promote CJK in multiple directions (packages). So, it will be best for all to route traffic to your repos, instead of trying to keep everything in this org.
I'm ok to accept PR with short block "how to add CJK support". Anytime.
- Note, it's better to focus not on "strange words" like emphasis, but on action like "For CJS support install [this] and [this] plugins". Or "For CJK support read the [additional instruction]". With any link(s) you wish. AFAIK, strikethough and breaks should be patched too.

Technical points:

Punctuation helper still stays without Astral CJK support. It shouldn't be a problem for the punctuation character group, but please keep in mind.

markdown-it/lib/common/utils.mjs

Lines 177 to 184 in 7769621

    
           // Currently without astral characters support. 
        
           function isPunctChar (ch) { 
        
             return ucmicro.P.test(ch) || ucmicro.S.test(ch) 
        
           } 
        
           function isPunctCharCode (code) { 
        
             return isPunctChar(fromCodePoint(code)) 
        
           }

A side problem.

Another pain with CJK is proper auto-linking markdown-it/linkify-it#15 (comment)

Currently, I plan to extend punctuation with appropriate CJK chars, which should help, but not in all cases. If you can help to create a list of resolvable real-world patterns about links in CJK, that will help significantly.

puzrin · 2026-05-22T14:11:25Z

Hey, I got emails with changelog comments, but I don't see those when opening links. For clarity:

Astral characters === supplementary characters. Unofficial spoken form.
I understand this PR is not directly about CJK. Just wished to show my respect to the "same author".

Anyway, if you feel something is not good enough, feel free to PR update; I will accept it.

tats-u · 2026-05-22T15:07:57Z

I'm ok to accept PR with short block "how to add CJK support". Anytime.

Note, it's better to focus not on "strange words" like emphasis, but on action like "For CJS support install [this] and [this] plugins". Or "For CJK support read the [additional instruction]". With any link(s) you wish. AFAIK, strikethough and breaks should be patched too.

At most, we could add something like: "Note that this package and its associated standards alone may not provide adequate CJK language support. For better CJK support, you might want to consider using extensions like markdown-it-cjk-breaks or markdown-it-cjk-friendly." in the very last section of the Syntax extension in README.

Another pain with CJK is proper auto-linking markdown-it/linkify-it#15 (comment)

Currently, I plan to extend punctuation with appropriate CJK chars, which should help, but not in all cases. If you can help to create a list of resolvable real-world patterns about links in CJK, that will help significantly.

The ongoing nature of this issue is shamefully attributable to GitHub's failure to properly address it and ultimately abandon the task. The GitHub-specific auto-linking specification appears haphazardly written compared to standard CommonMark feature specifications. We outside contributors should rigorously restandardize this in a far more satisfactory and consistent manner.

Exclude Chinese Punctuation in GFM Autolinks github/cmark-gfm#83 (An absolute joke that it's been ignored for 8 years! GitHub folks are being so lazy!)
Recognize following non-ASCII punctuation in extended www autolink github/cmark-gfm#377
Exclude non-ASCII path or backslash in extended www autolink github/cmark-gfm#384

Actually, my first encounter with this problem came when a Streamdown maintainer asked me if I had anything to address it: vercel/streamdown#327

Hey, I got emails with changelog comments, but I don't see those when opening links. For clarity:

Astral characters === supplementary characters. Unofficial spoken form.

I understand this PR is not directly about CJK. Just wished to show my respect to the "same author".

https://www.quora.com/Why-are-unicode-characters-outside-the-BMP-called-astral

I didn't know that. You can proceed with the release as it is now. My apologies for the confusion.

tats-u · 2026-05-22T15:09:44Z

I marked my comments there Resolved.

puzrin · 2026-05-22T20:10:19Z

At most, we could add something like: "Note that this package and its associated standards alone may not provide adequate CJK language support. For better CJK support, you might want to consider using extensions like markdown-it-cjk-breaks or markdown-it-cjk-friendly." in the very last section of the Syntax extension in README.

I'm ok to place this on top, if that will be a short, direct instruction (not "highly-likely", and not a recommendation to use another package instead)

Since the CommonMark specification lacks CJK support, this package lacks it too.
But you can fix the most of issues by installing plugin `<link_to_a_plugin_from_your_repo>`.

Why:

Users will get a direct path on how to solve a problem, without the need to think about details (no need to care separately about breaks, emphasis, and so on).
The maintenance/feedback will be split properly.
You will get the maximum possible traffic as an independent developer and CJK promoter.

Let's be honest - chances to land CJK to CM spec quickly are so-so. But this should not stop us :). That's exactly why this package was created - because in the real world, just specification is not enough for practical needs.

The ongoing nature of this issue is shamefully attributable to GitHub's failure to properly address it and ultimately abandon the task...

Probably my explanation was not good, and there is some confusion. linkify-it is not related to GitHub/GFM in any way. It's an independent companion package, shipped with markdown-it, and collects all rules from scratch. But since markdown-it users report CJK problems, I'm interested in solving those as much as possible.

The common problem with all heuristic approaches is that they should avoid false positives. That's not about coding; it's about understanding the frequency of cases and the probability of errors. Since I don't read CKJ websites, I don't know how to estimate both metrics, and all issues can stay stale for a long time. Not because I don't like fixing, but because I don't like risking breaking something else.

If you decide to participate, create a separate issue with the desired patterns to catch. Then I'll focus only on your [trusted] proposals, with high priority.

tats-u added 4 commits December 15, 2024 16:25

Recognize non-BMP punctuations & symbols

5e78c00

Add comment

7a6d58a

Fix comment

a091ed9

codePointAt is excessive

1321d2e

tats-u mentioned this pull request Jan 22, 2025

Recognize non-BMP punctuation & symbols (to prepare for CJK support in the future) micromark/micromark#189

Open

4 tasks

tats-u changed the title ~~Recognize non-BMP punctuations & symbols~~ Recognize supplementary (non-BMP) punctuations & symbols Mar 17, 2025

tats-u mentioned this pull request Mar 23, 2025

Add supplementary (non-BMP) currency symbol in Unicode symbol example commonmark/commonmark-spec#794

Merged

Tweak comment

e2d36b5

Merge branch 'master' into non-bmp

13d2890

tats-u marked this pull request as draft May 20, 2026 14:59

Use homemade fromCodePoint

5803c7a

Merge branch 'master' into non-bmp

9651fc1

Move getLastCharCode to top level

4238595

tats-u marked this pull request as ready for review May 20, 2026 15:14

tats-u commented May 20, 2026

View reviewed changes

puzrin merged commit 2c1e305 into markdown-it:master May 20, 2026
1 check passed

puzrin added a commit that referenced this pull request May 21, 2026

Polish PRs #1072, #1074

59955f2

tats-u deleted the non-bmp branch May 21, 2026 03:36

tats-u referenced this pull request May 22, 2026

Changelog update

c471b55

-  const codePoint = str.codePointAt(pos - 2)
-  // undefined > 0xffff = false, so we don't need extra check here
-  return codePoint > 0xffff ? codePoint : charCode
+  const probablyHighSurrogate = str.charCodeAt(pos - 2)
+  // NaN >> 10 = 0, so we don't need extra check here.
+  // Both 0xD800 >> 10 and 0xDBFF >> 10 are 54. 0xD7FF >> 10 and 0xDC00 >> 10 are 53 and 55 respectively.
+  return probablyHighSurrogate >> 10 === 54 ? (((probablyHighSurrogate & 0x3FF) << 10) | ((charCode & 0x3FF)) + 0x10000) : charCode

Uh oh!

Conversation

tats-u commented Dec 15, 2024

Uh oh!

tats-u commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented May 20, 2026

Uh oh!

puzrin commented May 20, 2026

Uh oh!

tats-u commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented May 20, 2026

Uh oh!

puzrin commented May 20, 2026

Uh oh!

tats-u May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

puzrin May 20, 2026

Choose a reason for hiding this comment

Uh oh!

tats-u May 20, 2026

Choose a reason for hiding this comment

Uh oh!

tats-u commented May 20, 2026

Uh oh!

tats-u commented May 20, 2026

Uh oh!

Uh oh!

puzrin commented May 21, 2026

Uh oh!

tats-u commented May 21, 2026

Uh oh!

puzrin commented May 21, 2026

Uh oh!

puzrin commented May 22, 2026

Uh oh!

tats-u commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tats-u commented May 22, 2026

Uh oh!

puzrin commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tats-u commented Jan 15, 2026 •

edited

Loading

tats-u commented May 20, 2026 •

edited

Loading

tats-u May 20, 2026 •

edited

Loading

tats-u commented May 22, 2026 •

edited

Loading