Recognize supplementary (non-BMP) punctuations & symbols#1072
Conversation
|
It's now your turn, |
|
If you merge #1074, I'm going to have to resolve a conflict once again. |
|
done |
|
I believe I have fixed all "PR notes" in #1071 (comment) other than bench/profiling. |
|
Can we use |
Everything with the word Need to go for 2-4 hours, then will return and continue. |
| const codePoint = str.codePointAt(pos - 2) | ||
| // undefined > 0xffff = false, so we don't need extra check here | ||
| return codePoint > 0xffff ? codePoint : charCode |
There was a problem hiding this comment.
FYI, version without codePointAt:
| const codePoint = str.codePointAt(pos - 2) | |
| // undefined > 0xffff = false, so we don't need extra check here | |
| return codePoint > 0xffff ? codePoint : charCode | |
| const probablyHighSurrogate = str.charCodeAt(pos - 2) | |
| // NaN >> 10 = 0, so we don't need extra check here. | |
| // Both 0xD800 >> 10 and 0xDBFF >> 10 are 54. 0xD7FF >> 10 and 0xDC00 >> 10 are 53 and 55 respectively. | |
| return probablyHighSurrogate >> 10 === 54 ? (((probablyHighSurrogate & 0x3FF) << 10) | ((charCode & 0x3FF)) + 0x10000) : charCode |
Please press "Commit suggestion" if you prefer this.
There was a problem hiding this comment.
Oh, seems I forgot too much https://github.com/markdown-it/markdown-it/blob/master/CHANGELOG.md#changed-1
v14 officially dropped support for ancient browsers. Now codePointAt and other functions of this kind are ok. I'm awfully sorry for the buzz.
There was a problem hiding this comment.
I'm glad to know that. Should I continue to use the homemade fromCodePoint?
I forgot to update the conditional expression. I have just updated it.
Sure. I'm going to bed now. Good night. |
|
If want me to revert 5803c7a, please tell me. |
|
Merged to move forward. I don't think it's rational to waste your time on code polishing instead of on really important things. I will tweak the rest. What are the next steps? How can I help you? I've seen your efforts in promoting global CJK support. That's perfect. And I certainly wish this package could provide good CJK support somehow (via plugin, for example, with direct reference in the readme). |
|
Thank you for the merge. Regarding CJK support, I don't currently see any immediate need to make further changes to the core markdown-it code unless we integrate CJK-Friendly Emphasis directly into markdown-it. Even when markdown-it has no plans to natively support only CommonMark-compliant extensions, I believe this PR's merge marks full preparation for CJK-Friendly Emphasis. Unlike other Markdown parsers, markdown-it's implementation makes it easier to retrieve characters two positions beyond a delimiter run, which shouldn't pose significant obstacles during future updates following CommonMark specification revisions. I think introducing markdown-it-cjk-friendly in the README would be a good idea. |
|
FYI: 59955f2. Update to guarantee stable types and no exceptions with broken surrogate pairs. How I see the situation in the scope of this package
Technical points:
A side problem. Another pain with CJK is proper auto-linking markdown-it/linkify-it#15 (comment) Currently, I plan to extend punctuation with appropriate CJK chars, which should help, but not in all cases. If you can help to create a list of resolvable real-world patterns about links in CJK, that will help significantly. |
|
Hey, I got emails with changelog comments, but I don't see those when opening links. For clarity:
Anyway, if you feel something is not good enough, feel free to PR update; I will accept it. |
At most, we could add something like: "Note that this package and its associated standards alone may not provide adequate CJK language support. For better CJK support, you might want to consider using extensions like markdown-it-cjk-breaks or markdown-it-cjk-friendly." in the very last section of the Syntax extension in README.
The ongoing nature of this issue is shamefully attributable to GitHub's failure to properly address it and ultimately abandon the task. The GitHub-specific auto-linking specification appears haphazardly written compared to standard CommonMark feature specifications. We outside contributors should rigorously restandardize this in a far more satisfactory and consistent manner.
Actually, my first encounter with this problem came when a Streamdown maintainer asked me if I had anything to address it: vercel/streamdown#327
https://www.quora.com/Why-are-unicode-characters-outside-the-BMP-called-astral I didn't know that. You can proceed with the release as it is now. My apologies for the confusion. |
|
I marked my comments there Resolved. |
I'm ok to place this on top, if that will be a short, direct instruction (not "highly-likely", and not a recommendation to use another package instead) Why:
Let's be honest - chances to land CJK to CM spec quickly are so-so. But this should not stop us :). That's exactly why this package was created - because in the real world, just specification is not enough for practical needs.
Probably my explanation was not good, and there is some confusion. The common problem with all heuristic approaches is that they should avoid false positives. That's not about coding; it's about understanding the frequency of cases and the probability of errors. Since I don't read CKJ websites, I don't know how to estimate both metrics, and all issues can stay stale for a long time. Not because I don't like fixing, but because I don't like risking breaking something else. If you decide to participate, create a separate issue with the desired patterns to catch. Then I'll focus only on your [trusted] proposals, with high priority. |
Fixes #1071