-
-
Notifications
You must be signed in to change notification settings - Fork 227
Description
Abstract: this code isn't up to unicode spec, partially incorrect (and can probably be replaced by a simple combination of lowercase/uppercase in modern js engines).
So I was looking at this test the other day: commonmark/commonmark-spec#582
And I was trying to figure out what's the difference between "naive" lowercasing and actual Unicode case folding as implemented in commonmark.
So I checked it with the spec published here:
http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
And surely enough, I found a difference:
AB73; C; 13A3; # CHEROKEE SMALL LETTER O
They are corresponding lower and upper case letter, even javascript knows that:
$ node -e 'console.log("\uab73".toUpperCase().codePointAt(0).toString(16));'
13a3
$ node -e 'console.log("\u13a3".toLowerCase().codePointAt(0).toString(16));'
ab73
But commonmark.js doesn't treat them as the same character:
$ echo -e '[\u13a3]\n\n[\uab73]: /test' | ./bin/commonmark
<p>[Ꭳ]</p>
Funnily enough, original implementation (https://github.com/mathiasbynens/swapcase) supports this:
$ node -e "console.log(require('swapcase')('\uab73').codePointAt(0).toString(16))"
13a3
So it looks like what we're having here is not simply outdated, but was never correct to begin with.
It is my estimation that around 330 other characters in current implementation are processed incorrectly, as opposed to 170 in original swapcase, which are most likely due to changes in unicode between now and 5 years ago when it was published.
I have a better solution. Use simple and naive str.toLowerCase(). It only gets 125 codepoints wrong, which is better.
Or use str.toUpperCase(). It is actually almost perfect, as it only gets 6 characters incorrectly (list of exceptions: İ, ϴ, ẞ, Ω, K, Å - those are already uppercased, but have differently uppercased versions).
If we combine those approaches, str.toLowerCase().toUpperCase() is observed to normalize ALL characters in the same way as case fold spec (as tested on node 12, earlier v8 versions have had some issues).
So... why not use that, and save on 30kb of javascript code and its future maintenance time?