Skip to content

Incorrect implementation of unicode case folding #168

@rlidwka

Description

@rlidwka

Abstract: this code isn't up to unicode spec, partially incorrect (and can probably be replaced by a simple combination of lowercase/uppercase in modern js engines).


So I was looking at this test the other day: commonmark/commonmark-spec#582

And I was trying to figure out what's the difference between "naive" lowercasing and actual Unicode case folding as implemented in commonmark.

So I checked it with the spec published here:
http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt

And surely enough, I found a difference:

AB73; C; 13A3; # CHEROKEE SMALL LETTER O

They are corresponding lower and upper case letter, even javascript knows that:

$ node -e 'console.log("\uab73".toUpperCase().codePointAt(0).toString(16));'
13a3
$ node -e 'console.log("\u13a3".toLowerCase().codePointAt(0).toString(16));'
ab73

But commonmark.js doesn't treat them as the same character:

$ echo -e '[\u13a3]\n\n[\uab73]: /test' | ./bin/commonmark
<p>[Ꭳ]</p>

Funnily enough, original implementation (https://github.com/mathiasbynens/swapcase) supports this:

$ node -e "console.log(require('swapcase')('\uab73').codePointAt(0).toString(16))"
13a3

So it looks like what we're having here is not simply outdated, but was never correct to begin with.

It is my estimation that around 330 other characters in current implementation are processed incorrectly, as opposed to 170 in original swapcase, which are most likely due to changes in unicode between now and 5 years ago when it was published.


I have a better solution. Use simple and naive str.toLowerCase(). It only gets 125 codepoints wrong, which is better.

Or use str.toUpperCase(). It is actually almost perfect, as it only gets 6 characters incorrectly (list of exceptions: İ, ϴ, ẞ, Ω, K, Å - those are already uppercased, but have differently uppercased versions).

If we combine those approaches, str.toLowerCase().toUpperCase() is observed to normalize ALL characters in the same way as case fold spec (as tested on node 12, earlier v8 versions have had some issues).

So... why not use that, and save on 30kb of javascript code and its future maintenance time?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions