Conversation
k4b7
left a comment
There was a problem hiding this comment.
Nice start... requesting more tests and polyfill String.prototype.normalize.
| }); | ||
|
|
||
| it("should parse Latin-1 outside \\text{}", function() { | ||
| expect('ÀàÇçÉéÏïÖöÛû').toParse(); |
There was a problem hiding this comment.
Can we keep these tests but only check for Çç? Eventually we'll want to test for œ, Œ, æ, Æ, and ß when they get added.
There was a problem hiding this comment.
This PR actually removes support for Çç. I believe they were added prematurely when forcing support for Latin-1. I just verified that none of the Latin-1 glyphs, Çç included, are in the KaTeX fonts. (I believe this actually makes sense, and the purpose of this PR is to map them into the accent macros, like LaTeX would.) Here's an example of incorrect looking Çç vs Cc in the current KaTeX:
There was a problem hiding this comment.
I opened https://github.com/KaTeX/katex-fonts/issues/2 to track the cedilla issue.
| expect("ÀÁÂÃÄÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäèéêëìíîïñòóôõöùúûüýÿ") | ||
| .toParseLike( | ||
| "\\grave A\\acute A\\hat A\\tilde A\\ddot A" + | ||
| "\\grave E\\acute E\\hat E\\ddot E" + |
There was a problem hiding this comment.
I guess expanding to \grave A makes sense. Can you add a MathML fixture test? I'm curious to see if we recombine the accent with the letter when we render to MathML.
There was a problem hiding this comment.
They render exactly how accent commands do, which I learned is via <mover accent="true">. I think this makes sense... Test added.
| const commentRegexString = "%[^\n]*[\n]"; | ||
| const controlWordRegexString = "\\\\[a-zA-Z@]+"; | ||
| const controlSymbolRegexString = "\\\\[^\uD800-\uDFFF]"; | ||
| const combiningDiacriticalMarkString = "[\u0300-\u036f]"; |
There was a problem hiding this comment.
This contains many more characters than we support. Where do we generate an error if someone uses an unsupported combining mark? Maybe add a test for this.
There was a problem hiding this comment.
I added an error test (in error-spec.js because I wanted toFailWithParseError). Incidentally, most of my tests are in katex-spec.js because I wanted toParseLike, but they would make more sense in unicode-spec.js... Not sure how to best share code between these test modules.
There was a problem hiding this comment.
I've created #1021 to extract helper functions into test-utils.js. I can do that this weekend and then we could rebase this with the tests in the proper suite.
src/Parser.js
Outdated
| } | ||
| // Decompose symbol into base and combining characters | ||
| // if the combined symbol is not recognized. | ||
| text = text.normalize('NFD'); |
There was a problem hiding this comment.
This is super cool. I wish I would've known about this sooner. There's no support for this in IE and Safari only got this in version 10. We might consider using https://github.com/walling/unorm instead.
There was a problem hiding this comment.
Ah, I should have checked the browser compatibility matrix for String.normalize. The only problem with a polyfill is that it increases the katex.min.js build by 60%, from 226,836 bytes to 363,641 bytes (as any polyfill requires a huge table of Unicode symbols).
So I think this raises a serious question of whether we should:
- Include the polyfill.
- Not include the polyfill, and support Unicode in all supported browsers except IE.
- Rewrite the PR to avoid
String.normalize, at least in the built form. We actually only want symbols that expand to accents within our supported set, so the symbol table we could generate at build time could be smaller thanunorm's. - Drop this PR altogether (though this is probably a desirable feature, and IE probably isn't evolving).
I think option 2 or 3 is best. Thoughts? I can look into option 3.
There was a problem hiding this comment.
It's in Edge so my vote would be for option 2. If people need to support IE then they can add the polyfill themselves. We should document this requirement.
There was a problem hiding this comment.
Sounds good. That's the easiest! I'll add a fallback when normalize isn't implemented, maybe with a warning, and add documentation.
src/Parser.js
Outdated
| } else if (cjkRegex.test(text)) { | ||
| this.consume(); | ||
| return newArgument( | ||
| new ParseNode("textord", text, this.mode, nucleus)); |
There was a problem hiding this comment.
If we bailing here do we need to include the combiningDiacriticalMarkString with surrogate pairs in the regex in Lexer.js?
There was a problem hiding this comment.
combiningDiacriticalMarkString is already included in Lexer's tokenRegex both after a single codepoint and after a surrogate pair.
There was a problem hiding this comment.
We have to lex the tokens before we can test the tokens against the cjkRegex here, d'oh.
| const accent = match[0][i]; | ||
| if (!accents[accent]) { | ||
| throw new ParseError(`Unknown accent ' ${accent}'`, nucleus); | ||
| } |
| defineSymbol(math, main, mathord, ch, ch); | ||
| defineSymbol(text, main, textord, ch, ch); | ||
| } | ||
|
|
| // Transform combining characters into accents | ||
| if (match) { | ||
| for (let i = 0; i < match[0].length; i++) { | ||
| const accent = match[0][i]; |
There was a problem hiding this comment.
Are there any unicode glyphs that include multiple accents? If so, we should test that this works correctly.
There was a problem hiding this comment.
I believe the answer is "no" -- all the examples I've seen (e.g. here) show up to one accent in the first character, and the rest obtained with combining characters. But given the combiningDiacriticalMarkString in the Lexer, there could be arbitrarily many combining characters at this point in the code.
|
@edemaine this is looking pretty good. It looks like some of the screenshot tests use |
src/symbols.js
Outdated
| const mathTextSymbols = "0123456789/@.\""; | ||
| for (let i = 0; i < mathTextSymbols.length; i++) { | ||
| const ch = mathTextSymbols.charAt(i); | ||
| for (const ch of "0123456789/@.\"") { |
There was a problem hiding this comment.
Nice cleanup, but I think this may actually cause the bundle size to increase quite a bit b/c it requires a polyfill for the string iterator which I believe uses .codePointAt() which also ends up be polyfilled. Can you generate a build and see how much the bundle size has increased?
There was a problem hiding this comment.
I don't think there are polyfills involved, but this is what it expands to:
var _iteratorNormalCompletion = true;
var _didIteratorError = false;
var _iteratorError = undefined;
try {
for (var _iterator = (0, _getIterator3.default)("0123456789/@.\""), _step; !(_iteratorNormalCompletion = (_step = _iterator.next()).done); _iteratorNormalCompletion = true) {
var _ch = _step.value;
defineSymbol(math, main, textord, _ch, _ch);
}
// All of these are textords in text mode
} catch (err) {
_didIteratorError = true;
_iteratorError = err;
} finally {
try {
if (!_iteratorNormalCompletion && _iterator.return) {
_iterator.return();
}
} finally {
if (_didIteratorError) {
throw _iteratorError;
}
}
}Or minified:
var M=true;var _=false;var S=undefined;try{for(var z=(0,n.default)('0123456789/@."'),T;!(M=(T=z.next()).done);M=true){var A=T.value;u(o,f,k,A,A)}}catch(e){_=true;S=e}finally{try{if(!M&&z.return){z.return()}}finally{if(_){throw S}}}var C=true;var O=false;var N=undefined;try{for(var j=(0,n.default)('0123456789!@*()-=+[]<>|";:?/.,'),L;!(C=(L=j.next()).done);C=true){var E=L.value;u(s,f,k,E,E)}}catch(e){O=true;N=e}finally{try{if(!C&&j.return){j.return()}}finally{if(O){throw N}}}
So it does add a few hundred bytes. Shall I switch back?
(To clarify, _getIterator3 gets defined even without these lines.)
There was a problem hiding this comment.
The code that it generates is pretty awful. We should experiment with loose mode for the for-of plugin for babel sometime, see https://babeljs.io/docs/plugins/transform-es2015-for-of. Interestingly, when iterating over an array literal it will output a basic for loop. Maybe we could do for (const letter of "abc...".split()) { ... } to force this optimization.
There was a problem hiding this comment.
How about the following, which Babel seems to leave as-is?
"0123456789/@.\"".split().forEach(function () {
defineSymbol(math, main, textord, ch, ch);
});Wow, this is actually fastest! jsperf benchmark
There was a problem hiding this comment.
right, b/c it still has to call charCodeAt... yeah, let's go back to regular for loops.
|
@kevinbarabash Looking at the history, I realize that the "weak" support for Latin-1 characters (using the wrong font) is actually really old, so we should keep support for those characters for backwards compatibility. I added a TODO comment about how they're currently supported "incorrectly". I also updated the Unicode screenshot, which shows proper support for many characters but not all (like |
|
Not sure why travis isn't running the tests. I had to restart the tests manually. |
|
I added a test and warning for |
|
I think this is ready for final review. We can move the tests around later (?). |
|
@edemaine sorry I've been holding off on approving this. I've been doing some more thinking about making people add their own polyfill for IE support and I wonder how many people are still supporting IE. For the record KA is still supporting IE10+. Although we haven't upgraded KaTeX recently, if we were to do so we probably wouldn't want to include the whole |
|
That makes sense. Maybe I should try to get it working with normalize only during build, and not requiring it at run time, then. |
|
Hi, |
|
@Syndesi I'd suggest opening a separate issue (or PR) for this idea. Actually, there might be one; I'd need to search. Rendering repeatedly in the browser to measure box sizes is the general approach of MathJax, which is the main reason (I believe) it's so slow in comparison to KaTeX. It also prevents server-side rendering, which KaTeX currently supports. So it's unlikely that this will happen, but feel free to open/find a discussion about it. |
@edemaine were you thinking of using |
|
@kevinbarabash Yes, exactly. Maybe there's a I'd set this up only to recognize accents that KaTeX can handle, so it'd hopefully be not too large a list. We can still handle combining characters in exactly the way of this PR, just skipping the normalization. Instead, I think I'd define (in |
|
OK, I gave a crack at this: Multi-accented characters were challenging to get right (glad we added a test for them!), and it turns out there are characters with two accents. I checked and there aren't any triple-accented characters. I'm not sure of the best location for |
| @@ -0,0 +1,290 @@ | |||
| // This file is GENERATED by unicodeMake.js. DO NOT MODIFY. | |||
| // This file is GENERATED by unicodeMake.js. DO NOT MODIFY. | ||
|
|
||
| export const unicodeSymbols = { | ||
| "\u00e1": "\u0061\u0301", // á = \'{a} |
k4b7
left a comment
There was a problem hiding this comment.
I have a few minor concerns/questions, but other than that it looks really solid.
src/unicodeMake.js
Outdated
|
|
||
| console.log("// This file is GENERATED by unicodeMake.js. DO NOT MODIFY."); | ||
| console.log(""); | ||
| console.log("export const unicodeSymbols = {"); |
There was a problem hiding this comment.
Do we need this in addition to the export default?
There was a problem hiding this comment.
I'm still somewhat new to ES6, so I'm not sure what's normal here. I still find it weird that export default const is forbidden. I believe we could just do export default {...} if we don't need to name the object, so maybe that's best? The current system allows importing in two different ways (default or nondefault), but that's probably not necessary.
There was a problem hiding this comment.
I find the lack of support for export default const weird as well. You can do:
export { foo as default };
I guess it's a moot point now since there no longer seems to be an export default in this file.
src/unicodeMake.js
Outdated
| // Read supported accents from Parser.js | ||
| const Parser = fs.readFileSync('./Parser.js', {encoding: 'utf8'}); | ||
| const match = /accents = [^;]*;/.exec(Parser); | ||
| eval(match[0]); |
There was a problem hiding this comment.
Maybe we could have Parser.js export accents instead.
There was a problem hiding this comment.
I should add a comment about this. I originally tried it that way, but Node doesn't support import, and I'm running unicodeMake.js in Node, so Babel would have to get involved, and the accents are not available in built katex.js. I'm open to better suggestions! We could, for example, put accents here in unicodeMake.js and/or unicodeSymbols.js and import them in Parser.js -- perhaps that's best.
There was a problem hiding this comment.
OK, I tried moving them. It's not especially pretty -- I still evald the code in unicodeMake, which led to extra \ escaping in the accents definition... Let me know if you see a cleaner way! But at least I'm no longer parsing Parser.js with a regex!
| for (const accent of Object.getOwnPropertyNames(accents)) { | ||
| const combined = letter + accent; | ||
| const normalized = combined.normalize('NFC'); | ||
| if (normalized.length === 1) { |
| }); | ||
|
|
||
| it("should parse 'ΓΔΘΞΠΣΦΨΩ'", function() { | ||
| expect("ΓΔΘΞΠΣΦΨΩ").toParse(); |
There was a problem hiding this comment.
I see, these got moved down to unicode-spec.js.
| // TODO(edemaine): Unsupported Latin-1 letters in text: ÆÇÐØÞßæçðøþ | ||
| expect("\\text{ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïñòóôõöùúûüýÿ}") | ||
| .toParseLike( | ||
| "\\text{\\`A\\'A\\^A\\~A\\\"A\\r A" + |
There was a problem hiding this comment.
\\\" caught me off-guard until I realized that the " need to be escaped as well.
test/unicode-spec.js
Outdated
| expect('여보세요').toNotParse(); | ||
| it("should parse CJK outside \\text{}", function() { | ||
| expect('私はバナナです。').toParse(); | ||
| expect('여보세요').toParse(); |
There was a problem hiding this comment.
This change isn't really related to accents and should probably be in a separate PR.
There was a problem hiding this comment.
It's relevant in that this PR rewrites how we parse Unicode. The relevant line change is from if (this.mode === "text" && cjkRegex.test(nucleus.text)) to if (cjkRegex.test(text)). I can't make a separate PR now because they'd both change that line, but if you like, I can remove that change from here and add a second PR once this one is accepted.
FWIW, this one-line change would fix #895.
There was a problem hiding this comment.
I can remove that change from here and add a second PR once this one is accepted.
I think that makes the most sense. Thanks. :)
| // This is an internal Node tool, not part of the KaTeX distribution, | ||
| // whose purpose is to generate unicodeSymbols.js in this directory. | ||
| // In this way, only this tool, and not the distribution/browser, | ||
| // needs String's normalize function. |
There was a problem hiding this comment.
Can you include when this tool should be run?
|
|
||
| unicode: | ||
| cd src && $(NODE) unicodeMake.js >unicodeSymbols.js | ||
| src/unicodeSymbols.js: unicode |
There was a problem hiding this comment.
I wonder if we can run this as part of the build and then not even bother with committing the result. That we don't have to worry about whether people are running it when they should or not.
There was a problem hiding this comment.
My motivation for committing it is so that people wouldn't worry about when to run it. (You need to run it whenever we add support for a new accent, or when the Unicode spec changes, neither of which I imagine being very common.) Another advantage is that we can directly see what characters are supported, so it's easy to complain about a missing one, and for PRs to show what additional characters are being supported.
If we do it during build, would we also have to do it when running the test server...? This approach felt more complicated to me, but I'm happy to have a specific suggestion.
There was a problem hiding this comment.
Let's leave it like this for now. I definitely like it committing it from a documentation point of view. I don't think we have much work left to do on accents so there shouldn't be too many PRs which is affected by this.
src/unicodeMake.js
Outdated
| '\\u030a': {text: '\\\\r'}, | ||
| '\\u030b': {text: '\\\\H'}, | ||
| }`; | ||
| eval(accentsCode); |
There was a problem hiding this comment.
I'm confused why the eval is necessary, can't we just console.log(accentsCode); this as is?
There was a problem hiding this comment.
A better solution might be to do:
const {accents} = require('./unicodeSymbols.js');
const encodedAccents = {};
for (const [key, value] of Object.values(accents)) {
const encodedEntry = {};
encodedAccents[encode(key)] = encodedEntry;
if (value.text) {
encodedEntry.text = encode(value.text);
}
if (value.math) {
encodedEntry.math = encode(value.math);
}
}
This way we have one source of truths for the accents.
There was a problem hiding this comment.
First, the eval is necessary to load this data into the current module. console.log is what outputs the same code to unicodeSymbols.js. In this way, there is already just one source, accentsCode in unicodeMake.js, which generates unicodeSymbols.js which has a copy of the data, but that file should never be modified...
Your bootstrapping approach above is neat, but feels awkward. You're asking unicodeMake.js to first import unicodeSymbols.js, but then modify unicodeSymbols.js... In particular, it won't work with the > piping as is, and is fragile if e.g. unicodeSymbols.js ever got corrupted (e.g. from a bug in unicodeMake.js), then you'd lose the accents table.
I have a third approach: make a unicodeAccents.js that just gives the accents list, and is written with Node-style modules (module.export = ... instead of export). Then it should be both requireable in unicodeMake.js and (I think) importable by the rest of KaTeX.
|
@edemaine do you know what package we need to load to render the |
|
@kevinbarabash I think mainly we need to use Looks like I must have to include some more packages to get Cyrillic and CJK fonts. There's also the known "missing |
|
I tried putting the accents in their own Node-style module, (I also rebased, so it will look like the old commits repeating themselves. Only the last 3 are new.) |
|
The good news:
Areas for improvement:
The items that are starred are things that should probably be resolved before merging this PR. |
|
@edemaine thanks for refactoring |
|
@kevinbarabash I guess for backwards compatibility, we could just let I'm guessing that the squishing of the accents on capital letters is also some kind of font issue (another symbol that we need), but I don't actually see where it's implemented in LaTeX... I just see The horizontal spacing is definitely something that can be fixed. It's an existing issue with |
This is important for multi-accented characters.
|
New texcmp with #1033 merged (via rebase): Ready to merge? Still the cedilla issue and accent-on-capital letter issue, which we can add. (Neither specific to this PR.) I also added support for accented Greek letters, as I recalled/noticed there are a handful of those as well. (I didn't add e.g. Greek Capital Letter A(lpha) with acute accent because we don't support Greek Capital Letter A(lpha) yet.) @ronkok Does |
|
@edemaine, I am not aware of a stand-alone Unicode character for Greek capital alpha with acute accent, so |
I think so. We can deal with the cedilla and |
|
@edemaine I removed some of the letters from the math mode test since we aren't rendering them with the correct font anyways. |
|
@edemaine thanks for all of the hard work on this. I'm glad it's finally merged. |
|
@ronkok I was indeed referring to U+0386 Ά which is "Greek Capital Letter A with acute accent" according to this list. (I assume A means Alpha here, though I'm no expert.) I see that other lists call it "GREEK CAPITAL LETTER ALPHA WITH TONOS", but it does canonicalize into U+0391 (Greek Capital Alpha) followed by U+0301 (combining acute accent). There are similar "Latin-looking" (and thus no-LaTeX-macro) characters Έ Ή Ί Ό Ύ. But if LaTeX doesn't support them, fine for us not to. @kevinbarabash Thanks for pushing and helping to make this PR so much better! Together with @ronkok's Unicode symbol support, the next KaTeX release will be really Unicode friendly! |
|
Very cool stuff. I'll start to write it into the KaTeX function support page. @edemaine, when I submit the PR for the updated function support page, I would appreciate if you would look it over. I'm not sure that I have picked up on everything that you and @kevinbarabash are doing here. |
|
@ronkok Sounds good! For this PR, the new single symbols should all be in src/unicodeSymbols.js. They currently work in both text and math mode (but I'm not sure they should work in math mode -- see #1046). In addition, the combining characters for all the supported text and math accents work, as listed in src/unicodeAccents.js. In addition, #1030 added Thanks for continuing to keep the documentation up-to-date!! |
For now, I'll write it up to suggest that they work only in text mode. I would like to keep math mode out of the docs until we're sure that it won't break future work. We can always change the docs later to include math mode. |





This PR fixes #801.
parseSymbolnow recognizes both combined and uncombined forms of Unicode accents, and builds accent objects just like the accent functionsAdded CJK support to math mode (not just text mode). "KaTeX parse error: Expected 'EOF', got '<bla>' at position 1" when input pure Chinese character(s) #895.