Skip to content

Unicode accents#992

Merged
k4b7 merged 28 commits intoKaTeX:masterfrom
edemaine:accents2
Dec 29, 2017
Merged

Unicode accents#992
k4b7 merged 28 commits intoKaTeX:masterfrom
edemaine:accents2

Conversation

@edemaine
Copy link
Member

@edemaine edemaine commented Nov 25, 2017

This PR fixes #801.

Copy link
Member

@k4b7 k4b7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start... requesting more tests and polyfill String.prototype.normalize.

});

it("should parse Latin-1 outside \\text{}", function() {
expect('ÀàÇçÉéÏïÖöÛû').toParse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep these tests but only check for Çç? Eventually we'll want to test for œ, Œ, æ, Æ, and ß when they get added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR actually removes support for Çç. I believe they were added prematurely when forcing support for Latin-1. I just verified that none of the Latin-1 glyphs, Çç included, are in the KaTeX fonts. (I believe this actually makes sense, and the purpose of this PR is to map them into the accent macros, like LaTeX would.) Here's an example of incorrect looking Çç vs Cc in the current KaTeX:

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened https://github.com/KaTeX/katex-fonts/issues/2 to track the cedilla issue.

expect("ÀÁÂÃÄÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäèéêëìíîïñòóôõöùúûüýÿ")
.toParseLike(
"\\grave A\\acute A\\hat A\\tilde A\\ddot A" +
"\\grave E\\acute E\\hat E\\ddot E" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess expanding to \grave A makes sense. Can you add a MathML fixture test? I'm curious to see if we recombine the accent with the letter when we render to MathML.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They render exactly how accent commands do, which I learned is via <mover accent="true">. I think this makes sense... Test added.

const commentRegexString = "%[^\n]*[\n]";
const controlWordRegexString = "\\\\[a-zA-Z@]+";
const controlSymbolRegexString = "\\\\[^\uD800-\uDFFF]";
const combiningDiacriticalMarkString = "[\u0300-\u036f]";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contains many more characters than we support. Where do we generate an error if someone uses an unsupported combining mark? Maybe add a test for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an error test (in error-spec.js because I wanted toFailWithParseError). Incidentally, most of my tests are in katex-spec.js because I wanted toParseLike, but they would make more sense in unicode-spec.js... Not sure how to best share code between these test modules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #1021 to extract helper functions into test-utils.js. I can do that this weekend and then we could rebase this with the tests in the proper suite.

src/Parser.js Outdated
}
// Decompose symbol into base and combining characters
// if the combined symbol is not recognized.
text = text.normalize('NFD');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool. I wish I would've known about this sooner. There's no support for this in IE and Safari only got this in version 10. We might consider using https://github.com/walling/unorm instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should have checked the browser compatibility matrix for String.normalize. The only problem with a polyfill is that it increases the katex.min.js build by 60%, from 226,836 bytes to 363,641 bytes (as any polyfill requires a huge table of Unicode symbols).

So I think this raises a serious question of whether we should:

  1. Include the polyfill.
  2. Not include the polyfill, and support Unicode in all supported browsers except IE.
  3. Rewrite the PR to avoid String.normalize, at least in the built form. We actually only want symbols that expand to accents within our supported set, so the symbol table we could generate at build time could be smaller than unorm's.
  4. Drop this PR altogether (though this is probably a desirable feature, and IE probably isn't evolving).

I think option 2 or 3 is best. Thoughts? I can look into option 3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in Edge so my vote would be for option 2. If people need to support IE then they can add the polyfill themselves. We should document this requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. That's the easiest! I'll add a fallback when normalize isn't implemented, maybe with a warning, and add documentation.

src/Parser.js Outdated
} else if (cjkRegex.test(text)) {
this.consume();
return newArgument(
new ParseNode("textord", text, this.mode, nucleus));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bailing here do we need to include the combiningDiacriticalMarkString with surrogate pairs in the regex in Lexer.js?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combiningDiacriticalMarkString is already included in Lexer's tokenRegex both after a single codepoint and after a surrogate pair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to lex the tokens before we can test the tokens against the cjkRegex here, d'oh.

const accent = match[0][i];
if (!accents[accent]) {
throw new ParseError(`Unknown accent ' ${accent}'`, nucleus);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enough for me.

defineSymbol(math, main, mathord, ch, ch);
defineSymbol(text, main, textord, ch, ch);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up!

// Transform combining characters into accents
if (match) {
for (let i = 0; i < match[0].length; i++) {
const accent = match[0][i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any unicode glyphs that include multiple accents? If so, we should test that this works correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the answer is "no" -- all the examples I've seen (e.g. here) show up to one accent in the first character, and the rest obtained with combining characters. But given the combiningDiacriticalMarkString in the Lexer, there could be arbitrarily many combining characters at this point in the code.

@k4b7
Copy link
Member

k4b7 commented Dec 13, 2017

@edemaine this is looking pretty good. It looks like some of the screenshot tests use Ç and ç so those will have to be changed or we could add support for them even though we don't have the correct glyphs and fix that issue in a separate PR. Your call.

src/symbols.js Outdated
const mathTextSymbols = "0123456789/@.\"";
for (let i = 0; i < mathTextSymbols.length; i++) {
const ch = mathTextSymbols.charAt(i);
for (const ch of "0123456789/@.\"") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup, but I think this may actually cause the bundle size to increase quite a bit b/c it requires a polyfill for the string iterator which I believe uses .codePointAt() which also ends up be polyfilled. Can you generate a build and see how much the bundle size has increased?

Copy link
Member Author

@edemaine edemaine Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there are polyfills involved, but this is what it expands to:

var _iteratorNormalCompletion = true;
var _didIteratorError = false;
var _iteratorError = undefined;

try {
    for (var _iterator = (0, _getIterator3.default)("0123456789/@.\""), _step; !(_iteratorNormalCompletion = (_step = _iterator.next()).done); _iteratorNormalCompletion = true) {
        var _ch = _step.value;

        defineSymbol(math, main, textord, _ch, _ch);
    }

    // All of these are textords in text mode
} catch (err) {
    _didIteratorError = true;
    _iteratorError = err;
} finally {
    try {
        if (!_iteratorNormalCompletion && _iterator.return) {
            _iterator.return();
        }
    } finally {
        if (_didIteratorError) {
            throw _iteratorError;
        }
    }
}

Or minified:

var M=true;var _=false;var S=undefined;try{for(var z=(0,n.default)('0123456789/@."'),T;!(M=(T=z.next()).done);M=true){var A=T.value;u(o,f,k,A,A)}}catch(e){_=true;S=e}finally{try{if(!M&&z.return){z.return()}}finally{if(_){throw S}}}var C=true;var O=false;var N=undefined;try{for(var j=(0,n.default)('0123456789!@*()-=+[]<>|";:?/.,'),L;!(C=(L=j.next()).done);C=true){var E=L.value;u(s,f,k,E,E)}}catch(e){O=true;N=e}finally{try{if(!C&&j.return){j.return()}}finally{if(O){throw N}}}

So it does add a few hundred bytes. Shall I switch back?

(To clarify, _getIterator3 gets defined even without these lines.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that it generates is pretty awful. We should experiment with loose mode for the for-of plugin for babel sometime, see https://babeljs.io/docs/plugins/transform-es2015-for-of. Interestingly, when iterating over an array literal it will output a basic for loop. Maybe we could do for (const letter of "abc...".split()) { ... } to force this optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following, which Babel seems to leave as-is?

"0123456789/@.\"".split().forEach(function () {
    defineSymbol(math, main, textord, ch, ch);
});

Wow, this is actually fastest! jsperf benchmark

jsperf output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't expecting that. Awesome!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's because I was using split wrong. It's now the slowest... So perhaps I should go back to a regular for loop.

Corrected jsperf output

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, b/c it still has to call charCodeAt... yeah, let's go back to regular for loops.

@edemaine
Copy link
Member Author

@kevinbarabash Looking at the history, I realize that the "weak" support for Latin-1 characters (using the wrong font) is actually really old, so we should keep support for those characters for backwards compatibility. I added a TODO comment about how they're currently supported "incorrectly".

I also updated the Unicode screenshot, which shows proper support for many characters but not all (like Çç). It also highlights an issue I've known about but we should probably discuss: ï renders with an accent on top of a dot (and similarly for all accented is and js). To fix this in text mode, we need \i and \j support (https://github.com/KaTeX/katex-fonts/issues/2), but I added a fix for it now in math mode because we have \imath and \jmath.

@k4b7
Copy link
Member

k4b7 commented Dec 14, 2017

Not sure why travis isn't running the tests. I had to restart the tests manually.

@edemaine
Copy link
Member Author

edemaine commented Dec 14, 2017

I added a test and warning for normalize(), and tested that it occurs when using IE (but not otherwise). And switched back to vanilla for loops.

@edemaine
Copy link
Member Author

I think this is ready for final review. We can move the tests around later (?).

@k4b7
Copy link
Member

k4b7 commented Dec 17, 2017

@edemaine sorry I've been holding off on approving this. I've been doing some more thinking about making people add their own polyfill for IE support and I wonder how many people are still supporting IE. For the record KA is still supporting IE10+. Although we haven't upgraded KaTeX recently, if we were to do so we probably wouldn't want to include the whole normalize polyfill and would end up writing our polyfill which covers what we using in KaTeX. At that point we may as well include the code in KaTeX. Later, we could add a build setting to leave it out for people who aren't supporting IE.

@edemaine
Copy link
Member Author

That makes sense. Maybe I should try to get it working with normalize only during build, and not requiring it at run time, then.

@Syndesi
Copy link

Syndesi commented Dec 17, 2017

Hi,
if I get it right, the width and length of supported characters are currently hardcoded and therefore unknown characters won't get rendered or break the formula.
Wouldn't it be better when the major characters are hardcoded (to improve the performance) and all other strings are put into an invisible paragraph which is then tested for its width & height? With this solution it should be possible to use every language and even those whose width depends on their context (e.g. Arabic).

@edemaine
Copy link
Member Author

@Syndesi I'd suggest opening a separate issue (or PR) for this idea. Actually, there might be one; I'd need to search. Rendering repeatedly in the browser to measure box sizes is the general approach of MathJax, which is the main reason (I believe) it's so slow in comparison to KaTeX. It also prevents server-side rendering, which KaTeX currently supports. So it's unlikely that this will happen, but feel free to open/find a discussion about it.

@k4b7
Copy link
Member

k4b7 commented Dec 17, 2017

Maybe I should try to get it working with normalize only during build, and not requiring it at run time, then.

@edemaine were you thinking of using normalize at build time to generate a table or something containing the split characters?

@edemaine
Copy link
Member Author

@kevinbarabash Yes, exactly. Maybe there's a src/unicode.js, which can be recompiled (e.g. if Unicode changes) via make src/unicode.js, which runs some code (e.g. src/make-unicode.js, or located elsewhere) not included in the KaTeX build, and only the latter uses normalize/unorm. Similar to fontMetrics making, I suppose -- I can look more at how that's done.

I'd set this up only to recognize accents that KaTeX can handle, so it'd hopefully be not too large a list. We can still handle combining characters in exactly the way of this PR, just skipping the normalization. Instead, I think I'd define (in src/unicode.js) macros that map e.g. á to a followed by an acute combining character.

@edemaine
Copy link
Member Author

edemaine commented Dec 18, 2017

OK, I gave a crack at this: src/unicodeMake.js generates src/unicodeSymbols.js via make unicode or make src/unicodeSymbols.js.

Multi-accented characters were challenging to get right (glad we added a test for them!), and it turns out there are characters with two accents. I checked and there aren't any triple-accented characters.

I'm not sure of the best location for unicodeMake.js. We don't have a util directory, which would be a natural place. It could also be in the root directory. But in src it immediately gets linted and such.

@@ -0,0 +1,290 @@
// This file is GENERATED by unicodeMake.js. DO NOT MODIFY.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

// This file is GENERATED by unicodeMake.js. DO NOT MODIFY.

export const unicodeSymbols = {
"\u00e1": "\u0061\u0301", // á = \'{a}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the comments.

Copy link
Member

@k4b7 k4b7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor concerns/questions, but other than that it looks really solid.


console.log("// This file is GENERATED by unicodeMake.js. DO NOT MODIFY.");
console.log("");
console.log("export const unicodeSymbols = {");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this in addition to the export default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still somewhat new to ES6, so I'm not sure what's normal here. I still find it weird that export default const is forbidden. I believe we could just do export default {...} if we don't need to name the object, so maybe that's best? The current system allows importing in two different ways (default or nondefault), but that's probably not necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the lack of support for export default const weird as well. You can do:

export { foo as default };

I guess it's a moot point now since there no longer seems to be an export default in this file.

// Read supported accents from Parser.js
const Parser = fs.readFileSync('./Parser.js', {encoding: 'utf8'});
const match = /accents = [^;]*;/.exec(Parser);
eval(match[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have Parser.js export accents instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add a comment about this. I originally tried it that way, but Node doesn't support import, and I'm running unicodeMake.js in Node, so Babel would have to get involved, and the accents are not available in built katex.js. I'm open to better suggestions! We could, for example, put accents here in unicodeMake.js and/or unicodeSymbols.js and import them in Parser.js -- perhaps that's best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I tried moving them. It's not especially pretty -- I still evald the code in unicodeMake, which led to extra \ escaping in the accents definition... Let me know if you see a cleaner way! But at least I'm no longer parsing Parser.js with a regex!

for (const accent of Object.getOwnPropertyNames(accents)) {
const combined = letter + accent;
const normalized = combined.normalize('NFC');
if (normalized.length === 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool.

});

it("should parse 'ΓΔΘΞΠΣΦΨΩ'", function() {
expect("ΓΔΘΞΠΣΦΨΩ").toParse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, these got moved down to unicode-spec.js.

// TODO(edemaine): Unsupported Latin-1 letters in text: ÆÇÐØÞßæçðøþ
expect("\\text{ÀÁÂÃÄÅÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝàáâãäåèéêëìíîïñòóôõöùúûüýÿ}")
.toParseLike(
"\\text{\\`A\\'A\\^A\\~A\\\"A\\r A" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\\\" caught me off-guard until I realized that the " need to be escaped as well.

expect('여보세요').toNotParse();
it("should parse CJK outside \\text{}", function() {
expect('私はバナナです。').toParse();
expect('여보세요').toParse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change isn't really related to accents and should probably be in a separate PR.

Copy link
Member Author

@edemaine edemaine Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's relevant in that this PR rewrites how we parse Unicode. The relevant line change is from if (this.mode === "text" && cjkRegex.test(nucleus.text)) to if (cjkRegex.test(text)). I can't make a separate PR now because they'd both change that line, but if you like, I can remove that change from here and add a second PR once this one is accepted.

FWIW, this one-line change would fix #895.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove that change from here and add a second PR once this one is accepted.

I think that makes the most sense. Thanks. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

// This is an internal Node tool, not part of the KaTeX distribution,
// whose purpose is to generate unicodeSymbols.js in this directory.
// In this way, only this tool, and not the distribution/browser,
// needs String's normalize function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include when this tool should be run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


unicode:
cd src && $(NODE) unicodeMake.js >unicodeSymbols.js
src/unicodeSymbols.js: unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can run this as part of the build and then not even bother with committing the result. That we don't have to worry about whether people are running it when they should or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation for committing it is so that people wouldn't worry about when to run it. (You need to run it whenever we add support for a new accent, or when the Unicode spec changes, neither of which I imagine being very common.) Another advantage is that we can directly see what characters are supported, so it's easy to complain about a missing one, and for PRs to show what additional characters are being supported.

If we do it during build, would we also have to do it when running the test server...? This approach felt more complicated to me, but I'm happy to have a specific suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it like this for now. I definitely like it committing it from a documentation point of view. I don't think we have much work left to do on accents so there shouldn't be too many PRs which is affected by this.

'\\u030a': {text: '\\\\r'},
'\\u030b': {text: '\\\\H'},
}`;
eval(accentsCode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why the eval is necessary, can't we just console.log(accentsCode); this as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better solution might be to do:

const {accents} = require('./unicodeSymbols.js');

const encodedAccents = {};
for (const [key, value] of Object.values(accents)) {
    const encodedEntry = {};
    encodedAccents[encode(key)] = encodedEntry;
    if (value.text) {
        encodedEntry.text = encode(value.text);
    }
    if (value.math) {
        encodedEntry.math = encode(value.math);
    }
}

This way we have one source of truths for the accents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the eval is necessary to load this data into the current module. console.log is what outputs the same code to unicodeSymbols.js. In this way, there is already just one source, accentsCode in unicodeMake.js, which generates unicodeSymbols.js which has a copy of the data, but that file should never be modified...

Your bootstrapping approach above is neat, but feels awkward. You're asking unicodeMake.js to first import unicodeSymbols.js, but then modify unicodeSymbols.js... In particular, it won't work with the > piping as is, and is fragile if e.g. unicodeSymbols.js ever got corrupted (e.g. from a bug in unicodeMake.js), then you'd lose the accents table.

I have a third approach: make a unicodeAccents.js that just gives the accents list, and is written with Node-style modules (module.export = ... instead of export). Then it should be both requireable in unicodeMake.js and (I think) importable by the rest of KaTeX.

@k4b7
Copy link
Member

k4b7 commented Dec 20, 2017

@edemaine do you know what package we need to load to render the Unicode test case in LaTeX?

@edemaine
Copy link
Member Author

edemaine commented Dec 20, 2017

@kevinbarabash I think mainly we need to use xelatex instead of pdflatex. So I don't think we can easily get texcmp to do the work for us (at least not in the Docker, which doesn't have xelatex installed). Running node texcmp.js locally with a modified texcmp.js, I got this image:

unicode

Looks like I must have to include some more packages to get Cyrillic and CJK fonts. There's also the known "missing \i" and "missing cedilla" issues, along with some kind of horizontal spacing issue I'm not sure about. I don't think the other differences are specific to this PR, though, but rather our existing accent mechanism.

@edemaine
Copy link
Member Author

edemaine commented Dec 20, 2017

I tried putting the accents in their own Node-style module, unicodeAccents.js. This seems to avoid all the messiness (the only blemish is one use of module.exports = instead of export default, but they behave the same). Should be good to go now.

(I also rebased, so it will look like the old commits repeating themselves. Only the last 3 are new.)

@k4b7
Copy link
Member

k4b7 commented Dec 22, 2017

The good news:

  • the base characters are using computer modern now, yay!
  • grave, acute, and circumflex accents have better shapes than before

Areas for improvement:

  • grave, acute, and circumflex accents on capitals need to be rotated or squished a bit
  • umlaut circles seem a bit too large
  • i should use the dotless i when it has an accent*
  • the horizontal spacing around i and I is too much*

The items that are starred are things that should probably be resolved before merging this PR.

@k4b7
Copy link
Member

k4b7 commented Dec 22, 2017

@edemaine thanks for refactoring unicodeAccents and revert the CJK change. :)

@edemaine
Copy link
Member Author

@kevinbarabash I guess for backwards compatibility, we could just let ï pass through as a character and render in Times. Would that be acceptable? We can't otherwise fix the double accent issue in text mode (it already is fixed in math mode) until https://github.com/KaTeX/katex-fonts/issues/2 gets solved.

I'm guessing that the squishing of the accents on capital letters is also some kind of font issue (another symbol that we need), but I don't actually see where it's implemented in LaTeX... I just see \DeclareTextAccentDefault{\'}{OT1}. Dot size is presumably also a font issue...

The horizontal spacing is definitely something that can be fixed. It's an existing issue with \acute{i}, \acute{\imath}, and \text{\'i} in KaTeX, not new to this PR (but this PR exacerbates it to apply to ï as well). I think we just never handled the case of the innards being less wide than the accent. I'll take a look... But we could also leave this to another PR, if we hackily let ï pass through.

@edemaine
Copy link
Member Author

edemaine commented Dec 28, 2017

New texcmp with #1033 merged (via rebase):

image

Ready to merge? Still the cedilla issue and accent-on-capital letter issue, which we can add. (Neither specific to this PR.)

I also added support for accented Greek letters, as I recalled/noticed there are a handful of those as well. (I didn't add e.g. Greek Capital Letter A(lpha) with acute accent because we don't support Greek Capital Letter A(lpha) yet.) @ronkok Does unicode-math or something alias that character to a Latin A?

@ronkok
Copy link
Collaborator

ronkok commented Dec 29, 2017

@edemaine, I am not aware of a stand-alone Unicode character for Greek capital alpha with acute accent, so unicode-math does not have such an alias. There is Ά, that is U+0386, Greek capital alpha with tonos, but unicode-math does not have an alias for it, either.

@k4b7
Copy link
Member

k4b7 commented Dec 29, 2017

Ready to merge? Still the cedilla issue and accent-on-capital letter issue, which we can add. (Neither specific to this PR.)

I think so. We can deal with the cedilla and \H issues separately.

@k4b7
Copy link
Member

k4b7 commented Dec 29, 2017

@edemaine I removed some of the letters from the math mode test since we aren't rendering them with the correct font anyways.

@k4b7 k4b7 merged commit 484d44e into KaTeX:master Dec 29, 2017
@k4b7
Copy link
Member

k4b7 commented Dec 29, 2017

@edemaine thanks for all of the hard work on this. I'm glad it's finally merged.

@edemaine
Copy link
Member Author

@ronkok I was indeed referring to U+0386 Ά which is "Greek Capital Letter A with acute accent" according to this list. (I assume A means Alpha here, though I'm no expert.) I see that other lists call it "GREEK CAPITAL LETTER ALPHA WITH TONOS", but it does canonicalize into U+0391 (Greek Capital Alpha) followed by U+0301 (combining acute accent). There are similar "Latin-looking" (and thus no-LaTeX-macro) characters Έ Ή Ί Ό Ύ. But if LaTeX doesn't support them, fine for us not to.

@kevinbarabash Thanks for pushing and helping to make this PR so much better! Together with @ronkok's Unicode symbol support, the next KaTeX release will be really Unicode friendly!

@ronkok
Copy link
Collaborator

ronkok commented Dec 30, 2017

Very cool stuff. I'll start to write it into the KaTeX function support page.

@edemaine, when I submit the PR for the updated function support page, I would appreciate if you would look it over. I'm not sure that I have picked up on everything that you and @kevinbarabash are doing here.

@edemaine
Copy link
Member Author

edemaine commented Dec 30, 2017

@ronkok Sounds good! For this PR, the new single symbols should all be in src/unicodeSymbols.js. They currently work in both text and math mode (but I'm not sure they should work in math mode -- see #1046). In addition, the combining characters for all the supported text and math accents work, as listed in src/unicodeAccents.js. In addition, #1030 added \ae, \AE, \oe, \OE, \o, \O, \ss and their Unicode equivalents to text mode.

Thanks for continuing to keep the documentation up-to-date!!

@ronkok
Copy link
Collaborator

ronkok commented Dec 30, 2017

I'm not sure they should work in math mode

For now, I'll write it up to suggest that they work only in text mode. I would like to keep math mode out of the docs until we're sure that it won't break future work. We can always change the docs later to include math mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Latin character support

4 participants