Conversation
gagern
left a comment
There was a problem hiding this comment.
Definitely a great first step, thanks! We will need to add a screenshot test for this, though. And I've added some suggestions you might want to consider, although none of them is a strict requirement as far as I am concerned.
src/Lexer.js
Outdated
| "([ \r\n\t]+)|" + // whitespace | ||
| "([!-\\[\\]-\u2027\u202A-\uD7FF\uF900-\uFFFF]" + // single codepoint | ||
| "|[\uD800-\uDBFF][\uDC00-\uDFFF]" + // surrogate pair | ||
| "|\\\\verb\\*([^\0]).*?(?:\\3|\n|$)" + // \verb* |
There was a problem hiding this comment.
Why [^\0]? If you want any character, including newline, you may write that as [^]. If there is some special semantics for \0 it should be explained in a comment or the PR discussion.
There was a problem hiding this comment.
I confirmed that \0 didn't work in LaTeX, with error message Text line contains an invalid character. But that's probably just TeX being unhappy with the character in general. I think it's better to allow \0, but I didn't know how. Thanks for the tip!
src/Lexer.js
Outdated
| "([!-\\[\\]-\u2027\u202A-\uD7FF\uF900-\uFFFF]" + // single codepoint | ||
| "|[\uD800-\uDBFF][\uDC00-\uDFFF]" + // surrogate pair | ||
| "|\\\\verb\\*([^\0]).*?(?:\\3|\n|$)" + // \verb* | ||
| "|\\\\verb([^*a-zA-Z]).*?(?:\\4|\n|$)" + // \verb unstarred |
There was a problem hiding this comment.
How about just writing .*?\\4 (and likewise for \\3)? The use of . already excludes \n, as well as \r, \u2028 and \u2029. I think that is a reasonable fit for how TeX does this, but [\n]*?\\4 would be conceivable as well. If the terminating delimiter is not found before the end of line, then this would simply fall through to the function name getting matched against \verb, so we could define a function of that name to report the error.
There was a problem hiding this comment.
Detecting unmatched delimiters was my goal, but that approach works too. Changed.
src/Parser.js
Outdated
| new ParseNode("textord", nucleus.text, this.mode, nucleus), | ||
| false, nucleus); | ||
| } else if (nucleus.text.slice(0, 5) === "\\verb" && | ||
| !nucleus.text.charAt(5).match(/[a-zA-Z]/)) { |
There was a problem hiding this comment.
How about /^\\verb[^a-zA-Z]/.test(nucleus.text)?
src/buildCommon.js
Outdated
| var makeVerb = function(group, options) { | ||
| var text = group.value.body; | ||
| if (group.value.star) { | ||
| text = text.replace(/ /g, '\u2423'); // Open box |
There was a problem hiding this comment.
Is that symbol included in the tt font? Come to think of it, do we want to complain if the input contains any symbol not present in our tt font, as we can't guarantee consistent rendering or correct metrics for these?
There was a problem hiding this comment.
It seems that the symbol is the wrong size, so probably not in the font... I thought that the usual "no metrics" warning would trigger in this case, but it seems not to, because I'm passing an entire string instead of individual characters. I changed the main code to look up characters one at a time, then try to combine them, as in text.
|
I implemented your comments. (Thanks @gagern!) The last issue seems to be the space character in |
src/Parser.js
Outdated
| // Lexer's tokenRegex is constructed to always have matching | ||
| // first/last characters. | ||
| if (arg.length < 2 || arg.charAt(0) !== arg.slice(-1)) { | ||
| throw new ParseError("\\verb failure to parse"); |
There was a problem hiding this comment.
What input would trigger this? Actually it might make sense to document error conditions in test/errors-spec.js.
There was a problem hiding this comment.
No inputs would trigger this. I was treating it as an assertion before the next line which strips those characters. But I can also remove it if you prefer. I'll add an error test; thanks for pointing me to that.
|
While I have some rough ideas of how the font generation works behind the scenes, I haven't actually done that. What I understand is roughly this: The TeX fonts were taken, thickened, got unicode mappings and were saved to various font formats. The metrics were derived directly from the texmf metrics descriptions, but with some replacements. That part is included in our By the way, your branch as it stands is difficult to review due to the merge from master. Can you rebase that on current master? |
|
Rebased. Personally, I think we could merge this PR into master now, and turn the missing character into a separate issue. But I understand if you'd rather wait, or remove |
|
I would be fine with this approach too, if we think we can get the character before the next release. I wouldn't want an incomplete feature in a release, since people may depend on behavior which won't be forward compatible. @kevinbarabash, can you judge how difficult adding that character will be? Should we disable the |
|
It isn't too difficult, here are the steps:
It's been a while since I've done it so hopefully it all still just works. I checked my system and |
|
@edemaine give it a try. Once you check out https://github.com/Khan/MathJax-dev you should be able to run the command that the docker runs locally to test your changes out which should be: |
|
@kevinbarabash I think we should have these steps documented somewhere. What do you suggest, Wiki or a README file in our repository or something else? |
|
@kevinbarabash I get the following error: caused by the following code in I can confirm that there is no |
|
I see the Dockerfile and the corresponding README are in the KaTeX repo. Makes my request to have this documented boil down to adding a hint on how to add symbols. I see that the Dockerfile does reference Google Code in a number of places. I'm currently trying to modernize that. |
|
Check out #624 if you want to tweak the fonts. There you can use your own fork of the MathJax-dev repository, instead of the official one. It should even be possible to include |
|
I've included a font that includes the space symbol, based on Khan/MathJax-dev#2. The |
src/buildHTML.js
Outdated
| var i; | ||
| for (i = 0; i < text.length; i++) { | ||
| body.push(buildCommon.makeSymbol( | ||
| text[i], "Main-Regular", group.mode, options, ["mathtt"])); |
There was a problem hiding this comment.
It feels a bit weird that \verb can handle input symbols which other commands can not, e.g. \mathtt{γ} doesn't work yet, but \verb|γ| does, with a warning to console, and \verb|“| doesn't even warn. Some might argue that incorrectly displaying a glyph is better than not displaying it at all, while others might worry about breaking existing content if we start supporting these symbols in the future.
So far (and in the context of unicode input) KaTeX chose the latter approach: don't support input unless you can do so in a compatible fashion. So perhaps we should filter for those symbols we actually have Typewriter metrics for?
If we take the other approach, and deliberately allow all symbols, even those we don't have fonts or metrics for, then it might be a good idea to consider surrogate pairs in this loop here. If the input contains a surrogate pair, then we want to keep the pair intact while building a symbol node from it.
There was a problem hiding this comment.
I don't have strong feelings either way. Maybe I'm used to the warnings for unsupported symbols because of my experience with #618. 😄 I also feel like it's not crazy to pass through text of \verb mostly verbatim, but warning when that might do bad things. (If you like, that warning could be modified to warn of possible incompatibilities with future versions, i.e., don't rely on this behavior.) Anyway, let me know what you think is best and we can work on a tweak either way. (I'm afraid I don't actually know how to do the surrogate pair detection, or to fail when characters are not in the font, but I'm guessing that is done elsewhere in the code that I can mimic/re-use.)
test/screenshotter/ss_data.yaml
Outdated
| errorColor: "#dd4c4c" | ||
| nolatex: deliberately does not compile | ||
| Verb: | ||
| \verb \verb , \verb|\verb|, \verb!<x> & </y>! |
There was a problem hiding this comment.
Now that we have ␣, there should be a \verb* in there as well. Perhaps use an array environment to provide multiple lines of content in a single screenshot.
src/buildHTML.js
Outdated
| text[i], "Main-Regular", group.mode, options, ["mathtt"])); | ||
| } | ||
| buildCommon.tryCombineChars(body); | ||
| return makeSpan(["mord", "verb"], body, options); |
There was a problem hiding this comment.
Why the "verb" class? I don't see any specific styling for this. And the more classes we use in KaTeX, the more classes may not be usable by the surrounding page for their own purpose. Personally I'd even suggest prefixing all our classes with KaTeX- or some such, in order to avoid conflicts. Unless we do that, I'd be extra careful to only introduce new classes when we have to.
There was a problem hiding this comment.
Both good points. I was imagining a user might want to style \verbs in some special way (like Github here does with backquotes), but it's a pretty hypothetical situation, and maybe not appropriate for LaTeX formatting, so probably not worth keeping this class. I'll remove it.
|
@gagern Thanks for updating the font! The I agree that the space in |
|
We've taught eslint to like (and in fact enforce) trailing commas again, in #622. Unfortunately the file is now in conflict, presumably with this. I feel all those comma changes were a pretty annoying detour, and will be happy once we no longer have any active branches without trailing commas. Until then we'll need to rebase – again. |
|
Indeed, looks like characters 2018 and 2019 were added to the typewriter font by someone else in the meanwhile. I assume those should be preserved. (Have you all been keeping your For reference, here's the diff: which I'm merging to the union |
|
I was still getting "No character metrics" warning for Somewhat related, I also get that |
|
Oops, Typewriter font needs character metrics defined for space (32) and for no-break space (160). (I realized that |
|
I did some digging to see what LaTeX is doing for spaces in The glue for |
|
I also tried which seems to indicate that spaces in |
|
@kevinbarabash |
|
I didn't realize there was a |
This is based on Khan/MathJax-dev#2 which hasn't been accepted yet at the time this commit is made.
|
Revitalizing this PR:
Here's a texcmp: I did not check whether characters are in the font, and rather rely on the warning that KaTeX produces about missing characters (see @gagern's old review, first comment). I'm not sure this is best. If someone agrees, I'll work on it. (I think I'd need to make a very simple custom lexer for the |
|
I was really confused as to why this was working b/c the visible space (open box) character doesn't show up in Apple's Font Book font viewing utility even though KaTeX appears to be rendering the glyph just fine. I tried opening KaTeX_Typewriter-regular in https://opentype.js.org/glyph-inspector.html and it shows up correctly. Not sure why Font Book wasn't displaying it. |
|
In the font file the glyph is at position 0x20 which is normally the space. I looked at some other fonts and Font Book doesn't render the space in any of them which I guess makes sense b/c in other fonts it's blank so there's nothing to look at. |
src/buildHTML.js
Outdated
| // 0.524995 is the width of a texttt character in LaTeX. | ||
| // It automatically gets scaled by the font size. | ||
| const rule = makeSpan(["mord", "rule"], [], newOptions); | ||
| rule.style.marginLeft = "0.524995em"; |
There was a problem hiding this comment.
This could be 0.525em since the advance is 525 in font units.
| // The space character isn't in the Typewriter-Regular font, | ||
| // so we implement it as a kern of the same size as a character. | ||
| // 0.524995 is the width of a texttt character in LaTeX. | ||
| // It automatically gets scaled by the font size. |
There was a problem hiding this comment.
This is unfortunate. One thing we might look at doing in the future is add a blank space character to the font so that we don't have to do this. That should allow us to use buildExpression like groupTypes.text does.
There was a problem hiding this comment.
You suggested the reverse in #574. 😄 FWIW, TeX implements spaces as skips instead of characters, but I agree it'd be more symmetric (and simpler) to use a space character. I'm not sure where to collect together font issues -- perhaps on new font repo? Added KaTeX/katex-fonts#1
There was a problem hiding this comment.
I forgot about that comment. I think the difference here is that \verb is kind of like \text and there we use real spaces. I think what you have here is fine.
| if (group.value.star) { | ||
| text = text.replace(/ /g, '\u2423'); // Open Box | ||
| } else { | ||
| text = text.replace(/ /g, '\xA0'); // No-Break Space |
There was a problem hiding this comment.
I'm curious why a non-breaking space vs. a normal space.
There was a problem hiding this comment.
\verb|hello world| will not break in LaTeX, and furthermore renders as multiple spaces, whereas the usual behavior for invisible spaces is that they combine into one. (Even within backtick mode in GFM, I had to use nonbreaking spaces to get them to not merge.) Added a comment to this effect.
I'm less clear on why I write it as \xA0 here and (formerly) in buildHTML.js. Probably the former is clearer, so I've changed the latter. Not sure if we have a preferred style on this matter.
src/Parser.js
Outdated
| if (arg.length < 2 || arg.charAt(0) !== arg.slice(-1)) { | ||
| throw new ParseError("\\verb failure to parse"); | ||
| } | ||
| arg = arg.slice(1, -1); |
There was a problem hiding this comment.
I forgot for a second that negative indices to slice behave like negative indices in python.
src/Parser.js
Outdated
| // Lexer's tokenRegex is constructed to always have matching | ||
| // first/last characters. | ||
| if (arg.length < 2 || arg.charAt(0) !== arg.slice(-1)) { | ||
| throw new ParseError("\\verb failure to parse"); |
There was a problem hiding this comment.
This could error message could be more precise. In the case of arg.length < 2 the error could be:
`\\verb missing matching delimiter '${arg.charAt(0)}'`
and in the case of arg.charAt(0) !== arg.slice(-1) it could be:
`'${arg.slice(-1)}' doesn't match initial delimiter '${arg.charAt(0)}'`
There was a problem hiding this comment.
This is actually more of an assertion: the regex should guarantee that that error is never thrown. The actual error that will get thrown is \\verb ended by end of line instead of matching delimiter which is more descriptive. (It doesn't mention the delimiter character, but that is not trivial to add... Let me know if I should work on it.)
There was a problem hiding this comment.
I thought the first and last characters of the arg were the delimiter so it seemed like a trivial change to add those characters to the error message. If this case should never happen it's probably more important to say in the error that this is a bug and that it should be reported as such.
There was a problem hiding this comment.
Rewrote the error message.
| /** | ||
| * Combine as many characters as possible in the given array of characters | ||
| * via their tryCombine method. | ||
| */ |
There was a problem hiding this comment.
Maybe this should be:
const tryCombineTextNodes = function(nodes) {
It's possible that we might also get passed a documentFragment so we should probably add a tryCombine method on that class as well just in case.
| it("complains about mismatched \\verb with end of line", function() { | ||
| expect("\\verb|hello\nworld|").toFailWithParseError( | ||
| "\\verb ended by end of line instead of matching delimiter"); | ||
| }); |
There was a problem hiding this comment.
Is there a way to test the exception being throw in Parser.js?
There was a problem hiding this comment.
I'm confused -- isn't that what this is doing? Or do you mean confirming that the error is actually of type ParseError? We could probably rewrite toFailWithParseError to do this.
There was a problem hiding this comment.
This is the error that's being thrown in functions.js. The error thrown in Parser.js has the text "\\verb failure to parse".
There was a problem hiding this comment.
From your comment above it sounds like it should never happen though, so I guess we can't test for it.
There was a problem hiding this comment.
Oh, I see. As described above, that error can never be thrown.
|
@edemaine friendly ping. I left some comments and there are a few merge conflicts, but I think this PR is pretty close to being done. |
|
@kevinbarabash Sorry for the delay (been tied up lately), and thanks for the reminder. Just merged with master, and added one commit addressing your comments. Hopefully good to go now. |
|
The screenshots need to be regenerated... probably b/c of that change from |
|
Thanks! Sorry for the screenshot oversight. |


Here's an implementation of
\verb, following #588. As @gagern suggested, the main parsing happens in the lexer, followed by code inparseSymbolto handle commands starting with\verb.Playing around with LaTeX, I found (and implemented) the following behavior:
\verbwill treat a following newline character as the start/end character.\verbargument cannot have a newline in it.\verbwill treat a following space as the start/end character.I'm pretty new to generating output in MathML and HTML, and struggled to get the
mathttfont, so please check that code in particular...