Skip to content

feat(macro): improve argument parsing#2085

Merged
ylemkimon merged 45 commits intomasterfrom
arg-parser
Sep 6, 2020
Merged

feat(macro): improve argument parsing#2085
ylemkimon merged 45 commits intomasterfrom
arg-parser

Conversation

@ylemkimon
Copy link
Member

@ylemkimon ylemkimon commented Aug 21, 2019

1. Primitive v.s. Macro
LaTeX has two types of command (control sequence): TeX primitive and macro. TeX primitives are provided by TeX engine. Macros, in contrast, are defined using primitives in the LaTeX kernel (latex.ltx) and packages such as amsmath. (Macro may expand into primitive commands without taking arguments. They are considered as primitives hereinafter.) Primitives scan argument as it goes, i.e., expanding macros, whereas macros take a properly nested group or non-blank token without expanding. For instance,

\def\foo{12}
\sqrt\foo
\underline\foo

is equivalent to

\sqrt12
\underline{12}

However, KaTeX uses primitive argument parsing for functions. This PR introduces a new argument type primitive, which uses the existing method, and uses it for parsing arguments of primitive commands. It also changes other group parsers such as parseArgumentGroup (new) and parseStringGroup to use macro argument parsing method, which scans the argument and denotes the end of the argument at the MacroExpander level before parsing. This fixes #924, fixes #1027, fixes #1720, and fixes #2319. Closes #611.

2. Macro
Delimited parameters in macro and brace handling are implemented in the new consumeArg function. Braced control sequence and unbraced replacement text no longer works.[1]

Optional arguments are implemented using a delimited parameter at the MacroExpander level. This fixes #1419 and #2288.

Whitespace handling is also improved to consume only in undelimited macro arguments and required/optional arguments. This broke \\ and this PR fixes it along with fix #2119.


Some changes may result in BREAKING CHANGEs such as [1], and issues linked above, however they probably were originally wrong behavior.

", expected '" + (delims && isDelimited ? delims[match] : "}") +
"'", tok);
}
if (delims && isDelimited) {
Copy link
Member Author

@ylemkimon ylemkimon Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delims && is required for Flow to recognize delims != null.

src/Parser.js Outdated
if (isOptional) {

// \sqrt is a macro only if an optional argument exists
if (funcData.type === "sqrt" && i === 1 && optArgs[0] == null) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find a better way to handle \sqrt than hard-coding it.

@ylemkimon ylemkimon requested review from edemaine and k4b7 October 11, 2019 15:55
@ylemkimon ylemkimon marked this pull request as ready for review October 11, 2019 17:20
@ylemkimon
Copy link
Member Author

This breaks the example at the main page:

% \f is defined as f(#1) using the macro
\f{x} = \int_{-\infty}^\infty
    \hat \f\xi\,e^{2 \pi i \xi x}
    \,d\xi

Any suggestion to replace it is welcome!

@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2020

Codecov Report

Merging #2085 into master will decrease coverage by 0.01%.
The diff coverage is 98.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2085      +/-   ##
==========================================
- Coverage   93.83%   93.81%   -0.02%     
==========================================
  Files          84       84              
  Lines        6112     6128      +16     
  Branches     1254     1260       +6     
==========================================
+ Hits         5735     5749      +14     
- Misses        346      348       +2     
  Partials       31       31              
Impacted Files Coverage Δ
src/functions/delimsizing.js 97.39% <ø> (ø)
src/functions/kern.js 84.21% <ø> (ø)
src/functions/mathchoice.js 94.11% <ø> (ø)
src/functions/mclass.js 98.03% <ø> (ø)
src/functions/op.js 98.91% <ø> (ø)
src/functions/styling.js 100.00% <ø> (ø)
src/parseNode.js 73.33% <ø> (ø)
src/Parser.js 96.26% <95.00%> (-0.73%) ⬇️
src/MacroExpander.js 97.82% <100.00%> (+0.49%) ⬆️
src/defineFunction.js 95.65% <100.00%> (+0.41%) ⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8578d74...a2df241. Read the comment docs.

@ylemkimon ylemkimon changed the title Improve macro argument parsing feat(macro): improve argument parsing Jul 17, 2020
@k4b7
Copy link
Member

k4b7 commented Aug 2, 2020

@ylemkimon in #2288 you commented:

Actually, #2085 will add support for delimited macros.

but earlier in this PR you also commented:

I wanted to demonstrate KaTeX supports a delimited macro, but it's currently not possible to define a delimited macro. But I think this is out of this PR's scope. I've also added a migration guide.

Looking at the code it looks delimiters are supported. Is that correct? If so, could you add some tests that demonstrate this?

@ylemkimon
Copy link
Member Author

@kevinbarabash I meant it's not possible to define a delimited macro using options.macros.

Comment on lines +3192 to +3198
it("\\gdef defines macros with delimited parameter", function() {
expect`\gdef\foo|#1||{#1}\text{\foo| x y ||}`.toParseLike`\text{ x y }`;
expect`\gdef\foo#1|#2{#1+#2}\foo 1 2 |34`.toParseLike`12+34`;
expect`\gdef\foo#1#{#1}\foo1^{23}`.toParseLike`1^{23}`;
expect`\gdef\foo|{}\foo`.not.toParse();
expect`\gdef\foo#1|{#1}\foo1`.not.toParse();
expect`\gdef\foo#1|{#1}\foo1}|`.not.toParse();
Copy link
Member Author

@ylemkimon ylemkimon Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests for delimited macro are already added here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what the | were for. I'm so used to brackets and parens as delimiters, I forgot that | is a delimiter as well. 🙃

@edemaine edemaine mentioned this pull request Aug 3, 2020
Copy link
Member

@k4b7 k4b7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a partial review. I'll pick up at def.js when I have some more time.

Comment on lines +22 to +24
## `\newline` and `\cr`
`\newline` and `\cr` no longer takes an optional size argument. To specify vertical
spacing, `\\` should be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change in behaviour arise from the macro/primitive changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash Partly, yes. Those changed space handling and they are updated to match their definitions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a late comment, sorry. But there is a breaking change in that \raisebox previously accepted unbraced dimensions and now it does not. That might be worth a note in the migration guide.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronkok Thank you for finding it! That was an unexpected change, but might be fixed with #2139. I'll add a note to the migration guide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is anything to "fix". LaTeX \raisebox does not accept unbraced dimensions, so the new behavior is not a bug. It's just needs a note in the migration guide.

tokens.shift();
}
tokens.reverse(); // to fit in with stack order
return {tokens, start, end: tok};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start and end can be outside the range of tokens, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash No, they are always the first and last or the second and second to last tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the case where there's a { and a } at the start and end of tokens are the start and end the braces and tokens the stuff inside the braces (after the pop() and shift())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash Sorry, my mistake. end can be outside the range of tokens, when it's delimited. In that case, end is the last delimiter.

Comment on lines +722 to +723
const expression = this.parseExpression(false, "EOF");
this.expect("EOF"); // expect the end of the argument
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use a different token to mark the end of arguments. Overloading EOF seems like it could be a bit confusing. Maybe something for a future PR given that there's already so much going on in this one.

Copy link
Member Author

@ylemkimon ylemkimon Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash I agree. This is my major concern, too. This will probably cause a lot of issues that are hard to debug, but I can't seem to find a simple better way. I'll add TODO for now.

Comment on lines +231 to +233
export const normalizeArgument = function(arg: AnyParseNode): AnyParseNode {
return arg.type === "ordgroup" && arg.body.length === 1 ? arg.body[0] : arg;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that arg could be a series of nested ordgroups? If so, would we want to unwrap each of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash I'm not sure whether it's possible, but that's not caused by changes in this PR.

Comment on lines +65 to +66
// \cr is equivalent to \\ without the optional size argument (see below)
// TODO: provide helpful error when \cr is used outside the array environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you raise a ticket for this TODO?

Comment on lines +132 to +136
// \def\Let@{\let\\\math@cr}
// \def\math@cr{...\math@cr@}
// \def\math@cr@{\new@ifnextchar[\math@cr@@{\math@cr@@[\z@]}}
// \def\math@cr@@[#1]{...\math@cr@@@...}
// \def\math@cr@@@{\cr}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this collapse multiple \crs in a row or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash No, they are just how \\ is defined in LaTeX and amsmath.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'll ever get used to reading .dtx files. 😓

Copy link
Member

@k4b7 k4b7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that LaTeX treated primitives differently from macros. The summary you posted at the top of this PR was quite enlightening. Thank you for noticing this discrepancy and figuring out how to correct it. I've finished my first pass. I don't see any blocking issues, but I do have a fair number of questions about various things so I'm holding off on approving until you have a chance to respond to them.

Comment on lines -12 to +10
// same signature, we implement them as one megafunction, with newRow
// indicating whether we're in the \cr case, and newLine indicating whether
// to break the line in the \newline case.

// \DeclareRobustCommand\\{...\@xnewline}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the ... stand for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash They are omitted commands for clarity.

Comment on lines +107 to +110
// If the very last character of the <parameter text> is #, so that
// this # is immediately followed by {, TeX will behave as if the {
// had been inserted at the right end of both the parameter text
// and the replacement text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of this situation would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash

expect`\gdef\foo#1#{#1}\foo1^{23}`.toParseLike`1^{23}`;

} else if (tok.text === "EOF") {
throw new ParseError("Expected a macro definition");
} else {
delimiters[numArgs].push(tok.text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can tok.text be any string or should we be checking it against a list of allowed delimiters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash AFAIK, every valid TeX token can be used as delimiters.

Comment on lines +129 to +130
} else if (tok.text === "EOF") {
throw new ParseError("Expected a macro definition");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this case arise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash If the end is reached without the macro body, e.g., \def\abc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really obvious now. 😅

Comment on lines +180 to +181
"Unexpected end of input in a macro argument," +
" expected '}' at end of input: …r{#ffffff{text}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the position information may make it hard to track down errors in some cases. The underlining was also helpful because it made it really easy to see the range of the issue. Looking at some of the test cases below, we still have both of these features in those test cases. Is it hard to maintain those features in these test cases because how macro expansion has changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash In case of macros, arguments should be scanned (fetched) first, so scanning errors are detected before parsing errors, hence changes in error position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is just how we've been reporting parse for some time now.

expect`\textcolor{1a2}{foo}`.toFailWithParseError(
"Invalid color: '1a2'" +
" at position 12: \\textcolor{1̲a̲2̲}{foo}");
" at position 11: \\textcolor{̲1̲a̲2̲}̲{foo}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an extra space inside of \textcolor{} in the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash It is probably combining underline not shown properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an extra space anymore. 🤷‍♂️

Comment on lines -2569 to -2594
describe("A document fragment", function() {
it("should have paddings applied inside an extensible arrow", function() {
const markup = katex.renderToString("\\tiny\\xrightarrow\\textcolor{red}{x}");
expect(markup).toContain("x-arrow-pad");
});

it("should have paddings applied inside an enclose", function() {
const markup = katex.renderToString(r`\fbox\textcolor{red}{x}`);
expect(markup).toContain("boxpad");
});

it("should have paddings applied inside a square root", function() {
const markup = katex.renderToString(r`\sqrt\textcolor{red}{x}`);
expect(markup).toContain("padding-left");
});
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these test cases removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinbarabash They're test cases added in #1707, which relied on edge-case and previous (wrong) behaviors.

Copy link
Member

@k4b7 k4b7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for answering all my questions and for all of your hard work on this PR.

tokens.shift();
}
tokens.reverse(); // to fit in with stack order
return {tokens, start, end: tok};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the case where there's a { and a } at the start and end of tokens are the start and end the braces and tokens the stuff inside the braces (after the pop() and shift())?

"need to increase maxExpand setting");
}
let tokens = expansion.tokens;
const args = this.consumeArgs(expansion.numArgs, expansion.delimiters);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize there could be multiple character delimiters. Thanks for the explanation.

Comment on lines +585 to +586
argToken.text = str;
return argToken;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, maybe we should have an assertion that argToken.text.length === str.length and throw if that's not the case. The error reporting locality would be better than waiting for an EOF mismatch.

Comment on lines +132 to +136
// \def\Let@{\let\\\math@cr}
// \def\math@cr{...\math@cr@}
// \def\math@cr@{\new@ifnextchar[\math@cr@@{\math@cr@@[\z@]}}
// \def\math@cr@@[#1]{...\math@cr@@@...}
// \def\math@cr@@@{\cr}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'll ever get used to reading .dtx files. 😓

Comment on lines +129 to +130
} else if (tok.text === "EOF") {
throw new ParseError("Expected a macro definition");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really obvious now. 😅

Comment on lines +180 to +181
"Unexpected end of input in a macro argument," +
" expected '}' at end of input: …r{#ffffff{text}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this is just how we've been reporting parse for some time now.

expect`\textcolor{1a2}{foo}`.toFailWithParseError(
"Invalid color: '1a2'" +
" at position 12: \\textcolor{1̲a̲2̲}{foo}");
" at position 11: \\textcolor{̲1̲a̲2̲}̲{foo}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an extra space anymore. 🤷‍♂️

@ylemkimon ylemkimon merged commit dc5f97a into master Sep 6, 2020
@ylemkimon
Copy link
Member Author

@kevinbarabash Thank you for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment