Skip to content

Create VarElem type replacing TextElem in math (WIP)#1779

Closed
damaxwell wants to merge 9 commits intotypst:mainfrom
damaxwell:mathtext
Closed

Create VarElem type replacing TextElem in math (WIP)#1779
damaxwell wants to merge 9 commits intotypst:mainfrom
damaxwell:mathtext

Conversation

@damaxwell
Copy link
Contributor

This PR is a work in progress and is a partial implementation of option (2) of RFC #6, #1125 concerning text vs. math in formulas in an attempt to shake down what the core issues are. Sometimes you need to implement something to find out where the tricky parts lie. In this case, some of the (initial) hard stuff turns out to be considerations not discussed in #1125: symbols, and regex. The PR is a draft, and will likely just serve as guidance toward whatever the end solution is.

The core issue is that when displaying mathematics

  • some text should be easy to typeset in the same style as the document
  • mathematical text is handled completely separately:
    • Most fonts cannot be used for math typesetting.
    • Math fonts typically have limited kerning/glyph substitutions, making them unsuitable for ordinary text.
    • The mechanisms for making things bold or italic are entirely different for the two contexts (use a different codepoint vs use a different face)
    • Math fonts have optical sizing features missing from ordinary fonts.
    • Etc.

Currently Typst has only one notion of text. In a formula environment, text consisting of a single character is treated very specially and corresponds to mathematical text. Runs of numbers have their own, analogous, special treatement as mathematical text, and everything else is essentially treated as ordinary text. Ordinary text is run through the same mechanism Typst uses to layout text in a paragraph, except that it's using a suboptimal math font to do ordinary text typesetting.

One natural attempt to remedy this is to introduce two different kinds of text that can appear simultaneously: ordinary and math. Given $ x "w" $, we want x to be math text and "w" to be ordinary, and the first thing to settle is which of these corresponds to the currently existing text element function. However, if the math text occupies the role of text, I'm not sure how to get "w" to escape back to the text settings when the formula was started so that the document's text properties can be inherited (much less be altered on the fly).

So it seems that the right thing to do is have text mean the same thing inside and outside of a formula. This means there needs to be a different entity to represent mathematical text. This is option (2) of #1125, and the name of the entity for the sake of argument is math.var.

Basic design

The PR introduces two new element functions, VarElem and SymbolElem, though only VarElem is exposed (as math.var). When parsing a formula, material that used to parse as SyntaxKind::Text now becomes SyntaxKind::MathVar; that would be singleton-characters and numbers. No other changes are made at the parser level and no syntax sugar is available for a VarElem.

Upon evaluation, a SyntaxKind::MathVar entity becomes a VarElem. Previously, raw symbol values (generated via the symbol function, e.g.) used to evaluate directly as TextElem. This isn't an option anymore because mathematical symbols would then be forever stuck as text. The mushy solution landed upon is the introduction of the SymbolElem. When rendered into a document outside of a formula, it converts itself into a TextElem, and inside a formula it becomes a VarElem. Sounds reasonable enough, but see below.

There were a number of places where TextElems were inserted into formulas on the fly (e.g. dif). These seem to be all stamped out now and are all VarElems.

When laying out a math formula, there are now two pathways for rendering text. When a TextElem is encountered one path is chosen; for now it is similar to what Typst did before, except that there is no special casing for single characters or runs of numbers, and no math styling is applied to the characters to make them, e.g., fraktur. This is just text. Note, however, that text it is still laid out with the math font. When a VarElem is encountered, there is a separate pathway that accounts for large operators changing size, math styling to caligraphic or fraktur or bold, etc., and conversion to optical sized glyphs. Singleton characters still get special treatment so as not to break too much all at once, and all other runs of text are run through Typst's standard mechanism, except that the font's script is set to math so that math features like optical sizing can be turned on.

Even with this partial progress, there are some wins.

  • Ordinary text "a" vs "aa" doesn't bounce around between itialic or not.
  • Runs of numbers and operator names now participate in optical sizing. Currently, only singleton characters are optically sized. Now:
$ e^e^(sin x) $
image
  • Text and math are starting to separate style-wise.
#set text(fill:blue)
#show math.var: #set text(fill:red)
$ x' < y' "iff" y' > x' $
image

Regex

Ugh. I didn't see this one coming. Typst allows for regex-style show rules, and now there are two different kinds of text (three if you count the fact that symbols are a third entity that are neither fish nor fowl). In the current implementation, a regex pattern is neutral, and can be specified by a symbol, in which case the pattern is the character. When matching, text, var or a symbol can all be matches. When matching on a symbol, it's just replaced. When matching on a text or var, the remaining fragments have the same type as you started with.

I have not kicked the tires on this too much. The tests pass. I saw some evidence of fragility (infinite recursion), but I think that some of this was already present. For example,

#show "a": [a b]
a

breaks on main already.

Tests

A number of reference images changed. Many are duplicates of the changes from pending PR #1774 (there has been some accidental mixing of CM Regular and Book faces by default in Typst). Some are things like newly optically sized numbers in exponents. I saw no degredations -- the changes are either very slight or are improvements.

Five .typ test files needed small updates. The updates were all cases where a string (or something that became a string) needed to be wrapped in var so that it was laid out as math, not text. I didn't figure out how to get a var to construct from both a string and an int (I haven't wrapped my head around the cast system yet), so it only takes strings in this implementation. A couple of the changes to the .typ files would have been smaller if something like var(3) were possible, which presumably is easy enough to implement if you know what you are doing.

Thoughts

Having two kinds of text adds model complexity. Sprinkling a few vars around makes programming Typst just a bit harder. On the other hand, the cases where this was needed in the test files already had some sophistication.

I don't like the solution in this implementation for symbols. They are the only thing that bridge the gap between text and var and they introduce all kinds of complexity. For example, consider $ [a #sym.sum b] $. Is the sum going to be rendered as text or var? In fact, the a and b are text and the sum is var. Ugh.

The worst parts of this PR relate to the mixing of text and var via symbols and via regex. Given this experience, I think a not awful path forward would be to just embrace the fact that there are two kinds of text, and there are along with them two kinds of symbols. Already, we have emoji and sym which have little semantically in common. So introduce a math symbol as a primitive type. All the sym things are one of these, and all the emoji are the other. Upon evaluation, the sym things become var and the emoji things become text (as symbols do currently). There is a math.symbol function to create math symbols, and it's entirely analogous the current symbol function, which would now create text symbols.

Once this is committed to, for regex I think there is text regex and var regex. Not 100% on this.

As for var, it needs to grow up. It should become its own flavor of text and wrap characters to render in an OpenType math font in the way that text wraps characters to render in a usual OpenType font. So var has some properties that text has (like size and fill -- both of which can be set to auto which fall back to the text values instead) and it doesn't have parameters that are irrelevant to math layout (looking at you, hyphenate) but in the course of time is the seat of things like cal and frak and bb. With this in play, it would then be possible to do things like easily change the font used for caligraphic letters only.

#show math.var.where(style:"cal"): #set math.var(font:...)

Some of the big work needed to make var grow up would be figure out how to repurpose the text rendering from shaping.rs and par.rs to the new setting. Maybe hijacking shape_range in par.rs would suffice.

Anyway, lots to consider here.

@damaxwell
Copy link
Contributor Author

damaxwell commented Jul 25, 2023

var grew up, and now owns its size, fill, and font. 🎉

The size and fill are tied to that of text when set to auto, the default.

Normal size.\
#set text(fill:blue, size:18pt)
$e^(pi i)$
image

Settings of size and fill to var override settings to text

Normal size.\
#set text(fill:blue, size:18pt)
#set math.var(fill:red, size: 11pt)
$e^(pi i)$
image

The text font in math is independent of the math font. If you really want Comic Sans, you can have it.

#set text(font:"Comic Sans MS");
$"whenever" x in RR$
image

Comic Sans is a bit tall compared to Computer Modern. We'll make it smaller.

#set text(font:"Comic Sans MS");
#show math.equation: it => { 
    show text: set text(size:10pt)
    it
}
$"whenever" x in RR$
image

That's better! Note that #show math.equation: set text(size:10pt) would not have sufficed: it would have see the text font size for the whole equation, and var would have inherited the small size as well.

Comic Sans is silly. How about something more respectable: the default font of Linux Libertine with STIX Two Math.

#set math.var(font:"STIX Two Math")
$"whenever" x in RR$
image

Just because we can:

#set math.var(font:"STIX Two Math", fill:green)
#set text(fill:blue);
$"whenever" x in RR$
]
image

The math font is not fixed for all time at the start of the equation. You can change math fonts on the fly.

#let gyrecal(x) =  {
    set math.var(font:"TeX Gyre Termes Math")
    $cal(#x)$
}
$cal(G) "vs" gyrecal(G)$
]
image

Note that the "vs" is in Libertine. 😁

Notes

To get a quick implementation up and running I took a shortcut. The var text rendering spoofs itself as text. That is, it builds a brand new style from scratch each time it lays out a var with just the settings it needs to get the right effect. (empty style, then add in the desired text properties). The right thing to do is to make some adustments to shaping.rs so that it isn't using hard-coded TextElem style calls and then call it directly for vars. It's a bit of work, but not so bad, I think.

I lost a battle with the borrow checker. The MathContext used to keep some pointers around tied to the math Font (some of which involved computation, so caching was a good plan). This was OK if the Font was fixed throughout the duration of math layout, but if the Font can change unpredictably, rust (reasonably) makes it hard to cache pointers to ephemeral data. I bet someone more fluent than me could work around this. For now, it's a bit of extra computation per character.

There are a number of "FIXME"s for minor things sprinkled in the code.

Aside from the test file var.typ, which was expanded, none of the other test source files were changed, and all of the tests pass. No reference images aside from var.png were changed. To make this happen, the preamble to the tests was set up with a global show rule, effectively

show math.equation: set text(font:"Computer Modern Math", weight: 450)

This is what math.equation is essentially doing in standard Typst. Note that this is setting the text font, not the var font, which already has the right default. In the real world, nobody should be setting a math font for their text font. If you want CM as a text font:

set text(font:"New Computer Modern")
$"difficult:" x divides Phi $
image

Compare with using a math font for the text, Typst's current behavior.

#set text(font:"New Computer Modern Math")
$"difficult:" x divides Phi $
image

The ligatures are gone.

I'll remove the preable show rule in a subsequent commit, but I thought it was important to exhibit that the current update (allowing var to own its own font independent of text) could be done with zero additional changes to the test files.

One of the tests inserted raw text (stuff in backticks) into an equation and was initially failing. It turns out that its
hard (impossible?) to tell from a content stream whether or not it's supposed to be laid out with spacing according to math vs. text conventions. In the case of raw text, it just expands in the end to some styles that change the text font, some colors, and a few other text settings, and then some text elements. All of this could have come from a math setting. So how is it supposed to be spaced? Math or text? Could be either. In Typst's current implementation, it uses a font change as the marker to start standard content layout. That's a neat hack, but it's not robust when you can have math and text fonts changing on the fly. Moreover, the text font is now different from the math font, so it might not even change when a switch to normal content is supposed to begin.

To deal with this, I introduced one more element function, math.ord. It's a marker that the content it contains is to be laid out with ordinary text conventions. When the math typesetter sees it, it passes the content off to the standard content layout routines instead. Conversely, when standard document layout sees and ord, it simply expands it to its content. Now raw text wraps itself in an ord to protect itself. There are probably a few other places in the standard library that need a ord wrapper.

Initially I was going to make ord a hidden thing, but after playing around, I've come to the conclusion that it's handy to
have as a tool. If there is ambiguity about text vs math, you have the possibiilty to wrap in an ord and guarantee text layout. So math.ord is a thing.

Next?

This is starting to feel like a thing that you can kick the tires on and see how it goes. There's still no syntax sugar, but that's fine for now.

The regex/symbol awkwardness has not been addressed. I'm still inclined to have a hard split between text/math symbols. But that's not a priority compared to getting math vs text interactions shaken down.

The next thing to do to clean this PR up would be to eliminate the spoofing and have proper, dedicated shaping.

@damaxwell
Copy link
Contributor Author

No more text spoofing. The shaper can get properties from either text or var in a run.

@laurmaedje
Copy link
Member

This is really great work! I don't want to rush the review here since the topic is very intricate and I really want to think through the problem and your solution. Since I'm currently quite busy working on web app issues, I have to postpone it a bit, but I'll try to get to reviewing it soon.

@PgBiel
Copy link
Contributor

PgBiel commented Jul 27, 2023

This is amazing work! Thanks for the detailed explanation as well!

As a side-note, please don't worry too much about the custom syntax. Considering it involves changes to the syntax (duh), and not to the layouting logic, I'm fairly sure this should only be implemented in a future, distant PR (assuming we get to a consensus about that). Having a functional var element should certainly be the top priority at first, as that's what will solve most of the relevant issues outlined in the RFC (if not all).

@damaxwell
Copy link
Contributor Author

No worries about the delay reviewing this, @laurmaedje. I had noticed that you've been focussed lately on the web app side. This is a big change (if indeed something like it gets accepted) and should be examined carefullly. I've got a little more refining I want to do, and will document the updates with comments here.

As for dedicated syntax, @PgBiel , I've come to the conclusion that there shouldn't be any. It's actually rare that you make a var, unless you are coding and reinterpreting some text or some numbers as math, in which case dedicated syntax isn't useful anyway. Here are the common kinds of text in math

  • single characters to be typeset asvar (which we have dedicated syntax for already)
  • ordinary text in the document font, which is in double quotes
  • numbers to be typeset as math, which the parser already converts to var. Thanks, parser!
  • Operator names like sin and friends.

It's only the last one that we don't have syntax for; we have op instead, which is now just shorthand:

op(x) = class("large", upright(var(x))

It's not var's job to make things upright, nor to change the class of its text. It just says "this text is to be typeset with the math font and conventions, thank you!". By default, it's always italic. Otherwise, if you apply a regex rule, you can split your var into pieces. If one of the pieces has length 1, you don't want it changing style on you. Since most vars are length 1, they should be italic. So all vars are. It's not often you want an all-italic word in math. So you don't need the syntax.

I could imagine a case to be made for syntax for op. But op isn't bad for one-offs, and you can always make a function for your own custom operators.

Today's update

Bug fix, so that var(font:"new-font","x") works. Previously, show math.var(font:"new-font") worked fine. But for single characters, the former was broken.

var can now construct from both an int and a string. This simplified most of the changes to the test files that needed changes, which are now all of the form "wrap something in var, because it would otherwise be text".

Some cleanup of the var interface. Most importantly, it used to have a property weight, that was analogous to the text property. But you really only end up using that property to find the math font (it distinguishes between New CM Math and New CM Math Book, for example). When typesetting, the thing that acts like weight for text isn't actually affected by the weight parameter. So it's renamed: base-weight. As in:

set math.var(font:"New Computer Modern Math", base-weight:450)

to get CM Math Book. It's the same as

set math.var(font:"New Computer Modern Math Book")

@PgBiel
Copy link
Contributor

PgBiel commented Jul 27, 2023

As for dedicated syntax, @PgBiel , I've come to the conclusion that there shouldn't be any. It's actually rare that you make a var, unless you are coding and reinterpreting some text or some numbers as math, in which case dedicated syntax isn't useful anyway.

I think this is a fair point overall. As I see it, we should probably release it without the syntax when it's ready, and collect user feedback regarding usage of var, to make sure we did the right decision.

@damaxwell damaxwell marked this pull request as draft July 27, 2023 21:20
@damaxwell
Copy link
Contributor Author

The last update made good on the principle that var has no effect on styling (i.e. it doesn't become upright depending on its length). I'll leave this PR alone for a while now.

I'm happy with the parts of the PR that are solely about transferring back and forth between math and written text. The interface for this feels easy and natural. There is one major outstanding issue that needs to be addressed: regex. If it weren't for that, I'd be really happy with where this ended up.

The issues (in the easiest case of show "stuff": <content>:

  • The pattern is a generic string, possibly represented by a symbol, which encodes its character.
  • The replacement is not generic. It is content, where the text comes in both flavors, and the symbols are straddling the fence.
  • The replacement content has no way of knowing which flavor it is replacing. The remnants of the match will be the same flavor you started with, but the replacement can be full of things that are unrelated. To replace a snippet of a var with a var, you'd use show "x": var("y"), but if that matches against text "axe" you get text a, var y and text e. Not so useful.
  • I don't see a good way in the current setup to match just text or just var with all the same semantics and outcomes otherwise.
  • Symbols being hybrid things makes reasoning about them trickier.
  • One could split symbols into text or math and have them be text or var. But there are a bunch of symbols that are in sym and aren't really math. How about dagger. Not very math-y, Or emdash. Or comma. That one seems like it gets used in both math and text from time to time.

I think what's needed at this point is a concrete specification of how regex should work in the presence of two flavors of text, possibly with symbols that render to either depending on whether the context is math or not. What should be the rules? I don't have a solution in mind that's good enough that I'd want to implement it.

One possibility is that bare strings and regex just match on text. There is a math.regex that matches instead on var. But then how about symbols? Do symbols match on text regex? They might become text, but they might become var, and I think you don't know during realization which context you are in. I'm a little stumped.

@laurmaedje
Copy link
Member

laurmaedje commented Jul 28, 2023

This reminds me of the problem that regex matches in raw/code blocks are also a problem that occurs from time to time. Also things like hyphenation affecting raw blocks. Is raw yet another flavor of text? One possible route would be regex flags for var/text/raw, and I guess normal strings would be just text, but I'm not sure I like it. Also show var("x") would be nice. But allowing arbitrary content to be replaced is also very complex.

@damaxwell
Copy link
Contributor Author

Ok, so here's a potential approach. There are still things not to like, but this might be good enough for now.

  1. Symbol primitives go back to their old behavior. Upon evaluation, they are text, nothing more. sym.integral means text-flavored integral, if that's what you really want.
  2. For text regex, all of the following are legit:
  • show "hello": <replacement>
  • show regex(<pattern>): <replacement>
  • show sym.check: <replacement>
  1. In all these cases, only text matches the given pattern, never var.
  2. There is a new primitive type that most people don't need to know exists, mathsymbol, say.
  3. Upon evaluation, a mathsymbol becomes var in the same way that an ordinary symbol becomes text.
  4. The math module no longer imports sym, but instead contains entities for everything in sym, except as mathsymbols. If you enter $theta$, you're getting a var. Similarly, math.theta gets you the var theta, whereas sym.theta gets you its text flavor.
  5. There is a separate math regex function, accessed as math.regex. It only matches on var. A mathsymbol is a legitimate selector, and is shorthand for math.regex(str(symbol)).
  6. So the following are all ok (once issues with recursion are addressed more generally):
  • show math.theta: math.theta.alt
  • show math.regex("/"): math.class("normal",$/$)
  1. There is a math.symbol function that mirrors the one for text, but creates mathsymbols, so you can make your own, and they participate in math regex just fine.

Things I like:

  • There is a consistent theme: stuff in double quotes invokes ordinary text. This is true both for regex and for material appearing in formulas.
  • Another consistent theme: regex selectors for formulas generally start with math (via math.regex or via a premade math symbol).
  • Symbols don't dance around (text? var?) and are easier to reason about. A math.Sigma is var, no matter what context you write it in.
  • When writing formulas, no change.
  • When doing regex for math symbols, the change is minor and intuitive: sym -> math.
  • show math.integral: and show math.regex("∫"): are identical.

Things I don't like:

  • The layer of math symbols vs text symbols may take an explanation for beginners: "math.theta likes to present itself in math mode, sym.theta likes to present itself as document text"
  • there is no math analog of show "hello": "howdy". This could be added in the future if there were notation for var. For the sake of argument, suppose @"stuff" is notation for var. Then show @"/": math.class("normal", @"/") becomes legit. This application is the strongest argument I can think of for having dedicated notation for var.
  • As it stands, show math.regex("/"): math.class("normal",$/$) is awkward, in that the slash on the left is in quotes and the one on the right is in dollar signs. If you write show math.regex("/"): math.class("normal"/") then your var slashes are going to get replaced with text, because "/" is text in math mode. You could also write show math.regex("/"): math.class("normal",var("/")), which is wordier but has all the slashes in quotes at least.

While there is merit to show var("x"), it's complicated by the fact that var has properties. Then you have to untangle show var(x, fallback:false). I'd rather see, if needed, syntax for unstyled var in the same way that a string is unstyled text.

I kinda like show math("/"): as the var analog of show "/":, but this really overloads what math means. It's already happy just being a module.

At any rate, if the principle going forward is that "math text and ordinary text have separate flavors, and regex is restricted to just one or the other", then I think this proposal allows it to happen with the most significant downside that simple var matching requires math.regex. But this can be addressed later on if notation for var is decided upon.

@laurmaedje laurmaedje added the waiting-on-review This PR is waiting to be reviewed. label Aug 8, 2023
@laurmaedje laurmaedje added waiting-on-design This PR or issue is blocked by design work. and removed waiting-on-review This PR is waiting to be reviewed. labels Nov 21, 2023
@laurmaedje
Copy link
Member

I'm closing this as there's still design work to do and PRs are not a great place to do it. It needs more quick discussion. I've opened a new topic in the contributors forum on Discord.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-design This PR or issue is blocked by design work.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants