Skip to content

Pass the inferred writing script to HarfBuzz, making locl effective#7415

Open
YDX-2147483647 wants to merge 8 commits intotypst:mainfrom
YDX-2147483647:locl-script
Open

Pass the inferred writing script to HarfBuzz, making locl effective#7415
YDX-2147483647 wants to merge 8 commits intotypst:mainfrom
YDX-2147483647:locl-script

Conversation

@YDX-2147483647
Copy link
Copy Markdown
Contributor

@YDX-2147483647 YDX-2147483647 commented Nov 19, 2025

This PR lets the inline shaping engine infer the scripting script (e.g., hani/latn/…) from the surrounding context of the text being shaped, and pass it to HarfBuzz if appropriate.

It will make the OpenType locl (Localized Forms) feature also effective for edge cases.
For instance, the second period mark in the example below uses the wrong glyph (corner-justified form) with v0.14.0, but will use the correct glyph (centered form) with this PR.

#set text(lang: "zh", region: "TW", font: "Noto Serif CJK SC")
#set heading(numbering: "1")
= Heading <a>

句號。@a@a 何故?

Before:

Image

After: (the screenshot is trimmed)

图片

Fixes #7396

Updating Noto CJK in dev assets

Note

No test is added because typst-dev-assets has no font supporting locl.
The latest Noto CJK supports locl, but the current version in typst-dev-assets are too old. See #7396 (comment) for details.

Discussions were moved to typst/typst-dev-assets#18.

@YDX-2147483647 YDX-2147483647 marked this pull request as ready for review November 19, 2025 11:13
@laurmaedje laurmaedje added the waiting-on-review This PR is waiting to be reviewed. label Nov 19, 2025
@laurmaedje laurmaedje added the text Related to the text category, which is all about text handling, shaping, etc. label Dec 4, 2025
Comment on lines +740 to +751
// If all characters in the range have generic scripts, search beyond
// the range to determine a specific script. If it still fails, use the
// original (though generic) script.
let prior_script = std::iter::once(prior_script)
.chain(
// First backward then forward
(text[..range.start].chars().rev())
.chain(text[range.end..].chars())
.map(|c| c.script()),
)
.find(|&sc| !is_generic_script(sc))
.unwrap_or(prior_script);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some precedent for this particular fallback approach in other software? Sometimes, we have to go with a custom approach, but generally I would always try to look for prior art.

Copy link
Copy Markdown
Member

@laurmaedje laurmaedje Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, since we already do script segmentation below, we probably don't need to rely on Harfbuzz's segment property guessing at all... and could always provide the appropriate script here. Though I'm not sure whether that has any unintended consequences if applied naively.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some precedent for this particular fallback approach

Well, I am not sure…

Regarding the overall direction, I discussed the issue briefly with a member of W3C Chinese Layout Task Force via email. I didn’t show him the code or any detailed algorithm, but he inferred that it should be fixed by the high-level engine, not the font or HarfBuzz.

As for the specific fallback approach, to be honest, I (kind of) came up with it on my own.

The root cause of the issue is that markups can interrupt the segmentation algorithm, making information beyond range necessary for shape_range(text, range, ...). If there wasn’t markup, those ranges would be passed to shape_range as a whole and everything would be okay.

To get around that, this PR incorporates information beyond range into shape_range and shape_segment as if the ranges had never been split by markups. In this sense, this PR does not propose any new approach.

Nonetheless, AI agents (Claude in English and Kimi in Chinese) claim that the current matches GTK/Pango, ICU, Skia, Qt QTextEngine, etc. I am not capable of verifying their claims. Here’s what I know:

we probably don't need to rely on Harfbuzz's segment property guessing

I agree. WebKit developers met a bug about in 2013: “Leaving direction to HarfBuzz to guess is really bad, but will do for now.”

However, considering that there are currently not many related issues, I think it's not worth refactoring at this time. It would be better to keep it as it is.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting HarfBuzz guess any segment properties is really bad, production code should never do that (in my opinion it is even a mistake that HarfBuzz exposes this API).

Script segmentation should be done to full paragraph text at once, just like bidi. So inline markup should not interrupt it.

Another implementation that might be easier to follow, is Raqm’s.

@laurmaedje laurmaedje added waiting-on-author Pull request waits on author and removed waiting-on-review This PR is waiting to be reviewed. labels Feb 10, 2026
Copilot AI added a commit to YDX-2147483647/typst that referenced this pull request Mar 11, 2026
typst#7415 review questions)

Co-authored-by: YDX-2147483647 <73375426+YDX-2147483647@users.noreply.github.com>
@laurmaedje laurmaedje added waiting-on-review This PR is waiting to be reviewed. and removed waiting-on-author Pull request waits on author labels Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

text Related to the text category, which is all about text handling, shaping, etc. waiting-on-review This PR is waiting to be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The inferred writing script is not passed to HarfBuzz, making locl ineffective

3 participants