Pass the inferred writing script to HarfBuzz, making locl effective#7415
Pass the inferred writing script to HarfBuzz, making locl effective#7415YDX-2147483647 wants to merge 8 commits intotypst:mainfrom
locl effective#7415Conversation
ac766e0 to
4bbf7cc
Compare
| // If all characters in the range have generic scripts, search beyond | ||
| // the range to determine a specific script. If it still fails, use the | ||
| // original (though generic) script. | ||
| let prior_script = std::iter::once(prior_script) | ||
| .chain( | ||
| // First backward then forward | ||
| (text[..range.start].chars().rev()) | ||
| .chain(text[range.end..].chars()) | ||
| .map(|c| c.script()), | ||
| ) | ||
| .find(|&sc| !is_generic_script(sc)) | ||
| .unwrap_or(prior_script); |
There was a problem hiding this comment.
Is there some precedent for this particular fallback approach in other software? Sometimes, we have to go with a custom approach, but generally I would always try to look for prior art.
There was a problem hiding this comment.
Generally speaking, since we already do script segmentation below, we probably don't need to rely on Harfbuzz's segment property guessing at all... and could always provide the appropriate script here. Though I'm not sure whether that has any unintended consequences if applied naively.
There was a problem hiding this comment.
Is there some precedent for this particular fallback approach
Well, I am not sure…
Regarding the overall direction, I discussed the issue briefly with a member of W3C Chinese Layout Task Force via email. I didn’t show him the code or any detailed algorithm, but he inferred that it should be fixed by the high-level engine, not the font or HarfBuzz.
As for the specific fallback approach, to be honest, I (kind of) came up with it on my own.
The root cause of the issue is that markups can interrupt the segmentation algorithm, making information beyond range necessary for shape_range(text, range, ...). If there wasn’t markup, those ranges would be passed to shape_range as a whole and everything would be okay.
To get around that, this PR incorporates information beyond range into shape_range and shape_segment as if the ranges had never been split by markups. In this sense, this PR does not propose any new approach.
Nonetheless, AI agents (Claude in English and Kimi in Chinese) claim that the current matches GTK/Pango, ICU, Skia, Qt QTextEngine, etc. I am not capable of verifying their claims. Here’s what I know:
-
UAX #24: Unicode Script Property:
A value of Inherited means that the character is treated as if it had the Script property value of a preceding base character.
This PR matches the above description, except that it also looks subsequent characters. That’s necessary because
typst-layout/src/inline/shaping.rsdoes not distinguish between Inherited and Common, and(in“(示例……)”is Common and hence requires looking forward. -
ICU4C
uscript_nextRundoes not consider information beyond the range, and it has a subtle treatment for paired characters. -
Pango’s codebase is too complicated for me to understand, but it looks like that it also has special treatment for paired characters.
How the script of
。is determined in<strong>示例</strong>。? - DeepWiki | Search (in Chinese)
we probably don't need to rely on Harfbuzz's segment property guessing
I agree. WebKit developers met a bug about in 2013: “Leaving direction to HarfBuzz to guess is really bad, but will do for now.”
However, considering that there are currently not many related issues, I think it's not worth refactoring at this time. It would be better to keep it as it is.
There was a problem hiding this comment.
Letting HarfBuzz guess any segment properties is really bad, production code should never do that (in my opinion it is even a mistake that HarfBuzz exposes this API).
Script segmentation should be done to full paragraph text at once, just like bidi. So inline markup should not interrupt it.
Another implementation that might be easier to follow, is Raqm’s.
typst#7415 review questions) Co-authored-by: YDX-2147483647 <73375426+YDX-2147483647@users.noreply.github.com>
This PR lets the inline shaping engine infer the scripting script (e.g., hani/latn/…) from the surrounding context of the text being shaped, and pass it to HarfBuzz if appropriate.
It will make the OpenType
locl(Localized Forms) feature also effective for edge cases.For instance, the second period mark in the example below uses the wrong glyph (corner-justified form) with v0.14.0, but will use the correct glyph (centered form) with this PR.
Before:
After: (the screenshot is trimmed)
Fixes #7396
Updating Noto CJK in dev assets
Note
No test is added because typst-dev-assets has no font supporting
locl.The latest Noto CJK supports
locl, but the current version in typst-dev-assets are too old. See #7396 (comment) for details.Discussions were moved to typst/typst-dev-assets#18.