Use texts of the first occurrences for `/ToUnicode` CMap by YDX-2147483647 · Pull Request #4585 · typst/typst

YDX-2147483647 · 2024-07-19T16:29:41Z

Resolves #4582
…just by deleting improve_glyph_sets!

There are two sources of information for /ToUnicode: the glyph_set recorded while write_text, and cmap tables of the font.
improve_glyph_sets leverages the font. It was refactored into a function in #4154, but the real code even predates ed6550f (2 years ago).

improve_glyph_sets was necessary before ad34763, when a glyph_set was a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).)

In ad34763, the glyph_set is refactored to a map from glyphs to texts.
Now we have enough information for /ToUnicode CMap—no need to search the font.

Changes

If the glyph…

…represents a single character…
- …and is mapped from only one code point:
  No change.
- …and is shared by multiple code points (e.g. CJK unified/compatibility):
  /ToUnicode changes from the largest code points to the first occurrence, and fixes /ToUnicode in PDF can be wrong if a glyph is mapped from multiple code points #4582.
…represents a sequence of characters (e.g. ligature)…
- …and they are also encoded as a single code point for compatibility (e.g “fi”/ﬁ):
  /ToUnicode changes from a single compatibility code point (ﬁ) to the sequence (fi).
  The behaviour in PDF viewers usually does not change.
- …and is not encoded in Unicode (e.g. “Th” in Linux Libertine):
  No change.

Testcase

(AFAIK, /ToUnicode is not included in testit.)

#set page(width: 30em, height: auto)

ffi Th ﬃ
#set text(ligatures: false)
ffi Th ﬃ

$integral_x y dif z$

#set text(lang: "zh", region: "CN", font: "Source Han Serif", fallback: false)

#let example-Han = [
  地球发动机是人类建造的力量最大的机器，比如我们所在的华北794号，全功率运行时能向大地产生150亿吨的推力。

  - ⼒ U+2F12 (Kangxi radical)
  - 力 U+529B (CJK unified)
  - 力 U+F98A (CJK compatibility)
]

#example-Han

#set text(fill: gradient.linear(..color.map.rainbow))
#example-Han

#set text(stroke: red)
#example-Han

TODO

Remove unicode_properties. (still an indirect dependency through rustybuzz)

Future work

Write /ActualText. (tracked by PDF text extraction can fail in complex shaping scenarios #4225 and Different characters that use the same glyph result in the same character when copied from PDFs #526)
~~Use the most common instead of the first occurrence.~~ I don't think we should use CJK unified even if the author always writes CJK compatibility.

Resolves typst#4582 …just by deleting `improve_glyph_sets`! There are two sources of information for `/ToUnicode`: the `glyph_set` recorded while `write_text`, and `cmap` tables of the font. `improve_glyph_sets` leverages the font. It was refactored into a function in typst#4154, but the real code even predates ed6550f (2 years ago). `improve_glyph_sets` was necessary before ad34763, when a `glyph_set` was a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).) In ad34763, the `glyph_set` is refactored to a map from glyphs to texts. Now we have enough information for `/ToUnicode` CMap—no need to search the font. If the glyph… - …represents a single character… - …and is mapped from only one code point: No change. - …and is shared by multiple code points (e.g. CJK unified/compatibility): `/ToUnicode` changes from the largest code points to the first occurrence, and fixes typst#4582. - …represents a sequence of characters (e.g. ligature)… - …and they are also encoded as a single code point for compatibility (e.g “fi”/ﬁ): `/ToUnicode` changes from a single compatibility code point (ﬁ) to the sequence (fi). The behaviour in PDF viewers usually does not change. - …and is not encoded in Unicode (e.g. “Th” in Linux Libertine): No change.

laurmaedje · 2024-07-20T13:05:26Z

I agree that this makes sense. The reason I kept the cmap search in ad34763 was probably just cautiousness.

laurmaedje · 2024-07-20T14:13:09Z

Thank you!

YDX-2147483647 force-pushed the cmap branch from 4b55e8a to 2e1e355 Compare July 19, 2024 16:44

YDX-2147483647 changed the title ~~Use texts of first occurrences for /ToUnicode CMap~~ Use texts of the first occurrences for /ToUnicode CMap Jul 19, 2024

YDX-2147483647 force-pushed the cmap branch 2 times, most recently from 360c1cd to a7123b5 Compare July 19, 2024 17:38

YDX-2147483647 force-pushed the cmap branch from a7123b5 to ce40f9f Compare July 19, 2024 17:43

YDX-2147483647 marked this pull request as ready for review July 19, 2024 17:51

Merge branch 'main' into cmap

ab72d4e

laurmaedje added this pull request to the merge queue Jul 20, 2024

Merged via the queue into typst:main with commit 9b001e2 Jul 20, 2024

YDX-2147483647 deleted the cmap branch July 20, 2024 16:49

LaurenzV mentioned this pull request Jul 25, 2024

Don't "improve" glyph sets typst/svg2pdf#79

Merged

laurmaedje mentioned this pull request Oct 2, 2024

Fix default ignorables #5099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use texts of the first occurrences for `/ToUnicode` CMap#4585

Use texts of the first occurrences for `/ToUnicode` CMap#4585
laurmaedje merged 2 commits intotypst:mainfrom
YDX-2147483647:cmap

YDX-2147483647 commented Jul 19, 2024 •

edited

Loading

Uh oh!

laurmaedje commented Jul 20, 2024

Uh oh!

laurmaedje commented Jul 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

YDX-2147483647 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testcase

TODO

Future work

Uh oh!

laurmaedje commented Jul 20, 2024

Uh oh!

laurmaedje commented Jul 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YDX-2147483647 commented Jul 19, 2024 •

edited

Loading