Skip to content

Use texts of the first occurrences for /ToUnicode CMap#4585

Merged
laurmaedje merged 2 commits intotypst:mainfrom
YDX-2147483647:cmap
Jul 20, 2024
Merged

Use texts of the first occurrences for /ToUnicode CMap#4585
laurmaedje merged 2 commits intotypst:mainfrom
YDX-2147483647:cmap

Conversation

@YDX-2147483647
Copy link
Copy Markdown
Contributor

@YDX-2147483647 YDX-2147483647 commented Jul 19, 2024

Resolves #4582
…just by deleting improve_glyph_sets!

There are two sources of information for /ToUnicode: the glyph_set recorded while write_text, and cmap tables of the font.
improve_glyph_sets leverages the font. It was refactored into a function in #4154, but the real code even predates ed6550f (2 years ago).

improve_glyph_sets was necessary before ad34763, when a glyph_set was a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).)

In ad34763, the glyph_set is refactored to a map from glyphs to texts.
Now we have enough information for /ToUnicode CMap—no need to search the font.

Changes

If the glyph…

  • …represents a single character…
  • …represents a sequence of characters (e.g. ligature)…
    • …and they are also encoded as a single code point for compatibility (e.g “fi”/fi):
      /ToUnicode changes from a single compatibility code point (fi) to the sequence (fi).
      The behaviour in PDF viewers usually does not change.
    • …and is not encoded in Unicode (e.g. “Th” in Linux Libertine):
      No change.

Testcase

(AFAIK, /ToUnicode is not included in testit.)

#set page(width: 30em, height: auto)

ffi Th ffi
#set text(ligatures: false)
ffi Th ffi

$integral_x y dif z$

#set text(lang: "zh", region: "CN", font: "Source Han Serif", fallback: false)

#let example-Han = [
  地球发动机是人类建造的力量最大的机器,比如我们所在的华北794号,全功率运行时能向大地产生150亿吨的推力。

  - ⼒ U+2F12 (Kangxi radical)
  - 力 U+529B (CJK unified)
  - 力 U+F98A (CJK compatibility)
]

#example-Han

#set text(fill: gradient.linear(..color.map.rainbow))
#example-Han

#set text(stroke: red)
#example-Han

TODO

  • Remove unicode_properties. (still an indirect dependency through rustybuzz)

Future work

@YDX-2147483647 YDX-2147483647 changed the title Use texts of first occurrences for /ToUnicode CMap Use texts of the first occurrences for /ToUnicode CMap Jul 19, 2024
@YDX-2147483647 YDX-2147483647 force-pushed the cmap branch 2 times, most recently from 360c1cd to a7123b5 Compare July 19, 2024 17:38
Resolves typst#4582
…just by deleting `improve_glyph_sets`!

There are two sources of information for `/ToUnicode`: the `glyph_set` recorded while `write_text`, and `cmap` tables of the font.
`improve_glyph_sets` leverages the font. It was refactored into a function in typst#4154, but the real code even predates ed6550f (2 years ago).

`improve_glyph_sets` was necessary before ad34763, when a `glyph_set` was a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).)

In ad34763, the `glyph_set` is refactored to a map from glyphs to texts.
Now we have enough information for `/ToUnicode` CMap—no need to search the font.

If the glyph…
- …represents a single character…
    - …and is mapped from only one code point:
        No change.
    - …and is shared by multiple code points (e.g. CJK unified/compatibility):
        `/ToUnicode` changes from the largest code points to the first occurrence, and fixes typst#4582.
- …represents a sequence of characters (e.g. ligature)…
    - …and they are also encoded as a single code point for compatibility (e.g “fi”/fi):
        `/ToUnicode` changes from a single compatibility code point (fi) to the sequence (fi).
        The behaviour in PDF viewers usually does not change.
    - …and is not encoded in Unicode (e.g. “Th” in Linux Libertine):
        No change.
@YDX-2147483647 YDX-2147483647 marked this pull request as ready for review July 19, 2024 17:51
@laurmaedje
Copy link
Copy Markdown
Member

I agree that this makes sense. The reason I kept the cmap search in ad34763 was probably just cautiousness.

@laurmaedje laurmaedje added this pull request to the merge queue Jul 20, 2024
@laurmaedje
Copy link
Copy Markdown
Member

Thank you!

Merged via the queue into typst:main with commit 9b001e2 Jul 20, 2024
@YDX-2147483647 YDX-2147483647 deleted the cmap branch July 20, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/ToUnicode in PDF can be wrong if a glyph is mapped from multiple code points

2 participants