Skip to content

/ToUnicode in PDF can be wrong if a glyph is mapped from multiple code points #4582

@YDX-2147483647

Description

@YDX-2147483647

Description

Compile the following on typst.app, download PDF, and copy the text.

#set text(lang: "zh", region: "CN", font: "Noto Serif CJK SC")
力量

You will get 力量 (U+F98A U+F97E in CJK Compatibility Ideographs), instead of 力量 (U+529B U+91CF in CJK Unified Ideographs).

CJK Compatibility Ideographs often look bizarre, because they fallback to the system font.

Screenshot of notepad.exe:
notepad screenshot

Continues #3416/#3435
Might relate to typst/webapp-issues#48

Cause

/ToUnicode in PDF maps glyphs to Unicode code points, making it possible to copy and search text.
However, sometimes a glyph in a font is shared by multiple code points.
There are 3 code points mapping to the glyph cid11384 in SourceHanSerifCN-Regular.otf (downloaded from Region Specific Subset OTFs Simplified Chinese (简体中文)):

(generated by a Python script)
from unicodedata import name

from fontTools.ttLib import TTFont

font = TTFont("SourceHanSerifCN-Regular.otf")
cmap = font["cmap"].getBestCmap()

the_character = "\u529b"
the_glyph = cmap[ord(the_character)]

for c, g in cmap.items():
    if g == the_glyph:
        print(f"{chr(c)} (U+{c:X} {name(chr(c))}): {g}")

print("---")

for c in "⼒力力":
    print(f"{c} (U+{ord(c):X} {name(c)}): {ord(c) in cmap = }")

Only one code point can be written to /ToUnicode, and Typst chooses U+F98A 力 (CJK compatibility).

Adobe’s implementation

Adobe explicitly maps the glyph to the most used one, U+529B 力 (CJK unified).

To UVS, Or Not To UVS - CJK Type Blog | Adobe:

A ToUnicode mapping file does exactly what its name suggests: it maps CIDs to Unicode code points, or to code point sequences. Unlike CMap resources that map Unicode code points to CIDs, or 'cmap' tables that map code points to GIDs that may also be CIDs, a ToUnicode mapping file specifies the inverse mapping. Some omissions and ambiguities can arise, either because a glyph is represented as a sequence, or it is mapped from multiple code points.

An excellent example of the latter is Adobe-Japan1-7 CID+1200, which is mapped from U+2F00 ⼀ KANGXI RADICAL ONE and U+4E00 一 (a CJK Unified Ideograph). If CID+1200 is included in a PDF, one would naturally expect U+4E00 一 to be copied, not U+2F00 ⼀ as its use is more obscure. The Adobe-Japan1-UCS2 ToUnicode mapping file makes this mapping preference explicit (04b0 is the zero-padded hexadecimal form of decimal 1200):

<04b0> <4e00>

https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L27-L29:

There may be cases of single CIDs being referenced in multiple encoding points within a single CMap file.
These cases are comma-delimited, within the same column.

CMap

/ToUnicode

Further example

#set page(height: auto, width: 30em)
#set text(lang: "zh", region: "CN", fallback: false)

#let fonts = (
  // "Linux Libertine", // no glyph
  // "Noto Serif", // no glyph
  "Noto Serif CJK SC",
  "Noto Sans CJK SC",
)

#for f in fonts {
  text(font: f)[
    #f

    - ⼒ U+2F12 Kangxi radical
    - 力 U+529B CJK unified
    - 力 U+F98A CJK compatibility

    地球发动机是人类建造的力量最大的机器,比如我们所在的华北794号,全功率运行时能向大地产生150亿吨的推力。

    #pagebreak()
  ]
}

Screenshot of the result in a nerd font:

Reproduction URL

No response

Operating system

Web app, Windows

Typst version

  • I am using the latest version of Typst

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdfRelated to PDF export or PDF embedding.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions