`/ToUnicode` in PDF can be wrong if a glyph is mapped from multiple code points

### Description

Compile the following on typst.app, download PDF, and copy the text.

```typst
#set text(lang: "zh", region: "CN", font: "Noto Serif CJK SC")
力量
```

You will get 力量 (U+F98A U+F97E in CJK *Compatibility* Ideographs), instead of [力量](https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%E5%8A%9B%E9%87%8F) (U+529B U+91CF in CJK *Unified* Ideographs).

CJK Compatibility Ideographs often look bizarre, because they fallback to the system font.

Screenshot of `notepad.exe`:
![notepad screenshot](https://github.com/user-attachments/assets/b6adf4b2-f701-43d1-931d-d82fda6c2acb)

Continues #3416/#3435
Might relate to https://github.com/typst/webapp-issues/issues/48

## Cause

`/ToUnicode` in PDF maps glyphs to Unicode code points, making it possible to copy and search text.
However, sometimes **a glyph** in a font is shared by **multiple code points**.
There are 3 code points mapping to the glyph cid11384 in `SourceHanSerifCN-Regular.otf` (downloaded from [Region Specific Subset OTFs Simplified Chinese (简体中文)](https://github.com/adobe-fonts/source-han-serif/releases/download/2.002R/14_SourceHanSerifCN.zip)):

- ⼒ ([U+2F12 KANGXI RADICAL POWER](https://util.unicode.org/UnicodeJsps/character.jsp?a=2F12)): cid11384
- 力 ([U+529B CJK UNIFIED IDEOGRAPH-529B](https://util.unicode.org/UnicodeJsps/character.jsp?a=529B)): cid11384
- 力 ([U+F98A CJK COMPATIBILITY IDEOGRAPH-F98A](https://util.unicode.org/UnicodeJsps/character.jsp?a=F98A)): cid11384

<details>
<summary>(generated by a Python script)</summary>

```python
from unicodedata import name

from fontTools.ttLib import TTFont

font = TTFont("SourceHanSerifCN-Regular.otf")
cmap = font["cmap"].getBestCmap()

the_character = "\u529b"
the_glyph = cmap[ord(the_character)]

for c, g in cmap.items():
    if g == the_glyph:
        print(f"{chr(c)} (U+{c:X} {name(chr(c))}): {g}")

print("---")

for c in "⼒力力":
    print(f"{c} (U+{ord(c):X} {name(c)}): {ord(c) in cmap = }")
```

</details>

Only one code point can be written to `/ToUnicode`, and Typst chooses U+F98A 力 (CJK compatibility).

## Adobe’s implementation

Adobe explicitly maps the glyph to the most used one, U+529B 力 (CJK unified).

[To UVS, Or Not To UVS - CJK Type Blog | Adobe](https://ccjktype.fonts.adobe.com/2019/05/to-uvs-or-not-to-uvs.html):

> A ToUnicode mapping file does exactly what its name suggests: it maps CIDs to Unicode code points, or to code point sequences. Unlike CMap resources that map Unicode code points to CIDs, or 'cmap' tables that map code points to GIDs that may also be CIDs, a ToUnicode mapping file specifies the inverse mapping. Some omissions and ambiguities can arise, either because a glyph is represented as a sequence, or it is mapped from multiple code points.

> An excellent example of the latter is Adobe-Japan1-7 CID+1200, which is mapped from U+2F00 ⼀ KANGXI RADICAL ONE and U+4E00 一 (a CJK Unified Ideograph). If CID+1200 is included in a PDF, one would naturally expect U+4E00 一 to be copied, not U+2F00 ⼀ as its use is more obscure. The Adobe-Japan1-UCS2 ToUnicode mapping file **makes this mapping preference explicit** (04b0 is the zero-padded hexadecimal form of decimal 1200):
>
> ```
> <04b0> <4e00>
> ```

https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L27-L29:

> There may be cases of single CIDs being referenced in multiple encoding points within a single CMap file.
> These cases are comma-delimited, within the same column.

CMap

- Chinese (https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L2629C60-L2629C69):

  ```
  CID	…	UniGB-UCS2
  2543	…	2f12,529b
  ```

- Korean (https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-KR-9/cid2code.txt#L14627C28-L14627C43):

  ```
  CID	…	UniAKR-UTF16
  14593	…	2f12,529b,f98a	
  ```

`/ToUnicode`

- Chinese (https://github.com/adobe-type-tools/mapping-resources-pdf/blob/2dd5e53fb74a01718b9dfd448a0d1cce6fff2aa5/pdf2unicode/Adobe-GB1-UCS2#L1862):

  ```
  <09ef> <529b>
  # <09ef> is 2543 in hex.
  ```

- Korean (https://github.com/adobe-type-tools/mapping-resources-pdf/blob/2dd5e53fb74a01718b9dfd448a0d1cce6fff2aa5/pdf2unicode/Adobe-KR-UCS2#L3525):

  ```
  <3901> <529b>
  # <3901> is 14593 in hex.
  ```

## Further example

```typst
#set page(height: auto, width: 30em)
#set text(lang: "zh", region: "CN", fallback: false)

#let fonts = (
  // "Linux Libertine", // no glyph
  // "Noto Serif", // no glyph
  "Noto Serif CJK SC",
  "Noto Sans CJK SC",
)

#for f in fonts {
  text(font: f)[
    #f

    - ⼒ U+2F12 Kangxi radical
    - 力 U+529B CJK unified
    - 力 U+F98A CJK compatibility

    地球发动机是人类建造的力量最大的机器，比如我们所在的华北794号，全功率运行时能向大地产生150亿吨的推力。

    #pagebreak()
  ]
}
```

Screenshot of the result in a nerd font:
![](https://github.com/user-attachments/assets/90fc79d1-e8d7-4d5b-a409-c55d024ae276)

> [!NOTE]
> This issue cannot be repoduced with `testit cjk --pdf` and https://github.com/typst/typst/blob/42754477886f6a12afbabfd2a64d8c787a57bc03/tests/suite/layout/inline/cjk.typ, because https://github.com/typst/typst-dev-assets/blob/48a924d9de82b631bc775124a69384c8d860db04/files/fonts/NotoSerifCJKsc-Regular.otf does not contain U+2F12 ⼒ (Kangxi radical) and U+F98A 力 (CJK compatibility).

### Reproduction URL

_No response_

### Operating system

Web app, Windows

### Typst version

- [X] I am using the latest version of Typst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`/ToUnicode` in PDF can be wrong if a glyph is mapped from multiple code points #4582

Description

Cause

Adobe’s implementation

Further example

Reproduction URL

Operating system

Typst version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

/ToUnicode in PDF can be wrong if a glyph is mapped from multiple code points #4582

Description

Description

Cause

Adobe’s implementation

Further example

Reproduction URL

Operating system

Typst version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`/ToUnicode` in PDF can be wrong if a glyph is mapped from multiple code points #4582