`ToUnicode` in PDF for a Type 0 CID Font might be Wrong

[Source Han Serif (思源宋体)](https://github.com/adobe-fonts/source-han-serif/tree/release#downloading-source-han-serif) and other CJK fonts are embedded as Type 0 CID Fonts in PDF. Typst 0.10.0 (70ca0d25) might generate a wrong `ToUnicode` for them.

> [!NOTE]
>
> Typst.app with default fonts is not affected.
>
> It looks like that typst.app uses `NotoSansCJKjp-Regular-Identity-H` for CJK characters. [Noto Sans CJK](https://github.com/notofonts/noto-cjk/) is identical to [Source Han Sans (思源黑体)](https://github.com/adobe-fonts/source-han-sans). ~~Nonetheless, `typst compile` with neither fonts works on my computer…~~ Language-specific OTFs works, but region-specific Subset OTFs does not.
>
> [HaranoAjiMincho](https://github.com/trueroad/HaranoAjiFonts), implied by Source Serif Pro, also works on typst.app.

> [!TIP]
>
> What is `ToUnicode`?
>
> In PDF, it is often the case that text is not encoded in Unicode. However, modern
> applications usually want them represented in Unicode to make it usable as text
> information. The `ToUnicode` CMap (**C**haracter **Map**) is a bridge between PDF text string encodings and Unicode encodings,
> and makes it possible to extract text in PDF files as
> Unicode encoded strings. It is important to make resulting PDF search‐able and
> copy‐and‐past‐able.
>
> —§1.4.2 of [The Dvipdfmx User’s Manual](https://mirrors.ctan.org/systems/doc/dvipdfmx/dvipdfmx.pdf)

## To Reproduce on Typst.app

1. Download `14_SourceHanSerifCN.zip` from [Region Specific **Subset** OTFs Simplified Chinese (简体中文)](https://github.com/adobe-fonts/source-han-serif/releases/download/2.002R/14_SourceHanSerifCN.zip).

2. Create a new project in typst.app and upload `SubsetOTF/CN/SourceHanSerifCN-Regular.otf` in that ZIP.

3. Compile the following.

   ```typst
   #set page(height: auto)
   
   #let fonts = (
     "Linux Libertine",
     "Source Han Serif",
   )
   
   #for f in fonts {
     text(font: f, size: 3em)[
       #f
   
       #h(1fr)ABC 孔乙己
   
     ]
   }
   ```
  
   [minimal-web-with-otf.pdf](https://github.com/typst/typst/files/14283404/minimal-web-with-otf.pdf)

4. Download the PDF, open it, copy all text and paste.

   - Adobe Acrobat / SumatraPDF:

     ```
     Linux Libertine
     ABC 孔乙己
     Source Han Serif
     ABC ���
     ```

   - Firefox (pdf.js) / MS Edge:

     ```
     Linux Libertine
     ABC 孔乙己
     Source Han Serif
     ABC 㯰□䄬
     ```

   Expected: `ABC 孔乙己` or `ABC孔乙己` (no space).

## To Debug

1. Upload the PDF to [PDF Object Browser](https://brendandahl.github.io/pdf.js.utils/browser/).

2. Go to Trailer → Root → Pages → Resources → Font:

   - …

   - F1

     - Base Font: `/UVRTQX+NotoSansCJKjp-Regular-Identity-H`

     - `ToUnicode`:

       ```
       …
       <2581> <4E59>  # <2581> ↦ U+4E59 (乙)
       <3BFE> <5B54>  # <3BFE> ↦ U+5B54 (孔)
       <4115> <5DF1>  # <4115> ↦ U+5DF1 (己)
       …
       ```

   - F2

     - Base Font: `/TLJDGM+SourceHanSerifCN-Regular-Identity-H`

     - `ToUnicode`:

       ```
       …
       <22A1> <4E59>  # <22A1> ↦ U+4E59 (乙)
       <2F8B> <5B54>  # <2F8B> ↦ U+5B54 (孔)
       <3227> <5DF1>  # <3227> ↦ U+5DF1 (己)
       …
       ```

3. Go to Trailer → Root → Pages → Kids → 0 → Contents, view contents:

   ```
   …
   /F1 33 Tf  # Use font F1 (Noto Sans CJK)
       …
       BT  # Begin text
       …
       [<3BFE25814115>] TJ # Show text
       ET  # End text
   /F2 33 Tf  # Use font F2 (Source Hans Serif CN)
       …
       BT  # Begin text
       …
       [<3BF025A1412C>] TJ  # Show text
       ET  # End text
   ```

   In the first case, F1’s `ToUnicode` contains `<3BFE25814115>`, so all readers convert to `孔乙己` as expected.

   In the second case, F2’s `ToUnicode` does not contain `<3BF025A1412C>`. Adobe Acrobat and SumatraPDF just say they don’t know (`���`), while Firefox and MS Edge parse them directly as Unicode:

   ```python
   >>> '\u3BF0\u25A1\u412C'
   '㯰□䄬'
   ```

4. Manullay edit the wrong `ToUnicode` in PDF, and now it becomes `孔乙己`.

## To Reproduce in the Repo

1. `testit cjk --pdf`

2. Open `tests/pdf/layout/cjk-punctuation-adjustment.pdf`, copy all text and paste.
   
   ![main](https://github.com/typst/typst/assets/73375426/30ea589b-c94c-451a-87ae-13bf9b12f007)

   Expected:

   ![to-unicode](https://github.com/typst/typst/assets/73375426/2091c4b9-5a28-465d-8ae0-8f2cc991bd75)

## Relevant Links

- #479: `ToUnicode` of ligatures

  https://github.com/typst/pdf-writer/issues/17#issuecomment-1717823906

  Ligatures like `ffi` for Source Han Serif are also broken.

- [Fix embedding of CID-keyed fonts into PDF · typst/typst@`dad7c88` · GitHub](https://github.com/typst/typst/commit/dad7c88576224f636f9292fd60f0f65dd4b3a043#diff-ce330df346cda0de39a4acd4b23dfe2f6c702b0a8947bc70d3adf0913d8f72b1R243), from [History for `crates/typst-pdf/src/font.rs`](https://github.com/typst/typst/commits/main/crates/typst-pdf/src/font.rs)

  [Make ligatures copyable and searchable · typst/typst@`ad34763`](https://github.com/typst/typst/commit/ad347632ab95e29eb5180b27142f5c264dfc611a)

- <details><summary>Code base</summary>

  https://github.com/typst/typst/blob/79e37ccbac080212dc42e996d760664c75d1a56f/crates/typst-pdf/src/font.rs#L200-L209

  https://github.com/typst/typst/blob/79e37ccbac080212dc42e996d760664c75d1a56f/crates/typst-pdf/src/font.rs#L147-L150

  https://github.com/typst/typst/blob/79e37ccbac080212dc42e996d760664c75d1a56f/crates/typst-pdf/src/lib.rs#L79-L85

  https://github.com/typst/typst/blob/79e37ccbac080212dc42e996d760664c75d1a56f/crates/typst-pdf/src/font.rs#L239-L263

  </details>

- [*Basics*](https://pdfa.org/wp-content/uploads/2023/08/PDF-Basics-CheatSheet.pdf) and [*Common Objects*](https://pdfa.org/wp-content/uploads/2023/08/PDF-CommonObjects-CheatSheet.pdf) in [PDF Cheat Sheets – PDF Association](https://pdfa.org/resource/pdf-cheat-sheets/)

- §9.7 Composite Fonts in [`PDF32000_2008.pdf`](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf), downloaded from [ISO 32000 (PDF) – PDF Association](https://pdfa.org/resource/iso-32000-pdf/)

- [The Type — 文字 / 设计 / 文化 » 字谈字畅 183：康熙怎么又来了](https://www.thetype.com/typechat/ep-183/): Kangxi radicals (康熙部首), e.g. `⼰` (U+2F30, Kangxi Radical Oneself) ≠ `己` (U+5DF1, CJK Unified Ideograph)

- [数字世界中的纸张——理解 PDF - neverland](https://type.cyhsu.xyz/2018/09/understanding-pdf-the-digitalized-paper/)

- [PDF 复制中的文字重复问题 - neverland](https://type.cyhsu.xyz/2018/12/why-do-pdf-copy-results-in-preview-app-have-redundant-characters/)

- [Analyzing documents with the Preflight tool (Adobe Acrobat Pro)](https://helpx.adobe.com/acrobat/using/analyzing-documents-preflight-tool-acrobat.html)

- [康熙来了 - neverland](https://type.cyhsu.xyz/2022/07/kangxi-radicals-weibo/)

- [全球文种的字体与布局](https://7sdream.github.io/fonts-and-layout-zhCN/chapters/04-opentype/exploring/cmap.html)

- [PDF转Word，为啥那么费劲？（PDF·文字篇）- 哔哩哔哩](https://www.bilibili.com/video/BV1Vi4y1C71M/)

- [Advanced typography in PDF - PP\_Advanced\_typography\_in\_PDF-compressed.pdf | iText PDF](https://itextpdf.com/sites/default/files/2018-12/PP_Advanced_typography_in_PDF-compressed.pdf)

- [To UVS, Or Not To UVS - CJK Type Blog	| Adobe](https://ccjktype.fonts.adobe.com/2019/05/to-uvs-or-not-to-uvs.html)

## Acknowledgment / Anecdotes

I have noticed the issue since several days after I met Typst. But I cannot report it as a practical issue until I read [*Color gradients and my gradual descent into madness* on Typst Blog](https://typst.app/blog/2023/color-gradients#debugging-pdf). Thanks Sébastien d'Herbais de Thun and the community!

Besides, KaiTi (楷体), the default font on Windows, embedded as a Type 2 CID Font, turns out to be `孔\r\n乙\r\n己` in SumatraPDF. Those `\r\n` are not desired. (Similar to #526) Even so, “KaiTi + Acrobat/Firefox/Edge” and “typst.app default fonts + SumatraPDF” give the expected `孔乙己`. Therefore it might be [SumatraPDF’s issue](https://github.com/sumatrapdfreader/sumatrapdf/issues/4430).

	/// Create a /ToUnicode CMap.
	fn create_cmap(
	ttf: &ttf_parser::Face,
	glyph_set: &mut BTreeMap<u16, EcoString>,
	) -> UnicodeCmap {
	// For glyphs that have codepoints mapping to them in the font's cmap table,
	// we prefer them over pre-existing text mappings from the document. Only
	// things that don't have a corresponding codepoint (or only a private-use
	// one) like the "Th" in Linux Libertine get the text of their first
	// occurrences in the document instead.

	// Write the /ToUnicode character map, which maps glyph ids back to
	// unicode codepoints to enable copying out of the PDF.
	let cmap = create_cmap(ttf, glyph_set);
	ctx.pdf.cmap(cmap_ref, &cmap.finish());

	/// For each font a mapping from used glyphs to their text representation.
	/// May contain multiple chars in case of ligatures or similar things. The
	/// same glyph can have a different text representation within one document,
	/// then we just save the first one. The resulting strings are used for the
	/// PDF's /ToUnicode map for glyphs that don't have an entry in the font's
	/// cmap. This is important for copy-paste and searching.
	glyph_sets: HashMap<Font, BTreeMap<u16, EcoString>>,

	/// Get the CID for a glyph id.
	///
	/// When writing text into a PDF, we have to specify CIDs (character ids) not
	/// GIDs (glyph IDs).
	///
	/// Most of the time, the mapping between these two is an identity mapping. In
	/// particular, for TrueType fonts, the mapping is an identity mapping because
	/// of this line above:
	/// ```ignore
	/// cid.cid_to_gid_map_predefined(Name(b"Identity"));
	/// ```
	///
	/// However, CID-keyed CFF fonts may have a non-identity mapping defined in
	/// their charset. For those, we must map the glyph IDs in a `TextItem` to CIDs.
	/// The font defines the map through its charset. The charset usually maps
	/// glyphs to SIDs (string ids) specifying the glyph's name. Not for CID-keyed
	/// fonts though! For these, the SIDs are CIDs in disguise. Relevant quote from
	/// the CFF spec:
	///
	/// > The charset data, although in the same format as non-CIDFonts, will
	/// > represent CIDs rather than SIDs, [...]
	///
	/// This function performs the mapping from glyph ID to CID. It also works for
	/// non CID-keyed fonts. Then, it will simply return the glyph ID.
	pub(super) fn glyph_cid(font: &Font, glyph_id: u16) -> u16 {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ToUnicode` in PDF for a Type 0 CID Font might be Wrong #3416

To Reproduce on Typst.app

To Debug

To Reproduce in the Repo

Relevant Links

Acknowledgment / Anecdotes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ToUnicode in PDF for a Type 0 CID Font might be Wrong #3416

Description

To Reproduce on Typst.app

To Debug

To Reproduce in the Repo

Relevant Links

Acknowledgment / Anecdotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`ToUnicode` in PDF for a Type 0 CID Font might be Wrong #3416