Skip to content

ToUnicode in PDF for a Type 0 CID Font might be Wrong #3416

@YDX-2147483647

Description

@YDX-2147483647

Source Han Serif (思源宋体) and other CJK fonts are embedded as Type 0 CID Fonts in PDF. Typst 0.10.0 (70ca0d2) might generate a wrong ToUnicode for them.

Note

Typst.app with default fonts is not affected.

It looks like that typst.app uses NotoSansCJKjp-Regular-Identity-H for CJK characters. Noto Sans CJK is identical to Source Han Sans (思源黑体). Nonetheless, typst compile with neither fonts works on my computer… Language-specific OTFs works, but region-specific Subset OTFs does not.

HaranoAjiMincho, implied by Source Serif Pro, also works on typst.app.

Tip

What is ToUnicode?

In PDF, it is often the case that text is not encoded in Unicode. However, modern
applications usually want them represented in Unicode to make it usable as text
information. The ToUnicode CMap (Character Map) is a bridge between PDF text string encodings and Unicode encodings,
and makes it possible to extract text in PDF files as
Unicode encoded strings. It is important to make resulting PDF search‐able and
copy‐and‐past‐able.

—§1.4.2 of The Dvipdfmx User’s Manual

To Reproduce on Typst.app

  1. Download 14_SourceHanSerifCN.zip from Region Specific Subset OTFs Simplified Chinese (简体中文).

  2. Create a new project in typst.app and upload SubsetOTF/CN/SourceHanSerifCN-Regular.otf in that ZIP.

  3. Compile the following.

    #set page(height: auto)
    
    #let fonts = (
      "Linux Libertine",
      "Source Han Serif",
    )
    
    #for f in fonts {
      text(font: f, size: 3em)[
        #f
    
        #h(1fr)ABC 孔乙己
    
      ]
    }

    minimal-web-with-otf.pdf

  4. Download the PDF, open it, copy all text and paste.

    • Adobe Acrobat / SumatraPDF:

      Linux Libertine
      ABC 孔乙己
      Source Han Serif
      ABC ���
      
    • Firefox (pdf.js) / MS Edge:

      Linux Libertine
      ABC 孔乙己
      Source Han Serif
      ABC 㯰□䄬
      

    Expected: ABC 孔乙己 or ABC孔乙己 (no space).

To Debug

  1. Upload the PDF to PDF Object Browser.

  2. Go to Trailer → Root → Pages → Resources → Font:

    • F1

      • Base Font: /UVRTQX+NotoSansCJKjp-Regular-Identity-H

      • ToUnicode:

        …
        <2581> <4E59>  # <2581> ↦ U+4E59 (乙)
        <3BFE> <5B54>  # <3BFE> ↦ U+5B54 (孔)
        <4115> <5DF1>  # <4115> ↦ U+5DF1 (己)
        …
        
    • F2

      • Base Font: /TLJDGM+SourceHanSerifCN-Regular-Identity-H

      • ToUnicode:

        …
        <22A1> <4E59>  # <22A1> ↦ U+4E59 (乙)
        <2F8B> <5B54>  # <2F8B> ↦ U+5B54 (孔)
        <3227> <5DF1>  # <3227> ↦ U+5DF1 (己)
        …
        
  3. Go to Trailer → Root → Pages → Kids → 0 → Contents, view contents:

    …
    /F1 33 Tf  # Use font F1 (Noto Sans CJK)
        …
        BT  # Begin text
        …
        [<3BFE25814115>] TJ # Show text
        ET  # End text
    /F2 33 Tf  # Use font F2 (Source Hans Serif CN)
        …
        BT  # Begin text
        …
        [<3BF025A1412C>] TJ  # Show text
        ET  # End text
    

    In the first case, F1’s ToUnicode contains <3BFE25814115>, so all readers convert to 孔乙己 as expected.

    In the second case, F2’s ToUnicode does not contain <3BF025A1412C>. Adobe Acrobat and SumatraPDF just say they don’t know (���), while Firefox and MS Edge parse them directly as Unicode:

    >>> '\u3BF0\u25A1\u412C'
    '㯰□䄬'
  4. Manullay edit the wrong ToUnicode in PDF, and now it becomes 孔乙己.

To Reproduce in the Repo

  1. testit cjk --pdf

  2. Open tests/pdf/layout/cjk-punctuation-adjustment.pdf, copy all text and paste.

    main

    Expected:

    to-unicode

Relevant Links

Acknowledgment / Anecdotes

I have noticed the issue since several days after I met Typst. But I cannot report it as a practical issue until I read Color gradients and my gradual descent into madness on Typst Blog. Thanks Sébastien d'Herbais de Thun and the community!

Besides, KaiTi (楷体), the default font on Windows, embedded as a Type 2 CID Font, turns out to be 孔\r\n乙\r\n己 in SumatraPDF. Those \r\n are not desired. (Similar to #526) Even so, “KaiTi + Acrobat/Firefox/Edge” and “typst.app default fonts + SumatraPDF” give the expected 孔乙己. Therefore it might be SumatraPDF’s issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions