You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Source Han Serif (思源宋体) and other CJK fonts are embedded as Type 0 CID Fonts in PDF. Typst 0.10.0 (70ca0d2) might generate a wrong ToUnicode for them.
Note
Typst.app with default fonts is not affected.
It looks like that typst.app uses NotoSansCJKjp-Regular-Identity-H for CJK characters. Noto Sans CJK is identical to Source Han Sans (思源黑体). Nonetheless, typst compile with neither fonts works on my computer… Language-specific OTFs works, but region-specific Subset OTFs does not.
HaranoAjiMincho, implied by Source Serif Pro, also works on typst.app.
Tip
What is ToUnicode?
In PDF, it is often the case that text is not encoded in Unicode. However, modern
applications usually want them represented in Unicode to make it usable as text
information. The ToUnicode CMap (Character Map) is a bridge between PDF text string encodings and Unicode encodings,
and makes it possible to extract text in PDF files as
Unicode encoded strings. It is important to make resulting PDF search‐able and
copy‐and‐past‐able.
Go to Trailer → Root → Pages → Kids → 0 → Contents, view contents:
…
/F1 33 Tf # Use font F1 (Noto Sans CJK)
…
BT # Begin text
…
[<3BFE25814115>] TJ # Show text
ET # End text
/F2 33 Tf # Use font F2 (Source Hans Serif CN)
…
BT # Begin text
…
[<3BF025A1412C>] TJ # Show text
ET # End text
In the first case, F1’s ToUnicode contains <3BFE25814115>, so all readers convert to 孔乙己 as expected.
In the second case, F2’s ToUnicode does not contain <3BF025A1412C>. Adobe Acrobat and SumatraPDF just say they don’t know (���), while Firefox and MS Edge parse them directly as Unicode:
>>>'\u3BF0\u25A1\u412C''㯰□䄬'
Manullay edit the wrong ToUnicode in PDF, and now it becomes 孔乙己.
To Reproduce in the Repo
testit cjk --pdf
Open tests/pdf/layout/cjk-punctuation-adjustment.pdf, copy all text and paste.
Besides, KaiTi (楷体), the default font on Windows, embedded as a Type 2 CID Font, turns out to be 孔\r\n乙\r\n己 in SumatraPDF. Those \r\n are not desired. (Similar to #526) Even so, “KaiTi + Acrobat/Firefox/Edge” and “typst.app default fonts + SumatraPDF” give the expected 孔乙己. Therefore it might be SumatraPDF’s issue.
Source Han Serif (思源宋体) and other CJK fonts are embedded as Type 0 CID Fonts in PDF. Typst 0.10.0 (70ca0d2) might generate a wrong
ToUnicodefor them.Note
Typst.app with default fonts is not affected.
It looks like that typst.app uses
NotoSansCJKjp-Regular-Identity-Hfor CJK characters. Noto Sans CJK is identical to Source Han Sans (思源黑体).Nonetheless,Language-specific OTFs works, but region-specific Subset OTFs does not.typst compilewith neither fonts works on my computer…HaranoAjiMincho, implied by Source Serif Pro, also works on typst.app.
Tip
What is
ToUnicode?In PDF, it is often the case that text is not encoded in Unicode. However, modern
applications usually want them represented in Unicode to make it usable as text
information. The
ToUnicodeCMap (Character Map) is a bridge between PDF text string encodings and Unicode encodings,and makes it possible to extract text in PDF files as
Unicode encoded strings. It is important to make resulting PDF search‐able and
copy‐and‐past‐able.
—§1.4.2 of The Dvipdfmx User’s Manual
To Reproduce on Typst.app
Download
14_SourceHanSerifCN.zipfrom Region Specific Subset OTFs Simplified Chinese (简体中文).Create a new project in typst.app and upload
SubsetOTF/CN/SourceHanSerifCN-Regular.otfin that ZIP.Compile the following.
minimal-web-with-otf.pdf
Download the PDF, open it, copy all text and paste.
Adobe Acrobat / SumatraPDF:
Firefox (pdf.js) / MS Edge:
Expected:
ABC 孔乙己orABC孔乙己(no space).To Debug
Upload the PDF to PDF Object Browser.
Go to Trailer → Root → Pages → Resources → Font:
…
F1
Base Font:
/UVRTQX+NotoSansCJKjp-Regular-Identity-HToUnicode:F2
Base Font:
/TLJDGM+SourceHanSerifCN-Regular-Identity-HToUnicode:Go to Trailer → Root → Pages → Kids → 0 → Contents, view contents:
In the first case, F1’s
ToUnicodecontains<3BFE25814115>, so all readers convert to孔乙己as expected.In the second case, F2’s
ToUnicodedoes not contain<3BF025A1412C>. Adobe Acrobat and SumatraPDF just say they don’t know (���), while Firefox and MS Edge parse them directly as Unicode:Manullay edit the wrong
ToUnicodein PDF, and now it becomes孔乙己.To Reproduce in the Repo
testit cjk --pdfOpen
tests/pdf/layout/cjk-punctuation-adjustment.pdf, copy all text and paste.Expected:
Relevant Links
Many glyphs are not searchable in the PDF #479:
ToUnicodeof ligaturesCan provide an example of loading a font from a TTF file? pdf-writer#17 (comment)
Ligatures like
ffifor Source Han Serif are also broken.Fix embedding of CID-keyed fonts into PDF · typst/typst@
dad7c88· GitHub, from History forcrates/typst-pdf/src/font.rsMake ligatures copyable and searchable · typst/typst@
ad34763Code base
typst/crates/typst-pdf/src/font.rs
Lines 200 to 209 in 79e37cc
typst/crates/typst-pdf/src/font.rs
Lines 147 to 150 in 79e37cc
typst/crates/typst-pdf/src/lib.rs
Lines 79 to 85 in 79e37cc
typst/crates/typst-pdf/src/font.rs
Lines 239 to 263 in 79e37cc
Basics and Common Objects in PDF Cheat Sheets – PDF Association
§9.7 Composite Fonts in
PDF32000_2008.pdf, downloaded from ISO 32000 (PDF) – PDF AssociationThe Type — 文字 / 设计 / 文化 » 字谈字畅 183:康熙怎么又来了: Kangxi radicals (康熙部首), e.g.
⼰(U+2F30, Kangxi Radical Oneself) ≠己(U+5DF1, CJK Unified Ideograph)数字世界中的纸张——理解 PDF - neverland
PDF 复制中的文字重复问题 - neverland
Analyzing documents with the Preflight tool (Adobe Acrobat Pro)
康熙来了 - neverland
全球文种的字体与布局
PDF转Word,为啥那么费劲?(PDF·文字篇)- 哔哩哔哩
Advanced typography in PDF - PP_Advanced_typography_in_PDF-compressed.pdf | iText PDF
To UVS, Or Not To UVS - CJK Type Blog | Adobe
Acknowledgment / Anecdotes
I have noticed the issue since several days after I met Typst. But I cannot report it as a practical issue until I read Color gradients and my gradual descent into madness on Typst Blog. Thanks Sébastien d'Herbais de Thun and the community!
Besides, KaiTi (楷体), the default font on Windows, embedded as a Type 2 CID Font, turns out to be
孔\r\n乙\r\n己in SumatraPDF. Those\r\nare not desired. (Similar to #526) Even so, “KaiTi + Acrobat/Firefox/Edge” and “typst.app default fonts + SumatraPDF” give the expected孔乙己. Therefore it might be SumatraPDF’s issue.