Description
Compile the following on typst.app, download PDF, and copy the text.
#set text(lang: "zh", region: "CN", font: "Noto Serif CJK SC")
力量
You will get 力量 (U+F98A U+F97E in CJK Compatibility Ideographs), instead of 力量 (U+529B U+91CF in CJK Unified Ideographs).
CJK Compatibility Ideographs often look bizarre, because they fallback to the system font.
Screenshot of notepad.exe:

Continues #3416/#3435
Might relate to typst/webapp-issues#48
Cause
/ToUnicode in PDF maps glyphs to Unicode code points, making it possible to copy and search text.
However, sometimes a glyph in a font is shared by multiple code points.
There are 3 code points mapping to the glyph cid11384 in SourceHanSerifCN-Regular.otf (downloaded from Region Specific Subset OTFs Simplified Chinese (简体中文)):
(generated by a Python script)
from unicodedata import name
from fontTools.ttLib import TTFont
font = TTFont("SourceHanSerifCN-Regular.otf")
cmap = font["cmap"].getBestCmap()
the_character = "\u529b"
the_glyph = cmap[ord(the_character)]
for c, g in cmap.items():
if g == the_glyph:
print(f"{chr(c)} (U+{c:X} {name(chr(c))}): {g}")
print("---")
for c in "⼒力力":
print(f"{c} (U+{ord(c):X} {name(c)}): {ord(c) in cmap = }")
Only one code point can be written to /ToUnicode, and Typst chooses U+F98A 力 (CJK compatibility).
Adobe’s implementation
Adobe explicitly maps the glyph to the most used one, U+529B 力 (CJK unified).
To UVS, Or Not To UVS - CJK Type Blog | Adobe:
A ToUnicode mapping file does exactly what its name suggests: it maps CIDs to Unicode code points, or to code point sequences. Unlike CMap resources that map Unicode code points to CIDs, or 'cmap' tables that map code points to GIDs that may also be CIDs, a ToUnicode mapping file specifies the inverse mapping. Some omissions and ambiguities can arise, either because a glyph is represented as a sequence, or it is mapped from multiple code points.
An excellent example of the latter is Adobe-Japan1-7 CID+1200, which is mapped from U+2F00 ⼀ KANGXI RADICAL ONE and U+4E00 一 (a CJK Unified Ideograph). If CID+1200 is included in a PDF, one would naturally expect U+4E00 一 to be copied, not U+2F00 ⼀ as its use is more obscure. The Adobe-Japan1-UCS2 ToUnicode mapping file makes this mapping preference explicit (04b0 is the zero-padded hexadecimal form of decimal 1200):
https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L27-L29:
There may be cases of single CIDs being referenced in multiple encoding points within a single CMap file.
These cases are comma-delimited, within the same column.
CMap
/ToUnicode
Further example
#set page(height: auto, width: 30em)
#set text(lang: "zh", region: "CN", fallback: false)
#let fonts = (
// "Linux Libertine", // no glyph
// "Noto Serif", // no glyph
"Noto Serif CJK SC",
"Noto Sans CJK SC",
)
#for f in fonts {
text(font: f)[
#f
- ⼒ U+2F12 Kangxi radical
- 力 U+529B CJK unified
- 力 U+F98A CJK compatibility
地球发动机是人类建造的力量最大的机器,比如我们所在的华北794号,全功率运行时能向大地产生150亿吨的推力。
#pagebreak()
]
}
Screenshot of the result in a nerd font:

Reproduction URL
No response
Operating system
Web app, Windows
Typst version
Description
Compile the following on typst.app, download PDF, and copy the text.
You will get 力量 (U+F98A U+F97E in CJK Compatibility Ideographs), instead of 力量 (U+529B U+91CF in CJK Unified Ideographs).
CJK Compatibility Ideographs often look bizarre, because they fallback to the system font.
Screenshot of

notepad.exe:Continues #3416/#3435
Might relate to typst/webapp-issues#48
Cause
/ToUnicodein PDF maps glyphs to Unicode code points, making it possible to copy and search text.However, sometimes a glyph in a font is shared by multiple code points.
There are 3 code points mapping to the glyph cid11384 in
SourceHanSerifCN-Regular.otf(downloaded from Region Specific Subset OTFs Simplified Chinese (简体中文)):(generated by a Python script)
Only one code point can be written to
/ToUnicode, and Typst chooses U+F98A 力 (CJK compatibility).Adobe’s implementation
Adobe explicitly maps the glyph to the most used one, U+529B 力 (CJK unified).
To UVS, Or Not To UVS - CJK Type Blog | Adobe:
https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L27-L29:
CMap
Chinese (https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-GB1-6/cid2code.txt#L2629C60-L2629C69):
Korean (https://github.com/adobe-type-tools/cmap-resources/blob/f5cf3bca7fdfeaceb77aa82847e974f2306c20b4/Adobe-KR-9/cid2code.txt#L14627C28-L14627C43):
/ToUnicodeChinese (https://github.com/adobe-type-tools/mapping-resources-pdf/blob/2dd5e53fb74a01718b9dfd448a0d1cce6fff2aa5/pdf2unicode/Adobe-GB1-UCS2#L1862):
Korean (https://github.com/adobe-type-tools/mapping-resources-pdf/blob/2dd5e53fb74a01718b9dfd448a0d1cce6fff2aa5/pdf2unicode/Adobe-KR-UCS2#L3525):
Further example
Screenshot of the result in a nerd font:

Note
This issue cannot be repoduced with
testit cjk --pdfand https://github.com/typst/typst/blob/42754477886f6a12afbabfd2a64d8c787a57bc03/tests/suite/layout/inline/cjk.typ, because https://github.com/typst/typst-dev-assets/blob/48a924d9de82b631bc775124a69384c8d860db04/files/fonts/NotoSerifCJKsc-Regular.otf does not contain U+2F12 ⼒ (Kangxi radical) and U+F98A 力 (CJK compatibility).Reproduction URL
No response
Operating system
Web app, Windows
Typst version