Use texts of the first occurrences for /ToUnicode CMap#4585
Merged
laurmaedje merged 2 commits intotypst:mainfrom Jul 20, 2024
Merged
Use texts of the first occurrences for /ToUnicode CMap#4585laurmaedje merged 2 commits intotypst:mainfrom
/ToUnicode CMap#4585laurmaedje merged 2 commits intotypst:mainfrom
Conversation
/ToUnicode CMap/ToUnicode CMap
360c1cd to
a7123b5
Compare
Resolves typst#4582 …just by deleting `improve_glyph_sets`! There are two sources of information for `/ToUnicode`: the `glyph_set` recorded while `write_text`, and `cmap` tables of the font. `improve_glyph_sets` leverages the font. It was refactored into a function in typst#4154, but the real code even predates ed6550f (2 years ago). `improve_glyph_sets` was necessary before ad34763, when a `glyph_set` was a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).) In ad34763, the `glyph_set` is refactored to a map from glyphs to texts. Now we have enough information for `/ToUnicode` CMap—no need to search the font. If the glyph… - …represents a single character… - …and is mapped from only one code point: No change. - …and is shared by multiple code points (e.g. CJK unified/compatibility): `/ToUnicode` changes from the largest code points to the first occurrence, and fixes typst#4582. - …represents a sequence of characters (e.g. ligature)… - …and they are also encoded as a single code point for compatibility (e.g “fi”/fi): `/ToUnicode` changes from a single compatibility code point (fi) to the sequence (fi). The behaviour in PDF viewers usually does not change. - …and is not encoded in Unicode (e.g. “Th” in Linux Libertine): No change.
Member
|
I agree that this makes sense. The reason I kept the cmap search in ad34763 was probably just cautiousness. |
Member
|
Thank you! |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #4582
…just by deleting
improve_glyph_sets!There are two sources of information for
/ToUnicode: theglyph_setrecorded whilewrite_text, andcmaptables of the font.improve_glyph_setsleverages the font. It was refactored into a function in #4154, but the real code even predates ed6550f (2 years ago).improve_glyph_setswas necessary before ad34763, when aglyph_setwas a list of glyphs, and we had to search the font (again) for their texts. (Each glyph represents a text, which is a Unicode code point or a sequence of code points (e.g. ligature).)In ad34763, the
glyph_setis refactored to a map from glyphs to texts.Now we have enough information for
/ToUnicodeCMap—no need to search the font.Changes
If the glyph…
No change.
/ToUnicodechanges from the largest code points to the first occurrence, and fixes/ToUnicodein PDF can be wrong if a glyph is mapped from multiple code points #4582./ToUnicodechanges from a single compatibility code point (fi) to the sequence (fi).The behaviour in PDF viewers usually does not change.
No change.
Testcase
(AFAIK,
/ToUnicodeis not included intestit.)TODO
unicode_properties. (still an indirect dependency through rustybuzz)Future work
/ActualText. (tracked by PDF text extraction can fail in complex shaping scenarios #4225 and Different characters that use the same glyph result in the same character when copied from PDFs #526)Use the most common instead of the first occurrence.I don't think we should use CJK unified even if the author always writes CJK compatibility.