ENH: CID font resource from font file to encode more characters#3652
ENH: CID font resource from font file to encode more characters#3652PJBrs wants to merge 24 commits intopy-pdf:mainfrom
Conversation
175a542 to
e43c57d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3652 +/- ##
========================================
Coverage 97.41% 97.42%
========================================
Files 55 55
Lines 9989 10172 +183
Branches 1833 1863 +30
========================================
+ Hits 9731 9910 +179
- Misses 150 152 +2
- Partials 108 110 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke. Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue? One final thing: I can also still improve this, if wanted. |
160d8d5 to
cf9b10e
Compare
5b3cd93 to
cbc9ee4
Compare
|
I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now: What we have in character_map actually is pypdf's representation of a /ToUnicode character mapping. Reflection by Google Gemini:
|
|
I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well. |
OK, I'll forge on then, using this PR for note taking. I'm finding now that it's not even very wrong, it's just overly simplistic. What I've learned in the interim: Font.encoding and Font.character_map can be said to map the same thing: decoded text (int in the context of simple fonts, str in the context of CID fonts) to unicode code points. EDIT LIKE THE /ToUnicode DICT IN A FONT RESOURCE This is helpful for extracting text from pdfs and it can also map different glyph substitutes to single unicode code points, e.g., for Arabic text. For this reason, character_widths actually is wrong, because it maps unicode code points to widths, which does not work if multiple glyphs map to the same unicode code point (e.g., Arabic). Furthermore, it doesn't really work for the reverse logic of producing text. In this PR, I think that I populated character_map in from_truetype_font_file in reverse, mapping unicode code points to character IDs. Otherwise, it actually seems to work, with the caveats that the character_widths are lossy when one unicode code point maps to multiple glyphs. So, for the purposes of abstraction I could actually merge map_dict and encoding without losing any information or functionality:
It would seem to that the underlying pypdf font / encoding / character_map architecture can be improved in three ways:
For this PR, it doesn't matter too much, I can just clean up the logic in character_map and then it should work for fonts where the CIDtoGID map is contiguous (this is another caveat). |
|
@stefan6419846 OK, final verdict for now - character_widths need to be keyed by CID for cid fonts or by character code for simple fonts. My decision to use the character widths code from the new Font class was unfortunately incorrect. I can still fix the above PR, I think, at least logically, and it will also mostly work, but not for any text that needs to be run through a text shaper. (I now, finally, understand that many fonts contain glyphs without a unicode code point, such as ligatures. You cannot address these using a unicode code point, and you also cannot get their widths through a unicode code point. Instead, you need to read their widths and glyphs by CID (for CID fonts) / character code (for simple fonts). This was what the old build_font_width_map code did in _cmap.py.) I'll fix this PR according to the new logic, but then I'll revert the character widths to the old logic that was in _cmap.py, port the old text extraction code back to it (should be simple) and port the layout text extraction code and the appearance stream code to it (will be harder). And that will be a good basis for generating arabic code. |
6919ab5 to
744c593
Compare
|
Something's still weird, font is not listed as embedded... |
|
OK, I may have reached somewhat of a breakthrough. I can now fully embed a font and associated font resource and encode new text. I needed to add a character_map after all, but not in the way that I thought. I can also do so while reusing a compatible font resource. Main remaining problems include:
|
|
I think that this PR now starts to get somewhat useful. As far as I can tell, it now no longer matters whether I create a new /FontDescriptor resource or use the old one. Also, visual text now corresponds with copy-pasted text. I'm going to change the api a little bit so that I can actually embed a font from a ttf file using writer.add_font(). In that way, it is easier to test the new font methods and different encodings. Tests all fail because I didn't fix the new test after adding a character_map. Probably needs another week of work. |
|
I fixed the test, now I'm just filling the form and then extracting the text using PdfReader. This is nice, because nothing changed in the code for text extraction, which means that the new code would be logically the same as the old. |
This patch more comprehensivel tries to detect font flags. Furthermore, it adds some checks to deal with missing tables in truetype fonts. It is a bit of a question what to do when the cmap itself is missing. In this version, we just continue, but perhaps we should raise a warning or even an error, because, in practice, it would mean that the font that results isn't usable.
This patch adds a test and a file with some sample font resources that all have specific font flags and/ or specific missing tables, to test all the if conditions in _font.py. The font resources were added using pypdf itself, and lifted from pdf files used as part of the current test suite.
This patch adds a method to produce a pdf font descriptor resource. For now, we assume that an embedded font file will be a TrueType font.
This enables generating a new unicode font resource in case of text widget values that cannot be encoded with existing font resources.
Also refactor to reduce complexity i Please enter the commit message for your changes. Lines starting
This patch adds back some code that got removed earlier and, at that time, did not see any test coverage. With new code that enables adding fonts, I've finally understood that, in some cases, a -1 key will be added to font.character_map. This will cause an encoding failure when generating font_glyph_byte_map.
|
This is beginning to finalise. Couple of points. First, we should only escape parentheses if we encode literal strings, not when we use Hexadecimal Strings. Second, I still need to change the font resource name to avoid clashes. Third, I notice that font changes when filling out forms only show up when I flatten text... |
…nt file" This reverts commit 5f9a113cbbc6771ba93b9c64612dc489caf5f55a.
…urce code in one place
|
This, I hope, is ready for review now. The main infrastructure that I added is all in _font.py, both for initialising a Font from an embedded font file, and for producing a CID font resource when we have a font file. The second bit of infrastructure all is in _appearance_stream.py, where we now better check whether we actually can encode some text, and if not, we try to create a better font resource that can. I also noticed that we should not, when we use a CID font resource, escape parentheses anymore, so I fixed that as well. I also refactored some code here, to put all code dealing with getting font font_name and font_resource in one routine. This was necessary to have all necessary information in place before we can decide whether to escape parentheses. The third bit of infrastructure is in _writer.py. Especially when we add new font resources, we need to make several bits indirect objects, such as the font descriptor, the font file and the /ToUnicode stream. We cannot do this in _appearance_stream.py, because we don't have a PdfWriter object in there. Here, again, I refactored some code, both to reduce code complexity and to remove some superfluous code. I ended up also adding a very small add_font() method to PdfWriter, both for testing convenience and to finally make the option of setting a font when filling forms more useful. Usage example: Almost all the PR is covered with tests, barring a couple of lines in AppearanceStream.py. I can add a test for this later as well. I did notice one thing - when setting a font while filling out a form, the result does only show when flattening the annotations, so I should need to look in to that as well. EDIT This was much easier to fix than I thought, see last patch. On the bright side, I could manually add a font and use it to fill a form, which is nice! I did use Gemini for some of the code, especially the code that produces the /ToUnicode stream and the code that produces the /W array in _font.py. In the end, this PR did get quite elaborate. I can also split it by removing the fix and refactoring associated with escaping parentheses, and the comments I added to _cmap.py are entirely unnecessary, although they do help understanding what is going on in there. |
|
@stefan6419846 Should I divide this PR in smaller sets of commits? I could also start just with the code that initialises a font from file. |
|
The smaller and self-explanatory a PR this, the higher are the chances of getting it merged. At the moment, reviewing larger PRs which are non-trivial might take a bit longer due to other more important stuff - like the current "flood" of security-related issues. |
This PR adds a new method to _font.py,
from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.This fixes #3361.
Contributes to fixing #3514.
Might be related to #3318. EDIT, it is not.
Includes all work from #3602.
EDIT.
How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.
Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.