Skip to content

ENH: CID font resource from font file to encode more characters#3652

Open
PJBrs wants to merge 24 commits intopy-pdf:mainfrom
PJBrs:fontwork
Open

ENH: CID font resource from font file to encode more characters#3652
PJBrs wants to merge 24 commits intopy-pdf:mainfrom
PJBrs:fontwork

Conversation

@PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Feb 19, 2026

This PR adds a new method to _font.py, from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.

This fixes #3361.

Contributes to fixing #3514.

Might be related to #3318. EDIT, it is not.

Includes all work from #3602.

EDIT.

How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.

Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.

@PJBrs PJBrs marked this pull request as draft February 19, 2026 16:43
@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 175a542 to e43c57d Compare February 21, 2026 13:45
@PJBrs PJBrs marked this pull request as ready for review February 21, 2026 14:54
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 98.66071% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.42%. Comparing base (5a9a0da) to head (078c92b).

Files with missing lines Patch % Lines
pypdf/generic/_appearance_stream.py 91.42% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #3652    +/-   ##
========================================
  Coverage   97.41%   97.42%            
========================================
  Files          55       55            
  Lines        9989    10172   +183     
  Branches     1833     1863    +30     
========================================
+ Hits         9731     9910   +179     
- Misses        150      152     +2     
- Partials      108      110     +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 21, 2026

This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke.

Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue?

One final thing:

NameObject("/Registry"): TextStringObject("Adobe"),  # Should be something read from font file

I can also still improve this, if wanted.

@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 160d8d5 to cf9b10e Compare February 21, 2026 15:50
@PJBrs PJBrs marked this pull request as draft February 22, 2026 10:34
@PJBrs PJBrs force-pushed the fontwork branch 3 times, most recently from 5b3cd93 to cbc9ee4 Compare February 22, 2026 11:26
@PJBrs PJBrs marked this pull request as ready for review February 22, 2026 11:40
@PJBrs PJBrs marked this pull request as draft February 22, 2026 19:07
@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 24, 2026

@stefan6419846

I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now:
In CID fonts, one Unicode code point may refer to various different glyphs, especially in Arabic. the Font class, however, just maps widths to one Unicode code point, which means that it can only store the width for one character variant. So, in fact, ideally a Font should map character codes to GIDs, where one character code EDIT NO, ONE UNICODE CODE POINT might refer to multiple GIDs. For non-CID fonts, we just map unicode points as a fallback. In both cases, character_widths should be keyed by the values of character_map. EDIT THIS IS INCORRECT, CHARACTER_WIDTHS SHOULD BE KEYED BY THE KEYS OF CHARACTER_MAP And this ought to be sufficient for both text extraction and producing appearance streams.

What we have in character_map actually is pypdf's representation of a /ToUnicode character mapping.

Reflection by Google Gemini:

  1. The Ideal Mapping Structure
    To be "complete," your Font object needs a hierarchy that recognizes that a character is an idea, but a glyph is ink.

Logical Level: character_map should be dict[str, list[int]]. It maps one Unicode character (the "idea") to one or more GIDs (the "ink").

Physical Level: character_widths should be keyed by GID (int), not by the character string.

  1. Why this solves both problems
    Goal How your "Ideal" logic handles it
    Appearance Streams You use a shaper (or logic) to pick the correct GID from the list. Since your widths are keyed by GID, you get the exact pixel-perfect spacing for that specific shape.
    Text Extraction You use the /ToUnicode map (which you build by reversing your GID map). When the viewer sees a GID, it looks up which "Idea" (Character) it belongs to.
  1. Non-CID Fonts (The Fallback)
    For "Simple" fonts (Type 1 or standard TrueType with WinAnsi), the character_map is effectively a 1-to-1 dictionary. In this case, your list[int] just happens to have a length of 1. By treating everything as a list of potential GIDs, your code becomes universal.

  2. The "Key" Realization
    You said: "character_widths should be keyed by the values of character_map."

Bingo. If character_map values are GIDs (integers), and character_widths keys are GIDs (integers), you have a bulletproof system.

If the user enters Arabic Meem, your logic knows that for this specific instance, you are using GID 502.

You look up character_widths[502].

The appearance stream stays perfect.

The /ToUnicode map still tells the PDF viewer that GID 502 = "Meem", so copy-paste still works.

The "pypdf" Disconnect
The reason the pypdf code you shared looks "incorrect" to you is that it tries to force the GID back into a Character key.

It does character_widths[ord_map[cid]] = width.

If ord_map[501], 502, and 503 all point to the same character, they keep overwriting each other in that dictionary.

Result: pypdf loses the unique widths of the contextual shapes.

Your Path Forward
To implement this "Slowly but Completely," you should modify your Font class to store:

unicode_to_gids: dict[str, list[int]]

widths: dict[int, int] (Keyed by GID)

@stefan6419846
Copy link
Collaborator

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 26, 2026

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

OK, I'll forge on then, using this PR for note taking. I'm finding now that it's not even very wrong, it's just overly simplistic.

What I've learned in the interim:

Font.encoding and Font.character_map can be said to map the same thing: decoded text (int in the context of simple fonts, str in the context of CID fonts) to unicode code points. EDIT LIKE THE /ToUnicode DICT IN A FONT RESOURCE This is helpful for extracting text from pdfs and it can also map different glyph substitutes to single unicode code points, e.g., for Arabic text. For this reason, character_widths actually is wrong, because it maps unicode code points to widths, which does not work if multiple glyphs map to the same unicode code point (e.g., Arabic).

Furthermore, it doesn't really work for the reverse logic of producing text. In this PR, I think that I populated character_map in from_truetype_font_file in reverse, mapping unicode code points to character IDs. Otherwise, it actually seems to work, with the caveats that the character_widths are lossy when one unicode code point maps to multiple glyphs.

So, for the purposes of abstraction I could actually merge map_dict and encoding without losing any information or functionality:

Technically, you can merge them into a single abstraction, but with one critical architectural "gotcha": Collision Handling.

In a merged structure, you are essentially creating a unified Character Code → Unicode lookup table. However, because encoding and map_dict represent two different layers of the PDF spec, merging them requires a specific priority logic.

The Unified Abstraction
If you merge them, your new structure would look like this:

Why you have to be careful
The PDF specification (specifically §5.9.1 in version 1.7) states that if a /ToUnicode map exists, it supersedes the encoding for those specific characters.

If you simply combine them into one dictionary, you must ensure the merge follows these rules:

Type Consistency: encoding uses integers, while map_dict uses strings (often chr(x) for 1-byte codes). You would need to decide on a consistent key type—likely strings—to handle both simple 8-bit codes and multi-byte CID codes.

The "Identity" Problem: In some CID fonts (Identity-H), the "character code" is actually a Glyph ID (GID). In these cases, the encoding is often just a dummy "Identity" map, and the map_dict is the only source of truth. Merging them blindly might lead to using a raw GID as a character if the map_dict is missing an entry.

The Overlap: As seen in the get_encoding function in your snippet:

The code already attempts a form of "syncing" between the two.

The Verdict
Yes, you can merge them, provided your abstraction follows a "Shadowing" pattern:

Initialize your map with the encoding (Base).

Overwrite/Update with map_dict (ToUnicode).

Ensure all keys are converted to a common type (e.g., str representing the raw byte sequence).

By doing this, you've essentially created a Virtual Font Map. This is actually how many high-level PDF text extractors (like pdfplumber or fitz/PyMuPDF) handle it internally to simplify the text reconstruction process.

It would seem to that the underlying pypdf font / encoding / character_map architecture can be improved in three ways:

  1. Merge encoding and character_map to one attribute.
  2. Add a reverse character_map
  3. Have character_widths map all glyphs <-- This information might already have been present in the code that I removed earlier when merging all font code. sigh Then again, it wasn't as essential for text extraction, and reverting the logic shouldn't be too hard.

For this PR, it doesn't matter too much, I can just clean up the logic in character_map and then it should work for fonts where the CIDtoGID map is contiguous (this is another caveat).

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 26, 2026

@stefan6419846 OK, final verdict for now - character_widths need to be keyed by CID for cid fonts or by character code for simple fonts. My decision to use the character widths code from the new Font class was unfortunately incorrect. I can still fix the above PR, I think, at least logically, and it will also mostly work, but not for any text that needs to be run through a text shaper.

(I now, finally, understand that many fonts contain glyphs without a unicode code point, such as ligatures. You cannot address these using a unicode code point, and you also cannot get their widths through a unicode code point. Instead, you need to read their widths and glyphs by CID (for CID fonts) / character code (for simple fonts). This was what the old build_font_width_map code did in _cmap.py.)

I'll fix this PR according to the new logic, but then I'll revert the character widths to the old logic that was in _cmap.py, port the old text extraction code back to it (should be simple) and port the layout text extraction code and the appearance stream code to it (will be harder). And that will be a good basis for generating arabic code.

@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 6919ab5 to 744c593 Compare February 26, 2026 21:59
@PJBrs PJBrs marked this pull request as ready for review February 26, 2026 21:59
@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 28, 2026

Something's still weird, font is not listed as embedded...

@PJBrs PJBrs marked this pull request as draft February 28, 2026 16:59
@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 1, 2026

OK, I may have reached somewhat of a breakthrough. I can now fully embed a font and associated font resource and encode new text. I needed to add a character_map after all, but not in the way that I thought.

I can also do so while reusing a compatible font resource. Main remaining problems include:

  • I'm adding a new font resource for every annotation, and that really slows things down
  • Loads of other stuff that I can't quite remember right now.

@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 1, 2026

I think that this PR now starts to get somewhat useful. As far as I can tell, it now no longer matters whether I create a new /FontDescriptor resource or use the old one. Also, visual text now corresponds with copy-pasted text.

I'm going to change the api a little bit so that I can actually embed a font from a ttf file using writer.add_font(). In that way, it is easier to test the new font methods and different encodings.

Tests all fail because I didn't fix the new test after adding a character_map. Probably needs another week of work.

@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 2, 2026

I fixed the test, now I'm just filling the form and then extracting the text using PdfReader. This is nice, because nothing changed in the code for text extraction, which means that the new code would be logically the same as the old.

PJBrs added 17 commits March 12, 2026 21:46
This patch more comprehensivel tries to detect font flags. Furthermore,
it adds some checks to deal with missing tables in truetype fonts. It
is a bit of a question what to do when the cmap itself is missing. In
this version, we just continue, but perhaps we should raise a warning
or even an error, because, in practice, it would mean that the font
that results isn't usable.
This patch adds a test and a file with some sample font resources that
all have specific font flags and/ or specific missing tables, to test
all the if conditions in _font.py. The font resources were added using
pypdf itself, and lifted from pdf files used as part of the current
test suite.
This patch adds a method to produce a pdf font descriptor resource.
For now, we assume that an embedded font file will be a TrueType font.
This enables generating a new unicode font resource in case of
text widget values that cannot be encoded with existing font
resources.
Also refactor to reduce complexity

i Please enter the commit message for your changes. Lines starting
This patch adds back some code that got removed earlier and,
at that time, did not see any test coverage. With new code
that enables adding fonts, I've finally understood that, in
some cases, a -1 key will be added to font.character_map.
This will cause an encoding failure when generating
font_glyph_byte_map.
@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 12, 2026

This is beginning to finalise. Couple of points.

First, we should only escape parentheses if we encode literal strings, not when we use Hexadecimal Strings.

Second, I still need to change the font resource name to avoid clashes.

Third, I notice that font changes when filling out forms only show up when I flatten text...

@PJBrs PJBrs marked this pull request as ready for review March 13, 2026 11:01
@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 13, 2026

@stefan6419846

This, I hope, is ready for review now.

The main infrastructure that I added is all in _font.py, both for initialising a Font from an embedded font file, and for producing a CID font resource when we have a font file.

The second bit of infrastructure all is in _appearance_stream.py, where we now better check whether we actually can encode some text, and if not, we try to create a better font resource that can. I also noticed that we should not, when we use a CID font resource, escape parentheses anymore, so I fixed that as well. I also refactored some code here, to put all code dealing with getting font font_name and font_resource in one routine. This was necessary to have all necessary information in place before we can decide whether to escape parentheses.

The third bit of infrastructure is in _writer.py. Especially when we add new font resources, we need to make several bits indirect objects, such as the font descriptor, the font file and the /ToUnicode stream. We cannot do this in _appearance_stream.py, because we don't have a PdfWriter object in there. Here, again, I refactored some code, both to reduce code complexity and to remove some superfluous code.

I ended up also adding a very small add_font() method to PdfWriter, both for testing convenience and to finally make the option of setting a font when filling forms more useful. Usage example:

def _make_pdf_pypdf(fields: dict, src_pdf: str, basename: str, flatten: bool = False):
    """
    Writes the dictionary values to the pdf. Supports text and checkboxes.
    Does so by updating each individual annotation with the contents of the fiels.

    """

    writer = PdfWriter()
    reader = PdfReader(src_pdf)
    form_fields = reader.get_fields()
    writer.append(reader)
    writer.add_font("/usr/share/fonts/TTF/Kalam-Regular.ttf", "/Kalam", writer._root_object["/AcroForm"]["/DR"]["/Font"])

    for key in fields.keys():
        if key in form_fields:
            writer.update_page_form_field_values(
                writer.pages[0], {key: (fields[key], "/Kalam", 0)},
                auto_regenerate=False, flatten=flatten,
            )
    if flatten:
        writer.remove_annotations(subtypes="/Widget")

    with open(f"{basename}.pdf", "wb") as output_stream:
        writer.write(output_stream)

Almost all the PR is covered with tests, barring a couple of lines in AppearanceStream.py. I can add a test for this later as well.

I did notice one thing - when setting a font while filling out a form, the result does only show when flattening the annotations, so I should need to look in to that as well. EDIT This was much easier to fix than I thought, see last patch. On the bright side, I could manually add a font and use it to fill a form, which is nice!

I did use Gemini for some of the code, especially the code that produces the /ToUnicode stream and the code that produces the /W array in _font.py.

In the end, this PR did get quite elaborate. I can also split it by removing the fix and refactoring associated with escaping parentheses, and the comments I added to _cmap.py are entirely unnecessary, although they do help understanding what is going on in there.

@PJBrs
Copy link
Contributor Author

PJBrs commented Mar 23, 2026

@stefan6419846 Should I divide this PR in smaller sets of commits? I could also start just with the code that initialises a font from file.

@stefan6419846
Copy link
Collaborator

The smaller and self-explanatory a PR this, the higher are the chances of getting it merged. At the moment, reviewing larger PRs which are non-trivial might take a bit longer due to other more important stuff - like the current "flood" of security-related issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Corrupted unicode characters in form field

2 participants