Skip to content

Corrupted unicode characters in form field #3361

@ematt

Description

@ematt

Hello,
I'm using pypdf to fill out a for and generate a printable pdf. Everything works fine, expcept when I use unicode strings. The text apprears corrupted in the output pdf, regardless of the pdf viewer I use. I tried Adobe Reader, SumatraPdf and Brave.

Image

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
# Windows-10-10.0.26100-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
# pypdf==5.7.0, crypt_provider=('cryptography', '45.0.5'), PIL=11.3.0

Code + PDF

This is a minimal, complete example that shows the issue:

from io import BytesIO

import pypdf
from pypdf.generic import NameObject, NumberObject, BooleanObject, IndirectObject
import pypdf.generic
import pypdf.types


data = {
    "subsemnatul": "Σὲ γνωρίζω ἀπὸ τὴν κόψη",
    "cnp_cui": "123456789",
    "localitatea": "Comuna Roșia-Nouă",
    "strada": "Căpitan Nicolae Licăreț",
    "adresa_nr": "12",
    "adresa_bl": "A",
    "adresa_sc": "1",
    "adresa_et": "5",
    "adresa_ap": "123",
    "adresa_judet": "Конференция",
}

# https://stackoverflow.com/a/55302753
def fill_with_pypdf(file, data):
    """
    Used to fill PDF with PyPDF.
    To fill, PDF form must have field name values that match the dictionary keys

    :param file: The PDF being written to
    :param data: The data dictionary being written to the PDF Fields
    :return:
    """
    
    with open(file, "rb") as input_stream:
        pdf_reader = pypdf.PdfReader(input_stream)

        if "/AcroForm" in pdf_reader.trailer["/Root"]:
            pdf_reader.trailer["/Root"]["/AcroForm"].update(
                {NameObject("/NeedAppearances"): BooleanObject(True)})

        writer = pypdf.PdfWriter(pdf_reader)
        # alter NeedAppearances
        try:
            catalog = writer._root_object
            # get the AcroForm tree and add "/NeedAppearances attribute
            if "/AcroForm" not in catalog:
                writer._root_object.update({
                    NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

            need_appearances = NameObject("/NeedAppearances")
            writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        except Exception as e:
            print('set_need_appearances_writer() catch : ', repr(e))

        if "/AcroForm" in writer._root_object:
            # Acro form is form field, set needs appearances to fix printing issues
            writer._root_object["/AcroForm"].update(
                {NameObject("/NeedAppearances"): BooleanObject(True)})

        # loop over all pages
        for page_num in range(len(pdf_reader.pages)):
            # writer.add_page(pdf_reader.pages[page_num])
            page = writer.pages[page_num]
            # loop over annotations, but ensure they are there first...
            if page.get('/Annots'):
                # update field values
                writer.update_page_form_field_values(page, data, auto_regenerate=False)
                for j in range(0, len(page['/Annots'])):
                    writer_annot = page['/Annots'][j].get_object()
                    # flatten all the fields by setting bit position to 1
                    # use loop below if only specific fields need to be flattened.
                    writer_annot.update({
                        NameObject("/Ff"): NumberObject(1)  # changing bit position to 1 flattens field
                    })
                    
        output_stream = BytesIO()
#lock fields
        permissions = pypdf.constants.UserAccessPermissions(
            pypdf.constants.UserAccessPermissions.PRINT | 
            pypdf.constants.UserAccessPermissions.PRINT_TO_REPRESENTATION |
            pypdf.constants.UserAccessPermissions.EXTRACT_TEXT_AND_GRAPHICS |
            pypdf.constants.UserAccessPermissions.EXTRACT
            )
        writer.encrypt(user_password="", owner_password="my-secret-password", algorithm="AES-256", use_128bit=False, permissions_flag=permissions)
        writer.write(output_stream)
        writer.set_need_appearances_writer(True)
        return output_stream.getvalue()

out = fill_with_pypdf("forms/CERERE INMATRICULARE form.pdf", data)

with open("output_pypdf.pdf", "wb") as f:
    f.write(out)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

output_pypdf.pdf

CERERE INMATRICULARE form.pdf

Traceback

This is the complete traceback I see:

# TODO: Your traceback goes here (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFworkflow-formsFrom a users perspective, forms is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions