Skip to content

Impossible to decode XFormObject ... #1269

@DL6ER

Description

@DL6ER

I'm trying to convert a lot of random PDFs found on the web to pure-text for further analysis of potential statistical abnormalities.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("p.a. trento.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  full_content = " ".join([page.extractText() for page in pdfreader.pages])

PDF used above: p.a. trento.pdf
Another example: qqplots.pdf

I can search for further files with errors, if needed (the two examples above are both plot files). I will obviously participate in testing and verifying any proposed bugfixes.

Traceback

There is no crash, however, these are 4164 warnings like

 impossible to decode XFormObject /M0
[...]
 impossible to decode XFormObject /M3
[...]
 impossible to decode XFormObject /M5
[...]
 impossible to decode XFormObject /F1-DejaVuSans-minus

What do I expect?

I'd like to just get the text without flooding my log file with warnings (during a simple test on a few hundred files, the log file grew into the Gigabytes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions