I'm trying to convert a lot of random PDFs found on the web to pure-text for further analysis of potential statistical abnormalities.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29
$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3
Code + PDF
This is a minimal, complete example that shows the issue:
import PyPDF2
with open("p.a. trento.pdf", "rb") as f:
pdfreader = PyPDF2.PdfFileReader(f, strict=False)
full_content = " ".join([page.extractText() for page in pdfreader.pages])
PDF used above: p.a. trento.pdf
Another example: qqplots.pdf
I can search for further files with errors, if needed (the two examples above are both plot files). I will obviously participate in testing and verifying any proposed bugfixes.
Traceback
There is no crash, however, these are 4164 warnings like
impossible to decode XFormObject /M0
[...]
impossible to decode XFormObject /M3
[...]
impossible to decode XFormObject /M5
[...]
impossible to decode XFormObject /F1-DejaVuSans-minus
What do I expect?
I'd like to just get the text without flooding my log file with warnings (during a simple test on a few hundred files, the log file grew into the Gigabytes).
I'm trying to convert a lot of random PDFs found on the web to pure-text for further analysis of potential statistical abnormalities.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform Linux-5.4.0-122-generic-x86_64-with-glibc2.29 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.10.3Code + PDF
This is a minimal, complete example that shows the issue:
PDF used above: p.a. trento.pdf
Another example: qqplots.pdf
I can search for further files with errors, if needed (the two examples above are both plot files). I will obviously participate in testing and verifying any proposed bugfixes.
Traceback
There is no crash, however, these are 4164 warnings like
What do I expect?
I'd like to just get the text without flooding my log file with warnings (during a simple test on a few hundred files, the log file grew into the Gigabytes).