-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PyPDF2 forever spinning at 100% CPU #1285
Copy link
Copy link
Closed
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFnf-performanceNon-functional change: PerformanceNon-functional change: Performancenf-securityNon-functional change: SecurityNon-functional change: Securityworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
Metadata
Metadata
Assignees
Labels
is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDFnf-performanceNon-functional change: PerformanceNon-functional change: Performancenf-securityNon-functional change: SecurityNon-functional change: Securityworkflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflowFrom a users perspective, text extraction is the affected feature/workflow
I want to read this PDF file but PyPDF2 starts hanging forever spinning at 100% CPU while reading the PDF.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform Linux-5.4.0-122-generic-x86_64-with-glibc2.29 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.10.3Code + PDF
This is a minimal, complete example that shows the issue:
PDF used above: The lean times in the Peruvian economy.pdf
Output of the script
At this point, the script starts spinning at 100% CPU for more than half an hour when I manually terminated it.
Preliminary code analysis
The code is spinning in this loop:
https://github.com/py-pdf/PyPDF2/blob/84460f54aa4721db36452fe510f8063838e358d5/PyPDF2/_cmap.py#L273-L282
with very large value of
b = 438093348969. After roughly one minuteagrew by9430662suggesting this loop would running for more than 32 days. For any other page in this PDF,bnever exceeds0xFFFDwhich would make this loop finish in about 0.4s.The lack of comments and the inconclusive variable names prevent any further debugging attempts from my side but, hopefully, this gives the maintainers a hint to what they should be looking at.