Comparing changes

We want to track performance over time only for what actually is in main. Closes #761

Closes #574 Closes #801 Co-authored-by: Craig Jones <craig@k6nnl.com>

Closes #161 Closes #308

This doesn't solve the issue, but it might make it less severe. See #520 See #268 See virantha/pypdfocr#59 sfneal@3558a69 Co-authored-by: danniesim <geemee@gmail.com>

When the PdfFileReader tries to find the xref marker, the readNextEndLine methods builds a so called line by reading byte-for-byte. Every time a new byte is read, it is concatenated with the currently read line. This leads to quadratic runtime O(n²) behavior as Python strings (also byte-strings) are immutable and have to be copied where n is the size of the file. For files where the xref marker can not be found at the end this takes a enormous amount of time: * 1mb of zeros at the end: 45.54 seconds * 2mb of zeros at the end: 357.04 seconds (measured on a laptop made in 2015) This pull request changes the relevant section of the code to become linear runtime O(n), leading to a run time of less then a second for both cases mentioned above. Furthermore this PR adds a regression test.

Closes #151

PyPDF2 now takes positive / negative spaces between text blocks into account. Not very elegant, but the result looks way better than before.

…339) People stumbled over this inconsistency: * #40 * https://stackoverflow.com/a/42991101/562769 This was also tested with: https://stackoverflow.com/questions/42941742/pypdf2-nested-bookmarks-with-same-name-not-working/42991101#comment73249244_42991101

If no '/ID' key is present in self.trailer an array of two empty bytestrings is used in place of an '/ID'. This is how Apache PDFBox handles this case. This makes PyPDF2 more robust to malformed PDFs. Closes #608 Closes #610 Full credit for this one to Richard Millson - Martin Thoma only fixed a merge conflict Co-authored-by: Richard Millson <8217613+richardmillson@users.noreply.github.com>

Fix: Convert decimal to int before passing it to datetime Closes #774

Closes #626

A change I would like to highlight is the performance improvement for large PDF files (#808) 🎉 New Features (ENH): - Add papersizes (#800) - Allow setting permission flags when encrypting (#803) - Allow setting form field flags (#802) Bug Fixes (BUG): - TypeError in xmp._converter_date (#813) - Improve spacing for text extraction (#806) - Fix PDFDocEncoding Character Set (#809) Robustness (ROB): - Use null ID when encrypted but no ID given (#812) - Handle recursion error (#804) Documentation (DOC): - CMaps (#811) - The PDF Format + commit prefixes (#810) - Add compression example (#792) Developer Experience (DEV): - Add Benchmark for Performance Testing (#781) Maintenance (MAINT): - Validate PDF magic byte in strict mode (#814) - Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339) - Quadratic runtime while parsing reduced to linear (#808) Testing (TST): - Newlines in text extraction (#807) Full Changelog: 1.27.8...1.27.9

Commits on Apr 22, 2022

DOC: Add compression example (#792 )

MartinThoma authored Apr 22, 2022

Configuration menu

View commit details

Copy full SHA for 668869f

Browse repository at this point

Copy the full SHA

668869f View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Apr 21, 2022

Commits on Apr 22, 2022

Commits on Apr 23, 2022

Commits on Apr 24, 2022

This comparison is taking too long to generate.

Uh oh!