Skip to content

BUG: Improve spacing for text extraction#806

Merged
MartinThoma merged 11 commits intomainfrom
spacing
Apr 23, 2022
Merged

BUG: Improve spacing for text extraction#806
MartinThoma merged 11 commits intomainfrom
spacing

Conversation

@MartinThoma
Copy link
Copy Markdown
Member

No description provided.

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 23, 2022
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #806 (fb4a895) into main (d4c8cab) will increase coverage by 0.01%.
The diff coverage is 75.00%.

@@            Coverage Diff             @@
##             main     #806      +/-   ##
==========================================
+ Coverage   75.22%   75.24%   +0.01%     
==========================================
  Files          11       11              
  Lines        3516     3522       +6     
  Branches      810      814       +4     
==========================================
+ Hits         2645     2650       +5     
  Misses        658      658              
- Partials      213      214       +1     
Impacted Files Coverage Δ
PyPDF2/pdf.py 81.85% <75.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4c8cab...fb4a895. Read the comment docs.

@MartinThoma MartinThoma merged commit d1be80d into main Apr 23, 2022
@MartinThoma MartinThoma deleted the spacing branch April 23, 2022 20:49
MartinThoma added a commit that referenced this pull request Apr 24, 2022
A change I would like to highlight is the performance improvement for
large PDF files (#808) 🎉

New Features (ENH):
-  Add papersizes (#800)
-  Allow setting permission flags when encrypting (#803)
-  Allow setting form field flags (#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (#813)
-  Improve spacing for text extraction (#806)
-  Fix PDFDocEncoding Character Set (#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (#812)
-  Handle recursion error (#804)

Documentation (DOC):
-  CMaps (#811)
-  The PDF Format + commit prefixes (#810)
-  Add compression example (#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (#339)
-  Quadratic runtime while parsing reduced to linear  (#808)

Testing (TST):
-  Newlines in text extraction (#807)

Full Changelog: 1.27.8...1.27.9
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
PyPDF2 now takes positive / negative spaces between text blocks into account. Not very elegant, but the result looks way better than before.
VictorCarlquist pushed a commit to VictorCarlquist/PyPDF2 that referenced this pull request Apr 29, 2022
A change I would like to highlight is the performance improvement for
large PDF files (py-pdf#808) 🎉

New Features (ENH):
-  Add papersizes (py-pdf#800)
-  Allow setting permission flags when encrypting (py-pdf#803)
-  Allow setting form field flags (py-pdf#802)

Bug Fixes (BUG):
-  TypeError in xmp._converter_date (py-pdf#813)
-  Improve spacing for text extraction (py-pdf#806)
-  Fix PDFDocEncoding Character Set (py-pdf#809)

Robustness (ROB):
-  Use null ID when encrypted but no ID given (py-pdf#812)
-  Handle recursion error (py-pdf#804)

Documentation (DOC):
-  CMaps (py-pdf#811)
-  The PDF Format + commit prefixes (py-pdf#810)
-  Add compression example (py-pdf#792)

Developer Experience (DEV):
-  Add Benchmark for Performance Testing (py-pdf#781)

Maintenance (MAINT):
-  Validate PDF magic byte in strict mode (py-pdf#814)
-  Make PdfFileMerger.addBookmark() behave life PdfFileWriters\' (py-pdf#339)
-  Quadratic runtime while parsing reduced to linear  (py-pdf#808)

Testing (TST):
-  Newlines in text extraction (py-pdf#807)

Full Changelog: py-pdf/pypdf@1.27.8...1.27.9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants