ENH: Process XRefStm by pubpub-zz · Pull Request #1297 · py-pdf/pypdf

pubpub-zz · 2022-08-28T22:00:58Z

Fixes #1273
Fixes #1279
Fixes #1292
Fixes #1294
Fixes #1295

ROB: Cope with xref starting on \r\n
ROB: Escaped octal code followed by decimal int
ROB: Cope with some corrupted entries in xref table
ROB: Extend xref autorepair cases

fixes py-pdf#1295 includes test file adjustment

codecov · 2022-08-29T07:42:33Z

Codecov Report

Merging #1297 (4edf6f8) into main (3b74312) will decrease coverage by 0.40%.
The diff coverage is 82.01%.

@@            Coverage Diff             @@
##             main    #1297      +/-   ##
==========================================
- Coverage   95.07%   94.67%   -0.41%     
==========================================
  Files          30       30              
  Lines        4973     5106     +133     
  Branches     1023     1052      +29     
==========================================
+ Hits         4728     4834     +106     
- Misses        139      157      +18     
- Partials      106      115       +9

Impacted Files	Coverage Δ
PyPDF2/_reader.py	`89.49% <72.52%> (-2.19%)`	⬇️
PyPDF2/_page.py	`94.36% <100.00%> (+<0.01%)`	⬆️
PyPDF2/_writer.py	`91.04% <100.00%> (-0.51%)`	⬇️
PyPDF2/generic/_base.py	`100.00% <100.00%> (+1.02%)`	⬆️
PyPDF2/generic/_utils.py	`100.00% <100.00%> (ø)`
PyPDF2/types.py	`100.00% <100.00%> (ø)`
PyPDF2/_codecs/adobe_glyphs.py	`100.00% <0.00%> (ø)`
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pubpub-zz · 2022-08-29T07:54:04Z

@MartinThoma,
Ready for review

pubpub-zz · 2022-08-29T09:25:06Z

stdby

fixes py-pdf#1279 / Status_v1_Reviewers-Guide.pdf

fixes py-pdf#1294 and may be others

fix py-pdf#1292

* if chained xref/trailer are not good * if the object header ('id' 'gen' obj) or if the object is not present in the xref table, will search the file for the object. fixes py-pdf#1273

MartinThoma · 2022-08-31T20:20:04Z

+        reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))
        reader.xmp_metadata
-    assert exc.value.args[0].startswith("XML in XmpInformation was invalid")
+    assert exc.value.args[0].startswith("Stream length not defined")


Why did this change? I guess the reader.xmp_metadata isn't even touched, is it?

Before this PR, one could at least get the number of pages:

assert len(reader.pages) == 5

I guess with this PR it no longer works?

I had to modify the test result. I did not analyze further

Before this PR, one could at least get the number of pages:

assert len(reader.pages) == 5

I guess with this PR it no longer works?

under analysis

The PDF was corrupted : the XRef package had a /Length key corrupted. I've changed the code to discard the loading of the XRef object to allow the main program to recover to a maximum information : you can now get the metadata 😊
the access to number of pages is (still?) possible

discard non readable XRef object to try to do your best

pubpub-zz · 2022-08-31T21:36:49Z

I had to merge iss_1292 to have a global PR.

this PR is now complete

complete TODO test

Co-authored-by: Martin Thoma <info@martin-thoma.de>

pubpub-zz · 2022-09-03T07:54:57Z

5 sec before me 😝

MartinThoma · 2022-09-03T08:01:05Z

I'll look into applying black automatically in the CI as an extra commit today 😄

Also, I want to make flake8 run in parallel to the tests and mypy after pytest so that I can still see issues there in a failed run.

pubpub-zz · 2022-09-03T08:06:16Z

I don't think it worth it.
the line missing came from the code review.
One thing I've noticed is that 3.10 check is performed twice. Do you know why ?(for energy saving)

MartinThoma · 2022-09-03T08:09:09Z

One thing I've noticed is that 3.10 check is performed twice.

It's a different test scenario. pycryptodome is removed in that test run.

Version 2.10.5, 2022-09-04 -------------------------- New Features (ENH): - Process XRefStm (#1297) - Auto-detect RTL for text extraction (#1309) Bug Fixes (BUG): - Avoid scaling cropbox twice (#1314) Robustness (ROB): - Fix offset correction in revised PDF (#1318) - Crop data of /U and /O in encryption dictionary to 48 bytes (#1317) - MultiLine bfrange in cmap (#1299) - Cope with 2 digit codes in bfchar (#1310) - Accept '/annn' charset as ASCII code (#1316) - Log errors during Float / NumberObject initialization (#1315) - Cope with corrupted entries in xref table (#1300) Documentation (DOC): - Migration guide (PyPDF2 1.x \xe2\x9e\x94 2.x) (#1324) - Creating a coverage report (#1319) - Fix AnnotationBuilder.free_text example (#1311) - Fix usage of page.scale by replacing it with page.scale_by (#1313) Developer Experience (DEV): - Only run coverage for PyPDF2 Maintenance (MAINT): - PdfReaderProtocol (#1303) - Throw PdfReadError if Trailer can't be read (#1298) - Remove catching OverflowException (#1302) Full Changelog: 2.10.4...2.10.5

pubpub-zz added 2 commits August 28, 2022 23:59

ENH : Process XRefStm

2dc76c0

fixes py-pdf#1295 includes test file adjustment

mypy

1c767d2

pubpub-zz added 12 commits August 29, 2022 13:32

ROB : cope with xref starting on \r\n

147b69e

fixes py-pdf#1279 / Status_v1_Reviewers-Guide.pdf

ROB : escaped octal code followed by decimal int

bc4cbc8

fixes py-pdf#1294 and may be others

PERF: simplify criteria

7bc1691

add test

bb8e317

ROB : cope with some corrupted entries in xref table

e6c1d7a

fix py-pdf#1292

flake8

55d58ae

mypy + improv

0a342b1

typo

0bb8079

mypy2

1bf4b61

typo

964079e

mypy 3

041ea87

ROB : extend xref autorepair cases

9e97efc

* if chained xref/trailer are not good * if the object header ('id' 'gen' obj) or if the object is not present in the xref table, will search the file for the object. fixes py-pdf#1273

pubpub-zz mentioned this pull request Aug 31, 2022

TypeError: 'NumberObject' object is not subscriptable #1273

Closed

MartinThoma reviewed Aug 31, 2022

View reviewed changes

pubpub-zz added 3 commits August 31, 2022 22:51

Merge branch 'iss_1292' into XRefStm

0540730

adjust and extend test

9854643

ROB : improve extraction

1051f65

discard non readable XRef object to try to do your best

pubpub-zz added 6 commits August 31, 2022 23:38

Merge branch 'main' into XRefStm

9aa019a

flake8 + merge

8cc00ae

cope with stream without getbuffer() + restore tst_xmp

26678b1

mypy

44bf5bd

extend test

1b17f72

complete TODO test

Merge branch 'iss_1294' into XRefStm

091e62f

pubpub-zz mentioned this pull request Sep 1, 2022

KeyError: b'1' in read_string_from_stream #1294

Closed

MartinThoma changed the title ~~ENH : Process XRefStm~~ ENH: Process XRefStm Sep 2, 2022

MartinThoma reviewed Sep 2, 2022

View reviewed changes

Comment thread tests/test_xmp.py Outdated

MartinThoma reviewed Sep 2, 2022

View reviewed changes

Comment thread tests/test_reader.py

MartinThoma reviewed Sep 2, 2022

View reviewed changes

Comment thread tests/test_reader.py Outdated

pubpub-zz and others added 2 commits September 3, 2022 09:41

Update tests/test_reader.py

b71c071

Co-authored-by: Martin Thoma <info@martin-thoma.de>

Update tests/test_reader.py

cfa33ee

Co-authored-by: Martin Thoma <info@martin-thoma.de>

MartinThoma reviewed Sep 3, 2022

View reviewed changes

Comment thread tests/test_reader.py

Flake8: Add missing newline

27fbc7f

rollback iaw review

4edf6f8

MartinThoma merged commit 1252a49 into py-pdf:main Sep 3, 2022

pubpub-zz deleted the XRefStm branch September 3, 2022 19:53

Conversation

pubpub-zz commented Aug 28, 2022 • edited by MartinThoma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pubpub-zz commented Aug 29, 2022

Uh oh!

pubpub-zz commented Aug 29, 2022

Uh oh!

MartinThoma Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

MartinThoma Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

pubpub-zz Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

pubpub-zz Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

pubpub-zz Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

pubpub-zz commented Aug 31, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pubpub-zz commented Sep 3, 2022

Uh oh!

MartinThoma commented Sep 3, 2022

Uh oh!

pubpub-zz commented Sep 3, 2022

Uh oh!

MartinThoma commented Sep 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pubpub-zz commented Aug 28, 2022 •

edited by MartinThoma

Loading

codecov bot commented Aug 29, 2022 •

edited

Loading