ENH: Added command update-offsets to adjust offsets and lengths. by srogmann · Pull Request #15 · py-pdf/pdfly

srogmann · 2022-08-28T10:01:38Z

This command adjusts /Length-entries of stream objects and the xref-offsets
in simple PDF files (ASCII only, one xref section only) to support writing PDF
files by means of a text editor.

I replaced the camelCase-variables by snake-case variables.

srogmann · 2024-05-24T21:12:08Z

Two hours ago I edited a larger pdf-file created by

qpdf --stream-data=uncompress

I remembered this old PR and used update_offsets.py to fix the offsets. I added

len_stream = None

to fix a bug occuring if there are several streams.

Lucas-C · 2024-10-29T22:03:23Z

I think this would be a great addition to pdfly 👍

Are you still willing to work on this @srogmann? 🙂

srogmann · 2024-10-30T20:50:07Z

@Lucas-C
The essential work on update_offsets.py was completed in August 2022. As of August 2022, I was able to automatically supplement offsets in simple PDFs. I don't know why the PR was not merged at the time; perhaps I should have made it clearer that, in my view, it was ready.

As mentioned above, in May 2024, I remembered pdfly and used update-offsets to correct a manually edited PDF file. In my view, the PR was also ready in May 2024.

An example is in the attached file-in.pdf , I used it to test the text-extraction of documents with Tm operators.

By update-offsets the XREF-section

xref
0 7
0000000000 65535 f 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
0000000015 00000 n 
trailer << /Size 7 /Info 2 0 R /Root 1 0 R >>
startxref
000000
%%EOF

will be converted into

xref
0 7
0000000000 65535 f 
0000000015 00000 n 
0000000081 00000 n 
0000000158 00000 n 
0000000217 00000 n 
0000000380 00000 n 
0000001005 00000 n 
trailer << /Size 7 /Info 2 0 R /Root 1 0 R >>
startxref
1086
%%EOF

The "target audience" for update_offsets are simple PDF documents that have been manually created using an editor. It is not suitable for complex or obfuscated PDFs.

Lucas-C · 2024-11-01T21:59:46Z

Thank you for your detailed answer @srogmann 👍

I'll be happy to review & merge this PR, but could you rebase it and solve the minor merge conflict, please?

Lucas-C

Could you please:

add a mention of the new command in README.md
add some unit tests in tests/test_update_offsets.py

pdfly/cli.py

pdfly/update_offsets.py

srogmann · 2024-11-03T21:57:37Z

@Lucas-C
I updated this PR.

During testing, I noticed an issue (specifically with pytest on Unix). In the Makefile, the test directory is written as 'Tests' with a capital 'T', but in the filesystem, it is written in lowercase.

pdfly/cli.py

pdfly/update_offsets.py

tests/test_update_offsets.py

Lucas-C · 2024-11-04T13:16:05Z

During testing, I noticed an issue (specifically with pytest on Unix). In the Makefile, the test directory is written as 'Tests' with a capital 'T', but in the filesystem, it is written in lowercase.

Thank you for notifying this problem 👍

Could you please fix this as part of this PR?

Lucas-C · 2024-11-04T13:17:11Z

PS: I myself wrote a similar script some time ago: https://github.com/Lucas-C/dotfiles_and_notes/blob/master/languages/python/set_pdf_xref.py

I'm really happy to include this feature in pdfly, it's a very good idea and would come handy to many people 👍 🙂

This command adjusts /Length-entries of stream objects and the xref-offsets in simple PDF files (ASCII only, one xref section only).

Lucas-C · 2024-11-04T19:10:23Z

The GitHub Actions pipeline is currently failing due to black not being applied on some files, to ensure a consistent code style:

$ black --extend-exclude sample-files .
would reformat /home/runner/work/pdfly/pdfly/pdfly/cli.py
would reformat /home/runner/work/pdfly/pdfly/tests/test_update_offsets.py
would reformat /home/runner/work/pdfly/pdfly/pdfly/update_offsets.py

Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>

…pdf.pdf

srogmann · 2024-11-06T22:25:34Z

@Lucas-C
I have slightly revised the implementation of update_offsets.py. The lines are now read in binary mode so that line breaks are not distorted. This is important when reading binary data in streams. As a result, a PDF file like 002-trivial-libre-office-writer.pdf now matches the original byte for byte after processing.

In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation.

Lucas-C · 2024-11-07T15:58:27Z

I have slightly revised the implementation of update_offsets.py. The lines are now read in binary mode so that line breaks are not distorted. This is important when reading binary data in streams. As a result, a PDF file like 002-trivial-libre-office-writer.pdf now matches the original byte for byte after processing.

Awesome! Good job 🙂

In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation.

That's fine really.
You did an excellent job on this, thank you for your contribution! 👍

I added a commit on the branch to fix some minor typing related issues.

## What's new ### New Features (ENH) - New `booklet` command to adjust offsets and lengths ([PR #77](#77)) - New `uncompress` command ([PR #75](#75)) - New `update-offsets` command to adjust offsets and lengths ([PR #15](#15)) - New `rm` command ([PR #59](#59)) - `metadata`: now also displaying CreationDate, Creator, Keywords & Subject ([PR #73](#73)) - Add warning for out-of-bounds page range in pdfly `cat` command ([PR #58](#58)) ### Bug Fixes (BUG) - `2-up` command, that only showed one page per sheet, on the left side, with blank space on the right ([PR #78](#78)) [Full Changelog](0.3.3...0.4.0)

Lucas-C · 2024-12-08T09:24:05Z

This has been released in version 0.4.0: https://pypi.org/project/pdfly/#history

Lucas-C requested changes Nov 1, 2024

View reviewed changes

pdfly/cli.py Outdated Show resolved Hide resolved

Lucas-C reviewed Nov 1, 2024

View reviewed changes

pdfly/update_offsets.py Outdated Show resolved Hide resolved

Lucas-C reviewed Nov 1, 2024

View reviewed changes

pdfly/update_offsets.py Outdated Show resolved Hide resolved

srogmann force-pushed the update_offsets branch from 3f6b60e to 2f4e11d Compare November 3, 2024 20:42

Lucas-C reviewed Nov 4, 2024

View reviewed changes

pdfly/cli.py Outdated Show resolved Hide resolved

Lucas-C reviewed Nov 4, 2024

View reviewed changes

pdfly/update_offsets.py Outdated Show resolved Hide resolved

Lucas-C reviewed Nov 4, 2024

View reviewed changes

tests/test_update_offsets.py Show resolved Hide resolved

Lucas-C reviewed Nov 4, 2024

View reviewed changes

tests/test_update_offsets.py Outdated Show resolved Hide resolved

Lucas-C mentioned this pull request Nov 4, 2024

Implement uncompress functionality for PDF files Closes #38 #66

Closed

srogmann and others added 8 commits November 4, 2024 19:42

ENH: Added command update-offsets to adjust offsets and lengths.

b18c8e0

This command adjusts /Length-entries of stream objects and the xref-offsets in simple PDF files (ASCII only, one xref section only).

BUG: Clear stream-length at new object.

25f0ccd

DEV: Logging migrated from Python's built-in logging to .

d8f6669

TST: Added test of update-offsets using hello.pdf.

37b9b9d

MAINT: Regex uppercase module constants.

be39e9b

DOC: Add update-offsets command

2465ffe

MAINT: Added suggested help-attribute.

8838ca5

Minor fixups & adding test_update_offsets_on_all_reference_files()

3429c2f

Lucas-C force-pushed the update_offsets branch from e0e405e to 3429c2f Compare November 4, 2024 18:42

srogmann and others added 4 commits November 4, 2024 22:01

MAINT: Bugfix help-attribute of x2pdf

10ce504

ENH: Support of referenced lengths.

9b0138a

TST: Renamed test PDF file..

e0a32ff

Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>

TST: Renamed test PDF file.

4f003e5

srogmann added 9 commits November 5, 2024 22:57

TST: rich.console introduces line-breaks in output.

47b16d4

MAINT: Changed /Length detection to support GeoTopo-komprimiert.pdf

dd1be3b

MAINT: Changed /Length detection to support output_with_metadata_pymu…

9146897

…pdf.pdf

MAINT: Changed /Length detection (PDF ref 3.1 white-space characters)

6d72f5a

MAINT: Don't replace pseudo line-breaks in binary parts of a pdf file.

657955b

MAINT: EOL can be CR, LF or CRLF.

5c3b92c

TST: Disabled some documents which are not supported.

68a352f

MAINT: black (code formatting)

51ed725

DEV: directory tests is lower-case.

c3a6c88

Lucas-C force-pushed the update_offsets branch 3 times, most recently from b62f298 to 5032317 Compare November 7, 2024 16:13

Pleasing mypy & typing imports under Python 3.8

fc42eb4

Lucas-C force-pushed the update_offsets branch from 5032317 to fc42eb4 Compare November 7, 2024 16:14

Lucas-C approved these changes Nov 7, 2024

View reviewed changes

Lucas-C merged commit da75816 into py-pdf:main Nov 7, 2024

Conversation

srogmann commented Aug 28, 2022

Uh oh!

srogmann commented May 24, 2024

Uh oh!

Lucas-C commented Oct 29, 2024

Uh oh!

srogmann commented Oct 30, 2024

Uh oh!

Lucas-C commented Nov 1, 2024

Uh oh!

Lucas-C left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srogmann commented Nov 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lucas-C commented Nov 4, 2024

Uh oh!

Lucas-C commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lucas-C commented Nov 4, 2024

Uh oh!

srogmann commented Nov 6, 2024

Uh oh!

Lucas-C commented Nov 7, 2024

Uh oh!

Lucas-C commented Dec 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lucas-C commented Nov 4, 2024 •

edited

Loading