ENH: Added command update-offsets to adjust offsets and lengths.#15
ENH: Added command update-offsets to adjust offsets and lengths.#15Lucas-C merged 22 commits intopy-pdf:mainfrom
Conversation
|
Two hours ago I edited a larger pdf-file created by I remembered this old PR and used update_offsets.py to fix the offsets. I added to fix a bug occuring if there are several streams. |
|
I think this would be a great addition to Are you still willing to work on this @srogmann? 🙂 |
|
@Lucas-C As mentioned above, in May 2024, I remembered pdfly and used update-offsets to correct a manually edited PDF file. In my view, the PR was also ready in May 2024. An example is in the attached file-in.pdf , I used it to test the text-extraction of documents with Tm operators. By update-offsets the XREF-section will be converted into The "target audience" for update_offsets are simple PDF documents that have been manually created using an editor. It is not suitable for complex or obfuscated PDFs. |
|
Thank you for your detailed answer @srogmann 👍 I'll be happy to review & merge this PR, but could you rebase it and solve the minor merge conflict, please? |
Lucas-C
left a comment
There was a problem hiding this comment.
Could you please:
- add a mention of the new command in
README.md - add some unit tests in
tests/test_update_offsets.py
3f6b60e to
2f4e11d
Compare
|
@Lucas-C During testing, I noticed an issue (specifically with pytest on Unix). In the |
Thank you for notifying this problem 👍 Could you please fix this as part of this PR? |
|
PS: I myself wrote a similar script some time ago: https://github.com/Lucas-C/dotfiles_and_notes/blob/master/languages/python/set_pdf_xref.py I'm really happy to include this feature in |
This command adjusts /Length-entries of stream objects and the xref-offsets in simple PDF files (ASCII only, one xref section only).
e0e405e to
3429c2f
Compare
|
The GitHub Actions pipeline is currently failing due to |
Co-authored-by: Lucas Cimon <925560+Lucas-C@users.noreply.github.com>
|
@Lucas-C In the tests, I have commented out four PDF documents that cannot be correctly processed with the current implementation. The current implementation is quite simple and works with regular expressions; it was originally intended to revise hand-edited PDF documents via an editor. The more accurately the script should work, the more it would be appropriate to parse the tokens according to chapter 3 of the PDF specification. Technically, this is possible, but it would far exceed the original goal of my implementation. |
Awesome! Good job 🙂
That's fine really. I added a commit on the branch to fix some minor typing related issues. |
b62f298 to
5032317
Compare
5032317 to
fc42eb4
Compare
## What's new ### New Features (ENH) - New `booklet` command to adjust offsets and lengths ([PR #77](#77)) - New `uncompress` command ([PR #75](#75)) - New `update-offsets` command to adjust offsets and lengths ([PR #15](#15)) - New `rm` command ([PR #59](#59)) - `metadata`: now also displaying CreationDate, Creator, Keywords & Subject ([PR #73](#73)) - Add warning for out-of-bounds page range in pdfly `cat` command ([PR #58](#58)) ### Bug Fixes (BUG) - `2-up` command, that only showed one page per sheet, on the left side, with blank space on the right ([PR #78](#78)) [Full Changelog](0.3.3...0.4.0)
|
This has been released in version |
This command adjusts /Length-entries of stream objects and the xref-offsets
in simple PDF files (ASCII only, one xref section only) to support writing PDF
files by means of a text editor.
I replaced the camelCase-variables by snake-case variables.