ENH: Add support for page labels by MartinThoma · Pull Request #1519 · py-pdf/pypdf

MartinThoma · 2022-12-28T21:36:49Z

Introduce a new PdfReader property page_labels that returns a list of strings.

In most cases, the list will just be

['1', '2', '3', '4', '5']

or similar, but sometimes it will be:

['i', 'ii', 'iii', 'iv', '1', '2', '3', '4', '5']

Evidence for User Need

Stackoverflow: Retrieve Custom page labels from document with pyPDF

codecov · 2022-12-28T22:45:19Z

Codecov Report

Base: 92.03% // Head: 91.91% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (23f4065) compared to base (cfed01f).
Patch coverage: 80.32% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1519      +/-   ##
==========================================
- Coverage   92.03%   91.91%   -0.12%     
==========================================
  Files          32       33       +1     
  Lines        5976     6037      +61     
  Branches     1163     1180      +17     
==========================================
+ Hits         5500     5549      +49     
- Misses        312      318       +6     
- Partials      164      170       +6

Impacted Files	Coverage Δ
pypdf/_protocols.py	`80.00% <ø> (ø)`
pypdf/_page_labels.py	`78.94% <78.94%> (ø)`
pypdf/_reader.py	`90.36% <100.00%> (+0.04%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MartinThoma · 2022-12-29T15:33:11Z

@MasterOdin @pubpub-zz What do you think about this one?

There are two TODOs because I haven't seen an example with Kids / Limits and I don't understand how it should work. But despite those, I think this PR already provides value. I would merge it, if you're ok with that.

MasterOdin

I think it's fine to merge stuff that's not fully done so long as it's not like the entire thing will have to be completely redone to support the additional cases. Maybe create an issue or so to track support for those / with an ask to the community to see if can get a PDF that has those features.

pypdf/_page_labels.py

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

Performance Improvement (PI) - Help the specializing adpative interpreter (#1522) New Features (ENH): - Add support for page labels (#1519) Bug Fixes (BUG): - upgrade clone_document_root (#1520)

josephjury123 · 2023-02-13T04:40:38Z

@MasterOdin @pubpub-zz What do you think about this one?

There are two TODOs because I haven't seen an example with Kids / Limits and I don't understand how it should work. But despite those, I think this PR already provides value. I would merge it, if you're ok with that.

Not sure if this is the correct place to post this... But i have been using the pypdf code to extract bookmarks from multiple PDFs in a folder and found one PDF which wouldn't work and gave me a message to share the PDF here. message was "/Kids or /Limits found in PageLabels." PDF file is 87MB so too big to share here directly

MartinThoma · 2023-02-13T06:36:52Z

Thank you! If there is nothing confidential / copyright protected in there, could you maybe share it in another way?

Maybe compression (zip / bzip2) helps?

You could also send it to me via email: info@martin-thoma.de (I hope my mail server doesn't reject it)

I'm opening this issue again so I don't forget about that part

loganpowell · 2023-02-13T19:33:24Z

I am also coming up against the /Kids or /Limits found... warning during outline processing. I'm parsing a very large tax document (16MB - no images). Is there a way to bypass?

MartinThoma · 2023-02-13T20:37:12Z

If you simply don't want to display the warning: https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html#warnings

MartinThoma · 2023-02-13T20:37:49Z

@loganpowell Is there any PDF you can share that causes this warning?

josephjury123 · 2023-02-14T03:05:32Z

@MartinThoma I have just shared the PDF with you via google drive, you should just be able to download a copy.

MartinThoma · 2023-02-14T07:28:04Z

Nice!

I'm on a business trip until Saturday. I hope I'll get a chance to look at it on Sunday :-)

loganpowell · 2023-02-14T13:53:05Z

@MartinThoma I could share a subset of the document (a couple of pages). Would that suffice?

MartinThoma · 2023-02-14T16:44:05Z

If it still causes the warnings: sure!

mfitzz10 · 2023-07-12T13:08:21Z

@MartinThoma Hello, im running into the same /Kids or /Limits issue above, causing page_labels not to be read correctly. Can you please explain how to resolve?

MartinThoma · 2023-07-12T20:28:08Z

Which version of pypdf do you use (print(pypdf.__version__))?
Can you share code / a PDF that triggers the issue? Can you share a traceback?

taynotfound · 2023-11-07T09:37:05Z

@MartinThoma Hello, im running into the same /Kids or /Limits issue above, causing page_labels not to be read correctly. Can you please explain how to resolve?

I have the same, But i have 1,3GB of PDF's so i dont know which PDF it is

stefan6419846 · 2023-11-07T09:40:05Z

I have the same, But i have 1,3GB of PDF's so i dont know which PDF it is

Wouldn't you be able to use appropriate logging on your side to pinpoint the offending file?

taynotfound · 2023-11-07T09:45:12Z

I use LLAMA_Index SimpleDirectoryLoader so idk how i can log that by my best

jpvan4 · 2023-11-10T14:39:40Z

I came across a pdf with the '/kids or /limits found in PageLabels' warning. I can send it through if you are still looking for examples

MartinThoma · 2023-11-10T15:39:04Z

Yes, please! If it's ok to have it public, you can post it here.

Otherwise, you can send it to me ( info@martin-thoma.de )

beevelop · 2023-12-14T21:59:43Z

@MartinThoma I stumbled upon this warning message for the following PDF file: https://www.bk.admin.ch/dam/bk/de/dokumente/terminologie/publikation_25_jahre_rtd.pdf.download.pdf/Terminologie_Epochen,%20Schwerpunkte,%20Umsetzungen.pdf

kyrakangaa · 2024-01-02T03:35:54Z

@MartinThoma Hi Martin, I was using SimpleDirectoryReader to load a 175 MB file and came across the same issue: /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: #1519. Can you please help me understand if there is a solution yet?

stefan6419846 · 2024-01-02T08:44:03Z

@kyrakangaa As long as you are using the latest pypdf version (see https://pypi.org/project/pypdf/#files) and still receive this warning without seeing a corresponding PR linked here, this most likely remains unresolved for now.

SimpleDirectoryReader is not part of the pypdf package, thus you might want to try using pypdf directly, although the warning probably will be the same. If you are able to publicly share your file here, feel free to do so.

JackTrapper · 2024-02-17T23:32:58Z

Here's a PDF you can use that reproduces the issue:

http://6502.org/documents/publications/dr_dobbs_journal/dr_dobbs_journal_vol_06.pdf

Which triggers bazillions of:

...snip...
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
...snip...

khukharev · 2024-02-26T23:12:30Z

I have another /Kids or /Limits
Coatue_Next_Decade_in_FinTech_Oct-22.pdf.zip
Coatue_Next_Decade_in_FinTech_Oct-22.pdf

ltorsini · 2024-02-26T23:19:02Z

I have another /Kids or /Limits

Has someone explained this error? I had a few special character errors that were fixable by modifying the doc but this is not the case with /Kids or /Limits

G:\My Drive\PDF Library\Chinese Traditional Religion\chinese_traditional_religion_pure_land_sutras.pdf with error: Invalid Elementary Object starting with b'\xce' @0: b'\xce\xc5'. Skipping...

MartinThoma · 2024-03-29T21:12:30Z

@khukharev I cannot reproduce:

from tests import get_data_from_url

from pypdf import PdfReader, __version__
from io import BytesIO


print(f"pypdf=={__version__}")
reader = PdfReader(BytesIO(get_data_from_url('https://github.com/py-pdf/pypdf/files/14412329/Coatue_Next_Decade_in_FinTech_Oct-22.pdf', name="Coatue_Next_Decade_in_FinTech_Oct-22.pdf")))
print(reader.page_labels)

gives:

pypdf==4.1.0
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56']

MartinThoma · 2024-03-29T21:13:43Z

@ltorsini Without a PDF, we cannot help you.

MartinThoma · 2024-03-29T21:57:12Z

@JackTrapper http://6502.org/documents/publications/dr_dobbs_journal/dr_dobbs_journal_vol_06.pdf is 258 MB 😅

Still, seems to work:

pypdf==4.1.0
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578']

pubpub-zz · 2024-03-30T07:56:13Z

I would recommend any new report of an issue to create a new issue with minimal test code and pdf file.
this issue is now too hold to keep track of regressions

stefan6419846 · 2024-03-30T08:42:33Z

@pubpub-zz We should probably adjust the error message in this case which redirects here explicitly by opening a corresponding issue and validating the documents already supplied above. Maybe we already have enough examples and do not need to ask for more anyway.

pubpub-zz · 2024-03-30T08:44:59Z

@pubpub-zz We should probably adjust the error message in this case which redirects here explicitly by opening a corresponding issue and validating the documents already supplied above. Maybe we already have enough examples and do not need to ask for more anyway.

Can you propose a PR?

MartinThoma · 2024-03-31T09:54:53Z

I'll lock the discussion to avoid new comments here. Please check #2560 instead.

MartinThoma force-pushed the page-labels branch 4 times, most recently from 5e6eca0 to e2ec897 Compare December 28, 2022 22:37

MartinThoma added the is-feature A feature request label Dec 28, 2022

ENH: Add support for page labels

413d4e2

MartinThoma force-pushed the page-labels branch from e2ec897 to 413d4e2 Compare December 28, 2022 23:03

Add another test

98fdb0a

MartinThoma requested review from MasterOdin and pubpub-zz December 29, 2022 15:30

MasterOdin approved these changes Dec 29, 2022

View reviewed changes

pypdf/_page_labels.py Outdated Show resolved Hide resolved

pypdf/_page_labels.py Outdated Show resolved Hide resolved

MartinThoma and others added 4 commits December 29, 2022 19:03

Review from MasterOdin

9ade37e

Update pypdf/_page_labels.py

6b53c5c

Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

fix mypy

a022470

Document missing implementation

23f4065

MartinThoma merged commit 8c9505c into main Dec 29, 2022

MartinThoma deleted the page-labels branch December 29, 2022 19:06

MartinThoma added a commit that referenced this pull request Dec 31, 2022

REL: 3.2.0

c2c4be6

Performance Improvement (PI) - Help the specializing adpative interpreter (#1522) New Features (ENH): - Add support for page labels (#1519) Bug Fixes (BUG): - upgrade clone_document_root (#1520)

py-pdf deleted a comment from Barath-S-0412 Mar 29, 2024

py-pdf deleted a comment from pubpub-zz Mar 29, 2024

This was referenced Mar 30, 2024

Add support for /Kids and /Limits in page labels #2560

Closed

DEV: Remove page labels PR link from message #2561

Merged

py-pdf locked as resolved and limited conversation to collaborators Mar 31, 2024

Conversation

MartinThoma commented Dec 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evidence for User Need

Uh oh!

codecov bot commented Dec 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MartinThoma commented Dec 29, 2022

Uh oh!

MasterOdin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

josephjury123 commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Feb 13, 2023

Uh oh!

loganpowell commented Feb 13, 2023

Uh oh!

MartinThoma commented Feb 13, 2023

Uh oh!

MartinThoma commented Feb 13, 2023

Uh oh!

josephjury123 commented Feb 14, 2023

Uh oh!

MartinThoma commented Feb 14, 2023

Uh oh!

loganpowell commented Feb 14, 2023

Uh oh!

MartinThoma commented Feb 14, 2023

Uh oh!

mfitzz10 commented Jul 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Jul 12, 2023

Uh oh!

taynotfound commented Nov 7, 2023

Uh oh!

stefan6419846 commented Nov 7, 2023

Uh oh!

taynotfound commented Nov 7, 2023

Uh oh!

jpvan4 commented Nov 10, 2023

Uh oh!

MartinThoma commented Nov 10, 2023

Uh oh!

beevelop commented Dec 14, 2023

Uh oh!

kyrakangaa commented Jan 2, 2024

Uh oh!

stefan6419846 commented Jan 2, 2024

Uh oh!

JackTrapper commented Feb 17, 2024

Uh oh!

khukharev commented Feb 26, 2024

Uh oh!

ltorsini commented Feb 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinThoma commented Mar 29, 2024

Uh oh!

MartinThoma commented Mar 29, 2024

Uh oh!

MartinThoma commented Mar 29, 2024

Uh oh!

pubpub-zz commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Mar 30, 2024

Uh oh!

pubpub-zz commented Mar 30, 2024

Uh oh!

MartinThoma commented Mar 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

MartinThoma commented Dec 28, 2022 •

edited

Loading

codecov bot commented Dec 28, 2022 •

edited

Loading

josephjury123 commented Feb 13, 2023 •

edited

Loading

mfitzz10 commented Jul 12, 2023 •

edited

Loading

ltorsini commented Feb 26, 2024 •

edited

Loading

pubpub-zz commented Mar 30, 2024 •

edited

Loading

MartinThoma commented Mar 31, 2024 •

edited

Loading