Skip to content

ENH: Add support for page labels#1519

Merged
MartinThoma merged 6 commits intomainfrom
page-labels
Dec 29, 2022
Merged

ENH: Add support for page labels#1519
MartinThoma merged 6 commits intomainfrom
page-labels

Conversation

@MartinThoma
Copy link
Copy Markdown
Member

@MartinThoma MartinThoma commented Dec 28, 2022

Introduce a new PdfReader property page_labels that returns a list of strings.

In most cases, the list will just be

['1', '2', '3', '4', '5']

or similar, but sometimes it will be:

['i', 'ii', 'iii', 'iv', '1', '2', '3', '4', '5']

Evidence for User Need

@MartinThoma MartinThoma force-pushed the page-labels branch 4 times, most recently from 5e6eca0 to e2ec897 Compare December 28, 2022 22:37
@MartinThoma MartinThoma added the is-feature A feature request label Dec 28, 2022
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 28, 2022

Codecov Report

Base: 92.03% // Head: 91.91% // Decreases project coverage by -0.11% ⚠️

Coverage data is based on head (23f4065) compared to base (cfed01f).
Patch coverage: 80.32% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1519      +/-   ##
==========================================
- Coverage   92.03%   91.91%   -0.12%     
==========================================
  Files          32       33       +1     
  Lines        5976     6037      +61     
  Branches     1163     1180      +17     
==========================================
+ Hits         5500     5549      +49     
- Misses        312      318       +6     
- Partials      164      170       +6     
Impacted Files Coverage Δ
pypdf/_protocols.py 80.00% <ø> (ø)
pypdf/_page_labels.py 78.94% <78.94%> (ø)
pypdf/_reader.py 90.36% <100.00%> (+0.04%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MartinThoma
Copy link
Copy Markdown
Member Author

@MasterOdin @pubpub-zz What do you think about this one?

There are two TODOs because I haven't seen an example with Kids / Limits and I don't understand how it should work. But despite those, I think this PR already provides value. I would merge it, if you're ok with that.

Copy link
Copy Markdown
Member

@MasterOdin MasterOdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to merge stuff that's not fully done so long as it's not like the entire thing will have to be completely redone to support the additional cases. Maybe create an issue or so to track support for those / with an ask to the community to see if can get a PDF that has those features.

@MartinThoma MartinThoma merged commit 8c9505c into main Dec 29, 2022
@MartinThoma MartinThoma deleted the page-labels branch December 29, 2022 19:06
MartinThoma added a commit that referenced this pull request Dec 31, 2022
Performance Improvement (PI)
-  Help the specializing adpative interpreter (#1522)

New Features (ENH):
-  Add support for page labels (#1519)

Bug Fixes (BUG):
-  upgrade clone_document_root (#1520)
@josephjury123
Copy link
Copy Markdown

josephjury123 commented Feb 13, 2023

@MasterOdin @pubpub-zz What do you think about this one?

There are two TODOs because I haven't seen an example with Kids / Limits and I don't understand how it should work. But despite those, I think this PR already provides value. I would merge it, if you're ok with that.

Not sure if this is the correct place to post this... But i have been using the pypdf code to extract bookmarks from multiple PDFs in a folder and found one PDF which wouldn't work and gave me a message to share the PDF here. message was "/Kids or /Limits found in PageLabels." PDF file is 87MB so too big to share here directly

Screenshot_20230212_082056

@MartinThoma
Copy link
Copy Markdown
Member Author

Thank you! If there is nothing confidential / copyright protected in there, could you maybe share it in another way?

Maybe compression (zip / bzip2) helps?

You could also send it to me via email: info@martin-thoma.de (I hope my mail server doesn't reject it)

I'm opening this issue again so I don't forget about that part

@loganpowell
Copy link
Copy Markdown

I am also coming up against the /Kids or /Limits found... warning during outline processing. I'm parsing a very large tax document (16MB - no images). Is there a way to bypass?

@MartinThoma
Copy link
Copy Markdown
Member Author

If you simply don't want to display the warning: https://pypdf.readthedocs.io/en/latest/user/suppress-warnings.html#warnings

@MartinThoma
Copy link
Copy Markdown
Member Author

@loganpowell Is there any PDF you can share that causes this warning?

@josephjury123
Copy link
Copy Markdown

@MartinThoma I have just shared the PDF with you via google drive, you should just be able to download a copy.

@MartinThoma
Copy link
Copy Markdown
Member Author

Nice!

I'm on a business trip until Saturday. I hope I'll get a chance to look at it on Sunday :-)

@loganpowell
Copy link
Copy Markdown

@MartinThoma I could share a subset of the document (a couple of pages). Would that suffice?

@MartinThoma
Copy link
Copy Markdown
Member Author

If it still causes the warnings: sure!

@mfitzz10
Copy link
Copy Markdown

mfitzz10 commented Jul 12, 2023

@MartinThoma Hello, im running into the same /Kids or /Limits issue above, causing page_labels not to be read correctly. Can you please explain how to resolve?

@MartinThoma
Copy link
Copy Markdown
Member Author

  • Which version of pypdf do you use (print(pypdf.__version__))?
  • Can you share code / a PDF that triggers the issue? Can you share a traceback?

@taynotfound
Copy link
Copy Markdown

@MartinThoma Hello, im running into the same /Kids or /Limits issue above, causing page_labels not to be read correctly. Can you please explain how to resolve?

I have the same, But i have 1,3GB of PDF's so i dont know which PDF it is

@stefan6419846
Copy link
Copy Markdown
Collaborator

I have the same, But i have 1,3GB of PDF's so i dont know which PDF it is

Wouldn't you be able to use appropriate logging on your side to pinpoint the offending file?

@taynotfound
Copy link
Copy Markdown

I use LLAMA_Index SimpleDirectoryLoader so idk how i can log that by my best

@jpvan4
Copy link
Copy Markdown

jpvan4 commented Nov 10, 2023

I came across a pdf with the '/kids or /limits found in PageLabels' warning. I can send it through if you are still looking for examples

@MartinThoma
Copy link
Copy Markdown
Member Author

Yes, please! If it's ok to have it public, you can post it here.

Otherwise, you can send it to me ( info@martin-thoma.de )

@beevelop
Copy link
Copy Markdown

@kyrakangaa
Copy link
Copy Markdown

@MartinThoma Hi Martin, I was using SimpleDirectoryReader to load a 175 MB file and came across the same issue: /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: #1519. Can you please help me understand if there is a solution yet?

@stefan6419846
Copy link
Copy Markdown
Collaborator

@kyrakangaa As long as you are using the latest pypdf version (see https://pypi.org/project/pypdf/#files) and still receive this warning without seeing a corresponding PR linked here, this most likely remains unresolved for now.

SimpleDirectoryReader is not part of the pypdf package, thus you might want to try using pypdf directly, although the warning probably will be the same. If you are able to publicly share your file here, feel free to do so.

@JackTrapper
Copy link
Copy Markdown

Here's a PDF you can use that reproduces the issue:

  • http://6502.org/documents/publications/dr_dobbs_journal/dr_dobbs_journal_vol_06.pdf

Which triggers bazillions of:

...snip...
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
[02/17/2024-18:32:27] /Kids or /Limits found in PageLabels. Please share this PDF with pypdf: https://github.com/py-pdf/pypdf/pull/1519
...snip...

@khukharev
Copy link
Copy Markdown

@ltorsini
Copy link
Copy Markdown

ltorsini commented Feb 26, 2024

I have another /Kids or /Limits

Has someone explained this error? I had a few special character errors that were fixable by modifying the doc but this is not the case with /Kids or /Limits

G:\My Drive\PDF Library\Chinese Traditional Religion\chinese_traditional_religion_pure_land_sutras.pdf with error: Invalid Elementary Object starting with b'\xce' @0: b'\xce\xc5'. Skipping...

@MartinThoma
Copy link
Copy Markdown
Member Author

@khukharev I cannot reproduce:

from tests import get_data_from_url

from pypdf import PdfReader, __version__
from io import BytesIO


print(f"pypdf=={__version__}")
reader = PdfReader(BytesIO(get_data_from_url('https://github.com/py-pdf/pypdf/files/14412329/Coatue_Next_Decade_in_FinTech_Oct-22.pdf', name="Coatue_Next_Decade_in_FinTech_Oct-22.pdf")))
print(reader.page_labels)

gives:

pypdf==4.1.0
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56']

@py-pdf py-pdf deleted a comment from Barath-S-0412 Mar 29, 2024
@py-pdf py-pdf deleted a comment from pubpub-zz Mar 29, 2024
@MartinThoma
Copy link
Copy Markdown
Member Author

@ltorsini Without a PDF, we cannot help you.

@MartinThoma
Copy link
Copy Markdown
Member Author

@JackTrapper http://6502.org/documents/publications/dr_dobbs_journal/dr_dobbs_journal_vol_06.pdf is 258 MB 😅

Still, seems to work:

pypdf==4.1.0
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578']

@pubpub-zz
Copy link
Copy Markdown
Collaborator

pubpub-zz commented Mar 30, 2024

I would recommend any new report of an issue to create a new issue with minimal test code and pdf file.
this issue is now too hold to keep track of regressions

@stefan6419846
Copy link
Copy Markdown
Collaborator

@pubpub-zz We should probably adjust the error message in this case which redirects here explicitly by opening a corresponding issue and validating the documents already supplied above. Maybe we already have enough examples and do not need to ask for more anyway.

@pubpub-zz
Copy link
Copy Markdown
Collaborator

@pubpub-zz We should probably adjust the error message in this case which redirects here explicitly by opening a corresponding issue and validating the documents already supplied above. Maybe we already have enough examples and do not need to ask for more anyway.

Can you propose a PR?

@MartinThoma
Copy link
Copy Markdown
Member Author

MartinThoma commented Mar 31, 2024

I'll lock the discussion to avoid new comments here. Please check #2560 instead.

@py-pdf py-pdf locked as resolved and limited conversation to collaborators Mar 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

is-feature A feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.