Skip to content

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

@jonahmajumder

Description

@jonahmajumder

Explanation

The PDF spec supports page labeling (distinct from page indexing), and I am happy to see that this has been incorporated into the PdfReader class via the page_labels property.

However, the current implementation throws an error in the edge case where "/S" is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the "/P" and "/St" keys in page label dictionaries.

Environment

$ python -m platform
macOS-10.16-x86_64-i386-64bit

$ python --version
Python 3.8.5

$ python -c "import pypdf;print(pypdf.__version__)"
3.2.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

reader = PdfReader('Numerical Mathematics.pdf') # my pdf file with somewhat strange (but legal) page labeling

print(reader.page_labels)

PDF file:
Numerical Mathematics.pdf

Traceback

This is the complete Traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_page_labels.py", line 164, in index2label
    return m[value["/S"]](index - start_index + 1)
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/generic/_data_structures.py", line 274, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/S'

Proposed Solution

The two modifications I would propose are:

  1. In pypdf._page_labels, on line 164, the dictionaries corresponding to page indices in the PageLabels number tree should be parsed in a way that incorporates PDF spec defaults: S = value.get("/S"), P = value.get("/P", ""), and St = value.get("/St", 1) and then used for labeling in a way that uses these regardless of whether their keys were present: return P + m[S](index - start_index + St)

  2. In pypdf._page_labels, in the dictionary m (defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.: None: lambda n: '' so that when "/S" is not included, the page index is simply ignored.

It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-bugFrom a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions