Explanation
The PDF spec supports page labeling (distinct from page indexing), and I am happy to see that this has been incorporated into the PdfReader class via the page_labels property.
However, the current implementation throws an error in the edge case where "/S" is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the "/P" and "/St" keys in page label dictionaries.
Environment
$ python -m platform
macOS-10.16-x86_64-i386-64bit
$ python --version
Python 3.8.5
$ python -c "import pypdf;print(pypdf.__version__)"
3.2.1
Code + PDF
This is a minimal, complete example that shows the issue:
from pypdf import PdfReader
reader = PdfReader('Numerical Mathematics.pdf') # my pdf file with somewhat strange (but legal) page labeling
print(reader.page_labels)
PDF file:
Numerical Mathematics.pdf
Traceback
This is the complete Traceback I see:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in page_labels
return [page_index2page_label(self, i) for i in range(len(self.pages))]
File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in <listcomp>
return [page_index2page_label(self, i) for i in range(len(self.pages))]
File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_page_labels.py", line 164, in index2label
return m[value["/S"]](index - start_index + 1)
File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/generic/_data_structures.py", line 274, in __getitem__
return dict.__getitem__(self, key).get_object()
KeyError: '/S'
Proposed Solution
The two modifications I would propose are:
-
In pypdf._page_labels, on line 164, the dictionaries corresponding to page indices in the PageLabels number tree should be parsed in a way that incorporates PDF spec defaults: S = value.get("/S"), P = value.get("/P", ""), and St = value.get("/St", 1) and then used for labeling in a way that uses these regardless of whether their keys were present: return P + m[S](index - start_index + St)
-
In pypdf._page_labels, in the dictionary m (defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.: None: lambda n: '' so that when "/S" is not included, the page index is simply ignored.
It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.
Explanation
The PDF spec supports page labeling (distinct from page indexing), and I am happy to see that this has been incorporated into the PdfReader class via the
page_labelsproperty.However, the current implementation throws an error in the edge case where
"/S"is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the"/P"and"/St"keys in page label dictionaries.Environment
$ python -m platform macOS-10.16-x86_64-i386-64bit $ python --version Python 3.8.5 $ python -c "import pypdf;print(pypdf.__version__)" 3.2.1Code + PDF
This is a minimal, complete example that shows the issue:
PDF file:
Numerical Mathematics.pdf
Traceback
This is the complete Traceback I see:
Proposed Solution
The two modifications I would propose are:
In
pypdf._page_labels, on line 164, the dictionaries corresponding to page indices in thePageLabelsnumber tree should be parsed in a way that incorporates PDF spec defaults:S = value.get("/S"),P = value.get("/P", ""), andSt = value.get("/St", 1)and then used for labeling in a way that uses these regardless of whether their keys were present:return P + m[S](index - start_index + St)In
pypdf._page_labels, in the dictionarym(defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.:None: lambda n: ''so that when "/S" is not included, the page index is simply ignored.It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.