Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S"

## Explanation

The PDF spec supports page labeling (distinct from page _indexing_), and I am happy to see that this has been incorporated into the PdfReader class via the `page_labels` property. 

However, the current implementation throws an error in the edge case where `"/S"` is not defined (and so no representation of the current page index should be used). Also, the current implementation is incomplete in that it does not incorporate the `"/P"` and `"/St"` keys in page label dictionaries.

## Environment

```bash
$ python -m platform
macOS-10.16-x86_64-i386-64bit

$ python --version
Python 3.8.5

$ python -c "import pypdf;print(pypdf.__version__)"
3.2.1
```

## Code + PDF

This is a minimal, complete example that shows the issue:

```python
from pypdf import PdfReader

reader = PdfReader('Numerical Mathematics.pdf') # my pdf file with somewhat strange (but legal) page labeling

print(reader.page_labels)

```
PDF file:
[Numerical Mathematics.pdf](https://github.com/py-pdf/pypdf/files/10447593/Numerical.Mathematics.pdf)

## Traceback

This is the complete Traceback I see:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in page_labels
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_reader.py", line 1067, in <listcomp>
    return [page_index2page_label(self, i) for i in range(len(self.pages))]
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/_page_labels.py", line 164, in index2label
    return m[value["/S"]](index - start_index + 1)
  File "~/.virtualenvs/pdf/lib/python3.8/site-packages/pypdf/generic/_data_structures.py", line 274, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError: '/S'
```

## Proposed Solution

The two modifications I would propose are:

1. In `pypdf._page_labels`, on line 164, the dictionaries corresponding to page indices in the `PageLabels` number tree should be parsed in a way that incorporates PDF spec defaults: `S = value.get("/S")`, `P = value.get("/P", "")`, and `St = value.get("/St", 1)` and then used for labeling in a way that uses these regardless of whether their keys were present: `return P + m[S](index - start_index + St)`

2. In `pypdf._page_labels`, in the dictionary `m` (defined starting on line 153), an entry should be added corresponding to the case where the "/S" key is not included, i.e.: `None: lambda n: ''` so that when "/S" is not included, the page index is simply ignored.

It would also be great to see page labeling incorporated into the PdfWriter class to support writing PDFs with custom page labeling, but I realize that is more work and probably lower priority than fixing the PdfReader page labeling functionality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Explanation

Environment

Code + PDF

Traceback

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Page Label Access Fails when PDF PageLabels Number Tree Does Not Contain "/S" #1560

Description

Explanation

Environment

Code + PDF

Traceback

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions