PdfReadError when xref table contains comments before trailer keyword

## Description

PDFs produced by [Vectorizer.AI](https://vectorizer.ai/) insert legal PDF comments (`%` to end of line) between the last xref table entry and the `trailer` keyword:

```
xref
0 9
0000000000 65535 f
0000000035 00000 n
...
0000008599 00000 n

% Trailer identifies number of objs, plus Root and Info objs
trailer
<<
  /Size 9
  /Root 1 0 R
  /Info 6 0 R
>>
```

This is legal per the PDF specification (§7.2.3: "a comment consists of the `%` character followed by any characters up to and including the next EOL character [...] Comments may appear anywhere in a PDF file").

However, `_read_standard_xref_table` in `pypdf/_reader.py` calls `read_non_whitespace()` after the last xref entry, which only skips whitespace — not comments. The `%` character is then misinterpreted, eventually reaching `BooleanObject.read_from_stream()` which tries to match `trai` (from `trailer`) as `true`/`false` and raises:

```
pypdf.errors.PdfReadError: Could not read Boolean object
```

## Steps to reproduce

```python
from pypdf import PdfReader

# Any PDF with a comment line between xref entries and "trailer"
reader = PdfReader("vectorizer_ai_output.pdf")
# → PdfReadError: Could not read Boolean object
```

Minimal inline reproducer:

```python
from io import BytesIO
from pypdf import PdfReader

pdf_data = (
    b"%%PDF-1.4\n"
    b"1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj\n"
    b"2 0 obj\n<< /Type /Pages /Count 1 /Kids [3 0 R] >>\nendobj\n"
    b"3 0 obj\n<< /Type /Page /MediaBox [0 0 100 100] /Parent 2 0 R >>\nendobj\n"
    b"xref\n0 4\n"
    b"0000000000 65535 f \n"
    b"%010d 00000 n \n"
    b"%010d 00000 n \n"
    b"%010d 00000 n \n"
    b"%% This is a legal PDF comment\n"
    b"trailer\n<< /Size 4 /Root 1 0 R >>\n"
    b"startxref\n%d\n"
    b"%%%%EOF\n"
)
pdf_data = pdf_data % (
    pdf_data.find(b"1 0 obj") - 1,
    pdf_data.find(b"2 0 obj") - 1,
    pdf_data.find(b"3 0 obj") - 1,
    pdf_data.find(b"xref") - 1,
)
reader = PdfReader(BytesIO(pdf_data))  # raises PdfReadError
```

## Environment

- pypdf version: 6.9.2 (also affects 5.3.1 and likely all versions)
- Python: 3.11+

## Fix

The fix is to call `skip_over_comment()` (which already exists in pypdf) after `read_non_whitespace()` in the xref-to-trailer transition. PR incoming.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PdfReadError when xref table contains comments before trailer keyword #3709

Description

Steps to reproduce

Environment

Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PdfReadError when xref table contains comments before trailer keyword #3709

Description

Description

Steps to reproduce

Environment

Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions