Skip to content

PdfReadError when xref table contains comments before trailer keyword #3709

@rassie

Description

@rassie

Description

PDFs produced by Vectorizer.AI insert legal PDF comments (% to end of line) between the last xref table entry and the trailer keyword:

xref
0 9
0000000000 65535 f
0000000035 00000 n
...
0000008599 00000 n

% Trailer identifies number of objs, plus Root and Info objs
trailer
<<
  /Size 9
  /Root 1 0 R
  /Info 6 0 R
>>

This is legal per the PDF specification (§7.2.3: "a comment consists of the % character followed by any characters up to and including the next EOL character [...] Comments may appear anywhere in a PDF file").

However, _read_standard_xref_table in pypdf/_reader.py calls read_non_whitespace() after the last xref entry, which only skips whitespace — not comments. The % character is then misinterpreted, eventually reaching BooleanObject.read_from_stream() which tries to match trai (from trailer) as true/false and raises:

pypdf.errors.PdfReadError: Could not read Boolean object

Steps to reproduce

from pypdf import PdfReader

# Any PDF with a comment line between xref entries and "trailer"
reader = PdfReader("vectorizer_ai_output.pdf")
# → PdfReadError: Could not read Boolean object

Minimal inline reproducer:

from io import BytesIO
from pypdf import PdfReader

pdf_data = (
    b"%%PDF-1.4\n"
    b"1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj\n"
    b"2 0 obj\n<< /Type /Pages /Count 1 /Kids [3 0 R] >>\nendobj\n"
    b"3 0 obj\n<< /Type /Page /MediaBox [0 0 100 100] /Parent 2 0 R >>\nendobj\n"
    b"xref\n0 4\n"
    b"0000000000 65535 f \n"
    b"%010d 00000 n \n"
    b"%010d 00000 n \n"
    b"%010d 00000 n \n"
    b"%% This is a legal PDF comment\n"
    b"trailer\n<< /Size 4 /Root 1 0 R >>\n"
    b"startxref\n%d\n"
    b"%%%%EOF\n"
)
pdf_data = pdf_data % (
    pdf_data.find(b"1 0 obj") - 1,
    pdf_data.find(b"2 0 obj") - 1,
    pdf_data.find(b"3 0 obj") - 1,
    pdf_data.find(b"xref") - 1,
)
reader = PdfReader(BytesIO(pdf_data))  # raises PdfReadError

Environment

  • pypdf version: 6.9.2 (also affects 5.3.1 and likely all versions)
  • Python: 3.11+

Fix

The fix is to call skip_over_comment() (which already exists in pypdf) after read_non_whitespace() in the xref-to-trailer transition. PR incoming.

Metadata

Metadata

Assignees

No one assigned

    Labels

    is-robustness-issueFrom a users perspective, this is about robustness

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions