Description
PDFs produced by Vectorizer.AI insert legal PDF comments (% to end of line) between the last xref table entry and the trailer keyword:
xref
0 9
0000000000 65535 f
0000000035 00000 n
...
0000008599 00000 n
% Trailer identifies number of objs, plus Root and Info objs
trailer
<<
/Size 9
/Root 1 0 R
/Info 6 0 R
>>
This is legal per the PDF specification (§7.2.3: "a comment consists of the % character followed by any characters up to and including the next EOL character [...] Comments may appear anywhere in a PDF file").
However, _read_standard_xref_table in pypdf/_reader.py calls read_non_whitespace() after the last xref entry, which only skips whitespace — not comments. The % character is then misinterpreted, eventually reaching BooleanObject.read_from_stream() which tries to match trai (from trailer) as true/false and raises:
pypdf.errors.PdfReadError: Could not read Boolean object
Steps to reproduce
from pypdf import PdfReader
# Any PDF with a comment line between xref entries and "trailer"
reader = PdfReader("vectorizer_ai_output.pdf")
# → PdfReadError: Could not read Boolean object
Minimal inline reproducer:
from io import BytesIO
from pypdf import PdfReader
pdf_data = (
b"%%PDF-1.4\n"
b"1 0 obj\n<< /Type /Catalog /Pages 2 0 R >>\nendobj\n"
b"2 0 obj\n<< /Type /Pages /Count 1 /Kids [3 0 R] >>\nendobj\n"
b"3 0 obj\n<< /Type /Page /MediaBox [0 0 100 100] /Parent 2 0 R >>\nendobj\n"
b"xref\n0 4\n"
b"0000000000 65535 f \n"
b"%010d 00000 n \n"
b"%010d 00000 n \n"
b"%010d 00000 n \n"
b"%% This is a legal PDF comment\n"
b"trailer\n<< /Size 4 /Root 1 0 R >>\n"
b"startxref\n%d\n"
b"%%%%EOF\n"
)
pdf_data = pdf_data % (
pdf_data.find(b"1 0 obj") - 1,
pdf_data.find(b"2 0 obj") - 1,
pdf_data.find(b"3 0 obj") - 1,
pdf_data.find(b"xref") - 1,
)
reader = PdfReader(BytesIO(pdf_data)) # raises PdfReadError
Environment
- pypdf version: 6.9.2 (also affects 5.3.1 and likely all versions)
- Python: 3.11+
Fix
The fix is to call skip_over_comment() (which already exists in pypdf) after read_non_whitespace() in the xref-to-trailer transition. PR incoming.
Description
PDFs produced by Vectorizer.AI insert legal PDF comments (
%to end of line) between the last xref table entry and thetrailerkeyword:This is legal per the PDF specification (§7.2.3: "a comment consists of the
%character followed by any characters up to and including the next EOL character [...] Comments may appear anywhere in a PDF file").However,
_read_standard_xref_tableinpypdf/_reader.pycallsread_non_whitespace()after the last xref entry, which only skips whitespace — not comments. The%character is then misinterpreted, eventually reachingBooleanObject.read_from_stream()which tries to matchtrai(fromtrailer) astrue/falseand raises:Steps to reproduce
Minimal inline reproducer:
Environment
Fix
The fix is to call
skip_over_comment()(which already exists in pypdf) afterread_non_whitespace()in the xref-to-trailer transition. PR incoming.