BUG: Fix stale object cache from non-authoritative object streams by astahlman · Pull Request #3698 · py-pdf/pypdf

astahlman · 2026-03-25T15:32:24Z

The batch-parse optimization (#3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked obj_num in self.xref_objStm, but this passes for any compressed object — not just ones that belong to the current stream.

In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one.

In practice this causes filled-in form field values to silently disappear when reading PDFs saved by form-filling software.

Fix: one-line change — only cache when xref_objStm points the object at the stream being decompressed.

The batch-parse optimization (added in py-pdf#3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked `obj_num in self.xref_objStm`, but this passes for any compressed object — not just ones that belong to the current stream. In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one. Fix: only cache when `xref_objStm` points the object at the stream being decompressed. Closes py-pdf#3697 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use Union[str, bytes] instead of str | bytes since the file does not use `from __future__ import annotations`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

astahlman · 2026-03-25T15:55:31Z

Looks like a transient failure downloading one of the test inputs. Maybe this test is known to be flaky? I don't have permission to re-trigger the CI job , unfortunately

=================================== FAILURES ===================================
______________________________ test_iss1615_1673 _______________________________
[gw2] linux -- Python 3.12.13 /opt/hostedtoolcache/Python/3.12.13/x64/bin/python

    @pytest.mark.enable_socket
    def test_iss1615_1673():
        """
        Test cases where /N is not indicating chains of objects
        test also where /N,... are not part of chains
        """
        # #1615
        url = "https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf"
        name = "graph_letter.pdf"
>       reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))

tests/test_generic.py:998: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/__init__.py:73: in get_data_from_url
    cache_path.write_bytes(_get_data_from_url(url))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

url = 'https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf'

    def _get_data_from_url(url: str) -> bytes:
        ssl._create_default_https_context = ssl._create_unverified_context
        attempts = 0
        while attempts < 3:
            try:
                with urllib.request.urlopen(  # noqa: S310
                        url
                ) as response:
                    return response.read()
            except HTTPError as e:
                if attempts < 3:
                    attempts += 1
                else:
                    raise e
>       raise ValueError(f"Unknown error handling {url}")
E       ValueError: Unknown error handling https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf

tests/__init__.py:37: ValueError

…stale-cache

codecov · 2026-03-25T18:17:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.43%. Comparing base (88eb5be) to head (af8cf4d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3698   +/-   ##
=======================================
  Coverage   97.43%   97.43%           
=======================================
  Files          55       55           
  Lines       10008    10009    +1     
  Branches     1839     1839           
=======================================
+ Hits         9751     9752    +1     
  Misses        149      149           
  Partials      108      108

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/test_reader.py

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

stefan6419846

Thanks.

astahlman mentioned this pull request Mar 25, 2026

Object stream batch-parse caches stale objects from non-authoritative streams #3697

Closed

Fix type annotation for Python 3.9 compat

add3eec

Use Union[str, bytes] instead of str | bytes since the file does not use `from __future__ import annotations`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

astahlman added 2 commits March 25, 2026 12:05

Trigger CI re-run

889ce66

Merge remote-tracking branch 'origin/main' into astahlman/fix-objstm-…

2e8e305

…stale-cache

stefan6419846 reviewed Mar 26, 2026

View reviewed changes

tests/test_reader.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Mar 26, 2026

View reviewed changes

tests/test_reader.py Outdated Show resolved Hide resolved

improve comments

af8cf4d

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

stefan6419846 approved these changes Mar 26, 2026

View reviewed changes

stefan6419846 merged commit 4d8ebce into py-pdf:main Mar 26, 2026
30 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix stale object cache from non-authoritative object streams#3698

BUG: Fix stale object cache from non-authoritative object streams#3698
stefan6419846 merged 5 commits intopy-pdf:mainfrom
astahlman:astahlman/fix-objstm-stale-cache

astahlman commented Mar 25, 2026

Uh oh!

astahlman commented Mar 25, 2026

Uh oh!

codecov bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

astahlman commented Mar 25, 2026

Uh oh!

astahlman commented Mar 25, 2026

Uh oh!

codecov bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 25, 2026 •

edited

Loading