Skip to content

BUG: Fix stale object cache from non-authoritative object streams#3698

Merged
stefan6419846 merged 5 commits intopy-pdf:mainfrom
astahlman:astahlman/fix-objstm-stale-cache
Mar 26, 2026
Merged

BUG: Fix stale object cache from non-authoritative object streams#3698
stefan6419846 merged 5 commits intopy-pdf:mainfrom
astahlman:astahlman/fix-objstm-stale-cache

Conversation

@astahlman
Copy link
Contributor

Closes #3697

The batch-parse optimization (#3677) caches every object found when decompressing an object stream. The guard intended to skip overridden objects checked obj_num in self.xref_objStm, but this passes for any compressed object — not just ones that belong to the current stream.

In incrementally-updated PDFs, the same object can appear in multiple object streams across revisions (per the PDF 1.7 spec, §7.5.6). The xref designates one stream as authoritative. Decompressing a stale stream (e.g. to read a co-located AcroForm dict) would cache the old version of the object, shadowing the current one.

In practice this causes filled-in form field values to silently disappear when reading PDFs saved by form-filling software.

Fix: one-line change — only cache when xref_objStm points the object at the stream being decompressed.

The batch-parse optimization (added in py-pdf#3677) caches every object
found when decompressing an object stream. The guard intended to
skip overridden objects checked `obj_num in self.xref_objStm`, but
this passes for any compressed object — not just ones that belong
to the current stream.

In incrementally-updated PDFs, the same object can appear in
multiple object streams across revisions (per the PDF 1.7 spec,
§7.5.6). The xref designates one stream as authoritative.
Decompressing a stale stream (e.g. to read a co-located AcroForm
dict) would cache the old version of the object, shadowing the
current one.

Fix: only cache when `xref_objStm` points the object at the stream
being decompressed.

Closes py-pdf#3697

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Union[str, bytes] instead of str | bytes since the file
does not use `from __future__ import annotations`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@astahlman
Copy link
Contributor Author

Looks like a transient failure downloading one of the test inputs. Maybe this test is known to be flaky? I don't have permission to re-trigger the CI job , unfortunately

=================================== FAILURES ===================================
______________________________ test_iss1615_1673 _______________________________
[gw2] linux -- Python 3.12.13 /opt/hostedtoolcache/Python/3.12.13/x64/bin/python

    @pytest.mark.enable_socket
    def test_iss1615_1673():
        """
        Test cases where /N is not indicating chains of objects
        test also where /N,... are not part of chains
        """
        # #1615
        url = "https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf"
        name = "graph_letter.pdf"
>       reader = PdfReader(BytesIO(get_data_from_url(url, name=name)))

tests/test_generic.py:998: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/__init__.py:73: in get_data_from_url
    cache_path.write_bytes(_get_data_from_url(url))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

url = 'https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf'

    def _get_data_from_url(url: str) -> bytes:
        ssl._create_default_https_context = ssl._create_unverified_context
        attempts = 0
        while attempts < 3:
            try:
                with urllib.request.urlopen(  # noqa: S310
                        url
                ) as response:
                    return response.read()
            except HTTPError as e:
                if attempts < 3:
                    attempts += 1
                else:
                    raise e
>       raise ValueError(f"Unknown error handling {url}")
E       ValueError: Unknown error handling https://github.com/py-pdf/pypdf/files/10671366/graph_letter.pdf

tests/__init__.py:37: ValueError

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.43%. Comparing base (88eb5be) to head (af8cf4d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3698   +/-   ##
=======================================
  Coverage   97.43%   97.43%           
=======================================
  Files          55       55           
  Lines       10008    10009    +1     
  Branches     1839     1839           
=======================================
+ Hits         9751     9752    +1     
  Misses        149      149           
  Partials      108      108           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@stefan6419846 stefan6419846 merged commit 4d8ebce into py-pdf:main Mar 26, 2026
30 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Object stream batch-parse caches stale objects from non-authoritative streams

2 participants