Skip to content

ROB: Deal with DictionaryObjects having streams of length 0#3114

Closed
stefan6419846 wants to merge 1 commit intopy-pdf:mainfrom
stefan6419846:issue3052
Closed

ROB: Deal with DictionaryObjects having streams of length 0#3114
stefan6419846 wants to merge 1 commit intopy-pdf:mainfrom
stefan6419846:issue3052

Conversation

@stefan6419846
Copy link
Copy Markdown
Collaborator

Closes #3052.

@Likend
Copy link
Copy Markdown
Contributor

Likend commented Oct 1, 2025

            if length > 0:
                data["__streamdata__"] = stream. Read(length)
            elif length < 0:
                data["__streamdata__"] = read_until_regex(
                    stream, re.compile(b"endstream")
                )

...

        if "__streamdata__" in data:
            return StreamObject.initialize_from_dictionary(data)
        retval = DictionaryObject()
        retval.update(data)
        return retval

I think the problem lies in this piece of code. A StreamObject of length 0 will not have "__streamdata__" in data, and thus it will be parsed into a DictionaryObject instead of a StreamObject.

I've run pytest in my Windows,. The tests failed because tika-950337.pdf contains a length-0 content stream in page 2 (in test_compress_raised, test_workflows.py), which will be parsed to be a DictionaryObject. However compress_content_streams() need to call get_data() method of the content, so it failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Length-0 streams are read incorrectly, which breaks some PDFs

2 participants