Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: py-pdf/pypdf
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 6.5.0
Choose a base ref
...
head repository: py-pdf/pypdf
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 6.6.0
Choose a head ref
  • 16 commits
  • 54 files changed
  • 4 contributors

Commits on Dec 22, 2025

  1. Configuration menu
    Copy the full SHA
    b6dedd3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    3a82cd3 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1e04c1c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ad99ef8 View commit details
    Browse the repository at this point in the history

Commits on Dec 23, 2025

  1. Configuration menu
    Copy the full SHA
    8e1ccea View commit details
    Browse the repository at this point in the history

Commits on Dec 31, 2025

  1. DEP: Block common page content modifications when assigned to reader (#…

    …3582)
    
    Closes #2260.
    
    This would previously lead to pages being written uncompressed, although
    the corresponding dictionary header declaring the filter to be
    FlateDecode. As a PdfReader is considered to be read-only, this change
    seems like the most suitable one for fixing this.
    
    With this change, it might be required to change own code previously
    relying on the more or less broken functionality, which is especially
    bad for shadow processing.
    
    Possible approaches to fix user code:
    
      * Use `PdfWriter(clone_from=...)` to add all pages to the writer.
      * Use the return value of `writer.add_page(page_from_reader)` which
        correctly belongs to the writer to apply possible modifications
        after adding the page itself (instead of before adding it).
    stefan6419846 authored Dec 31, 2025
    Configuration menu
    Copy the full SHA
    08e951d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    bda80a4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    97d47a0 View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2026

  1. MAINT: Fix compatibility with Pillow >= 12.1.0 (#3590)

    `PIL.Image.Image.getdata()` has been deprecated. Additionally, the type
    hints in `types-Pillow` were outdated and replaced by the official type
    hints.
    stefan6419846 authored Jan 5, 2026
    Configuration menu
    Copy the full SHA
    6951bb7 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a65708c View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2026

  1. MAINT: Converge on one shared Font class for text extraction and appe…

    …arance streams (#3583)
    
    * MAINT: Port AppearanceStream to the Font class
    
    This patch ports the AppearanceStream to the new Font class.
    This is in hardly any way different from the original code,
    except making sure that a default width is set in the
    character widths for the 14 Adobe core fonts. This is not in
    fact necessary at this point, but will be when the Font class
    sets default width itself, and other code begins to depend on
    that.
    
    * ENH: Also collect character widths when encoding is a string
    
    This patch ensures that character widths are collected
    correctly also for fonts that have encoding defined as
    a string.
    
    * MAINT: Fix test output after fixing character widths
    
    Previously, character widths were not computed for type1 and
    TrueType fonts when encoding was a string. For one test, that is,
    test_text_extraction_layout_mode in tests/test_workflows.py, this
    meant that all character widths were treated as one space width.
    
    Now, the real widths are used, which changes the output of the
    test significantly, but in keeping with the intended output. This
    patch implements the new output.
    
    To be sure, I counted the number of newlines and words in both
    versions, and they are exactly the same, so no spaces were
    accidentally omitted between words in the new version, nor were
    they added, since the new version has fewer spaces than the old
    one.
    
    * ENH: FontDescriptor: Add default width to character widths
    
    * ENH: Use space width from own calculations
    
    This patch makes sure that the Font class has a way
    to compute space_width.
    
    * MAINT: FontDescriptor: Remove superfluous if condition
    
    The FontDescriptor code deals with fonts by type. After
    having dealt with Type1, MMType1, Type3 and TrueType fonts,
    it is not necessary to check if the remaining fonts are
    CID or composite fonts, they all are.
    
    * MAINT: Layout mode text extraction: Port to new Font class
    
    This patch ports the layout mode text extraction code to the new font class.
    
    This introduces one test failure, which itself appears to derive from a
    misconception about space width in the original Font class.
    
    Previously, a layout mode font was initialized in _page.py as follows:
      fonts[font_name] = _layout_mode.Font(*cmap, font_dict)
    
    *cmap, in this case was the return value of build_char_map, which consists of:
      - Font sub-type;
      - Space_width criteria (50% of width);
      - Encoding map;
      - Character-map; and
      - Font-dictionary
    
    Notice that build_char_map does _not_ return the width of a space, but the
    width of _half_ a space. However, if we look at the arguments to the layout
    mode Font class, clearly the class expects to be passed the full width of a
    space. This is also clear from the word_width method in the layout mode Font
    class, which substitutes a missing width with 2 * space_width. It follows
    that the layout mode Font class _expected_ to be passed a full space width,
    but really was only passed the width of half a space.
    
    When porting to the new Font class, this becomes problematic when calculating
    text width, because the new Font class uses self.character_widths["default"]
    as a fallback for a missing width, which is approximately (and in many cases
    exactly) the width of two spaces. This in turn causes problems with text
    extraction in cases where the width of a space itself is missing ("The Crazy
    Ones"), and cases where a font with a missing character width is calculated
    wider than before.
    
    For the first issue, this patch introduces a work-around that also exists in
    the conventional text extraction code, that is, dealing with missing space
    width separately.
    
    For the second issue - this causes one test to fail:
    the test_layout_mode_text_state in tests/test_text_extraction.py. This is
    entirely due to the existence of a unicode private range character in the
    file.
    
    * MAINT: tests/test_text_extraction.py: Ignore whitespace in test_layout_mode_text_state
    
    For several reasons, the output of the test_layout_mode_text_state test has changed
    significantly with changing to the new Font class. Here's why:
    
    1. The original layout mode Font class set a space width that was actually half a
    space wide in reality. In computing word with, a default fallback value was used of
    "self.space_width * 2", which in reality was just the width of one space.
    
    2. The new Font class uses "self.character_widths["default"]" as a fallback value
    for calculating word width. This value is calculated as follows:
      - If a missing width is defined in a Font's font descriptor, set that as default
        width
      - Else if the width of a space is defined in a Font's character widths and it
        is not zero, set the width of two spaces as default width
      - Else calculate the average of all character widths and set that as default
        width
    
    For the document in test_layout_mode_text_state, this results in very different
    default character widths. In the original Font class, it set a space width of
    125, and used 250 as a fallback widht. With the new Font class, it reads a value
    of 1000 from missing width in a font descriptor.
    
    The document contains one character from a private unicode range, the width of
    which is not defined. This character appears a number of times throughout the
    document. As a result, this character's width is calculated much wider with the
    new code than with the old code. In all other respects, though, the output is the
    same. So, the test_layout_mode_text_state's test goal - seeing whether a font
    change within a q context is addressed correctly - still holds.
    
    The expected output of this test is stored as a user attachment on github.
    Instead of replacing the document, just remove the space characters from the
    rendered output and check the result. This makes the test pass while keeping
    its intended purpose.
    
    * MAINT: Remove specific Font class from layout mode
    
    * MAINT: _page.py: Add asdict for json dumps of fonts
    
    The layout-mode Font had a to_dict method that returned a json
    object of a Font object. Instead if directly copying that method,
    use asdict() from dataclasses in _page.py and use asdict in the
    associated test in test_text_extraction.
    
    I DID NOT TEST THE DEBUG CODE IN PAGE.PY MYSELF.
    
    * ROB: FontDescriptor: Don't raise error on invalid font widths
    
    Previously, when encountering an out-of-bounds width read
    while parsing character widths for a CID / composite font,
    we raised a ParseError. However, this is incompatible with
    the non-layout text extraction code. This patch emits a
    warning instead, just like _cmap.py does in the same case.
    
    * MAINT: Text extraction: Pass along fonts with cmaps
    
    This patch makes sure that Font classes are initialised and then
    passed along to the TextExtractor.
    
    * MAINT: TextExtractor: Set space_width from Font
    
    This patch sets space width from font instead of from cmap in the
    tf operator handler.
    
    Interestingly, this changes the output of one test from
    "Lorem ipsum " to "Lorem ipsum". This actually represents a revert
    from an earlier change in which we introduced the 14 core font
    character widths into the build_font_width_map code in _cmap.py.
    I'm now guessing that that earlier change was actually incorrent,
    because the CORE_FONT_METRICS - I think - use unicode characters
    in their character_widths, whereas the build_font_width_map just
    uses raw codes.
    
    * ENH: TextExtractor: Method for text width calculation based on Font
    
    * MAINT: TextExtractor: Use font for widths calculation
    
    * MAINT: TextExtractor: Remove _get_actual_font_widths method
    
    This method has become superfluous after switching to the new
    _get_actual_text_widths method.
    
    * MAINT: TextExtractor: Save font in cm_stack
    
    * MAINT: TextExtractor: use font instead of cmap in get_text_operands
    
    * MAINT: TextExtractor: Use font character map in get_display_str
    
    * MAINT: _cmap.py: Remove compute_font_width
    
    The compute_font_width method is no longer used and
    therefore obsolete.
    
    * ENH: Placeholder commit Type3 Font Descriptor
    
    * MAINT: Text extraction init: remove cmap
    
    * MAINT: TextExtractor: Remove cmap attribute
    
    * ROB: FontDescriptor: Add warning about invalid width
    
    This adds a warning to FontDescriptor that replicates a
    warning originally in the build_font_width_map method in
    cmap.py. In tests/test_cmap.py, test_function_in_font_widths
    specificallly tests for this warning. Adding this warning
    to FontDescriptor for the same problem case, the test keeps
    fulfilling its purpose, but now for the new Font class.
    
    * MAINT: TextExtractor: Remove cmaps, only pass font resources
    
    This patch stops collecting character maps, space widths and
    encodings to the TextExtractor, keeping only the font resource
    that is necessary in the TextExtractor class. All the other
    aspects are now covered with the Font class.
    
    Incidentally, this should reduce the number of times that
    font widths are collected during text extraction, which used
    to be once for every font resource (for collecting space
    width) and again during text extraction. Now it is only once,
    when the fonts are collected in page.py.
    
    * MAINT: _cmap.py: Remove build_font_width_map
    
    After moving the text extraction code to the font class, which
    collects its own font width map, this code is not needed anymore.
    
    * MAINT: _cmap.py: Remove unused code
    
    This removes three methods that have become obsolete since
    porting the non-layout text extraction code to the Font class.
    
    * MAINT: _cmap.py: Remove get_actual_str_key method
    
    * MAINT: _cmap.py: Remove compute_space_width
    
    * MAINT: test_cmaps.py: Port test for iss1533 to new Font code
    
    The test for iss1533 was based on the old build_char_map code.
    Now that that code is removed, port the test to the new Font
    class, which should cover the underlying issue just the same.
    
    * MAINT: font.py: Import _cmap's get_encoding in the normal way
    
    This does not cause a circular import anymore after refactoring.
    
    * ENH: TextExtractor: Separate old and new text for width calculation
    
    * MAINT: No separate method for font width calculation
    
    This reverts "ENH: TextExtractor: Separate old and new text
    for width calculation" and embeds the font widths calculation
    within the _handle_tj() method in _text_extractor.py and in
    get_display_str() in _text_extraction/__init__.py instead.
    
    This way, we get the character widths within the same loop
    in which we collect the unicode characters, without the need
    to keep track of old and new text, and having to add or
    separate these later on.
    
    Also, it actually takes so little code that this hardly
    justified the _get_actual_text_widths that did this before.
    
    * MAINT: _layout_mode: Rename TD_offset td_offset
    
    * MAINT: _generate_appearance_stream_data: Don't type Optional
    
    * MAINT: _page.py: Update comment that mentioned old Font class
    PJBrs authored Jan 7, 2026
    Configuration menu
    Copy the full SHA
    d9ce594 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a29e532 View commit details
    Browse the repository at this point in the history
  3. DOC: Add outlines documentation and link it in User Guide (#3511)

    Closes #3484.
    
    ---------
    
    Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
    mainuddin-md and stefan6419846 authored Jan 7, 2026
    Configuration menu
    Copy the full SHA
    f189f07 View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2026

  1. Configuration menu
    Copy the full SHA
    7126880 View commit details
    Browse the repository at this point in the history
  2. SEC: Improve handling of partially broken PDF files (#3594)

    This changes does indeed contain multiple fixes, but as all of them are
    about partially broken PDF files and possibly security-related, I
    decided to put them into one changeset.
    
    Besides renaming variables to make them more readable, this includes the
    following changes:
    
    1. When searching through a PDF file which does not define a `/Root`
       entry in the trailer, while employing a large `/Size` value inside
       the trailer, this would lead to us trying to access each object
       number until the limit defined by `/Size` has been reached.
       This behavior can now be controlled by a new parameter to
       `PdfReader` which defaults to a more sensible default.
    
    2. When a broken `startxref` table is discovered, we try to re-build
       it from scratch. This used a regex-based approach, which turned out
       to be problematic with files consisting of lots of whitespace
       characters. By replacing the regex-based approach by a manual search
       based upon `string.find()`, we were able to drastically improve the
       performance in such cases.
    
    3. When flattening the pages of a PDF file, having one of the `/Kids`
       of the `/Pages` catalog entry reference the `/Pages` entry again
       would run until Python detects a recursion error itself. This has
       been changed to explicitly check for such cyclic references.
    stefan6419846 authored Jan 9, 2026
    Configuration menu
    Copy the full SHA
    2941657 View commit details
    Browse the repository at this point in the history
  3. REL: 6.6.0

    ## What's new
    
    ### Security (SEC)
    - Improve handling of partially broken PDF files (#3594) by @stefan6419846
    
    ### Deprecations (DEP)
    - Block common page content modifications when assigned to reader (#3582) by @stefan6419846
    
    ### New Features (ENH)
    - Embellishments to generated text appearance streams (#3571) by @PJBrs
    
    ### Bug Fixes (BUG)
    - Do not consider multi-byte BOM-like sequences as BOMs (#3589) by @stefan6419846
    
    ### Robustness (ROB)
    - Avoid empty FlateDecode outputs without warning (#3579) by @stefan6419846
    
    ### Documentation (DOC)
    - Add outlines documentation and link it in User Guide (#3511) by @mainuddin-md
    
    ### Developer Experience (DEV)
    - Add PyPy 3.11 to test matrix and benchmarks (#3574) by @rassie
    
    ### Maintenance (MAINT)
    - Fix compatibility with Pillow >= 12.1.0 (#3590) by @stefan6419846
    
    [Full Changelog](6.5.0...6.6.0)
    stefan6419846 committed Jan 9, 2026
    Configuration menu
    Copy the full SHA
    10df9c7 View commit details
    Browse the repository at this point in the history
Loading