improved ExtractText(3) by pubpub-zz · Pull Request #969 · py-pdf/pypdf

pubpub-zz · 2022-06-10T20:34:27Z

New corrections for extract_text()
fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic

TODO : add some encodings missing

PyPDF2/_cmap.py

MartinThoma · 2022-06-10T21:07:24Z

There are two minor Flake8 issues:

./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_block_backwards' imported but unused
./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_previous_line' imported but unused

Do you prefer to fix them yourself or should I do it? (also as a general question)

pubpub-zz · 2022-06-10T21:09:47Z

@MartinThoma,
I need your help !! 😥
I have an issue in test_utils.py : My changes on tag 2.1.0 works but I get regressions on main.

Can you have a look please

MartinThoma · 2022-06-10T21:29:42Z

@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971

MartinThoma · 2022-06-10T21:30:12Z

I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-)

MartinThoma · 2022-06-10T21:32:11Z

Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged.

I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered)

pubpub-zz · 2022-06-10T21:37:59Z

@MartinThoma
If you look at the changed files I had to drastically revert in test_utils.py as I had major issues with it. give me 5 min and I will confirm/infirm my issue

pubpub-zz · 2022-06-10T21:42:33Z

@MartinThoma
my problems are this section of code


@pytest.mark.parametrize(
    ("dat", "pos", "expected", "expected_pos"),
    [
        (b"abc", 1, b"a", 0),
        (b"abc", 2, b"ab", 0),
        (b"abc", 3, b"abc", 0),
        (b"abc\n", 3, b"abc", 0),
        (b"abc\n", 4, b"", 3),
        (b"abc\n\r", 4, b"", 3),
        (b"abc\nd", 5, b"d", 3),
        # Skip over multiple CR/LF bytes
        (b"abc\n\r\ndef", 9, b"def", 3),
        # Include a block full of newlines...
        (
            b"abc" + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE) + b"d",
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d",
            3,
        ),
        # Include a block full of non-newline characters
        (
            b"abc\n" + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            3,
        ),
        # Both
        (
            b"abcxyz"
            + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE)
            + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            4 * io.DEFAULT_BUFFER_SIZE + 6,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            6,
        ),
    ],
)
def test_read_previous_line(dat, pos, expected, expected_pos):
    s = io.BytesIO(dat)
    s.seek(pos)
    assert read_previous_line(s) == expected
    assert s.tell() == expected_pos

MartinThoma · 2022-06-10T21:48:35Z

Oh damn. That sounds as if it's related to #646

I'll have a closer look tomorrow

pubpub-zz · 2022-06-10T21:52:17Z

I still have some work to fix text extraction with the "paper rotated"
Chinese / russian/ .... are working
I have some doubts about arabic as the text is written right to left.

codecov · 2022-06-10T22:04:17Z

Codecov Report

Merging #969 (2aea3e9) into main (9c4e7f5) will increase coverage by 0.16%.
The diff coverage is 86.43%.

@@            Coverage Diff             @@
##             main     #969      +/-   ##
==========================================
+ Coverage   84.25%   84.42%   +0.16%     
==========================================
  Files          18       18              
  Lines        4115     4179      +64     
  Branches      868      887      +19     
==========================================
+ Hits         3467     3528      +61     
- Misses        465      468       +3     
  Partials      183      183

Impacted Files	Coverage Δ
PyPDF2/_page.py	`82.65% <ø> (+1.10%)`	⬆️
PyPDF2/_cmap.py	`76.43% <76.43%> (+3.70%)`	⬆️
PyPDF2/generic.py	`89.70% <81.81%> (-0.13%)`	⬇️
PyPDF2/_utils.py	`98.03% <98.03%> (ø)`
PyPDF2/__init__.py	`100.00% <100.00%> (ø)`
PyPDF2/filters.py	`81.81% <0.00%> (+0.64%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4e7f5...2aea3e9. Read the comment docs.

MartinThoma · 2022-06-11T06:10:42Z

@pubpub-zz I've added the test back, without any adjustment. It works: #971
Did you maybe solve the issue in between?

MartinThoma · 2022-06-12T19:29:04Z

I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long

MartinThoma · 2022-06-12T19:30:39Z

Devlivered just before diner and I've mowed the lawn

Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊

tests/test_utils.py

MartinThoma · 2022-06-13T11:16:42Z

@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa

I try to go through the PR today evening / night :-)

PyPDF2/_cmap.py

MartinThoma · 2022-06-13T16:28:32Z

@pubpub-zz Looks good to me! I would squash-commit with the following text:

ENH: Text Extraction improvements

- Improvements around /Encoding / /ToUnicode
- Extraction of CMaps improved
- Fallback for font def missing
- Support for /Identity-H and /Identity-V: utf-16-be
- Support for /GB-EUC-H / /GB-EUC-V: gbk
- Support for /GBpc-EUC-H / /GBpc-EUC-V : gb2312
- Store default font space width for 18 commonly used fonts to improve
  whitespace extraction

Does that represent the changes well to users?

MartinThoma · 2022-06-13T16:29:08Z

Besides the two typos I've just commented, there is one robustness-change I would do: The .decode("utf-16-be") fails 167x for 22847 PDF files (0.7% of my dataset, so not too wild) with:

    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1125, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 221, in parse_to_unicode
    ] = unhexlify(sq).decode("utf-16-be")
  File "/home/moose/.pyenv/versions/3.10.2/lib/python3.10/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data

I would just wrap it in a try-except UnicodeDecodeError:

import logging
logger = logging.getLogger(__name__)

...

                while a <= b:
                    sq = fmt2 % c
                    key = unhexlify(fmt % a).decode(
                                "charmap" if map_dict[-1] == 1 else "utf-16-be"
                            )
                    unhexlified = unhexlify(sq)
                    try:
                         decoded = unhexlified.decode("utf-16-be")
                    except UnicodeDecodeError as exc:
                        logger.warning("UnicodeDecodeError when parsing cmap")
                        a += 1
                        c += 1
                        continue
                    map_dict[key] = decoded
                    int_entry.append(a)
                    a += 1
                    c += 1

Co-authored-by: Martin Thoma <info@martin-thoma.de>

pubpub-zz · 2022-06-13T16:53:03Z

under analysis

pubpub-zz · 2022-06-13T16:56:36Z

ENH: Text Extraction improvements

Improvements around /Encoding / /ToUnicode

Extraction of CMaps improved

Fallback for font def missing

Support for /Identity-H and /Identity-V: utf-16-be

Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
Arabic (for evaluation)
whitespace extraction improvement

…end of data use surrogatepass in _cmap and _page

pubpub-zz · 2022-06-13T19:21:24Z

@MartinThoma
This latest mod fixed the 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data
This should close the issue on the .7% remaining

The 2.2.0 release improves text extraction again via (#969): * Improvements around /Encoding / /ToUnicode * Extraction of CMaps improved * Fallback for font def missing * Support for /Identity-H and /Identity-V: utf-16-be * Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation) * Arabic (for evaluation) * Whitespace extraction improvements Those changes should mainly improve the text extraction for non-ASCII alphabets, e.g. Russian / Chinese / Japanese / Korean / Arabic. Full Changelog: 2.1.1...2.2.0

pubpub-zz added 9 commits June 10, 2022 22:14

Relative import

48421df

improve TextExtraction

c7829d8

TODO : add some encodings missing

Extend testing

7a9c22c

improve readability of BooleanObjects

b0a7736

Apply Black

d7f84d0

fix early mypy

59504ec

fix mypy2

58bd0e5

attempt fix iss with test_utils

941461a

Merge branch 'main' into ExtractText

e4c37cb

MartinThoma reviewed Jun 10, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

MartinThoma added 2 commits June 10, 2022 23:21

Minor flake8 fix

39e94f9

Adjust mypy types

9763868

Merge branch 'pubpub-zz-ExtractText' into origin/ExtractText

53294f2

pubpub-zz added 2 commits June 10, 2022 23:43

revert in test_utils

744464f

paste error

0ed4d9a

pubpub-zz added 2 commits June 10, 2022 23:53

flake 8

5b96216

flake8

b2830e9

Add 'test_previous_line' back

1223d75

MartinThoma mentioned this pull request Jun 11, 2022

Pubpub zz extract text #971

Closed

Merge branch 'main' into ExtractText

d92597a

MartinThoma reviewed Jun 13, 2022

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

Apply suggestions from code review

0ba91aa

MartinThoma reviewed Jun 13, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

MartinThoma reviewed Jun 13, 2022

View reviewed changes

PyPDF2/_cmap.py Outdated Show resolved Hide resolved

pubpub-zz and others added 2 commits June 13, 2022 18:35

typo

88f1298

Co-authored-by: Martin Thoma <info@martin-thoma.de>

typoUpdate PyPDF2/_cmap.py

de7ddc0

Co-authored-by: Martin Thoma <info@martin-thoma.de>

fix 'utf-16-be' codec can't decode bytes in position 0-1: unexpected …

2aea3e9

…end of data use surrogatepass in _cmap and _page

MartinThoma merged commit 72fcaae into py-pdf:main Jun 13, 2022

pubpub-zz mentioned this pull request Jun 13, 2022

File causes loop method call between functions extract_xform_text and _extract_text #966

Closed

pubpub-zz deleted the ExtractText branch June 14, 2022 18:04

MartinThoma mentioned this pull request Jun 16, 2022

process_operation raises "TypeError: a bytes-like object is required, not 'dict'" #953

Closed

DL6ER mentioned this pull request Aug 28, 2022

UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x45 in position 0: truncated data #1293

Closed

geo-ghci-test bot mentioned this pull request Apr 12, 2024

Geo GHCI test Dashboard sbrunner/scan-to-paperless#1314

Closed

3 tasks

djleamen mentioned this pull request Jul 6, 2025

pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by a character djleamen/doc-reader#5

Closed

Copilot AI mentioned this pull request Jul 6, 2025

Fix PyPDF2 infinite loop vulnerability by migrating to pypdf djleamen/doc-reader#6

Merged

rhythmatician mentioned this pull request Dec 23, 2025

pypdf and PyPDF2 possible Infinite Loop when a comment isn't followed by a character #2 Johnson-Gage-Inspection-Inc/pdf_uploader#10

Closed

Copilot AI mentioned this pull request Dec 23, 2025

Migrate from PyPDF2 to pypdf to fix infinite loop vulnerability Johnson-Gage-Inspection-Inc/pdf_uploader#11

Merged

devatsecure mentioned this pull request Jan 25, 2026

🔒 Argus Security Scan Report - AI-Powered Analysis Findings VectifyAI/PageIndex#79

Open

christianlouis mentioned this pull request Feb 12, 2026

Mitigate PyPDF2/pypdf Infinite Loop Vulnerability (CVE-2023-36464) christianlouis/DocuElevate#254

Closed

Copilot AI mentioned this pull request Feb 12, 2026

fix(deps): mitigate PyPDF2 infinite loop vulnerability (CVE-2023-36464) christianlouis/DocuElevate#255

Merged

Conversation

pubpub-zz commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Jun 10, 2022

Uh oh!

pubpub-zz commented Jun 10, 2022

Uh oh!

MartinThoma commented Jun 10, 2022

Uh oh!

MartinThoma commented Jun 10, 2022

Uh oh!

MartinThoma commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Jun 10, 2022

Uh oh!

pubpub-zz commented Jun 10, 2022

Uh oh!

MartinThoma commented Jun 10, 2022

Uh oh!

pubpub-zz commented Jun 10, 2022

Uh oh!

codecov bot commented Jun 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MartinThoma commented Jun 11, 2022

Uh oh!

MartinThoma commented Jun 12, 2022

Uh oh!

MartinThoma commented Jun 12, 2022

Uh oh!

Uh oh!

MartinThoma commented Jun 13, 2022

Uh oh!

Uh oh!

Uh oh!

MartinThoma commented Jun 13, 2022

Uh oh!

MartinThoma commented Jun 13, 2022

Uh oh!

pubpub-zz commented Jun 13, 2022

Uh oh!

pubpub-zz commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pubpub-zz commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pubpub-zz commented Jun 10, 2022 •

edited

Loading

MartinThoma commented Jun 10, 2022 •

edited

Loading

codecov bot commented Jun 10, 2022 •

edited

Loading

pubpub-zz commented Jun 13, 2022 •

edited

Loading

pubpub-zz commented Jun 13, 2022 •

edited

Loading