Skip to content

improved ExtractText(3)#969

Merged
MartinThoma merged 45 commits intopy-pdf:mainfrom
pubpub-zz:ExtractText
Jun 13, 2022
Merged

improved ExtractText(3)#969
MartinThoma merged 45 commits intopy-pdf:mainfrom
pubpub-zz:ExtractText

Conversation

@pubpub-zz
Copy link
Copy Markdown
Collaborator

@pubpub-zz pubpub-zz commented Jun 10, 2022

New corrections for extract_text()
fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic

@MartinThoma
Copy link
Copy Markdown
Member

There are two minor Flake8 issues:

./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_block_backwards' imported but unused
./tests/test_utils.py:7:1: F401 'PyPDF2._utils.read_previous_line' imported but unused

Do you prefer to fix them yourself or should I do it? (also as a general question)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma,
I need your help !! 😥
I have an issue in test_utils.py : My changes on tag 2.1.0 works but I get regressions on main.

Can you have a look please

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971

@MartinThoma
Copy link
Copy Markdown
Member

I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-)

@MartinThoma
Copy link
Copy Markdown
Member

MartinThoma commented Jun 10, 2022

Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged.

I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered)

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
If you look at the changed files I had to drastically revert in test_utils.py as I had major issues with it. give me 5 min and I will confirm/infirm my issue

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

@MartinThoma
my problems are this section of code


@pytest.mark.parametrize(
    ("dat", "pos", "expected", "expected_pos"),
    [
        (b"abc", 1, b"a", 0),
        (b"abc", 2, b"ab", 0),
        (b"abc", 3, b"abc", 0),
        (b"abc\n", 3, b"abc", 0),
        (b"abc\n", 4, b"", 3),
        (b"abc\n\r", 4, b"", 3),
        (b"abc\nd", 5, b"d", 3),
        # Skip over multiple CR/LF bytes
        (b"abc\n\r\ndef", 9, b"def", 3),
        # Include a block full of newlines...
        (
            b"abc" + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE) + b"d",
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d",
            3,
        ),
        # Include a block full of non-newline characters
        (
            b"abc\n" + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            2 * io.DEFAULT_BUFFER_SIZE + 4,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            3,
        ),
        # Both
        (
            b"abcxyz"
            + b"\n" * (2 * io.DEFAULT_BUFFER_SIZE)
            + b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            4 * io.DEFAULT_BUFFER_SIZE + 6,
            b"d" * (2 * io.DEFAULT_BUFFER_SIZE),
            6,
        ),
    ],
)
def test_read_previous_line(dat, pos, expected, expected_pos):
    s = io.BytesIO(dat)
    s.seek(pos)
    assert read_previous_line(s) == expected
    assert s.tell() == expected_pos

@MartinThoma
Copy link
Copy Markdown
Member

Oh damn. That sounds as if it's related to #646

I'll have a closer look tomorrow

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

I still have some work to fix text extraction with the "paper rotated"
Chinese / russian/ .... are working
I have some doubts about arabic as the text is written right to left.

@codecov
Copy link
Copy Markdown

codecov bot commented Jun 10, 2022

Codecov Report

Merging #969 (2aea3e9) into main (9c4e7f5) will increase coverage by 0.16%.
The diff coverage is 86.43%.

@@            Coverage Diff             @@
##             main     #969      +/-   ##
==========================================
+ Coverage   84.25%   84.42%   +0.16%     
==========================================
  Files          18       18              
  Lines        4115     4179      +64     
  Branches      868      887      +19     
==========================================
+ Hits         3467     3528      +61     
- Misses        465      468       +3     
  Partials      183      183              
Impacted Files Coverage Δ
PyPDF2/_page.py 82.65% <ø> (+1.10%) ⬆️
PyPDF2/_cmap.py 76.43% <76.43%> (+3.70%) ⬆️
PyPDF2/generic.py 89.70% <81.81%> (-0.13%) ⬇️
PyPDF2/_utils.py 98.03% <98.03%> (ø)
PyPDF2/__init__.py 100.00% <100.00%> (ø)
PyPDF2/filters.py 81.81% <0.00%> (+0.64%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c4e7f5...2aea3e9. Read the comment docs.

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz I've added the test back, without any adjustment. It works: #971
Did you maybe solve the issue in between?

@MartinThoma
Copy link
Copy Markdown
Member

I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long

@MartinThoma
Copy link
Copy Markdown
Member

Devlivered just before diner and I've mowed the lawn

Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa

I try to go through the PR today evening / night :-)

@MartinThoma
Copy link
Copy Markdown
Member

@pubpub-zz Looks good to me! I would squash-commit with the following text:

ENH: Text Extraction improvements

- Improvements around /Encoding / /ToUnicode
- Extraction of CMaps improved
- Fallback for font def missing
- Support for /Identity-H and /Identity-V: utf-16-be
- Support for /GB-EUC-H / /GB-EUC-V: gbk
- Support for /GBpc-EUC-H / /GBpc-EUC-V : gb2312
- Store default font space width for 18 commonly used fonts to improve
  whitespace extraction

Does that represent the changes well to users?

@MartinThoma
Copy link
Copy Markdown
Member

Besides the two typos I've just commented, there is one robustness-change I would do: The .decode("utf-16-be") fails 167x for 22847 PDF files (0.7% of my dataset, so not too wild) with:

    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py", line 1125, in _extract_text
    cmaps[f] = build_char_map(f, space_width, obj)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 21, in build_char_map
    map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py", line 221, in parse_to_unicode
    ] = unhexlify(sq).decode("utf-16-be")
  File "/home/moose/.pyenv/versions/3.10.2/lib/python3.10/encodings/utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data

I would just wrap it in a try-except UnicodeDecodeError:

import logging
logger = logging.getLogger(__name__)

...

                while a <= b:
                    sq = fmt2 % c
                    key = unhexlify(fmt % a).decode(
                                "charmap" if map_dict[-1] == 1 else "utf-16-be"
                            )
                    unhexlified = unhexlify(sq)
                    try:
                         decoded = unhexlified.decode("utf-16-be")
                    except UnicodeDecodeError as exc:
                        logger.warning("UnicodeDecodeError when parsing cmap")
                        a += 1
                        c += 1
                        continue
                    map_dict[key] = decoded
                    int_entry.append(a)
                    a += 1
                    c += 1

pubpub-zz and others added 2 commits June 13, 2022 18:35
Co-authored-by: Martin Thoma <info@martin-thoma.de>
Co-authored-by: Martin Thoma <info@martin-thoma.de>
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

under analysis

@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented Jun 13, 2022

ENH: Text Extraction improvements

  • Improvements around /Encoding / /ToUnicode
  • Extraction of CMaps improved
  • Fallback for font def missing
  • Support for /Identity-H and /Identity-V: utf-16-be
  • Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
  • Arabic (for evaluation)
  • whitespace extraction improvement

…end of data

use surrogatepass  in _cmap and _page
@pubpub-zz
Copy link
Copy Markdown
Collaborator Author

pubpub-zz commented Jun 13, 2022

@MartinThoma
This latest mod fixed the 'utf-16-be' codec can't decode bytes in position 0-1: unexpected end of data
This should close the issue on the .7% remaining

@MartinThoma MartinThoma merged commit 72fcaae into py-pdf:main Jun 13, 2022
MartinThoma added a commit that referenced this pull request Jun 13, 2022
The 2.2.0 release improves text extraction again via (#969):

* Improvements around /Encoding / /ToUnicode
* Extraction of CMaps improved
* Fallback for font def missing
* Support for /Identity-H and /Identity-V: utf-16-be
* Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation)
* Arabic (for evaluation)
* Whitespace extraction improvements

Those changes should mainly improve the text extraction for non-ASCII alphabets,
e.g. Russian / Chinese / Japanese / Korean / Arabic.

Full Changelog: 2.1.1...2.2.0
@pubpub-zz pubpub-zz deleted the ExtractText branch June 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants