Conversation
TODO : add some encodings missing
|
There are two minor Flake8 issues: Do you prefer to fix them yourself or should I do it? (also as a general question) |
|
@MartinThoma, Can you have a look please |
|
@pubpub-zz I might be sleepy-dumb, but I don't see what you mean. I think you only have to minor stylistic / mypy adjustments you need to make: #971 |
|
I'll have a more detailed look tomorrow at all the goodness you're bringing to PyPDF2 this time :-) |
|
Oh, if you worry about the code coverage: That's not so bad. It's especially not a blocker from getting your improvements merged. I will run various tests (especially https://github.com/py-pdf/benchmarks ) to check things are improved. I can live if coverage drops a bit (and I will have a more detailed look at the places which are not covered) |
|
@MartinThoma |
|
@MartinThoma |
|
Oh damn. That sounds as if it's related to #646 I'll have a closer look tomorrow |
|
I still have some work to fix text extraction with the "paper rotated" |
Codecov Report
@@ Coverage Diff @@
## main #969 +/- ##
==========================================
+ Coverage 84.25% 84.42% +0.16%
==========================================
Files 18 18
Lines 4115 4179 +64
Branches 868 887 +19
==========================================
+ Hits 3467 3528 +61
- Misses 465 468 +3
Partials 183 183
Continue to review full report at Codecov.
|
|
@pubpub-zz I've added the test back, without any adjustment. It works: #971 |
|
I've set the ids because the auto-generated I'd takes just all of the parameters which was extremely long |
Good job 😁👍 I was just making burgers for my girlfriend and we will now have an relaxed evening 😊 |
|
@pubpub-zz I've updated the PR so that the tests run. It was weird that they didn't succeed ... apparently, the tests ran on code as if it was already having the automatic merge. The automatic merge didn't adjust the ids range: 0ba91aa I try to go through the PR today evening / night :-) |
|
@pubpub-zz Looks good to me! I would squash-commit with the following text: Does that represent the changes well to users? |
|
Besides the two typos I've just commented, there is one robustness-change I would do: The I would just wrap it in a try-except import logging
logger = logging.getLogger(__name__)
...
while a <= b:
sq = fmt2 % c
key = unhexlify(fmt % a).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be"
)
unhexlified = unhexlify(sq)
try:
decoded = unhexlified.decode("utf-16-be")
except UnicodeDecodeError as exc:
logger.warning("UnicodeDecodeError when parsing cmap")
a += 1
c += 1
continue
map_dict[key] = decoded
int_entry.append(a)
a += 1
c += 1 |
Co-authored-by: Martin Thoma <info@martin-thoma.de>
|
under analysis |
|
…end of data use surrogatepass in _cmap and _page
|
@MartinThoma |
The 2.2.0 release improves text extraction again via (#969): * Improvements around /Encoding / /ToUnicode * Extraction of CMaps improved * Fallback for font def missing * Support for /Identity-H and /Identity-V: utf-16-be * Support for /GB-EUC-H / /GB-EUC-V / GBp/c-EUC-H / /GBpc-EUC-V (beta release for evaluation) * Arabic (for evaluation) * Whitespace extraction improvements Those changes should mainly improve the text extraction for non-ASCII alphabets, e.g. Russian / Chinese / Japanese / Korean / Arabic. Full Changelog: 2.1.1...2.2.0
New corrections for
extract_text()fixes extraction in cmap
#953
#431
#242
#591 /#954 should be good but doubts on arabic