ENH: Addition of optional visitor-functions in extract_text()#1252
ENH: Addition of optional visitor-functions in extract_text()#1252MartinThoma merged 28 commits intopy-pdf:mainfrom
Conversation
You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
|
Thank you for the contribution ❤️ I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice! I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective. In the meantime, would you mind running |
|
I executed black, it reformatted my changes in _page.py and test_page.py :-). |
The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.
|
@srogmann, |
You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.
It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.
The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.
When executing extract_text(...) the optional visitor-function visitor_text gets the font-dictionary and the font-size. The font-dictionary contains the font-name and other font properties.
|
@pubpub-zz [...] |
|
@pubpub-zz
|
|
@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test? Besides those, is the PR ready in your opinion? |
|
@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state. |
|
I'm sorry for the delay; I thought there still was something to be done 🙈 |
|
@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it. Each change of the output-result in _extract_text requires a visitor-call in _extract_text: There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state. In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor. |
|
@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong). |
Codecov ReportBase: 94.53% // Head: 94.10% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1252 +/- ##
==========================================
- Coverage 94.53% 94.10% -0.44%
==========================================
Files 28 28
Lines 5035 5068 +33
Branches 1035 1051 +16
==========================================
+ Hits 4760 4769 +9
- Misses 165 177 +12
- Partials 110 122 +12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-) |
Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help. |
|
@pubpub-zz Did you have a look? What do you think about the changes? If you're good with them as well, I would merge + release :-) |
|
This sounds good. I had some request earlier that have been fullfiled. |
|
@srogmann Very nice work 🥳 I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it 🎉 |
|
@MartinThoma Thanks for merging! An additation to CONTRIBUTORS.html would be fine.
In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers). |
New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0
This request adds optional visitor-callbacks in
extract_text()._extract_text()calls these visitor-methods while scanning the text-objects of apage. So one can analyze the operations in the page and the positions of the texts.tests/test_page.pyextracts the texts of labels in a Figure and serves as an example how to use this enhancement.