ENH: Addition of optional visitor-functions in extract_text() by srogmann · Pull Request #1252 · py-pdf/pypdf

srogmann · 2022-08-18T21:39:02Z

This request adds optional visitor-callbacks in extract_text().

_extract_text() calls these visitor-methods while scanning the text-objects of a page. So one can analyze the operations in the page and the positions of the texts.

tests/test_page.py extracts the texts of labels in a Figure and serves as an example how to use this enhancement.

You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.

It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.

MartinThoma · 2022-08-19T05:24:05Z

Thank you for the contribution ❤️

I didn't expect that we could get text-tokens and their positions in the document in such a rather easy extension. Nice!

I still need to think about this PR / check if there is a performance impact and look at it from a maintenance perspective.

In the meantime, would you mind running black .? You need pip install black; it's a code-formatter that fixes all of the Flake8 issues.

srogmann · 2022-08-19T11:09:38Z

I executed black, it reformatted my changes in _page.py and test_page.py :-).

The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.

pubpub-zz · 2022-08-21T07:42:19Z

@srogmann,
some extra parameters to be returned to the functions could be useful for some filterig : BaseFont Name and Size (rescaled to the page) ; this would be useful for title extraction for example

You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.

It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.

The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.

When executing extract_text(...) the optional visitor-function visitor_text gets the font-dictionary and the font-size. The font-dictionary contains the font-name and other font properties.

srogmann · 2022-08-22T23:14:11Z

@pubpub-zz
I added the font-dictionary and the font-size in the text-visitor-function. I added the reference to the font-dictionary instead of the BaseFont because I didn't know what might be of further interest.

    def print_visi(text, cm_matrix, tm_matrix, font_dict, font_size):
        if text.strip() != "":
            listTexts.append(
                PositionedText(
                    text, tm_matrix[4], tm_matrix[5], font_dict, font_size
                )
            )

[...]

# Check the fonts. We check: /F2 9.96 Tf [...] [(Dat)-2(e)] TJ
textDatOfDate = listRows[0][0][0]
assert textDatOfDate.font_dict is not None
assert textDatOfDate.font_dict["/Name"] == "/F2"
assert textDatOfDate.font_dict["/BaseFont"] == "/Arial,Bold"
assert textDatOfDate.font_dict["/Encoding"] == "/WinAnsiEncoding"
assert textDatOfDate.font_size == 9.96`

srogmann · 2022-08-23T11:01:55Z

@pubpub-zz
One could add helper classes like PositionedText to support parsing of formatted texts. I used tests/test_page.py as some kind of inkubator ;-).

assert textDat.get_base_font() == "/Arial,Bold"

MartinThoma · 2022-09-14T04:15:18Z

@srogmann What is the state of this PR? Do you need help to resolve the merge conflicts / the failing test?

Besides those, is the PR ready in your opinion?

srogmann · 2022-09-14T10:47:57Z

@MartinThoma In my opinion the PR was ready 22 days ago. I will have a look at the current state.

MartinThoma · 2022-09-14T12:00:43Z

I'm sorry for the delay; I thought there still was something to be done 🙈

srogmann · 2022-09-14T20:23:57Z

@MartinThoma I merged and resolved conflicts. The PR should work. You may have a look at it.

Each change of the output-result in _extract_text requires a visitor-call in _extract_text:

                    if visitor_text is not None:
                        visitor_text(text, cm_matrix, tm_matrix, cmap[3], font_size)

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

In tests/test_page.py there are some functions which might be of interest in PyPDF2 or pdfly in future to support using a visitor.
For example one might try to create a svg file:

    def exportSvgFile(listTexts, listRects, fileName):
        import svgwrite

        dwg = svgwrite.Drawing(fileName, profile="tiny")
        color = svgwrite.rgb(255, 0, 0, "%")
        for r in listRects:
            dwg.add(dwg.rect((r.x, r.y), (r.w, r.h), stroke=color, fill_opacity=0.05))
        for t in listTexts:
           dwg.add(dwg.text(t.text, insert=(t.x, t.y), fill="blue"))
        dwg.save()

tests/test_page.py

PyPDF2/_page.py

srogmann · 2022-09-24T19:05:25Z

@MartinThoma I added DictionaryObject in cmaps (my last commit-comment is wrong).

codecov · 2022-09-24T19:08:46Z

Codecov Report

Base: 94.53% // Head: 94.10% // Decreases project coverage by -0.43% ⚠️

Coverage data is based on head (1969c9f) compared to base (2845c6d).
Patch coverage: 35.13% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1252      +/-   ##
==========================================
- Coverage   94.53%   94.10%   -0.44%     
==========================================
  Files          28       28              
  Lines        5035     5068      +33     
  Branches     1035     1051      +16     
==========================================
+ Hits         4760     4769       +9     
- Misses        165      177      +12     
- Partials      110      122      +12

Impacted Files	Coverage Δ
PyPDF2/_cmap.py	`95.08% <ø> (ø)`
PyPDF2/_page.py	`91.67% <35.13%> (-3.46%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MartinThoma · 2022-09-25T06:03:26Z

@srogmann Thank you for your contribution! If you want, I can add you to https://pypdf2.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

MartinThoma · 2022-09-25T06:05:09Z

There are quite a lot of variables describing the current state, cmap[3] contains the font-dictionary. Perhaps in future there will be one object describing the state.

Yes, I think that would make sense. Also getting rid of the use of global / non-local variables and passing data explicitly around might help.

MartinThoma

Looks good to me!

MartinThoma · 2022-09-25T06:10:05Z

@pubpub-zz Did you have a look? What do you think about the changes?

If you're good with them as well, I would merge + release :-)

pubpub-zz · 2022-09-25T07:54:41Z

This sounds good. I had some request earlier that have been fullfiled.
I agree that it is time to release it for user feedbacks

MartinThoma · 2022-09-25T08:44:21Z

@srogmann Very nice work 🥳

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples. I'm super excited to see how people will use it 🎉

srogmann · 2022-09-25T13:38:14Z

@MartinThoma Thanks for merging!

An additation to CONTRIBUTORS.html would be fine.

I think this has a lot of potential, but it's also pretty hard to use. We will need to add documentation and typical examples.

In tests/test/page.py there a two util-classes PositionedText and Rectangle. After renaming they might be useful when one wants to write an own visitor. Documentation and typical examples would be nice. But docs/user/ is contained in the repository, too, so I can think about another pull-request containing some documentation and examples (e.g. the util-classes mentioned and a sample to extract tables or to ignore page-headers).

New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0

srogmann added 2 commits August 18, 2022 22:10

ENH: Added visitor-callbacks in PageObject.extract_text(...).

76801d7

You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.

TST: Test of visitor-callbacks in extract_text().

39a9f08

It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.

STY: Executed black to format code (spaces, line-breaks, ...).

92c0cf8

srogmann added 2 commits August 19, 2022 19:56

Fetch main-Updates (_utils.py).

c320ea8

TST: Added function extractTable(...) to read text in cells of a table.

177fea2

The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.

srogmann added 8 commits August 22, 2022 22:08

STY: Updated some comments in test-code.

4389590

ENH: Added visitor-callbacks in PageObject.extract_text(...).

eccc779

You may use this callbacks to visit all operators and its arguments and to get the positions of the text-objects. You may use this to extract the rectangles of a table and the texts in its cells in some PDF files.

TST: Test of visitor-callbacks in extract_text().

8297b13

It extracts labels of rectangles in Figure 2 of GeoBase_NHNC1_Data_Model_UML_EN.

STY: Executed black to format code (spaces, line-breaks, ...).

165b686

TST: Added function extractTable(...) to read text in cells of a table.

ed784e9

The function extractTable(listTexts, listRects) uses the function extractTextAndRectangles(page, rectFilter) which uses the function extract_text with visitors to extract text in cells of a table.

STY: Updated some comments in test-code.

9922f1c

ENH: visitor_text additionally gets font-dictionary and font-size.

ae7c993

When executing extract_text(...) the optional visitor-function visitor_text gets the font-dictionary and the font-size. The font-dictionary contains the font-name and other font properties.

Merge remote branch 'extract_text_visitors' into extract_text_visitors

4afa052

TST: Added funtion get_base_font() the get the BaseFont.

f83ae31

srogmann added 3 commits September 14, 2022 21:20

Merge branch 'main' into extract_text_visitors

18d2f4a

BUG: Merged output-changes into visitor-calls.

19003b3

TST: Updated text_visitor-test (line-break disappeared)

a5b8b44

MartinThoma reviewed Sep 17, 2022

View reviewed changes

tests/test_page.py Show resolved Hide resolved

MartinThoma reviewed Sep 17, 2022

View reviewed changes

tests/test_page.py Show resolved Hide resolved

flake8 fixes

17f2d61

MartinThoma reviewed Sep 18, 2022

View reviewed changes

PyPDF2/_page.py Outdated Show resolved Hide resolved

Missed a bracket

ab5d118