feat: added hOCR exporter #111

galz10 · 2023-04-26T16:35:53Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

…ocumentai-toolbox into analytic-changes

google/cloud/documentai_toolbox/wrappers/hocr.py

dizcology · 2023-04-27T16:53:48Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+    if object.layout.bounding_poly.vertices:
+        min_x, min_y = object.layout.bounding_poly.vertices[0].x, object.layout.bounding_poly.vertices[0].y
+        max_x, max_y = object.layout.bounding_poly.vertices[2].x, object.layout.bounding_poly.vertices[2].y
+        return f"bbox {int(min_x)} {int(min_y)} {int(max_x)} {int(max_y)}"


Instead of this, we could introduce a wrapper for bounding_poly that knows how to represent itself in hocr-style string.

are you suggesting a bounding_poly wrapper just for hOCR or for all of docai toolbox.

For Document AI's bounding poly messages: https://github.com/googleapis/googleapis/blob/57b675c1534e6feb806ca9cb48a3c4a4023e91fe/google/cloud/documentai/v1/geometry.proto#L49. Alternatively, at least we should extract some of the code here to helper functions for simpler testing.

google/cloud/documentai_toolbox/wrappers/hocr.py

holtskinner · 2023-05-01T14:43:27Z

In the description for this PR, (and the inline documentation) can you add context as to what hOCR is and how it's being used?

galz10 · 2023-05-02T13:21:42Z

In the description for this PR, (and the inline documentation) can you add context as to what hOCR is and how it's being used?

Yes, will do this is just a draft to help finalize the design doc but when i make this not a draft i'll add the context.

dizcology · 2023-05-02T16:48:28Z

google/cloud/documentai_toolbox/wrappers/document.py

+    def to_hocr(self,filename: str) -> str:
+        hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename)
+
+        return hocr.export_hocr()


This pattern (of instantiating an object only to call a method and then immediately discard the object itself) indeed suggests that we do not need to have a whole _Hocr class, but some helper functions that can carry out the calculation will do.

To be clear, I am not requesting to change this at this point, let's continue the discussion.

right a class for the main hOCR object might not be needed but i still think we need python representation of the hOCR objects like hOCR_page, etc...

I've tried to use only helper functions without the loading object steps and the result is a super slow export, with this implementation it takes milliseconds for the export to return a string and the second implementation it takes seconds up to a minute.

I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.

So I can try to add this logic inside the normal wrappers to see if there is any affect I believe there will be. I believe the reason it would take more time is because we're adding another operation to each token, line , paragraph, block and page wrapping process. So when you have N pages with N tokens,lines,paragraphs and blocks the time added to each wrapping I believe would be a lot(I think the time complexity would be O(n^5). As opposed to only exporting the objects to hOCR when the user actually wants to do export the wrapped document.

dizcology · 2023-05-02T16:58:44Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+    for line in lines:
+        start_index = line.layout.text_anchor.text_segments[0].start_index
+        end_index = line.layout.text_anchor.text_segments[0].end_index
+        words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index]


No, I meant the logic that calculates which wrapper Word belongs to which wrapper Line. This would have nothing to do with hOCR, but about the hierarchy that is (implicitly!) presented in Document AI Document components.

(But yes, if we drive that line of logic further, we will likely end up with a rather different design.)

To be clear, I am not requesting to change this at this point, let's continue the discussion.

dizcology · 2023-05-02T17:06:43Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+
+
+def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension):
+    if element_with_layout.layout.bounding_poly.vertices:


Looks like there are vertices and there are normalized_vertices. What are the situations where only normalized_vertices are populated but not vertices? (That looks like the only case where dimensions is used, and right not we pass this to the hOCR classes just for this purpose. This perhaps could be avoided.)

i've seen cases where the OCR doesn't populate vertices, but i'll test this to validate that it happens sometimes otherwise if it doesn't happen the dimensions is not needed.

Can we also confirm this with the service team?

dizcology · 2023-05-02T17:09:21Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+class _HocrWord:
+    text: str = dataclasses.field(repr=False)
+    documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False)
+    dimensions: documentai.Document.Page.Dimension = dataclasses.field(repr=False)


This might be an InitVar following the pattern introduced in #110

(The word does not need this as an attribute, and it is used only in __post_init__ to calculate the bounding box.)

yea once the #110 changes are merged i'll modify this PR to use the new pattern

dizcology · 2023-05-02T17:11:11Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+@dataclasses.dataclass
+class _HocrWord:
+    text: str = dataclasses.field(repr=False)
+    documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False)


Also this, perhaps just an InitVar since the hOCR word does not need it as an attribute.

Same as above

dizcology · 2023-05-02T17:11:50Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+
+    def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
+        f = ""
+        word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")


There must be library for this.

to do the text replace stuff or the hOCR stuff? the only thing i found about hOCR was not maintained and did not work properly. For the text replace stuff there might be but wouldn't importing a library just for this purpose be overkill ?

https://stackoverflow.com/questions/1061697/whats-the-easiest-way-to-escape-html-in-python

dizcology · 2023-05-02T17:12:51Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+    def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
+        f = ""
+        word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
+        f += f"<span class='ocrx_word' id='word_{pidx}_{bidx}_{paridx}_{lidx}_{widx}' title='{self.bounding_box}'>{word_text}</span>\n"       


Similarly, there are standard libraries for manipulating and creating XML: https://docs.python.org/3/library/xml.etree.elementtree.html

oh i did not know i could do this through a library i'll take a look.

dizcology · 2023-05-02T17:17:44Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+        f += f"<meta name=\"ocr-number-of-pages\" content=\"{len(self.documentai_pages)}\" />\n"
+        f += "<meta name=\"ocr-capabilities\" content=\"ocr_page ocr_carea ocr_par ocr_line ocrx_word\" />\n"
+        f += "</head>\n"
+        f += "<body>\n"


Put everything that is constant (does not depend on the actual pages) into a single multiline f-string, maybe at the top of the file, to make this more readable.

dizcology · 2023-05-05T16:31:27Z

google/cloud/documentai_toolbox/converters/vision_helpers.py

 from google.cloud.vision import TextAnnotation, Symbol, Word, Paragraph, Block, Page
 from google.cloud import vision

+from google.cloud.documentai_toolbox.constants import ElementWithLayout


I noticed that we are not very consistent with importing modules versus importing names/classes. The general preference is to import modules. For example the next engineer has to refer to the import line to know if Page here is from the Toolbox itself or from google.cloud.vision.

(This does not have to be done in this PR, but please add a internal clean up tracking issue to fix this.)

dizcology · 2023-05-05T16:32:23Z

google/cloud/documentai_toolbox/wrappers/document.py


 from pikepdf import Pdf

+from google.cloud.documentai_toolbox.wrappers.hocr import _Hocr


Rename to _hocr to signal that it is not part of the public API. Also import the module instead of the class here.

dizcology · 2023-05-05T16:35:11Z

google/cloud/documentai_toolbox/wrappers/document.py

+    def to_hocr(self,filename: str) -> str:
+        hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename)
+
+        return hocr.export_hocr()


I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.

dizcology · 2023-05-05T16:35:45Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+
+
+def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension):
+    if element_with_layout.layout.bounding_poly.vertices:


Can we also confirm this with the service team?

dizcology · 2023-05-05T16:40:33Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+
+
+def _get_text(element_with_layout: ElementWithLayout, document_text):
+    start_index = element_with_layout.layout.text_anchor.text_segments[0].start_index


Similarly here, we should consider either

(1) promoting ElementWithLayout to a proper class (and not just a type alias) so that we can put the code element_with_layout.layout.text_anchor.text_segments[0].start_index into an ElementWithLayout.start_index property. (And in fact, then, _get_text could be just a property of that class.)

or

(2) introduce a private wrapper class of Layout or TextAnchor.

(This refactoring appears to be purely internal and does not have to block the hOCR export work.)

Yea that's something we can do to simplify our usage of these across the toolbox, I think it might be beneficial to add this change in a separate PR and not the PR for hOCR feature.

dizcology · 2023-05-05T16:43:32Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+    return document_text[start_index:end_index].replace('/n', '')
+
+
+def _load_hocr_words(words: List[documentai.Document.Page.Token], document_text: str, dimensions: documentai.Document.Page.Dimension):


(1) It may be simpler to test if the method converts a single Document AI Token to a single hOCR Word, and let the caller handle the for loop.

(2) The function name could provide even more information, such as hocr_words_from_documentai_tokens, or the singular form (if you move the for loop out), hocr_word_from_documentai_token.

Side note: there are other alternatives that is more object-oriented, now that we have the classes. For example

class _HocrWord: ... @classmethod def from_word(cls, word: wrappers.Word) -> '_HocrWord': ...

This still requires figuring out whether a "word" needs to know about the page's dimensions (which does not make sense to me), but indeed, the text should be already contained in the word and having to pass both in seems like a sign of something else not working properly.

dizcology · 2023-05-05T16:50:34Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+    for line in lines:
+        start_index = line.layout.text_anchor.text_segments[0].start_index
+        end_index = line.layout.text_anchor.text_segments[0].end_index
+        words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index]


The Toolbox is the reasonable place to implement that logic (which we are doing here, but doing it in a way to be used only for hOCR export, while it is a more general concept that belongs to the Document AI Documents themselves). Should we not expect the user to want to iterate over all the lines on a Toolbox-wrapped Page, and do something about the lines?

dizcology · 2023-05-05T16:51:58Z

google/cloud/documentai_toolbox/wrappers/hocr.py

+
+    def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
+        f = ""
+        word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")


https://stackoverflow.com/questions/1061697/whats-the-easiest-way-to-escape-html-in-python

galz10 added 6 commits April 10, 2023 14:14

chore: edit get_storage_client to add module name

53742c5

added module name to get_bytes

ebe3a36

fixed failing test

00057e6

chore: added hocr

fe033ef

removed test files

9d264a3

Merge branch 'analytic-changes' of https://github.com/galz10/python-d…

61e5ed8

…ocumentai-toolbox into analytic-changes

product-auto-label bot added the size: xl Pull request size is extra large. label Apr 26, 2023

dizcology requested changes Apr 27, 2023

View reviewed changes

galz10 added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Apr 28, 2023

revised code per comments

93ca69b

galz10 requested a review from dizcology May 1, 2023 14:42

dizcology requested changes May 2, 2023

View reviewed changes

dizcology reviewed May 2, 2023

View reviewed changes

galz10 requested a review from dizcology May 2, 2023 20:22

dizcology requested changes May 5, 2023

View reviewed changes

galz10 requested a review from dizcology May 22, 2023 17:43

Merge branch 'googleapis:main' into analytic-changes

3c70d0a

galz10 mentioned this pull request Jun 15, 2023

feat: added hOCR export functionality #123

Merged

4 tasks

galz10 closed this Jun 15, 2023



		def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension):
		if element_with_layout.layout.bounding_poly.vertices:


		from pikepdf import Pdf

		from google.cloud.documentai_toolbox.wrappers.hocr import _Hocr



		def _get_text(element_with_layout: ElementWithLayout, document_text):
		start_index = element_with_layout.layout.text_anchor.text_segments[0].start_index

		return document_text[start_index:end_index].replace('/n', '')


		def _load_hocr_words(words: List[documentai.Document.Page.Token], document_text: str, dimensions: documentai.Document.Page.Dimension):

feat: added hOCR exporter #111

feat: added hOCR exporter #111

Uh oh!

Conversation

galz10 commented Apr 26, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holtskinner commented May 1, 2023

Uh oh!

galz10 commented May 2, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dizcology May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dizcology May 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dizcology May 2, 2023 •

edited

Loading

dizcology May 2, 2023 •

edited

Loading