Skip to content

Conversation

@galz10
Copy link
Collaborator

@galz10 galz10 commented Apr 26, 2023

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added the size: xl Pull request size is extra large. label Apr 26, 2023
if object.layout.bounding_poly.vertices:
min_x, min_y = object.layout.bounding_poly.vertices[0].x, object.layout.bounding_poly.vertices[0].y
max_x, max_y = object.layout.bounding_poly.vertices[2].x, object.layout.bounding_poly.vertices[2].y
return f"bbox {int(min_x)} {int(min_y)} {int(max_x)} {int(max_y)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, we could introduce a wrapper for bounding_poly that knows how to represent itself in hocr-style string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting a bounding_poly wrapper just for hOCR or for all of docai toolbox.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Document AI's bounding poly messages: https://github.com/googleapis/googleapis/blob/57b675c1534e6feb806ca9cb48a3c4a4023e91fe/google/cloud/documentai/v1/geometry.proto#L49. Alternatively, at least we should extract some of the code here to helper functions for simpler testing.

@galz10 galz10 added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Apr 28, 2023
@galz10 galz10 requested a review from dizcology May 1, 2023 14:42
@holtskinner
Copy link
Member

In the description for this PR, (and the inline documentation) can you add context as to what hOCR is and how it's being used?

@galz10
Copy link
Collaborator Author

galz10 commented May 2, 2023

In the description for this PR, (and the inline documentation) can you add context as to what hOCR is and how it's being used?

Yes, will do this is just a draft to help finalize the design doc but when i make this not a draft i'll add the context.

def to_hocr(self,filename: str) -> str:
hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename)

return hocr.export_hocr()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern (of instantiating an object only to call a method and then immediately discard the object itself) indeed suggests that we do not need to have a whole _Hocr class, but some helper functions that can carry out the calculation will do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not requesting to change this at this point, let's continue the discussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right a class for the main hOCR object might not be needed but i still think we need python representation of the hOCR objects like hOCR_page, etc...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to use only helper functions without the loading object steps and the result is a super slow export, with this implementation it takes milliseconds for the export to return a string and the second implementation it takes seconds up to a minute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can try to add this logic inside the normal wrappers to see if there is any affect I believe there will be. I believe the reason it would take more time is because we're adding another operation to each token, line , paragraph, block and page wrapping process. So when you have N pages with N tokens,lines,paragraphs and blocks the time added to each wrapping I believe would be a lot(I think the time complexity would be O(n^5). As opposed to only exporting the objects to hOCR when the user actually wants to do export the wrapped document.

for line in lines:
start_index = line.layout.text_anchor.text_segments[0].start_index
end_index = line.layout.text_anchor.text_segments[0].end_index
words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index]
Copy link
Collaborator

@dizcology dizcology May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant the logic that calculates which wrapper Word belongs to which wrapper Line. This would have nothing to do with hOCR, but about the hierarchy that is (implicitly!) presented in Document AI Document components.

(But yes, if we drive that line of logic further, we will likely end up with a rather different design.)

To be clear, I am not requesting to change this at this point, let's continue the discussion.



def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension):
if element_with_layout.layout.bounding_poly.vertices:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are vertices and there are normalized_vertices. What are the situations where only normalized_vertices are populated but not vertices? (That looks like the only case where dimensions is used, and right not we pass this to the hOCR classes just for this purpose. This perhaps could be avoided.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've seen cases where the OCR doesn't populate vertices, but i'll test this to validate that it happens sometimes otherwise if it doesn't happen the dimensions is not needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also confirm this with the service team?

class _HocrWord:
text: str = dataclasses.field(repr=False)
documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False)
dimensions: documentai.Document.Page.Dimension = dataclasses.field(repr=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an InitVar following the pattern introduced in #110

(The word does not need this as an attribute, and it is used only in __post_init__ to calculate the bounding box.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea once the #110 changes are merged i'll modify this PR to use the new pattern

@dataclasses.dataclass
class _HocrWord:
text: str = dataclasses.field(repr=False)
documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this, perhaps just an InitVar since the hOCR word does not need it as an attribute.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above


def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
f = ""
word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There must be library for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do the text replace stuff or the hOCR stuff? the only thing i found about hOCR was not maintained and did not work properly. For the text replace stuff there might be but wouldn't importing a library just for this purpose be overkill ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
f = ""
word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
f += f"<span class='ocrx_word' id='word_{pidx}_{bidx}_{paridx}_{lidx}_{widx}' title='{self.bounding_box}'>{word_text}</span>\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, there are standard libraries for manipulating and creating XML: https://docs.python.org/3/library/xml.etree.elementtree.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i did not know i could do this through a library i'll take a look.

f += f"<meta name=\"ocr-number-of-pages\" content=\"{len(self.documentai_pages)}\" />\n"
f += "<meta name=\"ocr-capabilities\" content=\"ocr_page ocr_carea ocr_par ocr_line ocrx_word\" />\n"
f += "</head>\n"
f += "<body>\n"
Copy link
Collaborator

@dizcology dizcology May 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put everything that is constant (does not depend on the actual pages) into a single multiline f-string, maybe at the top of the file, to make this more readable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@galz10 galz10 requested a review from dizcology May 2, 2023 20:22
from google.cloud.vision import TextAnnotation, Symbol, Word, Paragraph, Block, Page
from google.cloud import vision

from google.cloud.documentai_toolbox.constants import ElementWithLayout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that we are not very consistent with importing modules versus importing names/classes. The general preference is to import modules. For example the next engineer has to refer to the import line to know if Page here is from the Toolbox itself or from google.cloud.vision.

(This does not have to be done in this PR, but please add a internal clean up tracking issue to fix this.)


from pikepdf import Pdf

from google.cloud.documentai_toolbox.wrappers.hocr import _Hocr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to _hocr to signal that it is not part of the public API. Also import the module instead of the class here.

def to_hocr(self,filename: str) -> str:
hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename)

return hocr.export_hocr()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.



def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension):
if element_with_layout.layout.bounding_poly.vertices:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also confirm this with the service team?



def _get_text(element_with_layout: ElementWithLayout, document_text):
start_index = element_with_layout.layout.text_anchor.text_segments[0].start_index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, we should consider either

(1) promoting ElementWithLayout to a proper class (and not just a type alias) so that we can put the code element_with_layout.layout.text_anchor.text_segments[0].start_index into an ElementWithLayout.start_index property. (And in fact, then, _get_text could be just a property of that class.)

or

(2) introduce a private wrapper class of Layout or TextAnchor.

(This refactoring appears to be purely internal and does not have to block the hOCR export work.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's something we can do to simplify our usage of these across the toolbox, I think it might be beneficial to add this change in a separate PR and not the PR for hOCR feature.

return document_text[start_index:end_index].replace('/n', '')


def _load_hocr_words(words: List[documentai.Document.Page.Token], document_text: str, dimensions: documentai.Document.Page.Dimension):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) It may be simpler to test if the method converts a single Document AI Token to a single hOCR Word, and let the caller handle the for loop.

(2) The function name could provide even more information, such as hocr_words_from_documentai_tokens, or the singular form (if you move the for loop out), hocr_word_from_documentai_token.

Side note: there are other alternatives that is more object-oriented, now that we have the classes. For example

class _HocrWord:
  ...
  @classmethod
  def from_word(cls, word: wrappers.Word) -> '_HocrWord':
    ...

This still requires figuring out whether a "word" needs to know about the page's dimensions (which does not make sense to me), but indeed, the text should be already contained in the word and having to pass both in seems like a sign of something else not working properly.

for line in lines:
start_index = line.layout.text_anchor.text_segments[0].start_index
end_index = line.layout.text_anchor.text_segments[0].end_index
words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Toolbox is the reasonable place to implement that logic (which we are doing here, but doing it in a way to be used only for hOCR export, while it is a more general concept that belongs to the Document AI Documents themselves). Should we not expect the user to want to iterate over all the lines on a Toolbox-wrapped Page, and do something about the lines?


def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int):
f = ""
word_text = self.text.replace("&","&amp;").replace("<","&lt;").replace(">","&gt;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@galz10 galz10 requested a review from dizcology May 22, 2023 17:43
@galz10 galz10 mentioned this pull request Jun 15, 2023
4 tasks
@galz10 galz10 closed this Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Indicates a pull request not ready for merge, due to either quality or timing. size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants