-
Notifications
You must be signed in to change notification settings - Fork 20
feat: added hOCR exporter #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if object.layout.bounding_poly.vertices: | ||
| min_x, min_y = object.layout.bounding_poly.vertices[0].x, object.layout.bounding_poly.vertices[0].y | ||
| max_x, max_y = object.layout.bounding_poly.vertices[2].x, object.layout.bounding_poly.vertices[2].y | ||
| return f"bbox {int(min_x)} {int(min_y)} {int(max_x)} {int(max_y)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, we could introduce a wrapper for bounding_poly that knows how to represent itself in hocr-style string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you suggesting a bounding_poly wrapper just for hOCR or for all of docai toolbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Document AI's bounding poly messages: https://github.com/googleapis/googleapis/blob/57b675c1534e6feb806ca9cb48a3c4a4023e91fe/google/cloud/documentai/v1/geometry.proto#L49. Alternatively, at least we should extract some of the code here to helper functions for simpler testing.
|
In the description for this PR, (and the inline documentation) can you add context as to what hOCR is and how it's being used? |
Yes, will do this is just a draft to help finalize the design doc but when i make this not a draft i'll add the context. |
| def to_hocr(self,filename: str) -> str: | ||
| hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename) | ||
|
|
||
| return hocr.export_hocr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern (of instantiating an object only to call a method and then immediately discard the object itself) indeed suggests that we do not need to have a whole _Hocr class, but some helper functions that can carry out the calculation will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not requesting to change this at this point, let's continue the discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right a class for the main hOCR object might not be needed but i still think we need python representation of the hOCR objects like hOCR_page, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to use only helper functions without the loading object steps and the result is a super slow export, with this implementation it takes milliseconds for the export to return a string and the second implementation it takes seconds up to a minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can try to add this logic inside the normal wrappers to see if there is any affect I believe there will be. I believe the reason it would take more time is because we're adding another operation to each token, line , paragraph, block and page wrapping process. So when you have N pages with N tokens,lines,paragraphs and blocks the time added to each wrapping I believe would be a lot(I think the time complexity would be O(n^5). As opposed to only exporting the objects to hOCR when the user actually wants to do export the wrapped document.
| for line in lines: | ||
| start_index = line.layout.text_anchor.text_segments[0].start_index | ||
| end_index = line.layout.text_anchor.text_segments[0].end_index | ||
| words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I meant the logic that calculates which wrapper Word belongs to which wrapper Line. This would have nothing to do with hOCR, but about the hierarchy that is (implicitly!) presented in Document AI Document components.
(But yes, if we drive that line of logic further, we will likely end up with a rather different design.)
To be clear, I am not requesting to change this at this point, let's continue the discussion.
|
|
||
|
|
||
| def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension): | ||
| if element_with_layout.layout.bounding_poly.vertices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are vertices and there are normalized_vertices. What are the situations where only normalized_vertices are populated but not vertices? (That looks like the only case where dimensions is used, and right not we pass this to the hOCR classes just for this purpose. This perhaps could be avoided.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've seen cases where the OCR doesn't populate vertices, but i'll test this to validate that it happens sometimes otherwise if it doesn't happen the dimensions is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also confirm this with the service team?
| class _HocrWord: | ||
| text: str = dataclasses.field(repr=False) | ||
| documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False) | ||
| dimensions: documentai.Document.Page.Dimension = dataclasses.field(repr=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an InitVar following the pattern introduced in #110
(The word does not need this as an attribute, and it is used only in __post_init__ to calculate the bounding box.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea once the #110 changes are merged i'll modify this PR to use the new pattern
| @dataclasses.dataclass | ||
| class _HocrWord: | ||
| text: str = dataclasses.field(repr=False) | ||
| documentai_word: documentai.Document.Page.Token = dataclasses.field(repr=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this, perhaps just an InitVar since the hOCR word does not need it as an attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
|
|
||
| def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int): | ||
| f = "" | ||
| word_text = self.text.replace("&","&").replace("<","<").replace(">",">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be library for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to do the text replace stuff or the hOCR stuff? the only thing i found about hOCR was not maintained and did not work properly. For the text replace stuff there might be but wouldn't importing a library just for this purpose be overkill ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int): | ||
| f = "" | ||
| word_text = self.text.replace("&","&").replace("<","<").replace(">",">") | ||
| f += f"<span class='ocrx_word' id='word_{pidx}_{bidx}_{paridx}_{lidx}_{widx}' title='{self.bounding_box}'>{word_text}</span>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, there are standard libraries for manipulating and creating XML: https://docs.python.org/3/library/xml.etree.elementtree.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i did not know i could do this through a library i'll take a look.
| f += f"<meta name=\"ocr-number-of-pages\" content=\"{len(self.documentai_pages)}\" />\n" | ||
| f += "<meta name=\"ocr-capabilities\" content=\"ocr_page ocr_carea ocr_par ocr_line ocrx_word\" />\n" | ||
| f += "</head>\n" | ||
| f += "<body>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put everything that is constant (does not depend on the actual pages) into a single multiline f-string, maybe at the top of the file, to make this more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
| from google.cloud.vision import TextAnnotation, Symbol, Word, Paragraph, Block, Page | ||
| from google.cloud import vision | ||
|
|
||
| from google.cloud.documentai_toolbox.constants import ElementWithLayout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that we are not very consistent with importing modules versus importing names/classes. The general preference is to import modules. For example the next engineer has to refer to the import line to know if Page here is from the Toolbox itself or from google.cloud.vision.
(This does not have to be done in this PR, but please add a internal clean up tracking issue to fix this.)
|
|
||
| from pikepdf import Pdf | ||
|
|
||
| from google.cloud.documentai_toolbox.wrappers.hocr import _Hocr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to _hocr to signal that it is not part of the public API. Also import the module instead of the class here.
| def to_hocr(self,filename: str) -> str: | ||
| hocr = _Hocr(documentai_pages=self.pages, documentai_text=self.text,filename=filename) | ||
|
|
||
| return hocr.export_hocr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that having private classes representing the hOCR components will make the maintenance simpler, but I do not understand how the performance would be impacted that way. If anything, "equivalent" functional code is likely to be (very) slightly faster than having to create the objects and load their attributes.
|
|
||
|
|
||
| def _get_bounding_box(element_with_layout: ElementWithLayout, dimensions: documentai.Document.Page.Dimension): | ||
| if element_with_layout.layout.bounding_poly.vertices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also confirm this with the service team?
|
|
||
|
|
||
| def _get_text(element_with_layout: ElementWithLayout, document_text): | ||
| start_index = element_with_layout.layout.text_anchor.text_segments[0].start_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, we should consider either
(1) promoting ElementWithLayout to a proper class (and not just a type alias) so that we can put the code element_with_layout.layout.text_anchor.text_segments[0].start_index into an ElementWithLayout.start_index property. (And in fact, then, _get_text could be just a property of that class.)
or
(2) introduce a private wrapper class of Layout or TextAnchor.
(This refactoring appears to be purely internal and does not have to block the hOCR export work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's something we can do to simplify our usage of these across the toolbox, I think it might be beneficial to add this change in a separate PR and not the PR for hOCR feature.
| return document_text[start_index:end_index].replace('/n', '') | ||
|
|
||
|
|
||
| def _load_hocr_words(words: List[documentai.Document.Page.Token], document_text: str, dimensions: documentai.Document.Page.Dimension): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) It may be simpler to test if the method converts a single Document AI Token to a single hOCR Word, and let the caller handle the for loop.
(2) The function name could provide even more information, such as hocr_words_from_documentai_tokens, or the singular form (if you move the for loop out), hocr_word_from_documentai_token.
Side note: there are other alternatives that is more object-oriented, now that we have the classes. For example
class _HocrWord:
...
@classmethod
def from_word(cls, word: wrappers.Word) -> '_HocrWord':
...
This still requires figuring out whether a "word" needs to know about the page's dimensions (which does not make sense to me), but indeed, the text should be already contained in the word and having to pass both in seems like a sign of something else not working properly.
| for line in lines: | ||
| start_index = line.layout.text_anchor.text_segments[0].start_index | ||
| end_index = line.layout.text_anchor.text_segments[0].end_index | ||
| words = [word for word in page.tokens if word.layout.text_anchor.text_segments[0].start_index >= start_index if word.layout.text_anchor.text_segments[0].end_index <= end_index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Toolbox is the reasonable place to implement that logic (which we are doing here, but doing it in a way to be used only for hOCR export, while it is a more general concept that belongs to the Document AI Documents themselves). Should we not expect the user to want to iterate over all the lines on a Toolbox-wrapped Page, and do something about the lines?
|
|
||
| def to_hocr(self, pidx: int,bidx: int, paridx: int, lidx: int, widx: int): | ||
| f = "" | ||
| word_text = self.text.replace("&","&").replace("<","<").replace(">",">") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕