Skip to content

Conversation

@holtskinner
Copy link
Member

@holtskinner holtskinner commented Apr 19, 2023

  • Changed vision_helpers.py to input Document.Page to reduce extra imports.
  • Rewrote Entity image extraction to avoid storing the Pillow image in memory.
  • Added error checking for missing shards.
  • Added sorting of shards by index in case they are read out of order.
  • Fixed bug where Entity page numbers would be incorrect in multi-shard documents.
  • Simplified creation of Page sub classes by using __post_init__()
  • Renamed Page.text and Table.text to document_text to be more descriptive and match the other Page elements.
  • Rewrote table_sample.py to use more accurate/descriptive variable names.
  • Minor formatting of comments/docs

@holtskinner holtskinner requested review from a team as code owners April 19, 2023 21:43
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Apr 19, 2023
@holtskinner holtskinner added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 19, 2023
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 19, 2023
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Apr 20, 2023
@holtskinner holtskinner requested a review from a team as a code owner April 20, 2023 21:59
@galz10 galz10 requested a review from dizcology April 25, 2023 14:53
@galz10
Copy link
Collaborator

galz10 commented Apr 25, 2023

@dizcology i'd like to get a second opinion on these change, please take a look when you have time.

@galz10 galz10 requested review from galz10 and removed request for balajismaniam April 25, 2023 14:54
- Add incomplete shard check
- refactor some of the vision helpers to simplify imports
- Fix documentation formatting issue
@dizcology dizcology mentioned this pull request May 2, 2023
4 tasks
- Avoids some backwards incompatibility issues
- Fixed bug where Entity page numbers would not line up with multi-shard documents
@holtskinner holtskinner changed the title refactor: Change classes to avoid extra copies refactor: Restructuring of classes to reduce duplication and increase readibility Jun 15, 2023
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 15, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 15, 2023
holtskinner and others added 3 commits June 20, 2023 14:15
- Added a more specific type for dictionary. `Dict[str, Union[str, List[str]]]`
- Rewrote Docstring for `Entity.page_offset`
- Wrote unit test for `page_offset`
@holtskinner holtskinner added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 20, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 20, 2023
@galz10 galz10 requested a review from dizcology June 23, 2023 17:20
@holtskinner holtskinner merged commit 778e433 into main Jun 23, 2023
@holtskinner holtskinner deleted the memory branch June 23, 2023 17:35
holtskinner added a commit that referenced this pull request Jun 30, 2023
holtskinner added a commit that referenced this pull request Jul 7, 2023
* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in #110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library
holtskinner added a commit that referenced this pull request Jul 20, 2023
…ciency and follow standard practices. (#139)

* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in #110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library

* chore: Ran isort and black

* chore: Ran no-implicit-optional

* refactor: Refactored document.py
- improve readability, follow python conventions, and improve efficiency
- Also, fixed a previously unknown bug where `Document.search_pages()` returned inaccurate results because it only searched paragraph.text, not page.text

* refactor: Refactor gcs_utilities for readability/pythonic style

* refactor: Refactor page.py to improve efficiency, readability and follow python conventions

* refactor: Rename `Entity.documentai_entity` to `Entity.documentai_object` to match the page.py file

* refactor: Move bounding box extraction to `docai_utilities.py`

* refactor: Major Refactoring of converter_helpers.py to simplify/organize functions, reduce complexity, and increase readability

* fix: Fixed refactor of export_images in document.py

* refactor: Cleanup of blocks.py using `getattr()`

* refactor: Refactoring of bbox_conversion.py to improve readability and efficiency

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore(deps): update dependency google-cloud-documentai to v2.16.1 (#138)

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore: Reran black formatting after merge conflict

* refactor: Minor refactoring of test_bbox_conversion.py to improve readability

* refactor: Changed blocks.py to block.py for consistency.

- Changed how `Block` is initialized.
- Changed `load_blocks_from_schema` into a `@classmethod` to simplify imports.

* fix: Added Missing type annotations to `document.py`

* fix: Add new filename for block.py into test_bbox_conversion.py

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* fix: Fix failing tests for Block class. Changed all fields to have types

* fix: Changed `converter._get_bytes` to return a Tuple

* chore: Addressed Code Review Comments

- Removed FILES_TO_IGNORE
- Simplification of logic in `_get_multiplier` `convert_bbox_to_docproto_bbox`
- Addressed other lint errors
- Adjusted function names to indicate not protected members.

* fix: Remove extra reference to metadata_blob

* fix: Change expected test output and remove references to `geometry`

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
parthea pushed a commit to googleapis/google-cloud-python that referenced this pull request Nov 24, 2025
* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in googleapis/python-documentai-toolbox#110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library
parthea pushed a commit to googleapis/google-cloud-python that referenced this pull request Nov 24, 2025
…ciency and follow standard practices. (#139)

* refactor: Reorganize hocr functions

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.

* samples: Added sample for convert to hocr

* refactor: Reordering of classes in page.py

* refactor: Re-added refactoring to remove extra `get_*()` methods in page.py

- Added in googleapis/python-documentai-toolbox#110 Lost in Merge

* fix: Moved `templates` directory into package.
- Required for template to work in installed library

* chore: Ran isort and black

* chore: Ran no-implicit-optional

* refactor: Refactored document.py
- improve readability, follow python conventions, and improve efficiency
- Also, fixed a previously unknown bug where `Document.search_pages()` returned inaccurate results because it only searched paragraph.text, not page.text

* refactor: Refactor gcs_utilities for readability/pythonic style

* refactor: Refactor page.py to improve efficiency, readability and follow python conventions

* refactor: Rename `Entity.documentai_entity` to `Entity.documentai_object` to match the page.py file

* refactor: Move bounding box extraction to `docai_utilities.py`

* refactor: Major Refactoring of converter_helpers.py to simplify/organize functions, reduce complexity, and increase readability

* fix: Fixed refactor of export_images in document.py

* refactor: Cleanup of blocks.py using `getattr()`

* refactor: Refactoring of bbox_conversion.py to improve readability and efficiency

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore(deps): update dependency google-cloud-documentai to v2.16.1 (#138)

* fix: Change _get_files() to send full gcs uri to _get_bytes()
- Also reduce wait_time in tests

* refactor: Move `converter_helpers.py` functions into `converter.py`

- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.

* chore: Reran black formatting after merge conflict

* refactor: Minor refactoring of test_bbox_conversion.py to improve readability

* refactor: Changed blocks.py to block.py for consistency.

- Changed how `Block` is initialized.
- Changed `load_blocks_from_schema` into a `@classmethod` to simplify imports.

* fix: Added Missing type annotations to `document.py`

* fix: Add new filename for block.py into test_bbox_conversion.py

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* fix: Fix failing tests for Block class. Changed all fields to have types

* fix: Changed `converter._get_bytes` to return a Tuple

* chore: Addressed Code Review Comments

- Removed FILES_TO_IGNORE
- Simplification of logic in `_get_multiplier` `convert_bbox_to_docproto_bbox`
- Addressed other lint errors
- Adjusted function names to indicate not protected members.

* fix: Remove extra reference to metadata_blob

* fix: Change expected test output and remove references to `geometry`

---------

Co-authored-by: Mend Renovate <bot@renovateapp.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: l Pull request size is large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants