Skip to content

Conversation

@holtskinner
Copy link
Member

@holtskinner holtskinner commented Jul 6, 2023

  • Shouldn't impact any user-facing functions

BEGIN_COMMIT_OVERRIDE

fix: Internal refactoring to improve efficiency and readability. No external-facing changes.

END_COMMIT_OVERRIDE

- Use more jinja templating instead of hardcoding strings
- Simplified bounding box function
- Changed parameter name for `_get_hocr_bounding_box` to `page_dimension` for more clarity.
- Required for template to work in installed library
- improve readability, follow python conventions, and improve efficiency
- Also, fixed a previously unknown bug where `Document.search_pages()` returned inaccurate results because it only searched paragraph.text, not page.text
…ize functions, reduce complexity, and increase readability
@product-auto-label product-auto-label bot added the size: xl Pull request size is extra large. label Jul 6, 2023
holtskinner and others added 3 commits July 6, 2023 16:27
- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.
renovate-bot and others added 9 commits July 7, 2023 12:19
- `converter.py` only had one external facing function that called an internal function with the same parameters.
- Not sure if there was a specific reason for this setup, can be undone if needed.
- Changed how `Block` is initialized.
- Changed `load_blocks_from_schema` into a `@classmethod` to simplify imports.
@holtskinner holtskinner marked this pull request as ready for review July 7, 2023 18:14
@holtskinner holtskinner requested review from a team as code owners July 7, 2023 18:14
def _convert_bbox_units(
coordinate, input_bbox_units, width=None, height=None, multiplier=1
coordinate: float,
input_bbox_units: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can input_bbox_units be an enum instead of str?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would make more sense. However, since this value is provided by the user in a JSON file, I wonder if there is a straightforward way to make that conversion? (Besides changing the config value to be an integer) Or maybe if it would be better to use constants?


blocks: List[Block] = []
ens = _get_target_object(objects, entities)
for i in ens:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of this for loop and its nested if-else clauses is difficult to understand. I get the impression that many of them could be simplified if a block b's attributes have reasonable defaults. For instance, can't

if bounding_width:
                b.bounding_width = _get_target_object(b.bounding_box, bounding_width)

be simplified to just

b.bounding_width = _get_target_object(b.bounding_box, bounding_width)

?

(Please ignore and file an internal clean up issue if this file was simply renamed.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a refactoring that I did as well. So I'm good to make some changes as needed.

FILES_TO_IGNORE = [".DS_Store"]


def _get_base_ocr(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these simply moved from converter_helpers? Let me know if these additions need a more detailed review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were refactored as well, and everything was moved into this file.

You can see just the refactoring work in this commit fb517fb

holtskinner and others added 4 commits July 19, 2023 14:46
- Removed FILES_TO_IGNORE
- Simplification of logic in `_get_multiplier` `convert_bbox_to_docproto_bbox`
- Addressed other lint errors
- Adjusted function names to indicate not protected members.
@holtskinner holtskinner merged commit 82ac823 into main Jul 20, 2023
@holtskinner holtskinner deleted the linting branch July 20, 2023 21:15
@holtskinner holtskinner added the release-please:force-run To run release-please label Jul 27, 2023
@release-please release-please bot removed the release-please:force-run To run release-please label Jul 27, 2023
@holtskinner holtskinner added the release-please:force-run To run release-please label Jul 27, 2023
@release-please release-please bot removed the release-please:force-run To run release-please label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: xl Pull request size is extra large.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants