Skip to content

Add textract2page#160

Merged
stweil merged 5 commits intoUB-Mannheim:masterfrom
bertsky:add-textract2page
May 5, 2023
Merged

Add textract2page#160
stweil merged 5 commits intoUB-Mannheim:masterfrom
bertsky:add-textract2page

Conversation

@bertsky
Copy link
Copy Markdown
Contributor

@bertsky bertsky commented May 4, 2023

In contrast to all existing transformations, https://github.com/slub/textract2page MUST know the image file, so I also tried to make it easier for the user to know what script-args are possible/expected:

example calls for `--help-args`

> ocr-transform hocr page --help-args
        Usage: see http://www.saxonica.com/documentation/index.html#!using-xsl/commandline
        Options available: -? -a -catalog -config -cr -diag -dtd -ea -expand -explain -export -ext -im -init -it -jit -l -lib -license -m -nogo -now -ns -o -opt -or -outval -p -quit -r -relocate -repeat -s -sa -scmin -strip -t -T -target -TB -threads -TJ -Tlevel -Tout -TP -traceout -tree -u -val -versionmsg -warnings -x -xi -xmlversion -xsd -xsdversion -xsiloc -xsl -y --?
        Use -XYZ:? for details of option XYZ
        Params: 
          param=value           Set stylesheet string parameter
          +param=filename       Set stylesheet document parameter
          ?param=expression     Set stylesheet parameter using XPath
          !param=value          Set serialization parameter

> ocr-transform gcv hocr --help-args
    Extra arguments: <width> <height>

> ocr-transform page alto --help-args
    page-to-alto options:
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Log level
  --alto-version [4.2|4.1|4.0|3.1|3.0|2.1|2.0]
                                  Choose version of ALTO-XML schema to produce
                                  (older versions may not preserve all
                                  features)
  --check-words / --no-check-words
                                  Check whether PAGE-XML contains any Words
                                  and fail if not
  --check-border / --no-check-border
                                  Check whether PAGE-XML contains Border or
                                  PrintSpace
  --skip-empty-lines / --no-skip-empty-lines
                                  Whether to omit or keep empty lines in PAGE-
                                  XML
  --trailing-dash-to-hyp / --no-trailing-dash-to-hyp
                                  Whether to add a <HYP/> element if the last
                                  word in a line ends in "-"
  --dummy-textline / --no-dummy-textline
                                  Whether to create a TextLine for regions
                                  that have TextEquiv/Unicode but no TextLine
  --dummy-word / --no-dummy-word  Whether to create a Word for TextLine that
                                  have TextEquiv/Unicode but no Word
  --textequiv-index INTEGER       If multiple textequiv, use the n-th
                                  TextEquiv by @index
  --textequiv-fallback-strategy [raise|first|last]
                                  What to do if selected TextEquiv @index is
                                  not available: 'raise' will lead to a
                                  runtime error, 'first' will use the first
                                  TextEquiv, 'last' will use the last
                                  TextEquiv on the element
  --region-order [document|reading-order|reading-order-only]
                                  Order in which to iterate over the regions
  --textline-order [document|index|textline-order]
                                  Order in which to iterate over the textlines

> ocr-transform textract page --help-args
    textract2page arguments: <image-file>
    textract2page options:

Copy link
Copy Markdown
Collaborator

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition, thanks @bertsky and @rue-a.

Also the more detailed help for the script-args is an improvement.

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

@bertsky
Copy link
Copy Markdown
Contributor Author

bertsky commented May 5, 2023

You need the image as an argument because the AWS Textract JSON does not contain the image (dimensions)?

Exactly. Textract uses floating point ratios (0..1) for all coordinates. So even if we could live with empty or bogus @imageFilename, we need width and height to calculate the absolute coordinates everywhere.

(BTW, gcv__hocr is another case which needs width and height, but apparently it cannot derive these from the image file, so I just added width and height as script-args there.)

@stweil stweil merged commit dd38c29 into UB-Mannheim:master May 5, 2023
@stweil
Copy link
Copy Markdown
Member

stweil commented May 5, 2023

Thank you!

@stweil
Copy link
Copy Markdown
Member

stweil commented May 5, 2023

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3.
That's currently neither documented nor handled automatically in the Makefile.

@bertsky
Copy link
Copy Markdown
Contributor Author

bertsky commented May 5, 2023

I just noticed that this PR and also a previous commit ff11c35 require a virtual environment because of pip3. That's currently neither documented nor handled automatically in the Makefile.

Indeed. I did not notice either. I would leave it to the user to set up a venv or virtualenv or conda environment though. So we would only need a few remarks in the readme IMO.

@bertsky
Copy link
Copy Markdown
Contributor Author

bertsky commented May 5, 2023

On the other hand, we already make users set up a $HOME/.local/bin installation. It would be nice if that would suffice even for Python. For example, we could detect whether VIRTUAL_ENV is already defined, and if not, then create one under the same PREFIX at install-time, and activate it within ocr-transform at run-time.

@bertsky
Copy link
Copy Markdown
Contributor Author

bertsky commented May 5, 2023

#162

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants