Skip to content

tesseract english data is now a separate package#2479

Merged
eikek merged 1 commit intoeikek:masterfrom
programmerq:patch-1
Jan 30, 2024
Merged

tesseract english data is now a separate package#2479
eikek merged 1 commit intoeikek:masterfrom
programmerq:patch-1

Conversation

@programmerq
Copy link
Copy Markdown
Contributor

It looks like https://git.alpinelinux.org/aports/commit/community/tesseract-ocr?id=e1dc19b16f34ba3faeba489ea3412d3b3c67c12f introduced the english data language as a separate package.

I noticed this error when trying to run OCR on a file where I had selected english:

Error opening data file /usr/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'

In the docspell 0.40.0 joex image, the tesseract-ocr-5.2.0-r1 includes eng.traineddata (english), equ.traineddata (math equation detection), and osd.traineddata (orientation and script detection). But the tesseract-ocr-5.3.4-r0 package in docspell 0.41.0 joex doesn't include any of them.

I don't believe the osd/equ variants are used, so I didn't include them in the PR.

@eikek eikek added the docker All things regarding docker setup label Jan 30, 2024
@eikek
Copy link
Copy Markdown
Owner

eikek commented Jan 30, 2024

Oh, thank you!! 🙏🏼

@eikek
Copy link
Copy Markdown
Owner

eikek commented Jan 30, 2024

for reference #2374

@eikek eikek merged commit c8cb8b0 into eikek:master Jan 30, 2024
@eikek eikek added this to the Docspell 0.42.0 milestone Feb 7, 2024
@eikek eikek added the fix label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docker All things regarding docker setup fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants