GitHub - gojiplus/image-to-text: Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs

Image to Text

The script uses Tesseract to get text from pdfs. It reads pdfs from a specified directory and outputs text files to another directory. Tesseract works well for documents with simple structure and fonts that are easily parsed but generally struggles with more complex layout. To fix errors in the recovered text, you may want to use Edit Distance Based Search and Replace, exploiting the fact that errors in OCR tend of systematic.

Rather than use Tesseract, you can also use Abbyy FineReader or Captricity. And to estimate the error rate of OCR, you may want to use recognize.

For a general overview of how to convert paper to digitial and how to optimize that process, see A Quick Scan: From Paper to Digital

Usage

pdf2txt.py [options] pdf_directory

Command Line Options:

  -h, --help            show this help message and exit
  -d DPI, --dpi=DPI     JPEG Resolution in DPI (default: 400)
  -j JPGDIR, --jpgdir=JPGDIR
                        JPEG output directory (default: jpg)
  -t TXTDIR, --textdir=TXTDIR
                        Text output directory (default: text)
  -r, --resume          Resume OCR to Text (default: False)

Example:

python pdf2txt.py pdf_dir

The script will be post process all PDF files in pdf_dir directory and save the output text files to the text directory

License

Scripts are released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Readme.md		Readme.md
pdf2txt.py		pdf2txt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image to Text

Usage

Command Line Options:

Example:

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image to Text

Usage

Command Line Options:

Example:

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages