Free OCR Tool — Extract Text from Images & PDFs

Optical Character Recognition — Powered by Tesseract

Extract selectable, searchable text from scanned PDFs, photos of documents, and image files. Supports 100+ languages via Tesseract OCR engine.

100+ language support Scanned PDF & image input Outputs TXT, DOCX & searchable PDF

Drop your file here

or browse files

Up to 100 MB free • Outputs: TXT, PDF

SSL Encrypted

Files deleted in 2h

Fast processing

No signup needed

Simple & Fast

How to Use OCR Tool

Convert your files in three simple steps — no software, no signup.

Upload Image or PDF

Drop a scanned PDF, JPG, PNG, or TIFF of a document. Colour, greyscale, and black-and-white scans all work.

Select Language

Choose the language of the text in your document. Multi-language documents can use "Auto-detect" mode.

Download Extracted Text

Download as plain TXT, editable DOCX, or searchable PDF with an invisible text layer over the original scan.

Features

Why Use OCR Tool?

100+ Languages

Tesseract supports over 100 languages including Latin, Arabic, Chinese (Simplified/Traditional), Japanese, Korean, and Cyrillic scripts.

Searchable PDF Output

Create a searchable PDF where the original scan is visible but the extracted text layer is selectable and searchable.

DOCX Output

Export recognised text as a Microsoft Word document with basic paragraph structure preserved for further editing.

Image & PDF Input

Accepts JPG, PNG, TIFF, BMP, and multi-page scanned PDF files. Multi-page PDFs process all pages automatically.

Private Processing

Documents are never stored or indexed. OCR runs in a temporary isolated environment and files are deleted within 2 hours.

Pre-processing for Accuracy

Images are automatically deskewed, de-noised, and binarised before OCR to maximise recognition accuracy on real-world scans.

Compatibility

Supported Formats

All the formats you need, all in one place.

Format	Description	Extension	Best Used For
JPG / JPEG	Photo of a document or scanned page	`.jpg`	Smartphone photos of documents
PNG	High-contrast screenshot or scan	`.png`	Screenshots, receipts, invoices
TIFF	Multi-page TIFF scan from scanner	`.tiff`	Professional scanner output
PDF (SCANNED)	Image-only PDF without text layer	`.pdf`	Scanned contracts, books, forms
TXT OUTPUT	Plain text extraction from document	`.txt`	Data extraction, NLP, databases
DOCX OUTPUT	Editable Word document from OCR	`.docx`	Edit and format extracted content
PDF OUTPUT (OCR)	Searchable PDF with invisible text layer	`.pdf`	Archival, full-text search, Ctrl+F

FAQ

Frequently Asked Questions

OCR (Optical Character Recognition) analyses the pixel patterns in an image to identify characters and words. Tesseract, the engine we use, uses LSTM neural networks for high-accuracy recognition of printed text.

Clean, high-contrast, well-aligned text (like a laser-printed document) achieves 98-99% character accuracy. Handwritten text, low-resolution scans, or unusual fonts will have lower accuracy.

Tesseract can recognise neat, block handwriting but accuracy is much lower than for printed text. For best results use a high-resolution scan (300 dpi+) and select the correct language.

A searchable PDF contains the original scanned image as the visible layer, with an invisible text layer underneath. You can then search (Ctrl+F), select, and copy text in any PDF viewer.

Minimum 200 dpi for acceptable results, 300 dpi recommended for best accuracy. Scans at 600 dpi do not significantly improve accuracy but increase processing time.

Yes. Multi-page PDFs are processed page by page. Each page is de-skewed and recognised individually. Output PDF and DOCX contain all pages in order.

Real-World Uses

Who Uses OCR Tool?

From everyday users to professionals — see how people rely on this tool every day.

Field Worker

Extract Text from Photographed Docs

Photograph a printed form or sign with your phone, upload the image, and get editable text back in seconds.

Archivist

Make Scanned Archives Searchable

Add a searchable text layer to scanned historical documents, contracts, and records for full-text indexing.

Researcher

Extract Text from Academic Papers

OCR scanned journal articles or book chapters to copy passages, run citation searches, and feed into reference managers.

Accountant

Digitise Paper Receipts

Scan expense receipts and OCR them to extract amounts, dates, and merchant names for expense report automation.

Translator

Extract Text for Translation

OCR a scanned document to get editable text, then paste into a translation tool for multilingual document conversion.

Data Engineer

Feed Scanned Data into Pipelines

Convert batches of scanned invoice images to TXT output for automated data extraction and structured data pipelines.

Why HarmonyPal?

HarmonyPal vs. Alternatives

See how we compare to desktop software and other online converters.

Feature	Our Tool HarmonyPal	Adobe Acrobat	Other Online
100+ language support			Limited
Searchable PDF output
DOCX output with structure
Multi-page PDF input
Auto deskew & pre-processing
LSTM neural network engine			Varies
No watermark on output
Free to use

Under the Hood

Technical Specifications

Built on industry-standard open-source tools for maximum quality and reliability.

Limits & Restrictions

Max file size: 100 MB (free)
Input: JPG, PNG, TIFF, BMP, PDF (scanned)
Output: TXT, DOCX, searchable PDF

Tesseract OCR Text Recognition Scanned PDF LSTM

Conversion Engine

Tesseract 5.x OCR (LSTM neural network) · ImageMagick for pre-processing

Output Quality

98–99% character accuracy on clean 300 dpi+ printed text; lower on handwriting

Average Speed

Typical: 3–8 s per page; multi-page PDFs scale linearly per page

Data Security

HTTPS transfer · isolated temp dir · auto-purged in 2 h

Languages

100+ via Tesseract language packs

Input formats

JPG, PNG, TIFF, BMP, scanned PDF

Output formats

TXT, DOCX, searchable PDF

Complete Guide

How Tesseract OCR Works: LSTM Neural Networks, Pre-Processing, and Language Detection

Tesseract has been developed for over 30 years — originally at HP, then at Google, and now as an open-source project. Its modern LSTM engine brings accuracy levels previously only available in commercial OCR products.

From Image to Text: The OCR Pipeline

OCR is a multi-stage process. First, the input image is pre-processed: converted to greyscale, binarised (thresholded to black and white), deskewed (rotation corrected), and de-noised. Next, Tesseract's layout analysis segments the image into text regions, lines, and words. Finally, its LSTM model processes each line of text as a sequence of character probabilities and decodes the most likely character sequence using beam search.

LSTM vs Legacy OCR

Older OCR engines (including Tesseract 3.x) used a pattern-matching approach — comparing each character image against a library of stored templates. Tesseract 5's LSTM engine instead trains a recurrent neural network on millions of text examples, learning contextual cues at the line level. This dramatically improves accuracy on cursive text, unusual fonts, and degraded documents where individual characters are ambiguous but their context makes them clear.

Image Pre-Processing for Better Results

Before passing an image to Tesseract, we apply several pre-processing steps via ImageMagick: deskewing corrects documents photographed at an angle; binarisation converts colour or grey images to clean black-and-white text; de-noising removes scanner noise and compression artefacts; contrast enhancement makes faint text darker. These steps can recover several percentage points of accuracy on poor-quality scans.

Searchable PDF Output

A searchable PDF preserves the original scanned image as the visible layer, with an invisible text layer positioned precisely over each word. Generating this correctly requires mapping Tesseract's word bounding boxes (in image pixels) to PDF coordinates (in points, at the original DPI). We use hOCR output from Tesseract — a structured HTML format with bounding box data — to place each word accurately in the output PDF.

Language Support and Multi-Language Documents

Tesseract supports 100+ languages via separate trained data files. Selecting the correct language pack significantly improves accuracy — using an English model on French text will misidentify accented characters. For documents with mixed languages (common in legal documents, academic papers, or multilingual forms), select a combined language model (e.g. eng+fra) or use the auto-detect mode which runs a language identification pass before recognition.

Need More? Go Pro.

Unlock unlimited conversions, larger file sizes, priority processing, and cloud storage with HarmonyPal Pro.

Upgrade to Pro Browse All Tools

No credit card for free tier Cancel anytime 12M+ users trust us

Optical Character Recognition — Powered by Tesseract

Drop your file here

How to Use OCR Tool

Upload Image or PDF

Select Language

Download Extracted Text

Why Use OCR Tool?

100+ Languages

Searchable PDF Output

DOCX Output

Image & PDF Input

Private Processing

Pre-processing for Accuracy

Supported Formats

Frequently Asked Questions

What is OCR and how does it work?

What accuracy can I expect?

Does it work on handwritten text?

What is a searchable PDF?

What scan quality should I use?

Can I OCR a multi-page scanned PDF?

Who Uses OCR Tool?

Extract Text from Photographed Docs

Make Scanned Archives Searchable

Extract Text from Academic Papers

Digitise Paper Receipts

Extract Text for Translation

Feed Scanned Data into Pipelines

HarmonyPal vs. Alternatives

Technical Specifications

Limits & Restrictions

How Tesseract OCR Works: LSTM Neural Networks, Pre-Processing, and Language Detection

From Image to Text: The OCR Pipeline

LSTM vs Legacy OCR

Image Pre-Processing for Better Results

Searchable PDF Output

Language Support and Multi-Language Documents

Related Document Tools

Need More? Go Pro.