Công cụ OCR

Trích xuất Văn bản OCR

Trích xuất văn bản từ PDF quét và ảnh bằng Tesseract OCR.

Tesseract OCR 100+ ngôn ngữ PDF có thể tìm kiếm
Powered by Tesseract OCR (LSTM)
Thả file của bạn vào đây

hoặc chọn file

Tối đa 100 MB miễn phí • Định dạng Đầu ra: TXT, PDF

Đang chờ tải lên...
0%
File của bạn đã sẵn sàng
Xong!
Tải xuống
Chuyển đổi thất bại

Chuyển đổi thất bại

An toàn & Bảo mật
Xóa file sau 2 giờ
Xử lý nhanh
Không cần tài khoản
Cực Nhanh

How to Convert

Chuyển đổi bất kỳ file nào trong vài giây — không cần phần mềm, không cần đăng ký.

01
Tải lên

Tải lên file âm thanh

02
Chọn định dạng

Chọn định dạng đầu ra

03
Tải xuống

Nhận file đã chuyển đổi

Tại sao EasyConv

Why Use EasyConv

Professional-grade conversion with features designed for real-world workflows.

Tất cả định dạng chính

Hỗ trợ tất cả các định dạng phổ biến.

Kiểm soát chất lượng

Điều chỉnh cài đặt chất lượng.

DOCX Output

Export recognised text as a Microsoft Word document with basic paragraph structure preserved for further editing.

Giữ nguyên metadata

Metadata và tags được bảo toàn.

Xử lý an toàn

File được xử lý an toàn và bảo mật.

Hỗ trợ FFmpeg

Dựa trên thư viện FFmpeg đã được kiểm chứng.

300+ Định dạng

Supported Formats

Detailed breakdown of every format supported by this converter.

Format Description Extension Use Case
JPG / JPEG Photo of a document or scanned page .jpg Smartphone photos of documents
PNG High-contrast screenshot or scan .png Screenshots, receipts, invoices
TIFF Multi-page TIFF scan from scanner .tiff Professional scanner output
PDF (SCANNED) Image-only PDF without text layer .pdf Scanned contracts, books, forms
TXT OUTPUT Plain text extraction from document .txt Data extraction, NLP, databases
DOCX OUTPUT Editable Word document from OCR .docx Edit and format extracted content
PDF OUTPUT (OCR) Searchable PDF with invisible text layer .pdf Archival, full-text search, Ctrl+F
3 Bước Đơn giản

Frequently Asked Questions

Everything you need to know about this conversion tool.

OCR (Optical Character Recognition) analyses the pixel patterns in an image to identify characters and words. Tesseract, the engine we use, uses LSTM neural networks for high-accuracy recognition of printed text.

Clean, high-contrast, well-aligned text (like a laser-printed document) achieves 98-99% character accuracy. Handwritten text, low-resolution scans, or unusual fonts will have lower accuracy.

Tesseract can recognise neat, block handwriting but accuracy is much lower than for printed text. For best results use a high-resolution scan (300 dpi+) and select the correct language.

A searchable PDF contains the original scanned image as the visible layer, with an invisible text layer underneath. You can then search (Ctrl+F), select, and copy text in any PDF viewer.

Minimum 200 dpi for acceptable results, 300 dpi recommended for best accuracy. Scans at 600 dpi do not significantly improve accuracy but increase processing time.

Yes. Multi-page PDFs are processed page by page. Each page is de-skewed and recognised individually. Output PDF and DOCX contain all pages in order.
12M+ người dùng tin tưởng

Who Uses This Tool

Real-world use cases from professionals across different industries.

Field Worker
Extract Text from Photographed Docs

Photograph a printed form or sign with your phone, upload the image, and get editable text back in seconds.

Archivist
Make Scanned Archives Searchable

Add a searchable text layer to scanned historical documents, contracts, and records for full-text indexing.

Researcher
Extract Text from Academic Papers

OCR scanned journal articles or book chapters to copy passages, run citation searches, and feed into reference managers.

Accountant
Digitise Paper Receipts

Scan expense receipts and OCR them to extract amounts, dates, and merchant names for expense report automation.

Translator
Extract Text for Translation

OCR a scanned document to get editable text, then paste into a translation tool for multilingual document conversion.

Data Engineer
Feed Scanned Data into Pipelines

Convert batches of scanned invoice images to TXT output for automated data extraction and structured data pipelines.

Tại sao EasyConv

So sánh

Xem cách chúng tôi so sánh với các giải pháp khác

Feature Our Tool
EasyConv
Adobe Acrobat Other Online
100+ ngôn ngữ Limited
PDF có thể tìm kiếm
Đầu ra DOCX
PDF nhiều trang
Tự động căn chỉnh & xử lý trước
Mạng nơ-ron LSTM Varies
No watermark on output
Miễn phí
Technical Specifications

Technical Specifications

Detailed technical information about our conversion engine.

Limits
  • Max file size: 100 MB (free)
  • Input: JPG, PNG, TIFF, BMP, PDF (scanned)
  • Output: TXT, DOCX, searchable PDF
Tesseract OCR Text Recognition Scanned PDF LSTM
Engine
Tesseract 5.x OCR (LSTM neural network) · ImageMagick for pre-processing
Chất lượng
98–99% character accuracy on clean 300 dpi+ printed text; lower on handwriting
Speed
Typical: 3–8 s per page; multi-page PDFs scale linearly per page
Security
HTTPS transfer · isolated temp dir · auto-purged in 2 h
Languages
100+ via Tesseract language packs
Input formats
JPG, PNG, TIFF, BMP, scanned PDF
Output formats
TXT, DOCX, searchable PDF
Hướng dẫn đầy đủ

How Tesseract OCR Works: LSTM Neural Networks, Pre-Processing, and Language Detection

Tesseract has been developed for over 30 years — originally at HP, then at Google, and now as an open-source project. Its modern LSTM engine brings accuracy levels previously only available in commercial OCR products.

From Image to Text: The OCR Pipeline

OCR is a multi-stage process. First, the input image is pre-processed: converted to greyscale, binarised (thresholded to black and white), deskewed (rotation corrected), and de-noised. Next, Tesseract's layout analysis segments the image into text regions, lines, and words. Finally, its LSTM model processes each line of text as a sequence of character probabilities and decodes the most likely character sequence using beam search.

LSTM vs Legacy OCR

Older OCR engines (including Tesseract 3.x) used a pattern-matching approach — comparing each character image against a library of stored templates. Tesseract 5's LSTM engine instead trains a recurrent neural network on millions of text examples, learning contextual cues at the line level. This dramatically improves accuracy on cursive text, unusual fonts, and degraded documents where individual characters are ambiguous but their context makes them clear.

Image Pre-Processing for Better Results

Before passing an image to Tesseract, we apply several pre-processing steps via ImageMagick: deskewing corrects documents photographed at an angle; binarisation converts colour or grey images to clean black-and-white text; de-noising removes scanner noise and compression artefacts; contrast enhancement makes faint text darker. These steps can recover several percentage points of accuracy on poor-quality scans.

Searchable PDF Output

A searchable PDF preserves the original scanned image as the visible layer, with an invisible text layer positioned precisely over each word. Generating this correctly requires mapping Tesseract's word bounding boxes (in image pixels) to PDF coordinates (in points, at the original DPI). We use hOCR output from Tesseract — a structured HTML format with bounding box data — to place each word accurately in the output PDF.

Language Support and Multi-Language Documents

Tesseract supports 100+ languages via separate trained data files. Selecting the correct language pack significantly improves accuracy — using an English model on French text will misidentify accented characters. For documents with mixed languages (common in legal documents, academic papers, or multilingual forms), select a combined language model (e.g. eng+fra) or use the auto-detect mode which runs a language identification pass before recognition.

Sẵn sàng chuyển đổi?

Bắt đầu chuyển đổi miễn phí — không cần đăng ký

Không cần thẻ cho miễn phí Hủy bất cứ lúc nào 12M+ người dùng tin tưởng