A lightweight Ruby gem for extracting text from PDFs, including scanned PDFs using OCR.
This gem supports:
- PDFs with readable text
- Scanned PDFs using Tesseract OCR
- File objects, file paths, StringIO, and Rails/ActiveStorage uploads
- Fully Rails-independent
- Detect if PDF is scanned or text-based
- Extract text from normal PDFs using
PDF::Reader - Extract text from scanned PDFs using
RTesseractandMiniMagick - Automatic cleanup of temporary images
Add this line to your application's Gemfile:
gem 'pdf_ocr'Or install directly:
gem install pdf_ocr-
PDF::Reader
-
RTesseract
-
MiniMagick
-
Tesseract OCR (system-level executable)
-
pdftoppm from Poppler utils (for converting PDF pages to images)
require 'ocr'
require 'stringio'
# From a File object
file = File.open("path/to/document.pdf")
result = Ocr::DataExtractor.new(file).call
puts result["raw_text"] if result["success"]
# From a file path string
result = Ocr::DataExtractor.new("path/to/document.pdf").call
# From a StringIO object (in-memory PDF)
pdf_data = StringIO.new(File.read("path/to/document.pdf"))
result = Ocr::DataExtractor.new(pdf_data).call{
"success" => true,
"raw_text" => "Extracted text content from PDF ..."
}- If OCR fails for a scanned PDF:
{
"success" => false,
"message" => "Unable to extract text using OCR"
}- Ensure Tesseract OCR is installed on your system:
# Ubuntu/Debian sudo apt install tesseract-ocr # MacOS (with Homebrew) brew install tesseract - Ensure pdftoppm is installed (for PDF-to-image conversion):
# Ubuntu/Debian sudo apt install poppler-utils # MacOS (with Homebrew) brew install poppler - Ensure ImageMagick is installed ( for images):
# Ubuntu/Debian sudo apt install imagemagick # MacOS (with Homebrew) brew install imagemagick - This gem does not require Rails, but it will work with Rails ActiveStorage objects that respond to .open.
bundle install
bundle exec rspec
-
PDFs with selectable text
-
Scanned PDFs
-
Malformed PDFs (fallback to OCR)
-
Fork the repository
-
Create your feature branch (git checkout -b your-feature)
-
Commit your changes (git commit -am 'Add new feature')
-
Push to the branch (git push origin your-feature)
-
Open a Pull Request
Ravi Shankar Singhal
Senior Backend Developer — Ruby on Rails
📧 ravi.singhal2308@gmail.com
🌐 https://github.com/RaviShankarSinghal
MIT License © RaviShankarSinghal
This version includes:
- Version and build badges (replace with your repo info)
- Clear installation instructions
- Usage examples for File, path, and StringIO
- System dependencies
- Test instructions
- Contributing guidelines
- The gem is available as open source under the terms of the MIT License.