Status: v1.6.1 — Production-ready with Vision OCR and AI text correction.
pdf22md extracts text and images from PDF files and converts them into clean Markdown documents. Built with Swift for macOS, it uses modern concurrency (async/await, GCD) to process multi-page documents quickly. Features include Vision-based OCR for scanned PDFs and optional AI-powered text correction via OpenAI or Apple Intelligence.
This tool is useful for:
- Students and Researchers: Convert academic papers, lecture notes, and research articles into editable Markdown for note-taking or further editing.
- Technical Writers and Developers: Extract content from PDF documentation for use in Markdown-based systems such as wikis or static site generators.
- Content Creators: Transform PDF reports, e-books, or brochures into Markdown format for web publishing.
- Anyone extracting PDF content: A straightforward solution for copying text and images out of PDFs.
Key features include:
- Speed: Uses all available CPU cores to process pages concurrently. Especially effective on large documents.
- Vision OCR: Extracts text from scanned PDFs and images using Apple's Vision framework. Works automatically when PDFKit text extraction returns empty content.
- AI Text Correction: Optional post-processing with OpenAI API or Apple Intelligence to fix OCR errors, improve formatting, and clean up extracted text.
- Smart Heading Detection: Analyzes font sizes and usage frequency to automatically format titles and headings (
#,##,###) in the Markdown output. - Image Extraction:
- Pulls both raster (JPEG, PNG) and vector images from the PDF's XObject streams.
- Saves images into a specified assets folder.
- Links images in Markdown using this naming convention:
<pdf-basename>-<page-number>-<asset-number>.<ext>.
- Intelligent Image Formatting: Chooses between JPEG and PNG based on image properties like transparency and color complexity to optimize file size and quality.
- Batch Processing: Process multiple PDFs in parallel with configurable job count.
- Password Support: Open password-protected PDFs.
- Flexible Input/Output:
- Reads PDFs from file paths or
stdin. - Writes Markdown to files or
stdout.
- Reads PDFs from file paths or
- Custom DPI Rasterization: Converts vector graphics (charts, diagrams) into bitmaps at user-defined resolution. Default is 144 DPI.
- OCR Caching: Results cached in
~/.cache/pdf22md/ocr/to avoid re-processing unchanged PDFs.
(Coming Soon) Install via Homebrew tap:
brew install twardoch/tap/pdf22mdRequires Xcode Command Line Tools.
-
Clone the repository:
git clone https://github.com/twardoch/pdf22md.git cd pdf22md -
Build the tool:
make build
The binary will be located at
pdf22md/.build/release/pdf22md. -
Install system-wide (optional):
sudo make install
Basic syntax:
pdf22md [-i input.pdf] [-o output.md] [-a assets_folder] [-d dpi] [options]
Options:
-i, --input <path>: Input PDF file. If omitted, reads fromstdin.-o, --output <path>: Output Markdown file. If omitted, writes tostdout.-a, --assets <path>: Folder to save extracted images. Image extraction is skipped if not provided.-d, --dpi <value>: DPI for rasterizing vector graphics. Default:144.0.-p, --password <pwd>: Password for protected PDFs.
Processing Modes:
- Default: Standard mode with Vision OCR for enhanced text extraction.
--fast: Skip Vision OCR, use PDF text extraction only (faster but less accurate for scanned documents).
AI Options:
--ai: Enable AI-based text correction (uses Apple Intelligence if --api not specified).--api <config>: AI API in formatmodel:api_key@base_url(e.g.,gpt-4o:sk-xxx@https://api.openai.com/v1).--ai-prompt <file>: Custom AI prompt template (JSON file).
OCR Options:
--languages <codes>: Languages for Vision OCR (comma-separated ISO 639 codes, e.g.,en,fr,de). Default:en.--threshold <value>: Vision text preference threshold (default: 1.5, use Vision if >N times longer than PDF text).--no-cache: Disable OCR result caching.
Output Control:
-v, --verbose: Show additional warnings and debug info.-q, --quiet: Suppress all non-error output.
Batch Mode:
--batch: Enable batch processing of multiple PDFs.-j, --jobs <n>: Number of parallel jobs in batch mode. Default: CPU core count.
Examples:
-
Convert PDF with images:
pdf22md -i my_document.pdf -o my_document.md -a ./assets
-
Use stdin/stdout:
cat report.pdf | pdf22md > report.md
-
High DPI for better image quality:
pdf22md -i presentation.pdf -o slides.md -a ./images -d 300
-
With AI text correction (OpenAI):
pdf22md -i scanned.pdf -o cleaned.md --ai --api "gpt-4o:sk-xxx@https://api.openai.com/v1" -
Fast mode (PDF text only, no Vision OCR):
pdf22md -i document.pdf -o output.md --fast
-
Multi-language OCR:
pdf22md -i french_german.pdf -o output.md --languages fr,de
-
Batch process multiple PDFs:
pdf22md --batch -i ./docs/ -o ./output/ -a ./assets/ -j 4
-
Open password-protected PDF:
pdf22md -i protected.pdf -o output.md --password "secret"
The example.sh script converts sample PDFs using all four conversion methods for testing and comparison:
# Run all methods on testdata/pdf/*.pdf
./example.sh
# Quiet mode (summary only)
./example.sh -q
# Custom timeout and specific methods
./example.sh -t 60 -m fast,ultra
# Show help
./example.sh -hResults are stored in testdata/{fast,standard,optimized,ultra}/ with extracted assets.
- macOS: 12.0 or later
- Swift: 5.7 or later (for building from source)
- Xcode Command Line Tools (for building)
This occurs when xcode-select points to Command Line Tools instead of Xcode.app:
# Check current setting
xcode-select -p
# Fix (requires admin)
sudo xcode-select -s /Applications/Xcode.app/Contents/Developer- Ensure the PDF contains actual scanned images, not digital text
- Try increasing DPI:
--dpi 300 - Check language setting:
--languages en,frfor multi-language documents
- Verify API key format:
model:api_key@base_url - Test with curl:
curl -H "Authorization: Bearer YOUR_KEY" https://api.openai.com/v1/models - For local models (Ollama): use empty key:
llama3:@http://localhost:11434/v1
- Process fewer pages at once:
--max-pages 10 - Use
--fastmode to skip Vision OCR - Close other applications to free memory
- Use
--fastfor digital PDFs (skips Vision OCR) - Reduce DPI for images:
--dpi 72 - Use batch mode with parallel jobs:
--batch -j 4
Designed for speed and efficiency:
- Parallel Processing: Uses Swift's
async/awaitandTaskGroupto process PDF pages concurrently across all CPU cores. - Memory Efficient: Handles large documents without excessive memory usage.
- Smart Algorithms: Applies intelligent font analysis and image processing to minimize overhead.
For fastest processing:
# Digital PDFs with embedded text - skip Vision OCR
pdf22md -i document.pdf -o output.md --fast
# Lower DPI for smaller images (default: 144)
pdf22md -i document.pdf -o output.md -a ./images --dpi 72For batch processing:
# Process multiple PDFs with 4 parallel jobs
pdf22md --batch -i ./pdfs/ -o ./output/ -j 4
# Preview large PDFs - process first 3 pages only
pdf22md -i large.pdf -o preview.md --max-pages 3For scanned PDFs (OCR-heavy):
# OCR results are cached by default (~/.cache/pdf22md/ocr/)
# Second run on same PDF is instant
# Disable cache for fresh OCR
pdf22md -i scanned.pdf -o output.md --no-cache
# Specify languages for better accuracy
pdf22md -i multilingual.pdf -o output.md --languages de,fr,enMemory considerations:
- Large PDFs (100+ pages) process incrementally
- Use
--max-pagesto limit processing for previews - Image extraction at high DPI uses more memory
Single unified converter using Swift's structured concurrency:
PDFMarkdownConverter.swift: Main converter with Vision OCR integration and optional AI processing. Usesasync/awaitandTaskGroupfor parallel page processing.
-
Document Analysis & Font Statistics (
FontStatistics.swift)- Detects headings by analyzing font size frequency and usage.
- Sorts elements by page number and vertical position to preserve document flow.
-
Content Modeling (
PDFElement.swift)TextElement: Stores text string, bounding box, page index, font size, and style (bold, italic).ImageElement: StoresCGImage, bounds, page index, vector source status, and asset file path.
-
Page Processing (
PDFPageProcessor*.swift)- Extracts text and its attributes (font, size, style) using PDFKit.
- Image Extraction (
CGPDFImageExtractor.swift):- Pulls raster images from XObject streams.
- Rasterizes vector graphics at specified DPI.
- Creates
TextElementandImageElementinstances with extracted data.
-
Asset Pipeline (
AssetExtractor.swift)- Saves images with naming convention:
[pdf-basename]-[page_number]-[asset_index_on_page].[format]. - Selects PNG for images with transparency or fewer colors; JPEG for complex color patterns.
- Writes images to assets folder and returns correct paths for Markdown linking.
- Saves images with naming convention:
-
Markdown Output (
PDFMarkdownConverter*.swift)- Traverses sorted
PDFElementlist. - Converts
TextElementto Markdown with proper formatting (bold, italic, headings). - Converts
ImageElementto Markdown image links. - Inserts page breaks (
---) between pages when needed.
- Traverses sorted
- Pages processed in parallel using Swift's
TaskGroup. - Vision OCR requests execute concurrently per page.
- AI text correction processes pages sequentially with sliding window context.
- Automatically activates when PDFKit returns empty text (scanned PDFs).
- Uses
VNRecognizeTextRequestwith accurate recognition level. - Processes page images at configurable DPI for optimal accuracy.
- Results integrated into the same
TextElementmodel.
- Optional post-processing step using LLM APIs.
- OpenAI: GPT-4o-mini for efficient text cleanup.
- Apple Intelligence: On-device processing (macOS 15+).
- Corrects OCR errors, improves formatting, normalizes whitespace.
- Content Extraction Layer: Bridges PDFKit parsing with structured
PDFElementrepresentation. - Vision OCR Layer: Fallback text extraction using Apple's Vision framework.
- AI Processing Layer: Optional LLM-based text enhancement and correction.
- Asset Management Layer: Links
CGImageobjects to disk files and manages folder organization.
We welcome contributions. Follow these guidelines for smooth collaboration.
- Focused Changes: Only modify code relevant to your feature or bug fix.
- Complete Code: No placeholders. Submit working implementations.
- Incremental Approach: Break complex problems into smaller steps.
- Clear Reasoning: Explain your solution with evidence from code or behavior.
- Follow AGENTS.md: Respect local directory guidelines if present.
- Swift: 5.7+
- macOS: 12.0+
- Package Manager: Swift Package Manager. Update
Package.swiftas needed. - Concurrency: Use
async/awaitandActorsappropriately. Ensure thread safety with GCD. - Code Style: Follow Swift API Design Guidelines. Use SwiftFormat if config provided.
- Error Handling: Use Swift's
Errorprotocol. Define custom errors (e.g.,PDFConversionError) and propagate gracefully. - Value Types: Prefer
structoverclassunless reference semantics are required.
- All new code must include unit or integration tests.
- Tests located in
pdf22md/Tests/PDF22MDTests/. - Use XCTest framework.
- Fork and clone the repository.
- Create a branch (
feature/your-featureorbugfix/issue-number). - Implement changes following guidelines.
- Write tests and verify all pass:
swift test # or make test
- Build project:
swift build -c release # or make build - Update documentation (
README.md,CHANGELOG.md) if needed. - Commit with clear messages.
- Push to your fork.
- Open PR to
mainbranch of original repository.
For full details, see CONTRIBUTING.md.
- After updates:
- Update
CHANGELOG.md. - Review
TODO.md- remove completed items, add new ones. - Build application to verify functionality.
- Update