file2md

Convert PDF and DOCX files to clean, grep-friendly Markdown optimized for AI tools and IDE workflows.

Features

PDF conversion — text extraction with PyMuPDF, page separators, scanned PDF detection
DOCX conversion — headings, lists, tables, bold/italic preserved as Markdown
Grep-friendly output — paragraph reflow, hyphenation fix, whitespace normalization
Drag-and-drop web UI — upload and convert files in your browser
Full CLI — single file, batch, and web server modes
Table extraction — PDF tables converted to GitHub-flavored Markdown
Header/footer removal — heuristic detection of repeating headers/footers
Metadata headers — source filename and conversion timestamp in output
YAML frontmatter — optional structured metadata for downstream tools

Installation

# CLI only (lightweight)
pip install file2md

# With web UI
pip install file2md[web]

# Development (all dependencies)
pip install file2md[all]

Quick Start

Web UI

file2md serve
# Open http://127.0.0.1:8000 and drag your files

CLI — Single File

# Basic conversion
file2md convert document.pdf -o document.md

# With all enhancements
file2md convert report.pdf -o report.md --clean --frontmatter --page-labels --extract-tables

CLI — Batch

# Convert all PDFs and DOCXs in a directory
file2md batch ./documents --out-dir ./markdown --recursive

CLI Reference

`file2md convert`

Convert a single PDF or DOCX file to Markdown.

Flag	Description
`-o / --output`	Output file path (defaults to input name with `.md`)
`--clean`	Normalize whitespace, reflow paragraphs, fix hyphenation
`--frontmatter`	Add YAML frontmatter (source, timestamp, converter)
`--page-labels`	Add `## Page N` headings (PDF only)
`--extract-tables`	Detect and convert tables to GFM (PDF)
`--max-chars N`	Truncate output at N characters
`--overwrite`	Overwrite existing output file
`--quiet`	Suppress warnings
`--verbose`	Show detailed progress
`--json`	Machine-readable JSON output

`file2md batch`

Batch convert all PDF/DOCX files in a directory.

Flag	Description
`--out-dir`	Output directory (required)
`--recursive`	Process subdirectories
All flags from `convert`	Same options available

`file2md serve`

Start the web UI server.

Flag	Description
`--host`	Host to bind to (default: `127.0.0.1`)
`--port`	Port to listen on (default: `8000`)

Exit Codes

Code	Meaning
`0`	Success
`2`	Unsupported file type
`3`	Extraction failed
`4`	Scanned PDF detected (no OCR)

Output Conventions

Metadata

Every converted file includes a metadata comment:

<!-- source: document.pdf | converted: 2026-02-26T12:00:00Z | converter: file2md v0.1.0 -->

With --frontmatter:

---
source: document.pdf
converted: 2026-02-26T12:00:00Z
converter: file2md v0.1.0
---

PDF Page Separators

Pages are separated by ---. With --page-labels:

## Page 1

Content of page 1...

---

## Page 2

Content of page 2...

Clean Mode (`--clean`)

Paragraph reflow — undoes hard line wraps from PDF extraction
Hyphenation fix — merges hyphen-\nated words across lines
Header/footer removal — detects and removes repeating page headers/footers
Whitespace normalization — collapses extra spaces, limits blank lines

Architecture

src/file2md/
├── convert.py          # Main entry point — dispatches by file type
├── pdf.py              # PDF → Markdown (PyMuPDF)
├── docx_converter.py   # DOCX → Markdown (python-docx)
├── normalize.py        # Text cleanup (reflow, hyphenation, headers/footers)
├── cli.py              # Click CLI (convert, batch, serve)
├── web.py              # FastAPI web server
├── utils.py            # Shared types, validation, metadata
└── templates/
    └── index.html      # Drag-and-drop web UI

Development

git clone https://github.com/faizkhairi/file2md.git
cd file2md
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows
pip install -e ".[all]"

# Run tests
pytest

# Lint
ruff check src/ tests/

# Build
python -m build

Known Limitations

No OCR — scanned/image-only PDFs are detected and rejected with a clear error (exit code 4). OCR support is planned for a future release.
Complex PDF layouts — multi-column documents, sidebars, and footnotes may produce text in unexpected order.
Nested DOCX lists — only flat bullet/numbered lists are supported. Nested and mixed lists are not preserved.
Merged table cells — may produce duplicated or empty cells in the Markdown output.

Troubleshooting

"All pages appear to be scanned images" — The PDF contains only images, no extractable text. You need to OCR the PDF first using a tool like ocrmypdf before converting.

Tables not appearing (PDF) — Use the --extract-tables flag. Table detection is off by default to keep output clean for text-heavy documents.

Output has hard line breaks — Use the --clean flag to enable paragraph reflow, which joins lines that were artificially broken by PDF formatting.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
src/file2md		src/file2md
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

file2md

Features

Installation

Quick Start

Web UI

CLI — Single File

CLI — Batch

CLI Reference

`file2md convert`

`file2md batch`

`file2md serve`

Exit Codes

Output Conventions

Metadata

PDF Page Separators

Clean Mode (`--clean`)

Architecture

Development

Known Limitations

Troubleshooting

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

file2md

Features

Installation

Quick Start

Web UI

CLI — Single File

CLI — Batch

CLI Reference

file2md convert

file2md batch

file2md serve

Exit Codes

Output Conventions

Metadata

PDF Page Separators

Clean Mode (--clean)

Architecture

Development

Known Limitations

Troubleshooting

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`file2md convert`

`file2md batch`

`file2md serve`

Clean Mode (`--clean`)

Packages