Problem
The workspace knowledgebase only indexes text, markdown, and code files. PDF and DOCX are common document formats.
Research
- PrivateGPT: PDFReader (llama-index), DocxReader. Supports PDF, DOCX, EPUB, PPTX, images (OCR), IPython notebooks
- Khoj: PyMuPDF (fitz) for PDF with pdfplumber fallback, python-docx for DOCX
- LocalGPT: PDFMinerLoader or UnstructuredPDFLoader
Proposed approach
Add two new parser plugins following the existing workspace plugin contract:
PDF parser: plugins/workspace/parsers/pdf/
- Use PyMuPDF (fitz) as primary, pdfplumber as fallback
- Extract text per page with page markers
- Optional dep:
pip install pymupdf
DOCX parser: plugins/workspace/parsers/docx/
- Use python-docx library
- Extract paragraph text
- Optional dep:
pip install python-docx
Update BINARY_SUFFIXES to exclude .pdf and .docx (binary but parseable). Add workspace-docs optional dep group in pyproject.toml.
Related
Part of workspace foundation (#5840)
Problem
The workspace knowledgebase only indexes text, markdown, and code files. PDF and DOCX are common document formats.
Research
Proposed approach
Add two new parser plugins following the existing workspace plugin contract:
PDF parser:
plugins/workspace/parsers/pdf/pip install pymupdfDOCX parser:
plugins/workspace/parsers/docx/pip install python-docxUpdate
BINARY_SUFFIXESto exclude.pdfand.docx(binary but parseable). Addworkspace-docsoptional dep group in pyproject.toml.Related
Part of workspace foundation (#5840)