Skip to content

feat: workspace PDF and DOCX parser plugins #5850

@kshitijk4poor

Description

@kshitijk4poor

Problem

The workspace knowledgebase only indexes text, markdown, and code files. PDF and DOCX are common document formats.

Research

  • PrivateGPT: PDFReader (llama-index), DocxReader. Supports PDF, DOCX, EPUB, PPTX, images (OCR), IPython notebooks
  • Khoj: PyMuPDF (fitz) for PDF with pdfplumber fallback, python-docx for DOCX
  • LocalGPT: PDFMinerLoader or UnstructuredPDFLoader

Proposed approach

Add two new parser plugins following the existing workspace plugin contract:

PDF parser: plugins/workspace/parsers/pdf/

  • Use PyMuPDF (fitz) as primary, pdfplumber as fallback
  • Extract text per page with page markers
  • Optional dep: pip install pymupdf

DOCX parser: plugins/workspace/parsers/docx/

  • Use python-docx library
  • Extract paragraph text
  • Optional dep: pip install python-docx

Update BINARY_SUFFIXES to exclude .pdf and .docx (binary but parseable). Add workspace-docs optional dep group in pyproject.toml.

Related

Part of workspace foundation (#5840)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions