-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Problem Statement
When parsing large PDFs (e.g., a 463-page Chinese Pharmacopoeia), the current PDFParser loses all chapter/heading structure. The local pdfplumber strategy extracts raw text per page with <!-- Page N --> HTML comment markers, but these are excluded from MarkdownParser._find_headings(). As a result:
- Chapter/TOC structure is lost: The MarkdownParser finds zero headings and falls back to paragraph-based splitting, producing 538 flat numbered files (
name_1.mdthroughname_538.md) with no semantic organization. - Too many files in a single directory: All 538 slices land in one flat directory, making it hard to browse, search, or manage. The filesystem structure provides no information about the document's logical organization.
Proposed Solution
1. PDF Bookmark/Outline Extraction (Primary)
Extract the PDF's built-in bookmarks/outlines via pdfplumber's underlying pdfminer (pdf.doc.get_outlines()). Convert bookmark entries to markdown headings (#, ##, etc.) and inject them at the correct page positions before passing to MarkdownParser. This allows the existing heading-based splitting logic to naturally build a hierarchical directory structure.
2. Font-Size Heading Detection (Fallback)
When a PDF has no bookmarks, analyze character-level font information from page.chars to detect headings:
- Identify body text size (most frequent font size)
- Classify significantly larger text as headings
- Map font size tiers to heading levels (up to 4 levels)
- Group consecutive same-sized large characters into heading text
3. Directory Auto-Grouping (Generic)
Add a MAX_CHILDREN_PER_DIR threshold (default 50) to MarkdownParser. When any single directory level would contain more files than the threshold:
- No-heading path: Group into numbered subdirectories (
doc_001-050/,doc_051-100/) - Heading path: Group consecutive same-level sections into subdirectories named by first/last section
Alternatives Considered
- LLM-based structure inference: Too expensive for parsing phase, and the current architecture deliberately avoids LLM calls during parsing.
- Page-range based grouping only: Simple but loses semantic meaning — doesn't leverage the document's actual structure.
- Relying solely on MinerU: Not always available; the local pdfplumber path should work well independently.
Feature Area
Core (Client/Engine)
Use Case
When users ingest large structured PDFs (textbooks, standards documents, legal codes, pharmacopoeias, technical manuals), the parsed output should preserve the document's chapter/section hierarchy as a directory tree. This makes it possible to:
- Browse resources by chapter
- Search within specific sections
- Load context at the right granularity (L0/L1/L2)
- Avoid filesystem performance issues from hundreds of files in one directory
Implementation Plan
Files to modify
| File | Changes |
|---|---|
openviking/parse/parsers/pdf.py |
Add _extract_bookmarks(), _detect_headings_by_font(), modify _convert_local() to inject headings |
openviking/parse/parsers/markdown.py |
Add MAX_CHILDREN_PER_DIR, _auto_group_sections(), modify no-heading branch and _process_sections_with_merge() |
openviking_cli/utils/config/parser_config.py |
Add config fields: heading_detection, font_heading_min_delta, max_children_per_dir |
Phases
- Bookmark extraction — Extract PDF outlines, inject as markdown headings
- Font-size detection — Fallback heading detection via character font analysis
- Directory auto-grouping — Generic threshold-based subdirectory creation
- Configuration — Wire new config fields into PDFConfig and ParserConfig
Additional Context
Example of current broken output for a 463-page PDF:
data/viking/my-team/resources/tmppi5lkjtk/
├── tmppi5lkjtk_1.md # Page 2-3 raw text
├── tmppi5lkjtk_2.md # Page 4 raw text
├── ...
└── tmppi5lkjtk_538.md # Last chunk
Expected output after fix:
data/viking/my-team/resources/中华人民共和国药典/
├── 第一部_药材/
│ ├── 川木通.md
│ ├── 川贝母.md
│ └── ...
├── 第二部_化学药/
│ └── ...
└── ...
Metadata
Metadata
Assignees
Labels
Type
Projects
Status