Skip to content

fix(parser): resolve PDF bookmark destinations for integer page indices#794

Merged
qin-ctx merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/522-fix-pdf-chapter-structure
Mar 20, 2026
Merged

fix(parser): resolve PDF bookmark destinations for integer page indices#794
qin-ctx merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/522-fix-pdf-chapter-structure

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Fixes #522 (related to #386). PDF files with embedded chapter/section outlines were imported as flat content instead of preserving the heading hierarchy.

Three root causes addressed:

  • Integer page indices not handled. _extract_bookmarks only handled objid and .resolve() destination formats. Many PDF producers encode outline destinations as bare 0-based integer page indices ([0, "/Fit"]). These fell through to page_num = None and were dropped.
  • Unresolvable bookmarks silently discarded. Bookmarks where destination resolution fails now fall back to page 1 instead of being excluded from the output entirely.
  • Font heading threshold too aggressive. _detect_headings_by_font filtered out heading font sizes appearing more than 30% as often as body text. For documents with short chapters (where heading sizes naturally repeat on many pages), this excluded valid headings. Threshold raised to 50%.

Changes

File Change
openviking/parse/parsers/pdf.py Add isinstance(page_ref, int) branch in _extract_bookmarks; fall back to page 1 for unresolvable bookmarks in _convert_local; relax font heading frequency threshold from 0.3 to 0.5
tests/parse/test_pdf_bookmark_extraction.py Add test_extract_bookmarks_integer_page_index covering the integer destination format

Test plan

  • New test test_extract_bookmarks_integer_page_index passes
  • All existing TestExtractBookmarks tests pass unchanged
  • No ruff/mypy lint errors introduced

This contribution was developed with AI assistance (Claude Code + Codex).

Handle three issues causing PDF chapter structure to be lost:

1. _extract_bookmarks now handles integer page indices (0-based) that
   many PDF producers use instead of object references. Previously
   these fell through to page_num=None and were silently dropped.

2. Bookmarks with unresolvable page numbers are now injected at page 1
   instead of being discarded entirely. This preserves the heading
   hierarchy even when destination resolution fails.

3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of
   body text frequency. The previous threshold was too aggressive for
   documents with short chapters where heading font sizes appear on
   many pages.

Closes volcengine#522

This contribution was developed with AI assistance (Claude Code + Codex).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bookmarks_by_page[bm["page_num"]].append(bm)
# Fall back to page 1 for unresolvable destinations
page = bm["page_num"] or 1
bookmarks_by_page[page].append(bm)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a possible index overflow issue here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Integer page_ref values from some PDF producers could reference pages beyond the actual document length. The bookmarks_by_page defaultdict wouldn't crash (it's accessed by key, not index), but those bookmarks would be silently assigned to non-existent page numbers and dropped during iteration.

Fixed in 506baa7: added a bounds check (1 <= candidate <= len(pdf.pages)) so out-of-range integer indices are treated as unresolved instead. Added test coverage for the boundary case.

Add bounds check for 0-based integer page references to prevent
out-of-range bookmarks from being assigned to non-existent pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@qin-ctx qin-ctx assigned qin-ctx and unassigned qin-ctx Mar 20, 2026
@qin-ctx qin-ctx merged commit a1c4a05 into volcengine:main Mar 20, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026
zeattacker pushed a commit to zeattacker/OpenViking that referenced this pull request Mar 20, 2026
…es (volcengine#794)

* fix(parser): resolve PDF bookmark destinations for integer page indices

Handle three issues causing PDF chapter structure to be lost:

1. _extract_bookmarks now handles integer page indices (0-based) that
   many PDF producers use instead of object references. Previously
   these fell through to page_num=None and were silently dropped.

2. Bookmarks with unresolvable page numbers are now injected at page 1
   instead of being discarded entirely. This preserves the heading
   hierarchy even when destination resolution fails.

3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of
   body text frequency. The previous threshold was too aggressive for
   documents with short chapters where heading font sizes appear on
   many pages.

Closes volcengine#522

This contribution was developed with AI assistance (Claude Code + Codex).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(parser): clamp integer page indices to valid range

Add bounds check for 0-based integer page references to prevent
out-of-range bookmarks from being assigned to non-existent pages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews and merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: 导入部分文件时,pdf 解析存在问题,没有识别出章节导致全部都平铺

3 participants