fix(parser): resolve PDF bookmark destinations for integer page indices#794
Merged
qin-ctx merged 2 commits intovolcengine:mainfrom Mar 20, 2026
Merged
Conversation
Handle three issues causing PDF chapter structure to be lost: 1. _extract_bookmarks now handles integer page indices (0-based) that many PDF producers use instead of object references. Previously these fell through to page_num=None and were silently dropped. 2. Bookmarks with unresolvable page numbers are now injected at page 1 instead of being discarded entirely. This preserves the heading hierarchy even when destination resolution fails. 3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of body text frequency. The previous threshold was too aggressive for documents with short chapters where heading font sizes appear on many pages. Closes volcengine#522 This contribution was developed with AI assistance (Claude Code + Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ZaynJarvis
reviewed
Mar 20, 2026
| bookmarks_by_page[bm["page_num"]].append(bm) | ||
| # Fall back to page 1 for unresolvable destinations | ||
| page = bm["page_num"] or 1 | ||
| bookmarks_by_page[page].append(bm) |
Collaborator
There was a problem hiding this comment.
is there a possible index overflow issue here?
Contributor
Author
There was a problem hiding this comment.
Good catch. Integer page_ref values from some PDF producers could reference pages beyond the actual document length. The bookmarks_by_page defaultdict wouldn't crash (it's accessed by key, not index), but those bookmarks would be silently assigned to non-existent page numbers and dropped during iteration.
Fixed in 506baa7: added a bounds check (1 <= candidate <= len(pdf.pages)) so out-of-range integer indices are treated as unresolved instead. Added test coverage for the boundary case.
Add bounds check for 0-based integer page references to prevent out-of-range bookmarks from being assigned to non-existent pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qin-ctx
approved these changes
Mar 20, 2026
zeattacker
pushed a commit
to zeattacker/OpenViking
that referenced
this pull request
Mar 20, 2026
…es (volcengine#794) * fix(parser): resolve PDF bookmark destinations for integer page indices Handle three issues causing PDF chapter structure to be lost: 1. _extract_bookmarks now handles integer page indices (0-based) that many PDF producers use instead of object references. Previously these fell through to page_num=None and were silently dropped. 2. Bookmarks with unresolvable page numbers are now injected at page 1 instead of being discarded entirely. This preserves the heading hierarchy even when destination resolution fails. 3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of body text frequency. The previous threshold was too aggressive for documents with short chapters where heading font sizes appear on many pages. Closes volcengine#522 This contribution was developed with AI assistance (Claude Code + Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(parser): clamp integer page indices to valid range Add bounds check for 0-based integer page references to prevent out-of-range bookmarks from being assigned to non-existent pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Thanks for the reviews and merge! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #522 (related to #386). PDF files with embedded chapter/section outlines were imported as flat content instead of preserving the heading hierarchy.
Three root causes addressed:
_extract_bookmarksonly handledobjidand.resolve()destination formats. Many PDF producers encode outline destinations as bare 0-based integer page indices ([0, "/Fit"]). These fell through topage_num = Noneand were dropped._detect_headings_by_fontfiltered out heading font sizes appearing more than 30% as often as body text. For documents with short chapters (where heading sizes naturally repeat on many pages), this excluded valid headings. Threshold raised to 50%.Changes
openviking/parse/parsers/pdf.pyisinstance(page_ref, int)branch in_extract_bookmarks; fall back to page 1 for unresolvable bookmarks in_convert_local; relax font heading frequency threshold from 0.3 to 0.5tests/parse/test_pdf_bookmark_extraction.pytest_extract_bookmarks_integer_page_indexcovering the integer destination formatTest plan
test_extract_bookmarks_integer_page_indexpassesTestExtractBookmarkstests pass unchangedThis contribution was developed with AI assistance (Claude Code + Codex).