fix(parser): resolve PDF bookmark destinations for integer page indices by mvanhorn · Pull Request #794 · volcengine/OpenViking

mvanhorn · 2026-03-19T23:40:51Z

Summary

Fixes #522 (related to #386). PDF files with embedded chapter/section outlines were imported as flat content instead of preserving the heading hierarchy.

Three root causes addressed:

Integer page indices not handled. _extract_bookmarks only handled objid and .resolve() destination formats. Many PDF producers encode outline destinations as bare 0-based integer page indices ([0, "/Fit"]). These fell through to page_num = None and were dropped.
Unresolvable bookmarks silently discarded. Bookmarks where destination resolution fails now fall back to page 1 instead of being excluded from the output entirely.
Font heading threshold too aggressive. _detect_headings_by_font filtered out heading font sizes appearing more than 30% as often as body text. For documents with short chapters (where heading sizes naturally repeat on many pages), this excluded valid headings. Threshold raised to 50%.

Changes

File	Change
`openviking/parse/parsers/pdf.py`	Add `isinstance(page_ref, int)` branch in `_extract_bookmarks`; fall back to page 1 for unresolvable bookmarks in `_convert_local`; relax font heading frequency threshold from 0.3 to 0.5
`tests/parse/test_pdf_bookmark_extraction.py`	Add `test_extract_bookmarks_integer_page_index` covering the integer destination format

Test plan

New test test_extract_bookmarks_integer_page_index passes
All existing TestExtractBookmarks tests pass unchanged
No ruff/mypy lint errors introduced

This contribution was developed with AI assistance (Claude Code + Codex).

Handle three issues causing PDF chapter structure to be lost: 1. _extract_bookmarks now handles integer page indices (0-based) that many PDF producers use instead of object references. Previously these fell through to page_num=None and were silently dropped. 2. Bookmarks with unresolvable page numbers are now injected at page 1 instead of being discarded entirely. This preserves the heading hierarchy even when destination resolution fails. 3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of body text frequency. The previous threshold was too aggressive for documents with short chapters where heading font sizes appear on many pages. Closes volcengine#522 This contribution was developed with AI assistance (Claude Code + Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ZaynJarvis · 2026-03-20T03:09:53Z

openviking/parse/parsers/pdf.py

-                        bookmarks_by_page[bm["page_num"]].append(bm)
+                    # Fall back to page 1 for unresolvable destinations
+                    page = bm["page_num"] or 1
+                    bookmarks_by_page[page].append(bm)


is there a possible index overflow issue here?

Good catch. Integer page_ref values from some PDF producers could reference pages beyond the actual document length. The bookmarks_by_page defaultdict wouldn't crash (it's accessed by key, not index), but those bookmarks would be silently assigned to non-existent page numbers and dropped during iteration.

Fixed in 506baa7: added a bounds check (1 <= candidate <= len(pdf.pages)) so out-of-range integer indices are treated as unresolved instead. Added test coverage for the boundary case.

Add bounds check for 0-based integer page references to prevent out-of-range bookmarks from being assigned to non-existent pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…es (volcengine#794) * fix(parser): resolve PDF bookmark destinations for integer page indices Handle three issues causing PDF chapter structure to be lost: 1. _extract_bookmarks now handles integer page indices (0-based) that many PDF producers use instead of object references. Previously these fell through to page_num=None and were silently dropped. 2. Bookmarks with unresolvable page numbers are now injected at page 1 instead of being discarded entirely. This preserves the heading hierarchy even when destination resolution fails. 3. Relaxed font-size heading detection threshold from 0.3 to 0.5 of body text frequency. The previous threshold was too aggressive for documents with short chapters where heading font sizes appear on many pages. Closes volcengine#522 This contribution was developed with AI assistance (Claude Code + Codex). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(parser): clamp integer page indices to valid range Add bounds check for 0-based integer page references to prevent out-of-range bookmarks from being assigned to non-existent pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mvanhorn · 2026-03-23T02:57:21Z

Thanks for the reviews and merge!

github-project-automation bot added this to OpenViking project Mar 19, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 19, 2026

mvanhorn mentioned this pull request Mar 19, 2026

[Bug]: 导入部分文件时，pdf 解析存在问题，没有识别出章节导致全部都平铺 #522

Closed

ZaynJarvis reviewed Mar 20, 2026

View reviewed changes

fix(parser): clamp integer page indices to valid range

506baa7

Add bounds check for 0-based integer page references to prevent out-of-range bookmarks from being assigned to non-existent pages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

qin-ctx assigned qin-ctx and unassigned qin-ctx Mar 20, 2026

qin-ctx approved these changes Mar 20, 2026

View reviewed changes

qin-ctx merged commit a1c4a05 into volcengine:main Mar 20, 2026
6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): resolve PDF bookmark destinations for integer page indices#794

fix(parser): resolve PDF bookmark destinations for integer page indices#794
qin-ctx merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/522-fix-pdf-chapter-structure

mvanhorn commented Mar 19, 2026

Uh oh!

ZaynJarvis Mar 20, 2026

Uh oh!

mvanhorn Mar 20, 2026

Uh oh!

Uh oh!

mvanhorn commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvanhorn commented Mar 19, 2026

Summary

Changes

Test plan

Uh oh!

ZaynJarvis Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

mvanhorn Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mvanhorn commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants