fix(parser): MarkdownParser char limit by chenxiaofei-cxf · Pull Request #826 · volcengine/OpenViking

chenxiaofei-cxf · 2026-03-20T13:42:44Z

Summary

The token estimator (_estimate_token_count) in MarkdownParser uses a simplified heuristic (0.3 tokens/non-CJK-non-space char) that can underestimate actual token counts for dense code or technical content. This caused oversized L2 sections to pass the token check and be written as single files, which then exceeded embedding API limits (e.g. text-embedding-v4's 8192-token cap).

Changes:

Add max_section_chars: int = 6000 to ParserConfig as a hard character limit per section, guarding against token estimation errors
_smart_split_content: enforces char limit per chunk alongside the token estimate; guards against max_section_chars=0 crash
_save_section: requires both token AND char limits satisfied before writing a section as a single file
_save_merged: splits joined content when the merged result exceeds max_section_chars, preventing token-small but char-large sections from accumulating into oversized files
_parse_and_create_structure: small-document fast-path now also checks char count before writing as a single file
ParserConfig.validate() rejects max_section_chars <= 0
ov.conf.example updated with max_section_chars in the pdf parser block
Class docstring updated: add max_section_chars attribute, fix max_section_size description ("characters" → "tokens")

Test plan

pytest tests/parse/test_markdown_char_limit.py -v — 21 tests pass
ov add-resource <large-markdown-file> completes without embedding token limit errors

🤖 Generated with Claude Code

github-actions · 2026-03-20T13:44:04Z

Failed to generate code suggestions for PR

The token estimator (_estimate_token_count) uses a simplified heuristic (0.3 tokens/non-CJK-non-space char) that can underestimate actual token counts for dense code or technical content. This caused large sections to pass the token check and be written as single L2 files, which then exceeded embedding API limits (e.g., text-embedding-v4 8192 token cap). Changes: - Add `max_section_chars: int = 6000` to ParserConfig as a hard character limit per section, guarding against token estimation errors - `_smart_split_content`: enforces char limit per split chunk alongside the existing token estimate limit; falls back to token-only splitting when max_section_chars <= 0 to avoid range(n, 0) crash - `_save_section`: requires both token AND char limits satisfied before writing a section as a single file - `_save_merged`: splits joined content via _smart_split_content when the merged result exceeds max_section_chars, preventing token-small but char-large sections from accumulating into oversized files - `_parse_and_create_structure`: small-document fast-path now also checks char count before writing as a single file - `ParserConfig.validate()`: rejects max_section_chars <= 0 - `ParserConfig` class docstring: add max_section_chars attribute, fix max_section_size description ("characters" -> "tokens") - `ov.conf.example`: add max_section_chars to pdf parser block Add 21 unit tests covering ParserConfig defaults and validation, _smart_split_content char-limit enforcement, _save_section and _save_merged char-limit gates, and _parse_and_create_structure fast-path with and without headings (VikingFS mocked throughout). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er (volcengine#826) The token estimator (_estimate_token_count) uses a simplified heuristic (0.3 tokens/non-CJK-non-space char) that can underestimate actual token counts for dense code or technical content. This caused large sections to pass the token check and be written as single L2 files, which then exceeded embedding API limits (e.g., text-embedding-v4 8192 token cap). Changes: - Add `max_section_chars: int = 6000` to ParserConfig as a hard character limit per section, guarding against token estimation errors - `_smart_split_content`: enforces char limit per split chunk alongside the existing token estimate limit; falls back to token-only splitting when max_section_chars <= 0 to avoid range(n, 0) crash - `_save_section`: requires both token AND char limits satisfied before writing a section as a single file - `_save_merged`: splits joined content via _smart_split_content when the merged result exceeds max_section_chars, preventing token-small but char-large sections from accumulating into oversized files - `_parse_and_create_structure`: small-document fast-path now also checks char count before writing as a single file - `ParserConfig.validate()`: rejects max_section_chars <= 0 - `ParserConfig` class docstring: add max_section_chars attribute, fix max_section_size description ("characters" -> "tokens") - `ov.conf.example`: add max_section_chars to pdf parser block Add 21 unit tests covering ParserConfig defaults and validation, _smart_split_content char-limit enforcement, _save_section and _save_merged char-limit gates, and _parse_and_create_structure fast-path with and without headings (VikingFS mocked throughout). Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

github-project-automation bot added this to OpenViking project Mar 20, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 20, 2026

chenxiaofei-cxf force-pushed the feat/openai-rerank-provider branch from ddd94fa to 3d9360d Compare March 20, 2026 13:49

chenxiaofei-cxf changed the title ~~feat(rerank) + fix(parser): OpenAI-compatible rerank provider & MarkdownParser char limit~~ fix(parser): MarkdownParser char limit Mar 20, 2026

qin-ctx self-assigned this Mar 21, 2026

qin-ctx approved these changes Mar 21, 2026

View reviewed changes

qin-ctx merged commit 3064725 into volcengine:main Mar 21, 2026
6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parser): MarkdownParser char limit#826

fix(parser): MarkdownParser char limit#826
qin-ctx merged 1 commit intovolcengine:mainfrom
chenxiaofei-cxf:feat/openai-rerank-provider

chenxiaofei-cxf commented Mar 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenxiaofei-cxf commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenxiaofei-cxf commented Mar 20, 2026 •

edited

Loading