Skip to content

fix(parser): MarkdownParser char limit#826

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
chenxiaofei-cxf:feat/openai-rerank-provider
Mar 21, 2026
Merged

fix(parser): MarkdownParser char limit#826
qin-ctx merged 1 commit intovolcengine:mainfrom
chenxiaofei-cxf:feat/openai-rerank-provider

Conversation

@chenxiaofei-cxf
Copy link
Copy Markdown
Contributor

@chenxiaofei-cxf chenxiaofei-cxf commented Mar 20, 2026

Summary

The token estimator (_estimate_token_count) in MarkdownParser uses a simplified heuristic (0.3 tokens/non-CJK-non-space char) that can underestimate actual token counts for dense code or technical content. This caused oversized L2 sections to pass the token check and be written as single files, which then exceeded embedding API limits (e.g. text-embedding-v4's 8192-token cap).

Changes:

  • Add max_section_chars: int = 6000 to ParserConfig as a hard character limit per section, guarding against token estimation errors
  • _smart_split_content: enforces char limit per chunk alongside the token estimate; guards against max_section_chars=0 crash
  • _save_section: requires both token AND char limits satisfied before writing a section as a single file
  • _save_merged: splits joined content when the merged result exceeds max_section_chars, preventing token-small but char-large sections from accumulating into oversized files
  • _parse_and_create_structure: small-document fast-path now also checks char count before writing as a single file
  • ParserConfig.validate() rejects max_section_chars <= 0
  • ov.conf.example updated with max_section_chars in the pdf parser block
  • Class docstring updated: add max_section_chars attribute, fix max_section_size description ("characters" → "tokens")

Test plan

  • pytest tests/parse/test_markdown_char_limit.py -v — 21 tests pass
  • ov add-resource <large-markdown-file> completes without embedding token limit errors

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

The token estimator (_estimate_token_count) uses a simplified heuristic
(0.3 tokens/non-CJK-non-space char) that can underestimate actual token
counts for dense code or technical content. This caused large sections to
pass the token check and be written as single L2 files, which then
exceeded embedding API limits (e.g., text-embedding-v4 8192 token cap).

Changes:
- Add `max_section_chars: int = 6000` to ParserConfig as a hard character
  limit per section, guarding against token estimation errors
- `_smart_split_content`: enforces char limit per split chunk alongside
  the existing token estimate limit; falls back to token-only splitting
  when max_section_chars <= 0 to avoid range(n, 0) crash
- `_save_section`: requires both token AND char limits satisfied before
  writing a section as a single file
- `_save_merged`: splits joined content via _smart_split_content when the
  merged result exceeds max_section_chars, preventing token-small but
  char-large sections from accumulating into oversized files
- `_parse_and_create_structure`: small-document fast-path now also checks
  char count before writing as a single file
- `ParserConfig.validate()`: rejects max_section_chars <= 0
- `ParserConfig` class docstring: add max_section_chars attribute,
  fix max_section_size description ("characters" -> "tokens")
- `ov.conf.example`: add max_section_chars to pdf parser block

Add 21 unit tests covering ParserConfig defaults and validation,
_smart_split_content char-limit enforcement, _save_section and
_save_merged char-limit gates, and _parse_and_create_structure
fast-path with and without headings (VikingFS mocked throughout).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chenxiaofei-cxf chenxiaofei-cxf force-pushed the feat/openai-rerank-provider branch from ddd94fa to 3d9360d Compare March 20, 2026 13:49
@chenxiaofei-cxf chenxiaofei-cxf changed the title feat(rerank) + fix(parser): OpenAI-compatible rerank provider & MarkdownParser char limit fix(parser): MarkdownParser char limit Mar 20, 2026
@qin-ctx qin-ctx self-assigned this Mar 21, 2026
@qin-ctx qin-ctx merged commit 3064725 into volcengine:main Mar 21, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 21, 2026
zeattacker pushed a commit to zeattacker/OpenViking that referenced this pull request Mar 21, 2026
…er (volcengine#826)

The token estimator (_estimate_token_count) uses a simplified heuristic
(0.3 tokens/non-CJK-non-space char) that can underestimate actual token
counts for dense code or technical content. This caused large sections to
pass the token check and be written as single L2 files, which then
exceeded embedding API limits (e.g., text-embedding-v4 8192 token cap).

Changes:
- Add `max_section_chars: int = 6000` to ParserConfig as a hard character
  limit per section, guarding against token estimation errors
- `_smart_split_content`: enforces char limit per split chunk alongside
  the existing token estimate limit; falls back to token-only splitting
  when max_section_chars <= 0 to avoid range(n, 0) crash
- `_save_section`: requires both token AND char limits satisfied before
  writing a section as a single file
- `_save_merged`: splits joined content via _smart_split_content when the
  merged result exceeds max_section_chars, preventing token-small but
  char-large sections from accumulating into oversized files
- `_parse_and_create_structure`: small-document fast-path now also checks
  char count before writing as a single file
- `ParserConfig.validate()`: rejects max_section_chars <= 0
- `ParserConfig` class docstring: add max_section_chars attribute,
  fix max_section_size description ("characters" -> "tokens")
- `ov.conf.example`: add max_section_chars to pdf parser block

Add 21 unit tests covering ParserConfig defaults and validation,
_smart_split_content char-limit enforcement, _save_section and
_save_merged char-limit gates, and _parse_and_create_structure
fast-path with and without headings (VikingFS mocked throughout).

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants