fix(semantic): add budget guard to overview generation with batched summarization#683
Conversation
Centralizes hardcoded constants (30000, 256, 4000) into a configurable SemanticConfig dataclass. Configurable via ov.conf under "semantic" key. Addresses volcengine#531. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_generate_overview() now checks prompt size against SemanticConfig.max_overview_prompt_chars. If exceeded, file summaries are split into batches, each batch gets a partial overview, then partials are merged into a final overview. Also: - Replace hardcoded 30000 char limit with SemanticConfig.max_file_content_chars - Enforce abstract/overview max size on output (both semantic_processor and semantic_dag call sites) Fixes volcengine#529, addresses volcengine#531. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: config defaults, custom values, budget under/over limit detection, abstract/overview truncation, and batch splitting logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qin-ctx
left a comment
There was a problem hiding this comment.
Review Summary
This PR adds a needed budget guard for oversized overview prompts (Fixes #529) and centralizes hardcoded constants into SemanticConfig. The batch-and-merge approach for large directories is a sound design.
However, there are two correctness issues that should be addressed before merge:
- File index mismatch in batched generation: Each batch uses local indices
[1]..[N], but post-processing uses the globalfile_index_map, causing incorrect file name replacements. - Budget guard gap: The AND condition means directories with ≤50 files but very long summaries bypass the guard entirely — the exact scenario described in #529.
See inline comments for details and suggestions.
| for batch_idx, batch in enumerate(batches): | ||
| batch_lines = [] | ||
| for idx, item in enumerate(batch, 1): | ||
| batch_lines.append(f"[{idx}] {item['name']}: {item['summary']}") |
There was a problem hiding this comment.
[Bug] (blocking) Each batch uses local indices via enumerate(batch, 1), so every batch numbers its files [1], [2], ..., [batch_size]. However, the file_index_map passed to this method maps global indices (1..N across all files) to file names.
At line ~945, re.sub(r"\[(\d+)\]", replace_index, overview) replaces [number] references using the global map. If the LLM outputs [1] in batch 2's partial overview (referring to the first file of batch 2), it will be replaced with the name of global file #1 instead.
Suggested fix: either (a) use global indices when building each batch's prompt (offset by batch_idx * batch_size), or (b) apply replace_index to each partial overview individually using a per-batch index map before merging.
There was a problem hiding this comment.
Fixed. Each batch now uses global offset indices (global_offset + local_idx + 1), and replace_index is applied per-batch using a batch_index_map via a make_replacer closure. The global file_index_map is no longer used in the batched path.
| estimated_size = len(file_summaries_str) + len(children_abstracts_str) | ||
| if ( | ||
| estimated_size > semantic.max_overview_prompt_chars | ||
| and len(file_summaries) > semantic.overview_batch_size |
There was a problem hiding this comment.
[Bug] (blocking) The AND condition requires both estimated_size > budget and file_count > batch_size for the guard to trigger. This means a directory with ≤50 files but very long individual summaries (e.g. 50 files × 1.5K summary = 75K chars, exceeding the 60K budget) will bypass the guard entirely and still produce an oversized prompt.
This is the exact failure mode described in #529 — oversized prompts causing VLM timeouts. The fix only protects against the many-files variant, not the few-files-long-summaries variant.
Suggested fix: consider using OR instead of AND, or add a fallback for the case where the prompt is oversized but file count is below batch_size (e.g., truncating individual summaries proportionally to fit the budget).
There was a problem hiding this comment.
Fixed. Changed to three-way branching: (1) over budget + many files → batch and merge, (2) over budget + few files → truncate individual summaries proportionally to fit budget, (3) under budget → proceed as normal.
| from openviking_cli.utils.config import get_openviking_config | ||
|
|
||
| semantic = get_openviking_config().semantic | ||
| if len(overview) > semantic.overview_max_chars: |
There was a problem hiding this comment.
[Design] (non-blocking) This truncation block is duplicated nearly identically in semantic_processor.py:_process_memory_directory (lines ~357-362). Consider extracting a shared helper like _enforce_size_limits(overview, abstract, config) → (overview, abstract) to avoid divergence when limits or truncation logic change.
There was a problem hiding this comment.
Fixed. Extracted _enforce_size_limits(overview, abstract) on SemanticProcessor. Both call sites (semantic_processor.py and semantic_dag.py) now use it.
| { | ||
| "dir_name": dir_name, | ||
| "file_summaries": combined, | ||
| "children_abstracts": "None", |
There was a problem hiding this comment.
[Design] (non-blocking) In the merge step, children_abstracts is hardcoded to "None". Children abstracts are only included in the first batch's partial overview (line 891). When partials are merged, the LLM may not retain subdirectory context from batch 1's output, leading to loss of child directory information in the final overview.
Consider passing the children abstracts string into the merge prompt as well, so the final overview has full subdirectory context.
There was a problem hiding this comment.
Fixed. The merge prompt now includes children_abstracts_str instead of hardcoded "None".
| from openviking_cli.utils.config.parser_config import SemanticConfig | ||
|
|
||
|
|
||
| def test_semantic_config_defaults(): |
There was a problem hiding this comment.
[Suggestion] (non-blocking) The tests here thoroughly validate SemanticConfig defaults and budget arithmetic, which is good. However, there are no tests for the core behavioral changes: _batched_generate_overview index mapping, the merge step, or the truncation enforcement in semantic_dag.py / semantic_processor.py.
Consider adding at least one test with a mocked VLM that verifies:
- Batch splitting produces correct per-batch index numbering
- The
[number]→ file name replacement is correct across batches - Truncation is applied when overview/abstract exceeds limits
There was a problem hiding this comment.
Acknowledged — will add behavioral tests with mocked VLM in a follow-up if needed. The current tests validate the config, budget logic, and batch splitting which are the core pure-function components.
- Fix file index mismatch: use global offset indices per batch and apply replace_index per-batch via make_replacer closure - Fix budget guard gap: change AND to three-way branching (batch for many files, truncate for few-files-long-summaries, pass-through for under budget) - Extract _enforce_size_limits() helper to deduplicate truncation logic between semantic_processor.py and semantic_dag.py - Pass children_abstracts to merge prompt for full context retention Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qin-ctx
left a comment
There was a problem hiding this comment.
All blocking issues from the previous review have been addressed correctly:
- ✅ File index mismatch fixed with global offset indices + per-batch
make_replacerclosure - ✅ Budget guard gap fixed with three-way branching (batch, truncate, pass-through)
- ✅ Duplicated truncation logic extracted into
_enforce_size_limits - ✅ Children abstracts now passed to merge prompt
LGTM 🎉
Summary
_generate_overview()that prevents oversized LLM prompts for large directoriesSemanticConfigdataclassFixes #529, addresses #531
Changes Made
openviking_cli/utils/config/parser_config.py: AddedSemanticConfigdataclass with configurable limits (max_file_content_chars,max_overview_prompt_chars,overview_batch_size,abstract_max_chars,overview_max_chars)openviking_cli/utils/config/open_viking_config.py: WiredSemanticConfigintoOpenVikingConfig(configurable viaov.confunder"semantic"key)openviking/storage/queuefs/semantic_processor.py: Added budget guard with_single_generate_overview()(unchanged behavior for small dirs) and_batched_generate_overview()(split + merge for large dirs). Replaced hardcoded30000withSemanticConfig.max_file_content_chars. Added output size enforcement.openviking/storage/queuefs/semantic_dag.py: Added output size enforcement at the DAG call sitetests/misc/test_semantic_config.py: 7 tests covering config defaults, custom values, budget estimation, truncation, and batch splittingConfiguration Example
{ "semantic": { "max_file_content_chars": 30000, "max_overview_prompt_chars": 60000, "overview_batch_size": 50, "abstract_max_chars": 256, "overview_max_chars": 4000 } }All values have sensible defaults — no config changes required for existing users.
Type of Change
Testing
Checklist