fix(gemini): scope-reduce to TextEmbedder, resolve PR #607 review comments#3
fix(gemini): scope-reduce to TextEmbedder, resolve PR #607 review comments#3
Conversation
Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
…id, rust cli version... (volcengine#577) * fix: windows zip path norm * fix: account id in vector db * fix: add some log * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search * fix: add some log, and fixed search --------- Co-authored-by: openviking <openviking@example.com>
…ns (volcengine#609) agent_space_name() computed md5(user_id + agent_id) without a separator, so different (user_id, agent_id) pairs could produce the same hash when their concatenation matched (e.g. ("alice","bot") vs ("aliceb","ot")). Add ":" separator between user_id and agent_id in the hash input. The ":" character is safe because the validation regex [a-zA-Z0-9_-] prevents either field from containing it. Fix applied to all three implementations: - openviking_cli/session/user_id.py (Python SDK) - bot/vikingbot/openviking_mount/ov_server.py (bot server) - examples/openclaw-memory-plugin/client.ts (TypeScript example) Note: This is a breaking change for existing agent spaces. Existing data directories were named using the old hash and will not be found with the new hash. A migration script or fallback lookup may be needed. Fixes volcengine#595 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…ction (volcengine#610) commit_async() called extract_long_term_memories(messages=...) without passing user, session_id, or ctx. Because the compressor returns early when ctx is None, async commits always produced memories_extracted=0. The sync commit() path already passes all three parameters correctly. This aligns the async path to match. Regression from the COW pattern revert (volcengine#584) which dropped these arguments from the async call. Fixes volcengine#602 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…cengine#615) Move 429 re-enqueue logic directly into the exception handler instead of deferring via a flag, making the flow clearer and avoiding unnecessary DB writes on rate-limit errors. Add has_queue_manager/enqueue_embedding_msg to base backend class. Remove obsolete account_id param from test and prune outdated URI deduplicator test cases. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…lcengine#611) When multiple OpenViking processes share the same data directory (common with stdio MCP in multi-session hosts), they silently contend for AGFS and VectorDB resources. This produces misleading errors like "Collection 'context' does not exist" instead of pointing to the actual cause. Add a PID-based advisory lock in OpenVikingService.initialize() that: - Detects an existing live process using the same data directory - Raises DataDirectoryLocked with a clear error message explaining the contention and suggesting HTTP mode or separate data directories - Cleans up stale lock files from crashed processes - Releases the lock on normal exit via atexit The lock uses a .openviking.pid file in the data directory. HTTP mode (openviking-server) is unaffected since it handles concurrency natively. Relates to volcengine#473 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…volcengine#617) During session commit, messages.jsonl (the archived session transcript) was unnecessarily summarized by the semantic file summary pipeline, wasting tokens and adding latency. Add messages.jsonl to a skip list in _list_dir() so it never enters the semantic DAG. Closes volcengine#564 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…cengine#618) * fix(session): handle non-dict LLM responses in memory extraction When using local Ollama models, parse_json_from_response may return a list instead of the expected {"memories": [...]} dict, causing AttributeError on .get(). Add type checking after parsing: wrap lists as {"memories": data}, and fall back to {} for other unexpected types. Closes volcengine#605 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: fix ruff formatting in test file --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…olcengine#622) Add retrieval quality observability following the existing observer pattern (QueueObserver, VLMObserver, VikingDBObserver). - RetrievalStatsCollector: thread-safe singleton that accumulates per-query metrics (result counts, scores, latency, rerank usage) - RetrievalObserver: reads accumulated stats, reports health based on zero-result rate, formats status table with tabulate - Instrumented HierarchicalRetriever.retrieve() to record stats - Added /api/v1/observer/retrieval endpoint - Included in system-wide observer health check - 17 unit tests covering stats, collector, and observer This contribution was developed with AI assistance (Claude Code). Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…rom URL (volcengine#619) * fix(parse): preserve original filename and extension when importing from URL When importing resources via URL (e.g., COS/HTTP paths to code files), three bugs occurred: an extra directory was created using the temp file name, the file extension was changed from .py to .md, and URL-encoded characters in filenames were not decoded. Root cause: code file extensions (.py, .js, etc.) were not recognized by URLTypeDetector, causing them to be parsed as web pages through the HTML-to-markdown pipeline. This lost the original filename and always produced .md output. Changes: - Add common code/text file extensions to URLTypeDetector so they route to download instead of webpage parsing - Extract and URL-decode the original filename from the URL path - Save downloaded text/code files with their original name and extension instead of routing through MarkdownParser - Pass original filename to markdown and PDF parsers as source_path / resource_name so temp file names don't leak into the final URI Fixes volcengine#251 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(parse): reuse CODE_EXTENSIONS from constants module Import CODE_EXTENSIONS from openviking.parse.parsers.constants instead of duplicating the set in URLTypeDetector. Spread comes first in EXTENSION_MAP so explicit entries (.html, .htm) correctly override to DOWNLOAD_HTML. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: openviking <openviking@example.com>
…context agentId (volcengine#637) When OpenClaw gateway serves multiple agents, each agent's before_agent_start and agent_end hooks now carry the agent's ID in the second parameter (PluginHookAgentContext). The plugin dynamically switches the client's agentId before each recall/capture operation, ensuring memories are routed to the correct agent_space (md5(user_id + agent_id)[:12]). Changes: - client.ts: Add setAgentId()/getAgentId() to allow dynamic agent switching. Clears cached runtimeIdentity and resolvedSpaceByScope when switching to ensure correct space derivation. - index.ts: Extract agentId from hook ctx (2nd param) in both before_agent_start and agent_end handlers. This is backward compatible: if ctx.agentId is absent (single-agent setup), the plugin falls back to the static config agentId as before. Co-authored-by: Mac <mac@MacBookPro.lan>
…ine#640) refactor: replace operation trace with telemetry fix telemetry demo skill ingestion simplify telemetry summary metric keys rename remaining trace telemetry artifacts feat: support configurable telemetry payloads docs: rewrite operation telemetry design in chinese fix: reject telemetry for async session commit refactor: isolate telemetry orchestration refactor: remove telemetry from find payloads refactor: remove telemetry event payloads fix(trace): keep only telemetry-related changes fix(trace): remove top-level usage from telemetry responses feat(console): default telemetry on proxied operations
…gine#607 review) Merge 17 upstream commits (feat/fix: trace metrics, multi-agent isolation, vikingdb TUI, rate-limit simplification, session fixes, etc.) and resolve conflicts in embedding_msg, embedding_msg_converter, collection_schemas, pyproject.toml, and uv.lock. Scope reduction (defers multimodal pipeline to April when add-resource pipeline is ready): - Remove multimodal dispatch from embedding_utils.vectorize_file; media files always fall back to text/summary (resolves r2936557627: removes hardcoded "gemini" string check; resolves r2936556516: no PDF 6-page limit concern) - Drop media_uri/media_mime_type from EmbeddingMsg; add telemetry_id from upstream; preserve id through to_dict/from_dict round-trip - Drop media fields from EmbeddingMsgConverter; pass telemetry_id - Remove multimodal on_dequeue path from TextEmbeddingHandler; adopt upstream's telemetry-tracked text-only path - Add TODO in GeminiDenseEmbedder.embed_multimodal for parts-list aggregation (resolves r2936536550 — deferred to April) GeminiDenseEmbedder class retains supports_multimodal + embed_multimodal for future activation once Vectorize.parts is supported end-to-end.
…lution r2936557627 (use supports_multimodal, not string check): Split IMAGE/VIDEO/AUDIO block from PDF. TODO in media block explicitly documents the correct future pattern: `if embedder.supports_multimodal` instead of `(provider or "").lower() == "gemini"`, noting doubao-embedding- vision also exposes supports_multimodal=True. r2936556516 (PDF 6-page Gemini limit): PDF now has its own explicit elif branch with a comment explaining the hard limit (6 pages/request), the required solution (≤6-page chunk splitting with one vector per chunk), and the April 2026 add-resource timeline. r2936536550 (parts API — 1 media limitation): Confirm in code comment that current 1-text + 1-media implementation matches googleapis/python-genai test suite exactly. TODO updated with the concrete April plan: upgrade Vectorize to carry parts: List[Part] for interleaved text+media sequences from PDF chunk pipelines.
📝 WalkthroughWalkthroughThis PR introduces comprehensive operation-level telemetry collection, vector record pagination in the CLI UI, context-aware memory deduplication, retrieval statistics tracking, process-level locking, and extensive refactoring of the vector database layer to support per-tenant account isolation. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust, operation-level telemetry system to OpenViking, enabling detailed tracking of performance, resource consumption, and errors across various services. It refines the Gemini embedding integration by focusing on text-only capabilities and deferring complex multimodal features, while also enhancing the CLI with new debugging and observability tools. Significant improvements in system stability and cross-platform compatibility are delivered through data directory locking, improved path handling, and more resilient API interactions. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive telemetry system to track operation-scoped metrics, including token usage, queue processing, vector search, semantic DAG node stats, and memory extraction. It refactors the VikingVectorIndexBackend for improved multi-tenancy with per-account backend instances and introduces a VikingDBManagerProxy for simplified tenant-aware operations. The TUI client is enhanced with new debug vector methods and keybindings for viewing vector records. Retrieval quality metrics are now collected via a new RetrievalStatsCollector and exposed through a RetrievalObserver. Additionally, the PR includes fixes for URL filename preservation and handling of code file extensions during resource parsing, adds exponential backoff for Volcengine embedder API calls, and implements a PID-based advisory lock to prevent data directory contention. A minor issue was noted where the search_global_roots_in_tenant function name is misleading as it now includes level 2 items, suggesting a rename to better reflect its broader scope.
| extra_filter=extra_filter, | ||
| ), | ||
| In("level", [0, 1]), | ||
| In("level", [0, 1, 2]), # TODO: smj fix this |
There was a problem hiding this comment.
The function name search_global_roots_in_tenant is a bit misleading now that it also fetches level 2 items, as the name suggests it only fetches 'roots' (level 0 and 1). Consider renaming it to something like search_global_candidates_in_tenant to better reflect its behavior. This would also help resolve the associated TODO.
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
openviking/storage/local_fs.py (1)
202-226:⚠️ Potential issue | 🔴 CriticalZIP read will fail for Windows-created archives after path normalization.
The code normalizes
zip_pathat line 208, then uses the normalizedsafe_zip_pathto callzf.read(safe_zip_path)at line 223. However,zipfile.ZipFile.read()requires the exact entry name as stored in the ZIP archive.When a Windows-created ZIP contains entries with backslashes (e.g.,
project\file.txt), these are stored as-is in the archive. Normalizing the path toproject/file.txtbefore callingzf.read()raises aKeyErrorbecause the actual ZIP entry name still contains backslashes.To fix this, preserve the original filename for ZIP read operations:
Proposed fix
for info in infolist: - zip_path = info.filename - if not zip_path: + original_zip_path = info.filename + if not original_zip_path: continue # Normalize path separators to handle Windows-created ZIPs - zip_path = zip_path.replace("\\", "/") - safe_zip_path = _validate_ovpack_member_path(zip_path, base_name) + normalized_zip_path = original_zip_path.replace("\\", "/") + safe_zip_path = _validate_ovpack_member_path(normalized_zip_path, base_name) # Handle directory entries if safe_zip_path.endswith("/"): rel_path = get_viking_rel_path_from_zip(safe_zip_path.rstrip("/")) target_dir_uri = f"{root_uri}/{rel_path}" if rel_path else root_uri await viking_fs.mkdir(target_dir_uri, exist_ok=True, ctx=ctx) continue # Handle file entries rel_path = get_viking_rel_path_from_zip(safe_zip_path) target_file_uri = f"{root_uri}/{rel_path}" if rel_path else root_uri try: - data = zf.read(safe_zip_path) + data = zf.read(original_zip_path) await viking_fs.write_file_bytes(target_file_uri, data, ctx=ctx) except Exception as e: - logger.error(f"Failed to import {zip_path} to {target_file_uri}: {e}") + logger.error(f"Failed to import {original_zip_path} to {target_file_uri}: {e}") if not force: # In non-force mode, stop on error raise e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/storage/local_fs.py` around lines 202 - 226, The ZIP read fails because you normalize info.filename into safe_zip_path before calling zf.read; instead preserve the original archive entry name (info.filename or zip_path before replacing backslashes) for zf.read and only use safe_zip_path (after _validate_ovpack_member_path and slash normalization) for validation and computing rel_path/target URIs used by get_viking_rel_path_from_zip and viking_fs.write_file_bytes; update the loop to call zf.read(original_entry_name) and keep using safe_zip_path for path checks and directory handling (referencing info.filename, zip_path, safe_zip_path, _validate_ovpack_member_path, get_viking_rel_path_from_zip, zf.read, and viking_fs.write_file_bytes).bot/vikingbot/config/schema.py (1)
733-737:⚠️ Potential issue | 🟡 Minor
from_safe_nameshould validate/safely split input.Current parsing assumes exactly three
__-separated parts and can throwIndexErroron malformed values.✅ Robust parsing fix
`@staticmethod` def from_safe_name(safe_name: str): - file_name_split = safe_name.split("__") - return SessionKey( - type=file_name_split[0], channel_id=file_name_split[1], chat_id=file_name_split[2] - ) + parts = safe_name.split("__", 2) + if len(parts) != 3: + raise ValueError(f"Invalid SessionKey safe_name: {safe_name}") + return SessionKey(type=parts[0], channel_id=parts[1], chat_id=parts[2])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bot/vikingbot/config/schema.py` around lines 733 - 737, The from_safe_name function currently assumes safe_name splits into exactly three parts and can IndexError; change it to split with maxsplit=2 (safe_name.split("__", 2)), validate the resulting parts length equals 3, and raise a clear ValueError (or custom error) if malformed, then construct and return SessionKey(type=parts[0], channel_id=parts[1], chat_id=parts[2]); also consider stripping whitespace from parts before constructing the SessionKey to harden parsing.openviking/session/session.py (1)
260-276:⚠️ Potential issue | 🟡 MinorSet
memory.extractedwhen compressor is unavailable.In both commit paths, non-empty sessions with
self._session_compressorunset never emitmemory.extracted, creating inconsistent telemetry for successful commits.Proposed fix
# 2. Extract long-term memories if self._session_compressor: logger.info( f"Starting memory extraction from {len(messages_to_archive)} archived messages" ) memories = run_async( self._session_compressor.extract_long_term_memories( messages=messages_to_archive, user=self.user, session_id=self.session_id, ctx=self.ctx, ) ) logger.info(f"Extracted {len(memories)} memories") result["memories_extracted"] = len(memories) self._stats.memories_extracted += len(memories) get_current_telemetry().set("memory.extracted", len(memories)) + else: + get_current_telemetry().set("memory.extracted", 0) @@ # 2. Extract long-term memories if self._session_compressor: logger.info( f"Starting memory extraction from {len(messages_to_archive)} archived messages" ) memories = await self._session_compressor.extract_long_term_memories( messages=messages_to_archive, user=self.user, session_id=self.session_id, ctx=self.ctx, ) logger.info(f"Extracted {len(memories)} memories") result["memories_extracted"] = len(memories) self._stats.memories_extracted += len(memories) get_current_telemetry().set("memory.extracted", len(memories)) + else: + get_current_telemetry().set("memory.extracted", 0)Also applies to: 338-351
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/session/session.py` around lines 260 - 276, When self._session_compressor is unset but there are messages_to_archive, ensure telemetry and result fields still report zero memories extracted: in the branch where self._session_compressor is false (and in the analogous second commit path handling messages_to_archive), set result["memories_extracted"] = 0, call get_current_telemetry().set("memory.extracted", 0), and update self._stats.memories_extracted (no-op or add 0) and optionally log that no compressor was available; use the same symbols (self._session_compressor, result["memories_extracted"], self._stats.memories_extracted, get_current_telemetry()) so both commit paths emit consistent telemetry.openviking/models/embedder/volcengine_embedders.py (1)
206-227:⚠️ Potential issue | 🟠 Major
embed_batchlacks retry logic unlikeembed.The
embed()method usesexponential_backoff_retryfor resilience against 429 errors, butembed_batch()makes direct API calls without retry wrapping. This inconsistency means batch operations are more susceptible to transient rate limiting failures.Consider wrapping the batch API call with the same retry logic:
♻️ Proposed fix
def embed_batch(self, texts: List[str]) -> List[EmbedResult]: ... if not texts: return [] - try: + def _batch_embed_call(): if self.input_type == "multimodal": multimodal_inputs = [{"type": "text", "text": text} for text in texts] response = self.client.multimodal_embeddings.create( input=multimodal_inputs, model=self.model_name ) self._update_telemetry_token_usage(response) data = response.data else: response = self.client.embeddings.create(input=texts, model=self.model_name) self._update_telemetry_token_usage(response) data = response.data return [ EmbedResult(dense_vector=truncate_and_normalize(item.embedding, self.dimension)) for item in data ] + + try: + return exponential_backoff_retry( + _batch_embed_call, + max_wait=10.0, + base_delay=0.5, + max_delay=2.0, + jitter=True, + is_retryable=is_429_error, + logger=logger, + ) except Exception as e: ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/volcengine_embedders.py` around lines 206 - 227, The embed_batch method currently calls self.client.multimodal_embeddings.create and self.client.embeddings.create directly and lacks the exponential_backoff_retry used by embed; update embed_batch to wrap the API calls with the same exponential_backoff_retry helper (the one used by embed) so the multimodal and non-multimodal branches call exponential_backoff_retry(lambda: self.client.multimodal_embeddings.create(...)) and exponential_backoff_retry(lambda: self.client.embeddings.create(...)) respectively, preserve the existing _update_telemetry_token_usage(response) and subsequent parsing (EmbedResult creation and truncate_and_normalize) and keep the existing exception handling semantics.
🟡 Minor comments (8)
tests/unit/test_skill_processor_none.py-47-63 (1)
47-63:⚠️ Potential issue | 🟡 MinorThe test assertion uses a platform-specific error message that may cause CI failures on Windows.
The
match="File name too long"assertion is specific to POSIX systems (errno 36, ENAMETOOLONG). Windows handles filesystem path length limits differently and would raise errors with different messages or codes depending on whether long path support is enabled. This test will fail on Windows or in cross-platform CI without platform-specific handling.Consider removing the match pattern to accept any OSError, or add
pytest.mark.skipif(sys.platform == "win32", reason="POSIX-specific path length limits")if this behavior is intentionally Unix-only:Proposed fix
+import sys + def test_parse_skill_long_raw_content_raises_oserror(self): """Long raw SKILL.md content should still surface path probing errors.""" processor = SkillProcessor(vikingdb=None) @@ -59,5 +61,5 @@ class TestParseSkillNoneData: "Use this skill to validate telemetry ingestion.\n" ) - with pytest.raises(OSError, match="File name too long"): + with pytest.raises(OSError): processor._parse_skill(raw_skill)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_skill_processor_none.py` around lines 47 - 63, The test test_parse_skill_long_raw_content_raises_oserror asserts a POSIX-specific error message ("File name too long") when calling SkillProcessor._parse_skill which will fail on Windows; change the test to either remove the match so it simply expects any OSError (i.e., keep with pytest.raises(OSError) without match) or add a platform guard (pytest.mark.skipif(sys.platform == "win32", reason="POSIX-specific path length limits")) so the assertion is only run on POSIX systems—update the test function name test_parse_skill_long_raw_content_raises_oserror accordingly and adjust imports if you choose the skip approach.openviking/retrieve/retrieval_stats.py-26-33 (1)
26-33:⚠️ Potential issue | 🟡 MinorTrack scored items separately from returned items.
avg_score,min_score, andmax_scoreare keyed offtotal_results, but onlyscoresupdates the score accumulators. If a caller records results without per-hit scores,min_scorestaysinf; if all scores are negative,max_scorestays0.0.📊 Suggested fix
`@dataclass` class RetrievalStats: total_queries: int = 0 total_results: int = 0 zero_result_queries: int = 0 + scored_results: int = 0 total_score_sum: float = 0.0 - max_score: float = 0.0 + max_score: float = float("-inf") min_score: float = float("inf") @@ def avg_score(self) -> float: - if self.total_results == 0: + if self.scored_results == 0: return 0.0 - return self.total_score_sum / self.total_results + return self.total_score_sum / self.scored_results @@ - "max_score": round(self.max_score, 4) if self.total_results > 0 else 0.0, - "min_score": round(self.min_score, 4) if self.total_results > 0 else 0.0, + "max_score": round(self.max_score, 4) if self.scored_results > 0 else 0.0, + "min_score": round(self.min_score, 4) if self.scored_results > 0 else 0.0, @@ for s in scores: + self._stats.scored_results += 1 self._stats.total_score_sum += sAlso applies to: 48-51, 65-69, 120-125
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/retrieve/retrieval_stats.py` around lines 26 - 33, The scoring stats are keyed off total_results while only scores update total_score_sum, leaving min_score as inf and max_score at 0 for negative scores; add a new scored_results_count (int) field and increment it whenever scores are processed (alongside updating total_score_sum, min_score, max_score), change any avg_score computation to divide by scored_results_count (not total_results), and ensure final getters/serializers handle scored_results_count == 0 by returning sensible defaults (e.g., 0.0 or None) instead of inf/0; update references to total_score_sum, max_score, min_score, and any avg_score logic in the RetrievalStats class to use scored_results_count.openviking/models/embedder/openai_embedders.py-149-152 (1)
149-152:⚠️ Potential issue | 🟡 MinorSame concern: telemetry recorded before validating response data.
For consistency, consider moving the telemetry update after iterating
response.datato ensure tokens are only counted for successfully processed responses.Proposed reordering
response = self.client.embeddings.create(**kwargs) - self._update_telemetry_token_usage(response) - return [EmbedResult(dense_vector=item.embedding) for item in response.data] + results = [EmbedResult(dense_vector=item.embedding) for item in response.data] + self._update_telemetry_token_usage(response) + return results🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/openai_embedders.py` around lines 149 - 152, Telemetry is currently recorded before validating and consuming response data; modify the flow in the embedding call that uses self.client.embeddings.create so you first build the list of EmbedResult objects from response.data (e.g., iterate response.data and create EmbedResult(dense_vector=item.embedding) for each item), validate the results as needed, and only after successful iteration call self._update_telemetry_token_usage(response) to record token usage; this ensures _update_telemetry_token_usage is invoked only when response.data was successfully processed.openviking/models/embedder/openai_embedders.py-119-121 (1)
119-121:⚠️ Potential issue | 🟡 MinorConsider updating telemetry after validating response data.
Telemetry is recorded on line 120 before accessing
response.data[0].embeddingon line 121. Ifresponse.datais empty or malformed, token usage will be counted for a request that ultimately fails. Consider moving the telemetry update after the data extraction, or ensure the response structure is valid first.Proposed reordering
response = self.client.embeddings.create(**kwargs) - self._update_telemetry_token_usage(response) vector = response.data[0].embedding + self._update_telemetry_token_usage(response) return EmbedResult(dense_vector=vector)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/openai_embedders.py` around lines 119 - 121, Telemetry update is happening before validating the embedding response, so _update_telemetry_token_usage may record usage for a failed/malformed response; modify the embedding flow in the method that calls self.client.embeddings.create so you first validate response.data and extract vector (e.g., ensure response.data is a non-empty list and response.data[0].embedding exists) and only then call self._update_telemetry_token_usage(response); alternatively, perform the validation and raise a clear exception before any telemetry call to avoid counting failed requests.crates/ov_cli/src/tui/tree.rs-69-72 (1)
69-72:⚠️ Potential issue | 🟡 MinorDon't overload the root URI for display.
Using
"/"here makesFsEntry::name()return an empty string, so the synthetic root row renders blank. It also changes the URI propagated byselected_uri(). Keep a canonical root URI and derive"/"as a display label separately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/ov_cli/src/tui/tree.rs` around lines 69 - 72, The synthetic root node is using "/" as the FsEntry.uri which makes FsEntry::name() return an empty string and leaks that display value into selected_uri(); instead set the root TreeNode.entry.uri to the canonical internal root identifier (do not use "/"), and change the rendering/path-display logic to derive and display "/" as the label for the synthetic root (i.e., keep TreeNode / FsEntry.uri canonical, but map that canonical URI to the "/" display string when rendering and when computing row labels), ensuring selected_uri() continues to return the canonical URI unchanged.tests/telemetry/test_resource_summary.py-11-14 (1)
11-14:⚠️ Potential issue | 🟡 MinorThis test doesn't actually cover
queue_statusparsing.Both request-stat helpers are stubbed to fixed values, and the assertions only check those stubbed values.
record_resource_wait_metrics()could stop readingqueue_statusand this test would still pass. Leave one collector unpatched, or assert a field sourced directly fromqueue_status.Also applies to: 30-58
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/telemetry/test_resource_summary.py` around lines 11 - 14, The test doesn't actually verify parsing of queue_status by record_resource_wait_metrics because both request-stat helpers are stubbed and assertions only check those stubs; update tests in test_resource_summary.py so that either one of the request-stat collectors remains unpatched or add an assertion that directly checks a value derived from queue_status (e.g., queue_status["Semantic"].error_count or queue_status["Embedding"].processed) after calling record_resource_wait_metrics; ensure you reference the queue_status fixture/variable and the record_resource_wait_metrics call and assert that a metric or output reflects a queue_status field (for example assert recorded metric equals queue_status["Semantic"].error_count) so the test will fail if queue_status is not parsed.openviking/models/embedder/base.py-322-328 (1)
322-328:⚠️ Potential issue | 🟡 MinorRe-cap the jittered delay.
max_delayis enforced before jitter, so once the raw backoff reaches the cap the actual sleep can still exceed it. Clamp again after applying jitter to keep the per-retry ceiling predictable.Suggested fix
delay = min(base_delay * (2 ** (attempt - 1)), max_delay) if jitter: delay = delay * (0.5 + random.random()) + delay = min(delay, max_delay) remaining_time = max_wait - elapsed delay = min(delay, remaining_time)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/base.py` around lines 322 - 328, The jittered delay can exceed max_delay because max_delay was applied before jitter; update the backoff logic in the function that computes delay (the block using base_delay, attempt, max_delay, jitter, random) so that after applying jitter you clamp again: after "if jitter: delay = delay * (0.5 + random.random())" add "delay = min(delay, max_delay)" (then compute remaining_time and cap by that as before). This ensures the per-retry ceiling (max_delay) holds even when jitter is enabled.crates/ov_cli/src/tui/ui.rs-148-161 (1)
148-161:⚠️ Potential issue | 🟡 MinorTitle shows "1/0" when records list is empty.
When
app.vector_state.records.is_empty()is true, the title will displaycursor + 1which could show "1/0" before the empty state check on Line 171 returns. While the early return prevents rendering items, the title has already been set.Consider guarding the title construction or showing a cleaner "0/0" for empty state:
🔧 Proposed fix
let title = if let Some(total) = app.vector_state.total_count { - format!( - " Vector Records ({}/{}, total: {}) ", - app.vector_state.cursor + 1, - app.vector_state.records.len(), - total - ) + if app.vector_state.records.is_empty() { + format!(" Vector Records (0/{}, total: {}) ", 0, total) + } else { + format!( + " Vector Records ({}/{}, total: {}) ", + app.vector_state.cursor + 1, + app.vector_state.records.len(), + total + ) + } } else { - format!( - " Vector Records ({}/{}) ", - app.vector_state.cursor + 1, - app.vector_state.records.len() - ) + if app.vector_state.records.is_empty() { + " Vector Records (0/0) ".to_string() + } else { + format!( + " Vector Records ({}/{}) ", + app.vector_state.cursor + 1, + app.vector_state.records.len() + ) + } };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/ov_cli/src/tui/ui.rs` around lines 148 - 161, The title construction can display "1/0" when app.vector_state.records is empty because it unconditionally uses app.vector_state.cursor + 1; update the logic that builds title (the block using app.vector_state.total_count, app.vector_state.cursor, and app.vector_state.records.len()) to first check if app.vector_state.records.is_empty() and produce " Vector Records (0/0) " (and optionally include total when present) otherwise use the existing cursor+1 and len() formatting; ensure this guard sits before the current title assignment so the empty-state title is never computed from cursor+1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bot/vikingbot/config/loader.py`:
- Line 56: The bootstrap currently persists default_config too early via
save_config(default_config, config_path, include_defaults=True), which causes
bot.agents.model to be written and prevents _merge_vlm_model_config() from
applying vlm/provider overrides; fix by not persisting defaults before merging:
either change the save_config call to use include_defaults=False (or remove it)
and ensure _merge_vlm_model_config() runs and can mutate the in-memory config
first, or postpone the save_config call until after _merge_vlm_model_config()
has run; ensure the change prevents writing bot.agents.model/default provider
before merge.
In `@examples/openclaw-memory-plugin/index.ts`:
- Around line 357-365: The hook is mutating a shared OpenVikingClient by calling
client.setAgentId(hookAgentId), causing cross-request interference; instead make
agentId request-scoped by either creating/obtaining a per-agent client or
passing agentId into the memory operations rather than mutating the shared
instance. Update the before_agent_start and the other hook (the block around the
later capture/recall code) to use a request-scoped client accessor (e.g.,
getClientForAgent(agentId)) or call methods that accept an agentId parameter
instead of calling OpenVikingClient.setAgentId on the shared client, and ensure
any resolved-space cache is keyed by agentId rather than a global field.
In `@openviking_cli/client/http.py`:
- Around line 576-577: The response envelope's telemetry is currently discarded
in the client methods that call self._handle_response_data (e.g., in find() and
search()); extract telemetry = response_data.get("telemetry") and ensure it is
passed into the FindResult creation instead of dropping it—either by calling
FindResult.from_dict(result_dict, telemetry=telemetry) (or otherwise setting a
telemetry field on the returned FindResult) or by returning the full envelope
consistently; apply the same change in both places that call
FindResult.from_dict after _handle_response_data.
- Around line 225-227: The function _validate_telemetry currently calls
normalize_telemetry_request(telemetry) but returns the original telemetry
object; change it to capture and return the normalized result—i.e., call
normalized = normalize_telemetry_request(telemetry) (or return the call
directly) and then return that normalized payload so any non-mutating
normalization is sent to the server; update references in _validate_telemetry to
use the normalized value.
In `@openviking/console/bootstrap.py`:
- Around line 41-44: The default for the "--request-timeout-sec" argument is too
high (3600.0); change its default to a much lower safe value (e.g., 30.0 or
60.0) in the argparse add_argument call that defines "--request-timeout-sec" and
update the help string to note that longer timeouts require explicit opt-in for
long-running operations. Locate the add_argument invocation for
"--request-timeout-sec" in bootstrap.py and change the default float value and
help text accordingly.
In `@openviking/parse/parsers/html.py`:
- Around line 352-368: The extracted filename from _extract_filename_from_url
may decode to path-traversal values (e.g., "..", "../", or include separators)
and is used directly when composing temp URIs; sanitize it before returning by
stripping any path separators and dot-segments, normalizing/slugifying to a safe
filename (keep only alphanumerics, hyphen, underscore, dot), trim leading dots,
and fall back to a constant safe name like "download" if the result is empty or
equals "." or ".."; apply the same sanitization where original_filename is used
to build root_dir / file_uri to ensure no traversal escapes the temp subtree.
In `@openviking/retrieve/hierarchical_retriever.py`:
- Around line 355-367: The seeded level-2 hits in the initial_candidates loop
are added to collected_by_uri without applying the score gate; update the block
that iterates initial_candidates in hierarchical_retriever.py to compute
r["_final_score"] (using r.get("_score", 0.0)) and call
passes_threshold(r["_final_score"]) (or the appropriate scorer method used
elsewhere) before inserting into collected_by_uri and logging; ensure you still
only accept items where r.get("level", 2) == 2 and skip/log those that fail the
threshold so the global score_threshold is respected for seeded L2 hits.
In `@openviking/storage/collection_schemas.py`:
- Around line 165-189: The bug is that _merge_request_stats keeps appending the
same telemetry_id into cls._request_stats_order, so when the list exceeds
cls._max_cached_stats an old duplicate may be popped and its live aggregate
removed from cls._request_stats_by_telemetry_id; fix by ensuring
cls._request_stats_order contains each telemetry_id only once (or is updated to
reflect most-recent use) before eviction. Modify _merge_request_stats to, while
holding cls._request_stats_lock, remove any existing occurrence of telemetry_id
from cls._request_stats_order (or maintain a companion set) and then append
telemetry_id (or append/move-to-end) so eviction always removes a truly oldest
unused telemetry id and not a duplicate of an active one; keep
consume_request_stats behavior unchanged.
- Around line 309-312: The current logger.debug call (using record_id and
inserted_data['abstract'] and inserted_data['vector'][:5]) leaks raw content and
will KeyError on sparse-only runs; change the logging in the upsert/insert path
to avoid logging the raw abstract or vector slice and instead log safe metadata
(e.g., record_id, inserted_data.get("doc_id") or inserted_data.get("chunk_id"),
collection name or inserted_data.keys()) and a boolean indicating whether a
vector is present (vector_present = "vector" in inserted_data). Update the
logger.debug line that references record_id and inserted_data to use
inserted_data.get(...) and only include non-sensitive identifiers and vector
presence, not the actual abstract or vector contents.
- Around line 253-261: The 429 re-enqueue path currently requeues immediately
and indefinitely; modify it to add a retry counter on EmbeddingMsg (e.g.,
add/inspect a attempts or retry_count field on EmbeddingMsg), check and
increment that counter before re-enqueueing in the block guarded by is_429_error
and self._vikingdb.has_queue_manager, and apply an exponential backoff sleep
(use the same backoff logic pattern as in volcengine_vectorizer.py) before
calling self._vikingdb.enqueue_embedding_msg(embedding_msg); enforce a maximum
retry limit (drop/log and call a failure handler/report if retry_count exceeds
max) so messages are not re-enqueued forever, and ensure successful requeue
still calls report_success.
In `@openviking/storage/queuefs/semantic_processor.py`:
- Around line 76-103: _cache_dag_stats stores the same DagStats under both
cls._dag_stats_by_telemetry_id and cls._dag_stats_by_uri and appends
(telemetry_id, uri) to cls._dag_stats_order, but consume_dag_stats only removes
the entry it was queried by leaving the alias and stale order tuples behind; fix
by making caching idempotent (in _cache_dag_stats) — before appending a new
(telemetry_id, uri) remove any existing tuples for the same telemetry_id or uri
from cls._dag_stats_order and also remove their previous entries from
cls._dag_stats_by_telemetry_id/cls._dag_stats_by_uri to avoid duplicates, and
(in consume_dag_stats) when removing by telemetry_id or by uri also remove the
corresponding alias in the other dict and remove any matching tuples from
cls._dag_stats_order while holding _stats_lock so indices remain in sync and
eviction via _max_cached_stats works correctly.
- Around line 105-129: The current list-backed _request_stats_order in
_merge_request_stats lets duplicate telemetry ids accumulate and causes eviction
of live aggregates; replace the list with a keyed LRU (e.g.,
collections.OrderedDict or dict with move-to-end semantics) so each telemetry_id
is unique in the ordering. In _merge_request_stats (and related code paths),
after updating cls._request_stats_by_telemetry_id[telemetry_id], insert or move
the key to the end (OrderedDict.__setitem__ or move_to_end) and when len(order)
> cls._max_cached_stats pop the oldest via order.popitem(last=False) and also
pop that key from cls._request_stats_by_telemetry_id; keep consume_request_stats
behavior the same but it will now safely pop the stats by key without risk of
earlier duplicate eviction.
In `@openviking/storage/viking_fs.py`:
- Around line 648-650: The debug log in VikingFS.find leaks raw tenant/user
identifiers by interpolating real_ctx.account_id and real_ctx.user; change the
logger.debug calls (e.g., the one in VikingFS.find that currently references
real_ctx.account_id and real_ctx.user, and the similar calls around the 784-786
region) to redact or hash those values before logging (for example log counts,
masked values like "<redacted>" or a deterministic hash/ID instead of the raw
identifiers) so the message still conveys context without exposing PII.
In `@openviking/storage/viking_vector_index_backend.py`:
- Around line 620-637: The count() and clear() methods currently fall back to
_get_default_backend() when ctx is None which scopes operations to the "default"
account; change the no-ctx path to use _get_root_backend() instead so these
calls operate collection-wide (or alternatively make ctx mandatory).
Specifically, update both count() and clear() to call self._get_root_backend()
when ctx is falsy (keeping the existing use of _get_backend_for_context(ctx)
when ctx is provided), and ensure return await backend.count(...) and return
await backend.clear(...) remain unchanged.
- Around line 380-385: The health_check implementation always returns True if
collection_exists() doesn't raise, ignoring its boolean result; change
health_check (in VikingVectorIndexBackend) to await and capture the return value
from collection_exists() and return that boolean, and only return False on
exceptions (optionally logging the exception) so that an uninitialized backend
(collection_exists() == False) is reported unhealthy.
- Around line 137-141: The upsert currently only injects self._bound_account_id
when payload lacks "account_id", allowing callers to override and write into
other tenants; modify _SingleAccountBackend.upsert to validate that if
self._bound_account_id is set and payload.get("account_id") exists and differs
from it, the method rejects the request (raise a ValueError/ClientError) instead
of proceeding — enforce the check using self._bound_account_id and
payload["account_id"] and return/raise immediately with a clear error message
when they mismatch.
In `@openviking/utils/process_lock.py`:
- Around line 58-79: The current read -> check -> open("w") flow in acquire lock
is race-prone and returns success when writing fails; replace it with an atomic
create and fail-closed behavior: attempt to create the lock file using an atomic
open (os.open(lock_path, os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o644)), write
my_pid bytes and close; if os.open raises FileExistsError, re-read via
_read_pid_file(lock_path) and if the existing PID is alive (use _is_pid_alive)
raise DataDirectoryLocked (including existing_pid and data_dir), otherwise
remove the stale file and retry the atomic create once; any OSError when
creating or writing the lock should raise an exception (do not return lock_path)
so callers know acquisition failed. Use the existing symbols _read_pid_file,
_is_pid_alive, DataDirectoryLocked, lock_path, my_pid, and data_dir to locate
and implement this change.
In `@tests/misc/test_process_lock.py`:
- Around line 51-61: The test test_error_message_includes_remediation currently
will pass if acquire_data_dir_lock(tmpdir) does not raise; update it to
explicitly fail when no DataDirectoryLocked is raised by wrapping the call in a
try/except and adding a fail path (e.g., call to pytest.fail or raising
AssertionError) immediately after acquire_data_dir_lock(tmpdir) so the test only
succeeds when the except DataDirectoryLocked as exc branch runs and validates
the message; reference the test function name
test_error_message_includes_remediation and the exception class
DataDirectoryLocked as the places to change.
---
Outside diff comments:
In `@bot/vikingbot/config/schema.py`:
- Around line 733-737: The from_safe_name function currently assumes safe_name
splits into exactly three parts and can IndexError; change it to split with
maxsplit=2 (safe_name.split("__", 2)), validate the resulting parts length
equals 3, and raise a clear ValueError (or custom error) if malformed, then
construct and return SessionKey(type=parts[0], channel_id=parts[1],
chat_id=parts[2]); also consider stripping whitespace from parts before
constructing the SessionKey to harden parsing.
In `@openviking/models/embedder/volcengine_embedders.py`:
- Around line 206-227: The embed_batch method currently calls
self.client.multimodal_embeddings.create and self.client.embeddings.create
directly and lacks the exponential_backoff_retry used by embed; update
embed_batch to wrap the API calls with the same exponential_backoff_retry helper
(the one used by embed) so the multimodal and non-multimodal branches call
exponential_backoff_retry(lambda: self.client.multimodal_embeddings.create(...))
and exponential_backoff_retry(lambda: self.client.embeddings.create(...))
respectively, preserve the existing _update_telemetry_token_usage(response) and
subsequent parsing (EmbedResult creation and truncate_and_normalize) and keep
the existing exception handling semantics.
In `@openviking/session/session.py`:
- Around line 260-276: When self._session_compressor is unset but there are
messages_to_archive, ensure telemetry and result fields still report zero
memories extracted: in the branch where self._session_compressor is false (and
in the analogous second commit path handling messages_to_archive), set
result["memories_extracted"] = 0, call
get_current_telemetry().set("memory.extracted", 0), and update
self._stats.memories_extracted (no-op or add 0) and optionally log that no
compressor was available; use the same symbols (self._session_compressor,
result["memories_extracted"], self._stats.memories_extracted,
get_current_telemetry()) so both commit paths emit consistent telemetry.
In `@openviking/storage/local_fs.py`:
- Around line 202-226: The ZIP read fails because you normalize info.filename
into safe_zip_path before calling zf.read; instead preserve the original archive
entry name (info.filename or zip_path before replacing backslashes) for zf.read
and only use safe_zip_path (after _validate_ovpack_member_path and slash
normalization) for validation and computing rel_path/target URIs used by
get_viking_rel_path_from_zip and viking_fs.write_file_bytes; update the loop to
call zf.read(original_entry_name) and keep using safe_zip_path for path checks
and directory handling (referencing info.filename, zip_path, safe_zip_path,
_validate_ovpack_member_path, get_viking_rel_path_from_zip, zf.read, and
viking_fs.write_file_bytes).
---
Minor comments:
In `@crates/ov_cli/src/tui/tree.rs`:
- Around line 69-72: The synthetic root node is using "/" as the FsEntry.uri
which makes FsEntry::name() return an empty string and leaks that display value
into selected_uri(); instead set the root TreeNode.entry.uri to the canonical
internal root identifier (do not use "/"), and change the rendering/path-display
logic to derive and display "/" as the label for the synthetic root (i.e., keep
TreeNode / FsEntry.uri canonical, but map that canonical URI to the "/" display
string when rendering and when computing row labels), ensuring selected_uri()
continues to return the canonical URI unchanged.
In `@crates/ov_cli/src/tui/ui.rs`:
- Around line 148-161: The title construction can display "1/0" when
app.vector_state.records is empty because it unconditionally uses
app.vector_state.cursor + 1; update the logic that builds title (the block using
app.vector_state.total_count, app.vector_state.cursor, and
app.vector_state.records.len()) to first check if
app.vector_state.records.is_empty() and produce " Vector Records (0/0) " (and
optionally include total when present) otherwise use the existing cursor+1 and
len() formatting; ensure this guard sits before the current title assignment so
the empty-state title is never computed from cursor+1.
In `@openviking/models/embedder/base.py`:
- Around line 322-328: The jittered delay can exceed max_delay because max_delay
was applied before jitter; update the backoff logic in the function that
computes delay (the block using base_delay, attempt, max_delay, jitter, random)
so that after applying jitter you clamp again: after "if jitter: delay = delay *
(0.5 + random.random())" add "delay = min(delay, max_delay)" (then compute
remaining_time and cap by that as before). This ensures the per-retry ceiling
(max_delay) holds even when jitter is enabled.
In `@openviking/models/embedder/openai_embedders.py`:
- Around line 149-152: Telemetry is currently recorded before validating and
consuming response data; modify the flow in the embedding call that uses
self.client.embeddings.create so you first build the list of EmbedResult objects
from response.data (e.g., iterate response.data and create
EmbedResult(dense_vector=item.embedding) for each item), validate the results as
needed, and only after successful iteration call
self._update_telemetry_token_usage(response) to record token usage; this ensures
_update_telemetry_token_usage is invoked only when response.data was
successfully processed.
- Around line 119-121: Telemetry update is happening before validating the
embedding response, so _update_telemetry_token_usage may record usage for a
failed/malformed response; modify the embedding flow in the method that calls
self.client.embeddings.create so you first validate response.data and extract
vector (e.g., ensure response.data is a non-empty list and
response.data[0].embedding exists) and only then call
self._update_telemetry_token_usage(response); alternatively, perform the
validation and raise a clear exception before any telemetry call to avoid
counting failed requests.
In `@openviking/retrieve/retrieval_stats.py`:
- Around line 26-33: The scoring stats are keyed off total_results while only
scores update total_score_sum, leaving min_score as inf and max_score at 0 for
negative scores; add a new scored_results_count (int) field and increment it
whenever scores are processed (alongside updating total_score_sum, min_score,
max_score), change any avg_score computation to divide by scored_results_count
(not total_results), and ensure final getters/serializers handle
scored_results_count == 0 by returning sensible defaults (e.g., 0.0 or None)
instead of inf/0; update references to total_score_sum, max_score, min_score,
and any avg_score logic in the RetrievalStats class to use scored_results_count.
In `@tests/telemetry/test_resource_summary.py`:
- Around line 11-14: The test doesn't actually verify parsing of queue_status by
record_resource_wait_metrics because both request-stat helpers are stubbed and
assertions only check those stubs; update tests in test_resource_summary.py so
that either one of the request-stat collectors remains unpatched or add an
assertion that directly checks a value derived from queue_status (e.g.,
queue_status["Semantic"].error_count or queue_status["Embedding"].processed)
after calling record_resource_wait_metrics; ensure you reference the
queue_status fixture/variable and the record_resource_wait_metrics call and
assert that a metric or output reflects a queue_status field (for example assert
recorded metric equals queue_status["Semantic"].error_count) so the test will
fail if queue_status is not parsed.
In `@tests/unit/test_skill_processor_none.py`:
- Around line 47-63: The test test_parse_skill_long_raw_content_raises_oserror
asserts a POSIX-specific error message ("File name too long") when calling
SkillProcessor._parse_skill which will fail on Windows; change the test to
either remove the match so it simply expects any OSError (i.e., keep with
pytest.raises(OSError) without match) or add a platform guard
(pytest.mark.skipif(sys.platform == "win32", reason="POSIX-specific path length
limits")) so the assertion is only run on POSIX systems—update the test function
name test_parse_skill_long_raw_content_raises_oserror accordingly and adjust
imports if you choose the skip approach.
---
Nitpick comments:
In @.gitignore:
- Line 144: The .gitignore entry "docs/superpowers/" is unanchored and can match
directories anywhere; update that pattern to be rooted by changing
"docs/superpowers/" to "/docs/superpowers/" so it consistently anchors to the
repository root (matching the adjacent "/site" pattern) and prevents unintended
matches elsewhere.
In `@examples/opencode-memory-plugin/openviking-memory.ts`:
- Around line 9-10: Header sentence in openviking-memory.ts is awkward and uses
a nonstandard comma; update the top comment to a concise, grammatical
description—edit the header comment (the initial project description lines) to
use standard punctuation, consistent capitalization, and clearer phrasing (for
example: "Build an enterprise AI assistant for consumer brands with process
awareness and memory, serving product development through the pre-launch
lifecycle.") so the project metadata reads professionally and is easy to scan.
In `@openviking/console/app.py`:
- Around line 50-58: Extract the hardcoded upstream paths in
_should_default_telemetry into named constants and use them in the function to
avoid duplication; create a set like _TELEMETRY_DEFAULT_PATHS for exact-match
routes (e.g. "/api/v1/search/find", "/api/v1/resources"), and string constants
_TELEMETRY_SESSION_PREFIX and _TELEMETRY_COMMIT_SUFFIX for the
startswith/endswith checks, then update _should_default_telemetry to test
membership in _TELEMETRY_DEFAULT_PATHS and to use
startswith(_TELEMETRY_SESSION_PREFIX) and endswith(_TELEMETRY_COMMIT_SUFFIX) so
route values are defined once and can be reused elsewhere.
In `@openviking/models/embedder/volcengine_embedders.py`:
- Around line 21-34: Replace fragile string matching in is_429_error with a type
check against the SDK's ArkRateLimitError: import ArkRateLimitError from
volcenginesdkarkruntime and have is_429_error return True when
isinstance(exception, ArkRateLimitError); you can optionally keep the existing
string checks as a fallback for non-SDK exceptions, but primary detection must
use isinstance(exception, ArkRateLimitError) to reliably detect 429/rate-limit
errors.
In `@openviking/models/vlm/base.py`:
- Around line 83-90: The except block around the telemetry call in
openviking.models.vlm.base (the try that imports openviking.telemetry and calls
get_current_telemetry().add_token_usage(prompt_tokens, completion_tokens))
swallows all errors silently; update the handler to log the caught exception at
debug level so telemetry failures are recorded without impacting
inference—capture the exception and call the module/process logger (or
logging.debug) with a clear message and the exception info while preserving the
current behavior of not raising.
In `@openviking/server/routers/debug.py`:
- Around line 54-57: The parameter name `filter` in the debug_vector_count
function shadows Python's built-in filter(); rename the parameter (e.g., to
`filter_expr` or `query_filter`) and update all references inside
debug_vector_count and any callers to use the new name (`filter_expr`), ensuring
type annotation Optional[str] and the Depends(_ctx) signature remain unchanged;
search for usages of `filter` within openviking/server/routers/debug.py (and
related tests) to update variable references accordingly.
- Line 60: The local "import json" currently inside the request handler should
be moved to the module-level imports: add "import json" with the other
top-of-file imports and remove the inline import from the function body (the
function that currently contains the inline import/json usage). Update any
references to json in that handler to use the top-level import and run
lint/flake8 to ensure import ordering and no unused-import warnings.
In `@openviking/service/resource_service.py`:
- Around line 141-145: The telemetry set_error calls inside the timeout/except
handlers (calls to get_current_telemetry().set_error in the wait_complete
handler and the add_skill timeout branch, plus the other similar handlers) must
be best-effort: wrap each get_current_telemetry().set_error(...) invocation in
its own try/except that catches and swallows/logs any exceptions from telemetry
so they cannot mask the original business exception (re-raise or continue to
propagate the original timeout/processing error as before); update the three
occurrences that call set_error (the wait_complete exception handler, the
add_skill timeout branch, and the other similar handler) to use this defensive
pattern.
In `@openviking/storage/queuefs/embedding_msg.py`:
- Around line 9-24: The EmbeddingMsg dataclass currently defines a custom
__init__ that assigns self.id (so id isn't a declared dataclass field and asdict
won't include it); replace the custom initializer by declaring id as a dataclass
field with a default_factory that generates uuid4() and init=False (e.g., add an
id: str = field(default_factory=lambda: str(uuid4()), init=False)), keep
message, context_data, telemetry_id as dataclass fields, and remove the custom
__init__ so dataclass-generated __init__ and asdict(self) include all fields.
In `@openviking/storage/vikingdb_manager.py`:
- Around line 172-189: The class docstring for VikingDBManagerProxy is written
in Chinese; please translate it into English in place (maintain the same
docstring structure, examples and formatting) so it matches the rest of the
codebase; update the class-level docstring for VikingDBManagerProxy to an
English description that explains its purpose (tenant-bound proxy that injects
RequestContext into calls), how to initialize and use it (showing manager and
ctx usage), and keep API compatibility note and example unchanged aside from
language.
- Around line 402-421: The wrapper methods inconsistently pass the context
self._ctx positionally to the underlying manager methods (e.g., search_in_tenant
calling self._manager.search_in_tenant(self._ctx, ...)), which risks
argument-order bugs; change these wrappers (including search_in_tenant,
delete_uris, and any similar methods) to pass ctx as a keyword argument
(ctx=self._ctx) when calling the corresponding _manager methods (e.g.,
self._manager.search_in_tenant(ctx=self._ctx, query_vector=..., ...)) so all
calls use a consistent keyword-argument style.
In `@openviking/telemetry/operation.py`:
- Around line 218-237: The finish() method should guard against being called
multiple times: add a private boolean attribute _finished on the class, and
inside finish() acquire self._lock, check if self._finished and if so return the
previously created TelemetrySnapshot (store it on self, e.g., _snapshot) or
raise a clear exception; otherwise compute duration_ms using self._start_time,
build the summary with TelemetrySummaryBuilder (using
self._counters/_gauges/_error_*), set self._finished = True and cache the
constructed TelemetrySnapshot (with self.telemetry_id and summary) before
returning it to ensure subsequent calls return the same snapshot and do not
recompute durations.
In `@openviking/telemetry/registry.py`:
- Around line 11-33: The registry currently holds strong references in
_REGISTERED_TELEMETRY causing potential memory growth; change the implementation
so orphaned handles are automatically removed by either replacing
_REGISTERED_TELEMETRY with a weakref.WeakValueDictionary (so register_telemetry
stores the OperationTelemetry directly and entries drop when no other refs
exist) or by storing (handle, last_seen_timestamp) and adding a background
cleanup task function (e.g., cleanup_orphaned_telemetry) that periodically
removes entries older than a configurable TTL; update register_telemetry,
resolve_telemetry and unregister_telemetry to work with the chosen structure and
expose a config option or docstring describing the TTL/cleanup behavior if you
opt to only document rather than implement automatic cleanup.
In `@openviking/telemetry/resource_summary.py`:
- Around line 55-64: build_queue_status_payload currently assumes each value `s`
in `status` has attributes `processed`, `error_count`, and `errors`, which will
raise AttributeError for malformed entries; update the function to defensively
handle bad inputs by using getattr(s, "processed", 0) and getattr(s,
"error_count", 0), coerce `errors` via getattr(s, "errors", []) or treat `s` as
a dict (s.get("errors", [])) if necessary, ensure `errors` is iterable, and map
each error to {"message": getattr(e, "message", str(e))} (or skip non-iterable
values) so the function always returns a well-formed payload for
`build_queue_status_payload`.
In `@openviking/telemetry/runtime.py`:
- Around line 35-40: The record_histogram method currently appends values to
self._histograms[key] indefinitely causing potential unbounded memory growth;
modify this by introducing a bounded storage policy: either cap the list size
per key (e.g., drop oldest when length exceeds a MAX_HISTOGRAM_SIZE) or
implement reservoir sampling/rolling window logic when appending in
record_histogram, and add a periodic flush/aggregate method (e.g.,
flush_histograms or aggregate_histograms) that can be called by a background
thread or external scheduler to reduce memory (ensure thread-safety by using the
existing _lock around both updates and flushes); update references to
_histograms, _key, and record_histogram to use the new bounded/flush behavior
and document/configure the MAX_HISTOGRAM_SIZE or sampling parameters.
In `@openviking/utils/resource_processor.py`:
- Around line 155-170: The parse telemetry (telemetry.set calls) must run before
the early return when parse_result.temp_dir_path is falsy: compute and emit
"resource.parse.duration_ms" and "resource.parse.warnings_count" immediately
after parse completes (using parse_start and len(parse_result.warnings or [])),
then proceed to check parse_result.temp_dir_path and add errors/warnings and
return; update the block around parse_result, parse_start, and telemetry to call
telemetry.set(...) before the if not parse_result.temp_dir_path early-return
path so the “no content generated” case still emits duration and warnings
metrics.
In `@openviking/utils/skill_processor.py`:
- Around line 69-73: The telemetry timing for each stage (_parse_skill,
_build_overview, _write_files, _index_skill) is only set on the success path;
wrap each stage execution in a try/finally: record start = time.perf_counter()
before calling the stage, execute the stage in the try block, and in the finally
block compute and emit telemetry.set("skill.<stage>.duration_ms", ...) so the
duration is recorded even when the stage throws; apply this pattern to the parse
block around _parse_skill and to the blocks covering _build_overview (overview),
file writes (write), and indexing (index) referenced in the 94-99 and 103-129
ranges.
In `@pyproject.toml`:
- Line 250: The pytest minimum version is inconsistent: update the pytest
constraint so both the dependency group entry ("pytest>=9.0.2" in
[dependency-groups]) and the test extras entry (pytest in
[project.optional-dependencies] test) use the same minimum (e.g., change the
test extras constraint from "pytest>=7.0.0" to "pytest>=9.0.2"); ensure both
occurrences of the symbol pytest have identical constraint strings to keep
environments consistent.
In `@tests/client/test_windows_path_handling.py`:
- Around line 83-127: The tests only validate manual normalization and don't
exercise the real extraction path; add an end-to-end test that creates a ZIP
with backslash-containing entries and calls import_ovpack (from local_fs.py) to
perform the import, then assert the resulting extracted structure (e.g.,
directories/files under the returned base path) matches expectations;
alternatively, if you prefer not to exercise import_ovpack here, rename the
TestZipExtractionPathHandling class to clarify it only tests normalization
helpers rather than production extraction.
In `@tests/misc/test_process_lock.py`:
- Around line 41-43: The test currently hard-codes PID 1 by writing "1" into
lock_path which is brittle; instead spawn a short-lived subprocess (e.g., via
subprocess.Popen or multiprocessing.Process), obtain its pid, write that pid
into lock_path, and ensure you terminate/wait for the subprocess before/after
assertions so the lock check is deterministic; update the code around lock_path
usage in tests/misc/test_process_lock.py to create the helper process, use its
pid instead of "1", and clean up the process to avoid leaks.
In `@tests/server/conftest.py`:
- Around line 57-61: The fake Embedder's embed and embed_batch currently return
the identical vector for every input which prevents meaningful ranking; update
embed(self, text: str) to deterministically derive a distinct dense_vector from
the input (e.g., use a stable hash of text to seed a simple RNG or to generate
per-index values, then map to floats and normalize to length dimension) and have
embed_batch(self, texts: list[str]) call embed for each text (already present)
to return per-input discriminative EmbedResult objects; keep return type
EmbedResult and preserve the existing dimension and determinism so tests remain
reproducible.
In `@tests/session/test_memory_dedup_actions.py`:
- Around line 26-28: There are two different test-context creators in this
module (the module-level ctx = make_test_ctx() and the local helper
_make_ctx()), which can yield different user/account_id values and cause
inconsistent expectations; pick one approach and make all tests use it — either
remove the module-level ctx and replace its usages with calls to _make_ctx(), or
change _make_ctx() to call make_test_ctx() (or make_test_ctx() accept
account_id) so both produce the same account_id/user; update references to the
variable name (ctx or _make_ctx()) accordingly and ensure assertions expecting
account_id="acc1" match the chosen unified context.
In `@tests/session/test_memory_extractor_response_types.py`:
- Around line 59-64: Update the test docstring in
test_none_fallback_yields_empty to clearly state that using None or {} is
evaluated before calling _normalize_parsed_data (so the function receives an
empty dict) — e.g., change the docstring to explain we are verifying that an
empty dict (as produced by the None-or-{} fallback) is passed through unchanged;
leave the call to _normalize_parsed_data(None or {}) and assertions as-is and
only clarify the docstring text.
In `@tests/session/test_session_commit.py`:
- Around line 9-11: The module-level creation of ctx via make_test_ctx() in
tests/session/test_session_commit.py can leak mutable state between tests;
replace the module-scoped ctx with a pytest fixture (e.g., define a fixture
function named ctx using `@pytest.fixture` with desired scope, default "function"
or "module" if intentional) and update tests to accept ctx as a parameter so
each test gets its own context (or controlled shared context) instead of the
global ctx variable.
In `@tests/telemetry/test_layering_rules.py`:
- Around line 29-30: Replace the bare assertion "assert offenders == []" with an
assertion that includes a descriptive failure message referencing the offenders
variable (e.g., use an assertion message or f-string) so test failures show
"Expected no offenders but found: {offenders}" (or similar) to make CI/debugging
clearer; update the assertion in the test where the offenders list is checked.
In `@tests/unit/retrieve/test_retrieval_stats.py`:
- Around line 115-121: The test helper _setup_collector mutates the
module-private _collector in openviking.retrieve.retrieval_stats which tightly
couples tests to internal implementation; add a public setter in the module
(e.g., set_collector_for_testing) that accepts a RetrievalStatsCollector and
assigns it to the module's collector, then update the test to call
set_collector_for_testing(collector) instead of touching _collector directly so
tests remain stable if the module internals are refactored.
In `@tests/unit/tool_skill/test_tool_skill_calibration.py`:
- Around line 66-71: The test function name
test_suffix_like_weather_usage_does_not_match_weather is inconsistent with its
assertion that expects skill_name == "weather"; update the test name (or the
assertion) so intent is clear: either rename the test to something like
test_suffix_like_weather_usage_matches_weather, or change the assertion to
reflect non-matching behavior. Locate the test function in
tests/unit/tool_skill/test_tool_skill_calibration.py and adjust the test name
(or the expected value) accordingly; the related symbols to check are
SessionCompressor, _get_tool_skill_info, tool_parts and candidate used inside
the test to ensure the new name matches the actual behavior being asserted.
| # Create default config with empty bot section | ||
| default_config = Config() | ||
| save_config(default_config, config_path) | ||
| save_config(default_config, config_path, include_defaults=True) |
There was a problem hiding this comment.
Persisting defaults at bootstrap can block vlm overrides.
With Line 56 saving defaults and Line 178 dumping all default fields, bot.agents.model is persisted on first-run. Later, _merge_vlm_model_config() exits early when a model already exists, so root vlm.model/provider may never apply for new installs.
💡 Minimal fix to preserve current precedence
- save_config(default_config, config_path, include_defaults=True)
+ save_config(default_config, config_path, include_defaults=False)Also applies to: 178-178
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bot/vikingbot/config/loader.py` at line 56, The bootstrap currently persists
default_config too early via save_config(default_config, config_path,
include_defaults=True), which causes bot.agents.model to be written and prevents
_merge_vlm_model_config() from applying vlm/provider overrides; fix by not
persisting defaults before merging: either change the save_config call to use
include_defaults=False (or remove it) and ensure _merge_vlm_model_config() runs
and can mutate the in-memory config first, or postpone the save_config call
until after _merge_vlm_model_config() has run; ensure the change prevents
writing bot.agents.model/default provider before merge.
| api.on("before_agent_start", async (event: { messages?: unknown[]; prompt: string }, ctx?: { agentId?: string }) => { | ||
| // Dynamically switch agent identity for multi-agent memory isolation. | ||
| // In multi-agent gateway deployments, the hook context carries the current | ||
| // agent's ID so we route memory operations to the correct agent_space. | ||
| const hookAgentId = ctx?.agentId; | ||
| if (hookAgentId) { | ||
| const client = await getClient(); | ||
| client.setAgentId(hookAgentId); | ||
| api.logger.info?.(`memory-openviking: switched to agentId=${hookAgentId} for recall`); |
There was a problem hiding this comment.
Don't switch agentId on the shared plugin client.
getClient() returns the same OpenVikingClient for the plugin instance, so these hook-level setAgentId() calls can interleave and overwrite each other's headers and resolved-space cache. That can route recall/capture to the wrong agent space, and hooks without ctx.agentId inherit whatever agent ran last. Keep agentId request-scoped, or use distinct client/cache entries per agent instead of mutating the shared instance.
Also applies to: 486-493
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/openclaw-memory-plugin/index.ts` around lines 357 - 365, The hook is
mutating a shared OpenVikingClient by calling client.setAgentId(hookAgentId),
causing cross-request interference; instead make agentId request-scoped by
either creating/obtaining a per-agent client or passing agentId into the memory
operations rather than mutating the shared instance. Update the
before_agent_start and the other hook (the block around the later capture/recall
code) to use a request-scoped client accessor (e.g., getClientForAgent(agentId))
or call methods that accept an agentId parameter instead of calling
OpenVikingClient.setAgentId on the shared client, and ensure any resolved-space
cache is keyed by agentId rather than a global field.
| def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest: | ||
| normalize_telemetry_request(telemetry) | ||
| return telemetry |
There was a problem hiding this comment.
Return the normalized telemetry payload.
normalize_telemetry_request() is called here, but _validate_telemetry() always returns the original object. Any non-mutating normalization path still sends the raw request shape to the server.
Possible fix
`@staticmethod`
def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest:
- normalize_telemetry_request(telemetry)
- return telemetry
+ normalized = normalize_telemetry_request(telemetry)
+ return telemetry if normalized is None else normalized📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest: | |
| normalize_telemetry_request(telemetry) | |
| return telemetry | |
| def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest: | |
| normalized = normalize_telemetry_request(telemetry) | |
| return telemetry if normalized is None else normalized |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking_cli/client/http.py` around lines 225 - 227, The function
_validate_telemetry currently calls normalize_telemetry_request(telemetry) but
returns the original telemetry object; change it to capture and return the
normalized result—i.e., call normalized = normalize_telemetry_request(telemetry)
(or return the call directly) and then return that normalized payload so any
non-mutating normalization is sent to the server; update references in
_validate_telemetry to use the normalized value.
| response_data = self._handle_response_data(response) | ||
| return FindResult.from_dict(response_data.get("result") or {}) |
There was a problem hiding this comment.
Don't drop telemetry from find() and search() responses.
Both methods build FindResult from response_data["result"] and discard the envelope's telemetry, so the new telemetry= parameter is effectively a no-op for these client APIs. Either teach FindResult to carry telemetry or return an envelope consistently here.
Also applies to: 609-610
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking_cli/client/http.py` around lines 576 - 577, The response
envelope's telemetry is currently discarded in the client methods that call
self._handle_response_data (e.g., in find() and search()); extract telemetry =
response_data.get("telemetry") and ensure it is passed into the FindResult
creation instead of dropping it—either by calling
FindResult.from_dict(result_dict, telemetry=telemetry) (or otherwise setting a
telemetry field on the returned FindResult) or by returning the full envelope
consistently; apply the same change in both places that call
FindResult.from_dict after _handle_response_data.
| "--request-timeout-sec", | ||
| type=float, | ||
| default=30.0, | ||
| default=3600.0, | ||
| help="Upstream request timeout in seconds", |
There was a problem hiding this comment.
Default upstream timeout is too high for console traffic.
Line 43 sets a 1-hour default timeout, which can keep stalled upstream calls alive too long and degrade console responsiveness/resource usage under failure conditions. Prefer a lower safe default and require explicit opt-in for long-running operations.
Suggested adjustment
parser.add_argument(
"--request-timeout-sec",
type=float,
- default=3600.0,
+ default=30.0,
help="Upstream request timeout in seconds",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "--request-timeout-sec", | |
| type=float, | |
| default=30.0, | |
| default=3600.0, | |
| help="Upstream request timeout in seconds", | |
| "--request-timeout-sec", | |
| type=float, | |
| default=30.0, | |
| help="Upstream request timeout in seconds", |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/console/bootstrap.py` around lines 41 - 44, The default for the
"--request-timeout-sec" argument is too high (3600.0); change its default to a
much lower safe value (e.g., 30.0 or 60.0) in the argparse add_argument call
that defines "--request-timeout-sec" and update the help string to note that
longer timeouts require explicit opt-in for long-running operations. Locate the
add_argument invocation for "--request-timeout-sec" in bootstrap.py and change
the default float value and help text accordingly.
| if self._bound_account_id and not payload.get("account_id"): | ||
| payload["account_id"] = self._bound_account_id | ||
| logger.debug( | ||
| f"[_SingleAccountBackend.upsert] Final payload.account_id={payload.get('account_id')}" | ||
| ) |
There was a problem hiding this comment.
Reject cross-tenant account_id on bound upserts.
A backend bound to tenant A only sets payload["account_id"] when it is missing. If the caller supplies tenant B explicitly, this path writes into tenant B's partition.
Possible fix
- if self._bound_account_id and not payload.get("account_id"):
- payload["account_id"] = self._bound_account_id
+ if self._bound_account_id:
+ incoming_account_id = payload.get("account_id")
+ if incoming_account_id and incoming_account_id != self._bound_account_id:
+ raise PermissionError("account_id does not match bound tenant")
+ payload["account_id"] = self._bound_account_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/storage/viking_vector_index_backend.py` around lines 137 - 141,
The upsert currently only injects self._bound_account_id when payload lacks
"account_id", allowing callers to override and write into other tenants; modify
_SingleAccountBackend.upsert to validate that if self._bound_account_id is set
and payload.get("account_id") exists and differs from it, the method rejects the
request (raise a ValueError/ClientError) instead of proceeding — enforce the
check using self._bound_account_id and payload["account_id"] and return/raise
immediately with a clear error message when they mismatch.
| async def health_check(self) -> bool: | ||
| try: | ||
| await self.collection_exists() | ||
| return True | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Return the actual health result.
This returns True whenever collection_exists() doesn't raise, even when it returns False, so an uninitialized backend is reported as healthy.
Possible fix
async def health_check(self) -> bool:
try:
- await self.collection_exists()
- return True
+ return await self.collection_exists()
except Exception:
return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/storage/viking_vector_index_backend.py` around lines 380 - 385,
The health_check implementation always returns True if collection_exists()
doesn't raise, ignoring its boolean result; change health_check (in
VikingVectorIndexBackend) to await and capture the return value from
collection_exists() and return that boolean, and only return False on exceptions
(optionally logging the exception) so that an uninitialized backend
(collection_exists() == False) is reported unhealthy.
| async def count( | ||
| self, | ||
| filter: Optional[Dict[str, Any] | FilterExpr] = None, | ||
| *, | ||
| ctx: Optional[RequestContext] = None, | ||
| ) -> int: | ||
| if ctx: | ||
| backend = self._get_backend_for_context(ctx) | ||
| else: | ||
| backend = self._get_default_backend() | ||
| return await backend.count(filter=filter) | ||
|
|
||
| async def clear(self, *, ctx: Optional[RequestContext] = None) -> bool: | ||
| if ctx: | ||
| backend = self._get_backend_for_context(ctx) | ||
| else: | ||
| backend = self._get_default_backend() | ||
| return await backend.clear() |
There was a problem hiding this comment.
Don't silently scope ctx-free count() and clear() to the default tenant.
When ctx is omitted, both methods fall back to _get_default_backend(), which is bound to account "default". That changes these APIs from collection-wide operations to single-tenant ones and will undercount or partially clear data. Prefer _get_root_backend() for the no-context path, or make ctx mandatory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/storage/viking_vector_index_backend.py` around lines 620 - 637,
The count() and clear() methods currently fall back to _get_default_backend()
when ctx is None which scopes operations to the "default" account; change the
no-ctx path to use _get_root_backend() instead so these calls operate
collection-wide (or alternatively make ctx mandatory). Specifically, update both
count() and clear() to call self._get_root_backend() when ctx is falsy (keeping
the existing use of _get_backend_for_context(ctx) when ctx is provided), and
ensure return await backend.count(...) and return await backend.clear(...)
remain unchanged.
| existing_pid = _read_pid_file(lock_path) | ||
| if existing_pid and existing_pid != my_pid and _is_pid_alive(existing_pid): | ||
| raise DataDirectoryLocked( | ||
| f"Another OpenViking process (PID {existing_pid}) is already using " | ||
| f"the data directory '{data_dir}'. Running multiple OpenViking " | ||
| f"instances on the same data directory causes silent storage " | ||
| f"contention and data corruption.\n\n" | ||
| f"To fix this, use one of these approaches:\n" | ||
| f" 1. Use HTTP mode: start a single openviking-server and connect " | ||
| f"via --transport http (recommended for multi-session hosts)\n" | ||
| f" 2. Use separate data directories for each instance\n" | ||
| f" 3. Stop the other process (PID {existing_pid}) first" | ||
| ) | ||
|
|
||
| # Write our PID (overwrites stale lock from a dead process). | ||
| try: | ||
| os.makedirs(data_dir, exist_ok=True) | ||
| with open(lock_path, "w") as f: | ||
| f.write(str(my_pid)) | ||
| except OSError as exc: | ||
| logger.warning("Could not write PID lock %s: %s", lock_path, exc) | ||
| return lock_path |
There was a problem hiding this comment.
Lock acquisition is race-prone and currently fails open on write errors.
The read -> check -> open("w") flow is non-atomic, so two processes can both “acquire” the lock. Also, returning success when writing the lock fails removes the safety guarantee and can reintroduce data corruption.
🔧 Proposed fix (atomic create + fail-closed)
def acquire_data_dir_lock(data_dir: str) -> str:
@@
- existing_pid = _read_pid_file(lock_path)
- if existing_pid and existing_pid != my_pid and _is_pid_alive(existing_pid):
- raise DataDirectoryLocked(
- f"Another OpenViking process (PID {existing_pid}) is already using "
- f"the data directory '{data_dir}'. Running multiple OpenViking "
- f"instances on the same data directory causes silent storage "
- f"contention and data corruption.\n\n"
- f"To fix this, use one of these approaches:\n"
- f" 1. Use HTTP mode: start a single openviking-server and connect "
- f"via --transport http (recommended for multi-session hosts)\n"
- f" 2. Use separate data directories for each instance\n"
- f" 3. Stop the other process (PID {existing_pid}) first"
- )
-
- # Write our PID (overwrites stale lock from a dead process).
- try:
- os.makedirs(data_dir, exist_ok=True)
- with open(lock_path, "w") as f:
- f.write(str(my_pid))
- except OSError as exc:
- logger.warning("Could not write PID lock %s: %s", lock_path, exc)
- return lock_path
+ os.makedirs(data_dir, exist_ok=True)
+ while True:
+ try:
+ fd = os.open(lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
+ with os.fdopen(fd, "w") as f:
+ f.write(str(my_pid))
+ break
+ except FileExistsError:
+ existing_pid = _read_pid_file(lock_path)
+ if existing_pid == my_pid:
+ break # re-entrant acquire by same process
+ if existing_pid and _is_pid_alive(existing_pid):
+ raise DataDirectoryLocked(
+ f"Another OpenViking process (PID {existing_pid}) is already using "
+ f"the data directory '{data_dir}'. Running multiple OpenViking "
+ f"instances on the same data directory causes silent storage "
+ f"contention and data corruption.\n\n"
+ f"To fix this, use one of these approaches:\n"
+ f" 1. Use HTTP mode: start a single openviking-server and connect "
+ f"via --transport http (recommended for multi-session hosts)\n"
+ f" 2. Use separate data directories for each instance\n"
+ f" 3. Stop the other process (PID {existing_pid}) first"
+ )
+ # stale/corrupt lock; try to replace
+ try:
+ os.remove(lock_path)
+ except FileNotFoundError:
+ pass
+ except OSError as exc:
+ raise RuntimeError(f"Could not acquire PID lock {lock_path}: {exc}") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/utils/process_lock.py` around lines 58 - 79, The current read ->
check -> open("w") flow in acquire lock is race-prone and returns success when
writing fails; replace it with an atomic create and fail-closed behavior:
attempt to create the lock file using an atomic open (os.open(lock_path,
os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o644)), write my_pid bytes and close; if
os.open raises FileExistsError, re-read via _read_pid_file(lock_path) and if the
existing PID is alive (use _is_pid_alive) raise DataDirectoryLocked (including
existing_pid and data_dir), otherwise remove the stale file and retry the atomic
create once; any OSError when creating or writing the lock should raise an
exception (do not return lock_path) so callers know acquisition failed. Use the
existing symbols _read_pid_file, _is_pid_alive, DataDirectoryLocked, lock_path,
my_pid, and data_dir to locate and implement this change.
| def test_error_message_includes_remediation(self): | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| lock_path = os.path.join(tmpdir, LOCK_FILENAME) | ||
| with open(lock_path, "w") as f: | ||
| f.write("1") | ||
| try: | ||
| acquire_data_dir_lock(tmpdir) | ||
| except DataDirectoryLocked as exc: | ||
| msg = str(exc) | ||
| assert "openviking-server" in msg | ||
| assert "separate data directories" in msg |
There was a problem hiding this comment.
This test can pass without validating the expected exception path.
If acquire_data_dir_lock() does not raise, the test currently succeeds. Add an explicit failure when no DataDirectoryLocked is raised.
✅ Minimal fix
def test_error_message_includes_remediation(self):
@@
try:
acquire_data_dir_lock(tmpdir)
+ raise AssertionError("Should have raised DataDirectoryLocked")
except DataDirectoryLocked as exc:
msg = str(exc)
assert "openviking-server" in msg
assert "separate data directories" in msg🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/misc/test_process_lock.py` around lines 51 - 61, The test
test_error_message_includes_remediation currently will pass if
acquire_data_dir_lock(tmpdir) does not raise; update it to explicitly fail when
no DataDirectoryLocked is raised by wrapping the call in a try/except and adding
a fail path (e.g., call to pytest.fail or raising AssertionError) immediately
after acquire_data_dir_lock(tmpdir) so the test only succeeds when the except
DataDirectoryLocked as exc branch runs and validates the message; reference the
test function name test_error_message_includes_remediation and the exception
class DataDirectoryLocked as the places to change.
Summary
volcengine/OpenVikingmain (17 commits) — resolved 5 merge conflicts acrosscollection_schemas.py,embedding_msg.py,embedding_msg_converter.py,pyproject.toml,uv.lockadd-resourcepipeline supports itPR volcengine#607 Review Comment Resolutions
r2936536550 —
embed_multimodalparts API patternConfirmed via
googleapis/python-genaiofficial test suite that the existing implementation is correct:Added TODO(April 2026) to upgrade
vectorizeto carryparts: List[Part]for interleaved text+media sequences from PDF chunk pipelines, with link to official Gemini embedding aggregation docs.r2936556516 — PDF 6-page hard limit
Added explicit
elif is_pdf:block invectorize_file()with comment documenting:add-resourcepipeline support (planned April 2026)r2936557627 — Remove hardcoded
== "gemini"string checkReplaced with a TODO comment that explicitly references
embedder.supports_multimodalproperty as the correct future gate (not string provider comparison). The comment also notes thatdoubao-embedding-visionand other providers expose the same property, making string matching fragile.Key Changes
openviking/models/embedder/gemini_embedders.pyopenviking/utils/embedding_utils.pyembedder.supports_multimodalopenviking/storage/queuefs/embedding_msg.pymedia_uri/media_mime_type, addedtelemetry_id; fixedidpreservation into_dict/from_dictopenviking/storage/queuefs/embedding_msg_converter.pytelemetry_idopenviking/storage/collection_schemas.pytests/unit/test_embedding_msg.pytests/storage/test_embedding_msg_converter_tenant.pytests/unit/test_embedding_utils_mime.py_infer_image_mimeonlytests/storage/test_collection_schemas.pyTest plan
test_embedding_msg_roundtrip_preserves_id— verifies id is preserved through serialize/deserializetest_embedding_msg_telemetry_id_roundtrip— verifies telemetry_id round-tripstest_converter_text_only_vectorize— verifies EmbeddingMsgConverter produces text-only messagestest_infer_image_mime— verifies MIME inference for image extensionsApril 2026 Follow-ups (tracked via TODO comments)
vectorize_fileusingembedder.supports_multimodalproperty (not string match)parts: List[Part]toVectorizedataclass for PDF chunk pipelineSummary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation