Skip to content

fix(gemini): scope-reduce to TextEmbedder, resolve PR #607 review comments#3

Closed
chethanuk wants to merge 19 commits intomainfrom
fix/gemini-text-embedder-scope
Closed

fix(gemini): scope-reduce to TextEmbedder, resolve PR #607 review comments#3
chethanuk wants to merge 19 commits intomainfrom
fix/gemini-text-embedder-scope

Conversation

@chethanuk
Copy link
Copy Markdown
Owner

@chethanuk chethanuk commented Mar 15, 2026

Summary

  • Merged upstream volcengine/OpenViking main (17 commits) — resolved 5 merge conflicts across collection_schemas.py, embedding_msg.py, embedding_msg_converter.py, pyproject.toml, uv.lock
  • Scope reduction: Gemini Embedding 2 is a TextEmbedder first; multimodal capability lives on the class but full parts-aggregation dispatch is deferred to April 2026 when the add-resource pipeline supports it
  • Resolved all 3 PR feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) volcengine/OpenViking#607 review comments (r2936536550, r2936556516, r2936557627)

PR volcengine#607 Review Comment Resolutions

r2936536550 — embed_multimodal parts API pattern

Confirmed via googleapis/python-genai official test suite that the existing implementation is correct:

types.Content(parts=[Part.from_text(text=...), Part.from_bytes(data=..., mime_type=...)])

Added TODO(April 2026) to upgrade vectorize to carry parts: List[Part] for interleaved text+media sequences from PDF chunk pipelines, with link to official Gemini embedding aggregation docs.

r2936556516 — PDF 6-page hard limit

Added explicit elif is_pdf: block in vectorize_file() with comment documenting:

  • Gemini Embedding 2 hard-limits PDFs to 6 pages per request
  • Requires splitting into ≤6-page chunks + multi-vector storage
  • Needs add-resource pipeline support (planned April 2026)
  • Falls back to text summary until then

r2936557627 — Remove hardcoded == "gemini" string check

Replaced with a TODO comment that explicitly references embedder.supports_multimodal property as the correct future gate (not string provider comparison). The comment also notes that doubao-embedding-vision and other providers expose the same property, making string matching fragile.

Key Changes

File Change
openviking/models/embedder/gemini_embedders.py Added TODO(April 2026) referencing official parts API pattern
openviking/utils/embedding_utils.py Split media block into IMAGE/VIDEO/AUDIO + PDF elif; TODO references embedder.supports_multimodal
openviking/storage/queuefs/embedding_msg.py Resolved conflict: removed media_uri/media_mime_type, added telemetry_id; fixed id preservation in to_dict/from_dict
openviking/storage/queuefs/embedding_msg_converter.py Text-only pipeline; passes telemetry_id
openviking/storage/collection_schemas.py Resolved conflict: upstream telemetry architecture + text-only dequeue path
tests/unit/test_embedding_msg.py Removed media field tests; added id-preservation + telemetry roundtrip tests
tests/storage/test_embedding_msg_converter_tenant.py Removed media tests; added text-only converter tests
tests/unit/test_embedding_utils_mime.py Reduced to _infer_image_mime only
tests/storage/test_collection_schemas.py Removed multimodal path tests (no longer applicable)

Test plan

  • 216/220 tests pass (4 pre-existing failures: AGFS binding missing in worktree env + 2 upstream DAG test failures)
  • All modified files linted/type-clean
  • test_embedding_msg_roundtrip_preserves_id — verifies id is preserved through serialize/deserialize
  • test_embedding_msg_telemetry_id_roundtrip — verifies telemetry_id round-trips
  • test_converter_text_only_vectorize — verifies EmbeddingMsgConverter produces text-only messages
  • test_infer_image_mime — verifies MIME inference for image extensions

April 2026 Follow-ups (tracked via TODO comments)

  1. Enable multimodal dispatch in vectorize_file using embedder.supports_multimodal property (not string match)
  2. Add parts: List[Part] to Vectorize dataclass for PDF chunk pipeline
  3. Implement PDF page chunking (≤6 pages) + multi-vector storage in add-resource pipeline

Summary by CodeRabbit

  • New Features

    • Opt-in operation telemetry across APIs and SDKs; responses can include concise telemetry summaries.
    • CLI/TUI: Vector records browser with paging, counts, and keyboard navigation; new observer “retrieval” status.
    • Debug API: Vector scroll and count endpoints.
  • Improvements

    • Retrieval breadth and observability enhanced; added retrieval metrics and health checks.
    • More collision-safe agent space IDs.
    • Windows path handling and URL-download filename preservation improved.
    • Safer startup via data-directory lock; increased console request timeout.
  • Bug Fixes

    • Consistent path separators across platforms and archives.
  • Documentation

    • Added operation telemetry design guide.

zhougit86 and others added 19 commits March 14, 2026 23:24
Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com>
…id, rust cli version... (volcengine#577)

* fix: windows zip path norm

* fix: account id in vector db

* fix: add some log

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

* fix: add some log, and fixed search

---------

Co-authored-by: openviking <openviking@example.com>
…ns (volcengine#609)

agent_space_name() computed md5(user_id + agent_id) without a separator,
so different (user_id, agent_id) pairs could produce the same hash when
their concatenation matched (e.g. ("alice","bot") vs ("aliceb","ot")).

Add ":" separator between user_id and agent_id in the hash input. The ":"
character is safe because the validation regex [a-zA-Z0-9_-] prevents
either field from containing it.

Fix applied to all three implementations:
- openviking_cli/session/user_id.py (Python SDK)
- bot/vikingbot/openviking_mount/ov_server.py (bot server)
- examples/openclaw-memory-plugin/client.ts (TypeScript example)

Note: This is a breaking change for existing agent spaces. Existing data
directories were named using the old hash and will not be found with the
new hash. A migration script or fallback lookup may be needed.

Fixes volcengine#595

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…ction (volcengine#610)

commit_async() called extract_long_term_memories(messages=...) without
passing user, session_id, or ctx. Because the compressor returns early
when ctx is None, async commits always produced memories_extracted=0.

The sync commit() path already passes all three parameters correctly.
This aligns the async path to match.

Regression from the COW pattern revert (volcengine#584) which dropped these
arguments from the async call.

Fixes volcengine#602

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…cengine#615)

Move 429 re-enqueue logic directly into the exception handler instead of
deferring via a flag, making the flow clearer and avoiding unnecessary DB
writes on rate-limit errors. Add has_queue_manager/enqueue_embedding_msg
to base backend class. Remove obsolete account_id param from test and
prune outdated URI deduplicator test cases.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…lcengine#611)

When multiple OpenViking processes share the same data directory (common
with stdio MCP in multi-session hosts), they silently contend for AGFS
and VectorDB resources. This produces misleading errors like "Collection
'context' does not exist" instead of pointing to the actual cause.

Add a PID-based advisory lock in OpenVikingService.initialize() that:
- Detects an existing live process using the same data directory
- Raises DataDirectoryLocked with a clear error message explaining the
  contention and suggesting HTTP mode or separate data directories
- Cleans up stale lock files from crashed processes
- Releases the lock on normal exit via atexit

The lock uses a .openviking.pid file in the data directory. HTTP mode
(openviking-server) is unaffected since it handles concurrency natively.

Relates to volcengine#473

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…volcengine#617)

During session commit, messages.jsonl (the archived session transcript)
was unnecessarily summarized by the semantic file summary pipeline,
wasting tokens and adding latency. Add messages.jsonl to a skip list in
_list_dir() so it never enters the semantic DAG.

Closes volcengine#564

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…cengine#618)

* fix(session): handle non-dict LLM responses in memory extraction

When using local Ollama models, parse_json_from_response may return a
list instead of the expected {"memories": [...]} dict, causing
AttributeError on .get(). Add type checking after parsing: wrap lists
as {"memories": data}, and fall back to {} for other unexpected types.

Closes volcengine#605

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix ruff formatting in test file

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…olcengine#622)

Add retrieval quality observability following the existing observer
pattern (QueueObserver, VLMObserver, VikingDBObserver).

- RetrievalStatsCollector: thread-safe singleton that accumulates
  per-query metrics (result counts, scores, latency, rerank usage)
- RetrievalObserver: reads accumulated stats, reports health based on
  zero-result rate, formats status table with tabulate
- Instrumented HierarchicalRetriever.retrieve() to record stats
- Added /api/v1/observer/retrieval endpoint
- Included in system-wide observer health check
- 17 unit tests covering stats, collector, and observer

This contribution was developed with AI assistance (Claude Code).

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
…rom URL (volcengine#619)

* fix(parse): preserve original filename and extension when importing from URL

When importing resources via URL (e.g., COS/HTTP paths to code files),
three bugs occurred: an extra directory was created using the temp file
name, the file extension was changed from .py to .md, and URL-encoded
characters in filenames were not decoded.

Root cause: code file extensions (.py, .js, etc.) were not recognized
by URLTypeDetector, causing them to be parsed as web pages through the
HTML-to-markdown pipeline. This lost the original filename and always
produced .md output.

Changes:
- Add common code/text file extensions to URLTypeDetector so they route
  to download instead of webpage parsing
- Extract and URL-decode the original filename from the URL path
- Save downloaded text/code files with their original name and extension
  instead of routing through MarkdownParser
- Pass original filename to markdown and PDF parsers as source_path /
  resource_name so temp file names don't leak into the final URI

Fixes volcengine#251

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(parse): reuse CODE_EXTENSIONS from constants module

Import CODE_EXTENSIONS from openviking.parse.parsers.constants instead of
duplicating the set in URLTypeDetector. Spread comes first in EXTENSION_MAP
so explicit entries (.html, .htm) correctly override to DOWNLOAD_HTML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: openviking <openviking@example.com>
…context agentId (volcengine#637)

When OpenClaw gateway serves multiple agents, each agent's
before_agent_start and agent_end hooks now carry the agent's ID
in the second parameter (PluginHookAgentContext). The plugin
dynamically switches the client's agentId before each recall/capture
operation, ensuring memories are routed to the correct agent_space
(md5(user_id + agent_id)[:12]).

Changes:
- client.ts: Add setAgentId()/getAgentId() to allow dynamic agent
  switching. Clears cached runtimeIdentity and resolvedSpaceByScope
  when switching to ensure correct space derivation.
- index.ts: Extract agentId from hook ctx (2nd param) in both
  before_agent_start and agent_end handlers.

This is backward compatible: if ctx.agentId is absent (single-agent
setup), the plugin falls back to the static config agentId as before.

Co-authored-by: Mac <mac@MacBookPro.lan>
…ine#640)

refactor: replace operation trace with telemetry

fix telemetry demo skill ingestion

simplify telemetry summary metric keys

rename remaining trace telemetry artifacts

feat: support configurable telemetry payloads

docs: rewrite operation telemetry design in chinese

fix: reject telemetry for async session commit

refactor: isolate telemetry orchestration

refactor: remove telemetry from find payloads

refactor: remove telemetry event payloads

fix(trace): keep only telemetry-related changes

fix(trace): remove top-level usage from telemetry responses

feat(console): default telemetry on proxied operations
…gine#607 review)

Merge 17 upstream commits (feat/fix: trace metrics, multi-agent isolation,
vikingdb TUI, rate-limit simplification, session fixes, etc.) and resolve
conflicts in embedding_msg, embedding_msg_converter, collection_schemas,
pyproject.toml, and uv.lock.

Scope reduction (defers multimodal pipeline to April when add-resource
pipeline is ready):
- Remove multimodal dispatch from embedding_utils.vectorize_file; media
  files always fall back to text/summary (resolves r2936557627: removes
  hardcoded "gemini" string check; resolves r2936556516: no PDF 6-page
  limit concern)
- Drop media_uri/media_mime_type from EmbeddingMsg; add telemetry_id
  from upstream; preserve id through to_dict/from_dict round-trip
- Drop media fields from EmbeddingMsgConverter; pass telemetry_id
- Remove multimodal on_dequeue path from TextEmbeddingHandler; adopt
  upstream's telemetry-tracked text-only path
- Add TODO in GeminiDenseEmbedder.embed_multimodal for parts-list
  aggregation (resolves r2936536550 — deferred to April)

GeminiDenseEmbedder class retains supports_multimodal + embed_multimodal
for future activation once Vectorize.parts is supported end-to-end.
…lution

r2936557627 (use supports_multimodal, not string check):
  Split IMAGE/VIDEO/AUDIO block from PDF. TODO in media block explicitly
  documents the correct future pattern: `if embedder.supports_multimodal`
  instead of `(provider or "").lower() == "gemini"`, noting doubao-embedding-
  vision also exposes supports_multimodal=True.

r2936556516 (PDF 6-page Gemini limit):
  PDF now has its own explicit elif branch with a comment explaining the
  hard limit (6 pages/request), the required solution (≤6-page chunk splitting
  with one vector per chunk), and the April 2026 add-resource timeline.

r2936536550 (parts API — 1 media limitation):
  Confirm in code comment that current 1-text + 1-media implementation matches
  googleapis/python-genai test suite exactly. TODO updated with the concrete
  April plan: upgrade Vectorize to carry parts: List[Part] for interleaved
  text+media sequences from PDF chunk pipelines.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 15, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive operation-level telemetry collection, vector record pagination in the CLI UI, context-aware memory deduplication, retrieval statistics tracking, process-level locking, and extensive refactoring of the vector database layer to support per-tenant account isolation.

Changes

Cohort / File(s) Summary
Configuration & Schema Management
bot/vikingbot/config/loader.py, bot/vikingbot/config/schema.py, openviking/console/bootstrap.py
Updated config persistence to include default values, improved field definitions with Optional types and default factories, increased request timeout to 3600s.
Telemetry Core Infrastructure
openviking/telemetry/*, openviking/server/telemetry.py
New comprehensive telemetry subsystem with operation tracking, request/selection parsing, execution wrappers, registry, runtime backends, and resource/queue summary helpers for capturing metrics across the platform.
Vector Record Scrolling & Debug API
crates/ov_cli/src/client.rs, crates/ov_cli/src/commands/observer.rs, openviking/server/routers/debug.py
Added two new HTTP debug endpoints for vector pagination (scroll with limit/cursor) and counting, plus Rust client methods and observer CLI commands for retrieval status.
CLI TUI Vector Records UI
crates/ov_cli/src/tui/app.rs, crates/ov_cli/src/tui/ui.rs, crates/ov_cli/src/tui/event.rs, crates/ov_cli/src/tui/mod.rs, crates/ov_cli/src/tui/tree.rs, crates/ov_cli/src/main.rs
Introduced VectorRecordsState with cursor navigation, async loading, and pagination; new render_vector_records() function; refactored event handling for j/k/g/G keys based on display mode; restructured root loading with "/" as a single node containing scopes.
Telemetry Integration in Python Clients
openviking/async_client.py, openviking/client/session.py, openviking/sync_client.py, openviking_cli/client/base.py, openviking_cli/client/http.py, openviking_cli/client/sync_http.py
Added TelemetryRequest parameter to commit_session, add_resource, add_skill, search, and find methods; implemented telemetry validation, attachment, and response envelope handling; updated HTTP client response parsing to include telemetry payloads.
Telemetry Integration in Services
openviking/server/routers/resources.py, openviking/server/routers/search.py, openviking/server/routers/sessions.py, openviking/server/routers/observer.py
Wrapped resource add/skill operations with run_operation for telemetry capture; added CommitSessionRequest model; extended search find/grep endpoints with telemetry support and result inclusion; added observer retrieval endpoint.
Vector DB Layer Refactoring
openviking/storage/viking_vector_index_backend.py, openviking/storage/vikingdb_manager.py, openviking/storage/__init__.py
Introduced per-account backend isolation with _SingleAccountBackend; created VikingDBManagerProxy for automatic context injection; refactored all public methods to require RequestContext for tenant scoping; added root-only and proxy-based operations.
Service-Level Telemetry & Context
openviking/service/resource_service.py, openviking/service/search_service.py, openviking/service/debug_service.py, openviking/service/core.py
Added time-based telemetry instrumentation around resource/skill operations with queue metrics and error tracking; enhanced search to pass session context to VikingFS; added RetrievalObserver component; implemented data directory lock advisory.
Memory & Session Processing
openviking/session/memory_deduplicator.py, openviking/session/memory_extractor.py, openviking/session/compressor.py, openviking/session/session.py
Extended memory deduplication with RequestContext for context-aware lookups; added telemetry tracking for extracted memories; improved LLM response parsing for non-dict outputs; propagated user/session/ctx to extraction calls.
Retrieval & Storage
openviking/retrieve/hierarchical_retriever.py, openviking/retrieve/retrieval_stats.py, openviking/storage/viking_fs.py
Introduced RetrievalStatsCollector and RetrievalStats for query metrics; increased GLOBAL_SEARCH_TOPK from 3 to 5; refactored retriever to use VikingDBManagerProxy with telemetry hooks; extended VikingFS to initialize telemetry and record vector operations.
Embedder Telemetry & Retry Logic
openviking/models/embedder/base.py, openviking/models/embedder/openai_embedders.py, openviking/models/embedder/volcengine_embedders.py
Added exponential_backoff_retry utility for resilient embedding calls; integrated token usage telemetry reporting in OpenAI and Volcengine embedders; enhanced Volcengine with 429 error detection and retry logic.
Queue & Storage Processing
openviking/storage/collection_schemas.py, openviking/storage/queuefs/embedding_msg.py, openviking/storage/queuefs/embedding_msg_converter.py, openviking/storage/queuefs/semantic_msg.py, openviking/storage/queuefs/semantic_processor.py
Refactored EmbeddingMsg with telemetry_id instead of media fields; added RequestQueueStats and telemetry binding for queue processing; extended SemanticMsg and SemanticProcessor with telemetry tracking and per-telemetry statistics caching; removed media/fallback handling from embedding conversion.
Infrastructure & Utilities
openviking/utils/process_lock.py, openviking/utils/embedding_utils.py, openviking/utils/resource_processor.py, openviking/utils/skill_processor.py, openviking/utils/summarizer.py, openviking/parse/*
Implemented PID-based advisory locking for data directories; removed multimodal Gemini fallback in vectorization; added telemetry instrumentation for parse/finalize phases; normalized path separators in ZIP handling and directory scanning; extended summarizer with telemetry_id propagation.
Server Response Model & VLM
openviking/server/models.py, openviking/models/vlm/base.py
Replaced UsageInfo and time fields with unified telemetry dict in Response; added telemetry token usage reporting in VLM token tracking.
Bot Configuration
bot/vikingbot/openviking_mount/ov_server.py
Changed agent space name hash input from concatenation to colon-delimited format (user_id:agent_id).
Memory Plugin Examples
examples/openclaw-memory-plugin/client.ts, examples/openclaw-memory-plugin/index.ts, examples/opencode-memory-plugin/openviking-memory.ts
Introduced dynamic agent switching with setAgentId/getAgentId; updated space derivation to use colon separator; added context-based agent routing in memory hooks; enhanced plugin documentation.
Documentation & Dependency
docs/design/operation_telemetry.md, .gitignore, crates/ov_cli/Cargo.toml, pyproject.toml
Added comprehensive telemetry design document; excluded docs/superpowers/ from git; bumped ov_cli version to 0.2.6; added pytest>=9.0.2 dev dependency.
Test Infrastructure & Coverage
tests/server/conftest.py, tests/utils/mock_context.py, tests/telemetry/*, tests/server/test_*.py, tests/session/test_*.py, tests/storage/test_*.py, tests/unit/*, tests/parse/*, tests/cli/*, tests/misc/*
Added FakeEmbedder for server tests; created mock context helpers (make_test_user, make_test_ctx); comprehensive telemetry test suites covering execution, layering rules, resource metrics, and runtime behavior; extended API tests for telemetry request/response handling; added Windows path normalization tests; added PID lock tests; removed/updated multimodal embedding tests; added retrieval stats unit tests.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 Hop, hop, the telemetry bells ring bright,
Vector scrolls now dance in CLI light,
Memory knows its context, stats agree,
Locks keep data safe—hop with glee!
From agent spaces to retrieval stats so clean,
This rabbit's built the finest system e'er seen! 🥕✨

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/gemini-text-embedder-scope

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust, operation-level telemetry system to OpenViking, enabling detailed tracking of performance, resource consumption, and errors across various services. It refines the Gemini embedding integration by focusing on text-only capabilities and deferring complex multimodal features, while also enhancing the CLI with new debugging and observability tools. Significant improvements in system stability and cross-platform compatibility are delivered through data directory locking, improved path handling, and more resilient API interactions.

Highlights

  • Telemetry System Introduction: A comprehensive operation-level telemetry system has been implemented across various services (resources, search, sessions, embedding, semantic processing) to track performance, token usage, queue status, and errors. This includes new API parameters for requesting telemetry data.
  • Gemini Embedding Refocus: Gemini Embedding 2 functionality has been explicitly scoped to text-only, with full multimodal support and PDF chunking deferred to a future release (April 2026). Relevant TODOs have been added to mark future enhancements.
  • CLI Enhancements: New debug endpoints have been added to the CLI for scrolling and counting vector records. Additionally, new observer commands for transaction and retrieval metrics are now available.
  • Robustness Improvements: A PID-based advisory lock has been introduced to prevent multiple OpenViking processes from contending for the same data directory. Path normalization in ZIP operations and directory scanning has been improved to correctly handle Windows-style paths. The agent_space_name hashing now uses a separator to prevent potential hash collisions. Exponential backoff retry with rate limit detection has been implemented for Volcengine embedders.
  • Memory Management Refinements: The system now ensures that messages.jsonl files are skipped during semantic processing to avoid unnecessary summarization. Non-dictionary LLM responses during memory extraction are now handled gracefully. Multimodal media fields have been removed from EmbeddingMsg as multimodal support is deferred.
  • API Response Standardization: The time and usage fields in API responses have been replaced with a unified telemetry object for consistent data reporting.
Changelog
  • crates/ov_cli/src/client.rs
    • Added new client methods for debugging vector records, including scrolling and counting.
  • crates/ov_cli/src/tui/app.rs
    • Integrated new VectorRecordsState and related logic to enable viewing and navigating vector records within the TUI.
  • openviking/models/embedder/gemini_embedders.py
    • Added a TODO comment regarding future multimodal embedding API patterns and PDF chunking.
  • openviking/storage/collection_schemas.py
    • Updated TextEmbeddingHandler to integrate with the new telemetry system.
    • Implemented rate limit handling and request-scoped stat merging.
  • openviking/storage/queuefs/embedding_msg.py
    • Refactored EmbeddingMsg to remove multimodal media fields.
    • Introduced a telemetry_id for tracking.
  • openviking/storage/queuefs/embedding_msg_converter.py
    • Adapted to the new EmbeddingMsg structure, removing multimodal media handling.
    • Now passes telemetry_id.
  • openviking/storage/viking_vector_index_backend.py
    • Implemented a facade pattern for VikingVectorIndexBackend to manage per-account backends.
    • Enforced tenant isolation and required RequestContext for data operations.
  • openviking/telemetry/*
    • Introduced a new telemetry package with modules for context management, operation-scoped collectors, request parsing, and a registry for tracking active telemetry handles.
  • openviking/utils/embedding_utils.py
    • Simplified media type inference by removing multimodal-specific MIME maps.
    • Updated vectorize_file to reflect deferred multimodal support for Gemini and explicit PDF handling.
  • openviking/utils/process_lock.py
    • Added a new utility for advisory file locking to prevent multiple OpenViking instances from using the same data directory.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive telemetry system to track operation-scoped metrics, including token usage, queue processing, vector search, semantic DAG node stats, and memory extraction. It refactors the VikingVectorIndexBackend for improved multi-tenancy with per-account backend instances and introduces a VikingDBManagerProxy for simplified tenant-aware operations. The TUI client is enhanced with new debug vector methods and keybindings for viewing vector records. Retrieval quality metrics are now collected via a new RetrievalStatsCollector and exposed through a RetrievalObserver. Additionally, the PR includes fixes for URL filename preservation and handling of code file extensions during resource parsing, adds exponential backoff for Volcengine embedder API calls, and implements a PID-based advisory lock to prevent data directory contention. A minor issue was noted where the search_global_roots_in_tenant function name is misleading as it now includes level 2 items, suggesting a rename to better reflect its broader scope.

extra_filter=extra_filter,
),
In("level", [0, 1]),
In("level", [0, 1, 2]), # TODO: smj fix this
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function name search_global_roots_in_tenant is a bit misleading now that it also fetches level 2 items, as the name suggests it only fetches 'roots' (level 0 and 1). Consider renaming it to something like search_global_candidates_in_tenant to better reflect its behavior. This would also help resolve the associated TODO.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
openviking/storage/local_fs.py (1)

202-226: ⚠️ Potential issue | 🔴 Critical

ZIP read will fail for Windows-created archives after path normalization.

The code normalizes zip_path at line 208, then uses the normalized safe_zip_path to call zf.read(safe_zip_path) at line 223. However, zipfile.ZipFile.read() requires the exact entry name as stored in the ZIP archive.

When a Windows-created ZIP contains entries with backslashes (e.g., project\file.txt), these are stored as-is in the archive. Normalizing the path to project/file.txt before calling zf.read() raises a KeyError because the actual ZIP entry name still contains backslashes.

To fix this, preserve the original filename for ZIP read operations:

Proposed fix
         for info in infolist:
-            zip_path = info.filename
-            if not zip_path:
+            original_zip_path = info.filename
+            if not original_zip_path:
                 continue

             # Normalize path separators to handle Windows-created ZIPs
-            zip_path = zip_path.replace("\\", "/")
-            safe_zip_path = _validate_ovpack_member_path(zip_path, base_name)
+            normalized_zip_path = original_zip_path.replace("\\", "/")
+            safe_zip_path = _validate_ovpack_member_path(normalized_zip_path, base_name)

             # Handle directory entries
             if safe_zip_path.endswith("/"):
                 rel_path = get_viking_rel_path_from_zip(safe_zip_path.rstrip("/"))
                 target_dir_uri = f"{root_uri}/{rel_path}" if rel_path else root_uri
                 await viking_fs.mkdir(target_dir_uri, exist_ok=True, ctx=ctx)
                 continue

             # Handle file entries
             rel_path = get_viking_rel_path_from_zip(safe_zip_path)
             target_file_uri = f"{root_uri}/{rel_path}" if rel_path else root_uri

             try:
-                data = zf.read(safe_zip_path)
+                data = zf.read(original_zip_path)
                 await viking_fs.write_file_bytes(target_file_uri, data, ctx=ctx)
             except Exception as e:
-                logger.error(f"Failed to import {zip_path} to {target_file_uri}: {e}")
+                logger.error(f"Failed to import {original_zip_path} to {target_file_uri}: {e}")
                 if not force:  # In non-force mode, stop on error
                     raise e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/local_fs.py` around lines 202 - 226, The ZIP read fails
because you normalize info.filename into safe_zip_path before calling zf.read;
instead preserve the original archive entry name (info.filename or zip_path
before replacing backslashes) for zf.read and only use safe_zip_path (after
_validate_ovpack_member_path and slash normalization) for validation and
computing rel_path/target URIs used by get_viking_rel_path_from_zip and
viking_fs.write_file_bytes; update the loop to call zf.read(original_entry_name)
and keep using safe_zip_path for path checks and directory handling (referencing
info.filename, zip_path, safe_zip_path, _validate_ovpack_member_path,
get_viking_rel_path_from_zip, zf.read, and viking_fs.write_file_bytes).
bot/vikingbot/config/schema.py (1)

733-737: ⚠️ Potential issue | 🟡 Minor

from_safe_name should validate/safely split input.

Current parsing assumes exactly three __-separated parts and can throw IndexError on malformed values.

✅ Robust parsing fix
     `@staticmethod`
     def from_safe_name(safe_name: str):
-        file_name_split = safe_name.split("__")
-        return SessionKey(
-            type=file_name_split[0], channel_id=file_name_split[1], chat_id=file_name_split[2]
-        )
+        parts = safe_name.split("__", 2)
+        if len(parts) != 3:
+            raise ValueError(f"Invalid SessionKey safe_name: {safe_name}")
+        return SessionKey(type=parts[0], channel_id=parts[1], chat_id=parts[2])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bot/vikingbot/config/schema.py` around lines 733 - 737, The from_safe_name
function currently assumes safe_name splits into exactly three parts and can
IndexError; change it to split with maxsplit=2 (safe_name.split("__", 2)),
validate the resulting parts length equals 3, and raise a clear ValueError (or
custom error) if malformed, then construct and return SessionKey(type=parts[0],
channel_id=parts[1], chat_id=parts[2]); also consider stripping whitespace from
parts before constructing the SessionKey to harden parsing.
openviking/session/session.py (1)

260-276: ⚠️ Potential issue | 🟡 Minor

Set memory.extracted when compressor is unavailable.

In both commit paths, non-empty sessions with self._session_compressor unset never emit memory.extracted, creating inconsistent telemetry for successful commits.

Proposed fix
         # 2. Extract long-term memories
         if self._session_compressor:
             logger.info(
                 f"Starting memory extraction from {len(messages_to_archive)} archived messages"
             )
             memories = run_async(
                 self._session_compressor.extract_long_term_memories(
                     messages=messages_to_archive,
                     user=self.user,
                     session_id=self.session_id,
                     ctx=self.ctx,
                 )
             )
             logger.info(f"Extracted {len(memories)} memories")
             result["memories_extracted"] = len(memories)
             self._stats.memories_extracted += len(memories)
             get_current_telemetry().set("memory.extracted", len(memories))
+        else:
+            get_current_telemetry().set("memory.extracted", 0)
@@
         # 2. Extract long-term memories
         if self._session_compressor:
             logger.info(
                 f"Starting memory extraction from {len(messages_to_archive)} archived messages"
             )
             memories = await self._session_compressor.extract_long_term_memories(
                 messages=messages_to_archive,
                 user=self.user,
                 session_id=self.session_id,
                 ctx=self.ctx,
             )
             logger.info(f"Extracted {len(memories)} memories")
             result["memories_extracted"] = len(memories)
             self._stats.memories_extracted += len(memories)
             get_current_telemetry().set("memory.extracted", len(memories))
+        else:
+            get_current_telemetry().set("memory.extracted", 0)

Also applies to: 338-351

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/session/session.py` around lines 260 - 276, When
self._session_compressor is unset but there are messages_to_archive, ensure
telemetry and result fields still report zero memories extracted: in the branch
where self._session_compressor is false (and in the analogous second commit path
handling messages_to_archive), set result["memories_extracted"] = 0, call
get_current_telemetry().set("memory.extracted", 0), and update
self._stats.memories_extracted (no-op or add 0) and optionally log that no
compressor was available; use the same symbols (self._session_compressor,
result["memories_extracted"], self._stats.memories_extracted,
get_current_telemetry()) so both commit paths emit consistent telemetry.
openviking/models/embedder/volcengine_embedders.py (1)

206-227: ⚠️ Potential issue | 🟠 Major

embed_batch lacks retry logic unlike embed.

The embed() method uses exponential_backoff_retry for resilience against 429 errors, but embed_batch() makes direct API calls without retry wrapping. This inconsistency means batch operations are more susceptible to transient rate limiting failures.

Consider wrapping the batch API call with the same retry logic:

♻️ Proposed fix
     def embed_batch(self, texts: List[str]) -> List[EmbedResult]:
         ...
         if not texts:
             return []

-        try:
+        def _batch_embed_call():
             if self.input_type == "multimodal":
                 multimodal_inputs = [{"type": "text", "text": text} for text in texts]
                 response = self.client.multimodal_embeddings.create(
                     input=multimodal_inputs, model=self.model_name
                 )
                 self._update_telemetry_token_usage(response)
                 data = response.data
             else:
                 response = self.client.embeddings.create(input=texts, model=self.model_name)
                 self._update_telemetry_token_usage(response)
                 data = response.data

             return [
                 EmbedResult(dense_vector=truncate_and_normalize(item.embedding, self.dimension))
                 for item in data
             ]
+
+        try:
+            return exponential_backoff_retry(
+                _batch_embed_call,
+                max_wait=10.0,
+                base_delay=0.5,
+                max_delay=2.0,
+                jitter=True,
+                is_retryable=is_429_error,
+                logger=logger,
+            )
         except Exception as e:
             ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/volcengine_embedders.py` around lines 206 - 227,
The embed_batch method currently calls self.client.multimodal_embeddings.create
and self.client.embeddings.create directly and lacks the
exponential_backoff_retry used by embed; update embed_batch to wrap the API
calls with the same exponential_backoff_retry helper (the one used by embed) so
the multimodal and non-multimodal branches call
exponential_backoff_retry(lambda: self.client.multimodal_embeddings.create(...))
and exponential_backoff_retry(lambda: self.client.embeddings.create(...))
respectively, preserve the existing _update_telemetry_token_usage(response) and
subsequent parsing (EmbedResult creation and truncate_and_normalize) and keep
the existing exception handling semantics.
🟡 Minor comments (8)
tests/unit/test_skill_processor_none.py-47-63 (1)

47-63: ⚠️ Potential issue | 🟡 Minor

The test assertion uses a platform-specific error message that may cause CI failures on Windows.

The match="File name too long" assertion is specific to POSIX systems (errno 36, ENAMETOOLONG). Windows handles filesystem path length limits differently and would raise errors with different messages or codes depending on whether long path support is enabled. This test will fail on Windows or in cross-platform CI without platform-specific handling.

Consider removing the match pattern to accept any OSError, or add pytest.mark.skipif(sys.platform == "win32", reason="POSIX-specific path length limits") if this behavior is intentionally Unix-only:

Proposed fix
+import sys
+
    def test_parse_skill_long_raw_content_raises_oserror(self):
         """Long raw SKILL.md content should still surface path probing errors."""
         processor = SkillProcessor(vikingdb=None)
@@ -59,5 +61,5 @@ class TestParseSkillNoneData:
             "Use this skill to validate telemetry ingestion.\n"
         )

-        with pytest.raises(OSError, match="File name too long"):
+        with pytest.raises(OSError):
             processor._parse_skill(raw_skill)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_skill_processor_none.py` around lines 47 - 63, The test
test_parse_skill_long_raw_content_raises_oserror asserts a POSIX-specific error
message ("File name too long") when calling SkillProcessor._parse_skill which
will fail on Windows; change the test to either remove the match so it simply
expects any OSError (i.e., keep with pytest.raises(OSError) without match) or
add a platform guard (pytest.mark.skipif(sys.platform == "win32",
reason="POSIX-specific path length limits")) so the assertion is only run on
POSIX systems—update the test function name
test_parse_skill_long_raw_content_raises_oserror accordingly and adjust imports
if you choose the skip approach.
openviking/retrieve/retrieval_stats.py-26-33 (1)

26-33: ⚠️ Potential issue | 🟡 Minor

Track scored items separately from returned items.

avg_score, min_score, and max_score are keyed off total_results, but only scores updates the score accumulators. If a caller records results without per-hit scores, min_score stays inf; if all scores are negative, max_score stays 0.0.

📊 Suggested fix
 `@dataclass`
 class RetrievalStats:
     total_queries: int = 0
     total_results: int = 0
     zero_result_queries: int = 0
+    scored_results: int = 0
     total_score_sum: float = 0.0
-    max_score: float = 0.0
+    max_score: float = float("-inf")
     min_score: float = float("inf")
@@
     def avg_score(self) -> float:
-        if self.total_results == 0:
+        if self.scored_results == 0:
             return 0.0
-        return self.total_score_sum / self.total_results
+        return self.total_score_sum / self.scored_results
@@
-            "max_score": round(self.max_score, 4) if self.total_results > 0 else 0.0,
-            "min_score": round(self.min_score, 4) if self.total_results > 0 else 0.0,
+            "max_score": round(self.max_score, 4) if self.scored_results > 0 else 0.0,
+            "min_score": round(self.min_score, 4) if self.scored_results > 0 else 0.0,
@@
             for s in scores:
+                self._stats.scored_results += 1
                 self._stats.total_score_sum += s

Also applies to: 48-51, 65-69, 120-125

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/retrieve/retrieval_stats.py` around lines 26 - 33, The scoring
stats are keyed off total_results while only scores update total_score_sum,
leaving min_score as inf and max_score at 0 for negative scores; add a new
scored_results_count (int) field and increment it whenever scores are processed
(alongside updating total_score_sum, min_score, max_score), change any avg_score
computation to divide by scored_results_count (not total_results), and ensure
final getters/serializers handle scored_results_count == 0 by returning sensible
defaults (e.g., 0.0 or None) instead of inf/0; update references to
total_score_sum, max_score, min_score, and any avg_score logic in the
RetrievalStats class to use scored_results_count.
openviking/models/embedder/openai_embedders.py-149-152 (1)

149-152: ⚠️ Potential issue | 🟡 Minor

Same concern: telemetry recorded before validating response data.

For consistency, consider moving the telemetry update after iterating response.data to ensure tokens are only counted for successfully processed responses.

Proposed reordering
             response = self.client.embeddings.create(**kwargs)
-            self._update_telemetry_token_usage(response)

-            return [EmbedResult(dense_vector=item.embedding) for item in response.data]
+            results = [EmbedResult(dense_vector=item.embedding) for item in response.data]
+            self._update_telemetry_token_usage(response)
+            return results
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/openai_embedders.py` around lines 149 - 152,
Telemetry is currently recorded before validating and consuming response data;
modify the flow in the embedding call that uses self.client.embeddings.create so
you first build the list of EmbedResult objects from response.data (e.g.,
iterate response.data and create EmbedResult(dense_vector=item.embedding) for
each item), validate the results as needed, and only after successful iteration
call self._update_telemetry_token_usage(response) to record token usage; this
ensures _update_telemetry_token_usage is invoked only when response.data was
successfully processed.
openviking/models/embedder/openai_embedders.py-119-121 (1)

119-121: ⚠️ Potential issue | 🟡 Minor

Consider updating telemetry after validating response data.

Telemetry is recorded on line 120 before accessing response.data[0].embedding on line 121. If response.data is empty or malformed, token usage will be counted for a request that ultimately fails. Consider moving the telemetry update after the data extraction, or ensure the response structure is valid first.

Proposed reordering
             response = self.client.embeddings.create(**kwargs)
-            self._update_telemetry_token_usage(response)
             vector = response.data[0].embedding
+            self._update_telemetry_token_usage(response)

             return EmbedResult(dense_vector=vector)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/openai_embedders.py` around lines 119 - 121,
Telemetry update is happening before validating the embedding response, so
_update_telemetry_token_usage may record usage for a failed/malformed response;
modify the embedding flow in the method that calls self.client.embeddings.create
so you first validate response.data and extract vector (e.g., ensure
response.data is a non-empty list and response.data[0].embedding exists) and
only then call self._update_telemetry_token_usage(response); alternatively,
perform the validation and raise a clear exception before any telemetry call to
avoid counting failed requests.
crates/ov_cli/src/tui/tree.rs-69-72 (1)

69-72: ⚠️ Potential issue | 🟡 Minor

Don't overload the root URI for display.

Using "/" here makes FsEntry::name() return an empty string, so the synthetic root row renders blank. It also changes the URI propagated by selected_uri(). Keep a canonical root URI and derive "/" as a display label separately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/ov_cli/src/tui/tree.rs` around lines 69 - 72, The synthetic root node
is using "/" as the FsEntry.uri which makes FsEntry::name() return an empty
string and leaks that display value into selected_uri(); instead set the root
TreeNode.entry.uri to the canonical internal root identifier (do not use "/"),
and change the rendering/path-display logic to derive and display "/" as the
label for the synthetic root (i.e., keep TreeNode / FsEntry.uri canonical, but
map that canonical URI to the "/" display string when rendering and when
computing row labels), ensuring selected_uri() continues to return the canonical
URI unchanged.
tests/telemetry/test_resource_summary.py-11-14 (1)

11-14: ⚠️ Potential issue | 🟡 Minor

This test doesn't actually cover queue_status parsing.

Both request-stat helpers are stubbed to fixed values, and the assertions only check those stubbed values. record_resource_wait_metrics() could stop reading queue_status and this test would still pass. Leave one collector unpatched, or assert a field sourced directly from queue_status.

Also applies to: 30-58

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/telemetry/test_resource_summary.py` around lines 11 - 14, The test
doesn't actually verify parsing of queue_status by record_resource_wait_metrics
because both request-stat helpers are stubbed and assertions only check those
stubs; update tests in test_resource_summary.py so that either one of the
request-stat collectors remains unpatched or add an assertion that directly
checks a value derived from queue_status (e.g.,
queue_status["Semantic"].error_count or queue_status["Embedding"].processed)
after calling record_resource_wait_metrics; ensure you reference the
queue_status fixture/variable and the record_resource_wait_metrics call and
assert that a metric or output reflects a queue_status field (for example assert
recorded metric equals queue_status["Semantic"].error_count) so the test will
fail if queue_status is not parsed.
openviking/models/embedder/base.py-322-328 (1)

322-328: ⚠️ Potential issue | 🟡 Minor

Re-cap the jittered delay.

max_delay is enforced before jitter, so once the raw backoff reaches the cap the actual sleep can still exceed it. Clamp again after applying jitter to keep the per-retry ceiling predictable.

Suggested fix
             delay = min(base_delay * (2 ** (attempt - 1)), max_delay)

             if jitter:
                 delay = delay * (0.5 + random.random())
+            delay = min(delay, max_delay)

             remaining_time = max_wait - elapsed
             delay = min(delay, remaining_time)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/base.py` around lines 322 - 328, The jittered
delay can exceed max_delay because max_delay was applied before jitter; update
the backoff logic in the function that computes delay (the block using
base_delay, attempt, max_delay, jitter, random) so that after applying jitter
you clamp again: after "if jitter: delay = delay * (0.5 + random.random())" add
"delay = min(delay, max_delay)" (then compute remaining_time and cap by that as
before). This ensures the per-retry ceiling (max_delay) holds even when jitter
is enabled.
crates/ov_cli/src/tui/ui.rs-148-161 (1)

148-161: ⚠️ Potential issue | 🟡 Minor

Title shows "1/0" when records list is empty.

When app.vector_state.records.is_empty() is true, the title will display cursor + 1 which could show "1/0" before the empty state check on Line 171 returns. While the early return prevents rendering items, the title has already been set.

Consider guarding the title construction or showing a cleaner "0/0" for empty state:

🔧 Proposed fix
     let title = if let Some(total) = app.vector_state.total_count {
-        format!(
-            " Vector Records ({}/{}, total: {}) ",
-            app.vector_state.cursor + 1,
-            app.vector_state.records.len(),
-            total
-        )
+        if app.vector_state.records.is_empty() {
+            format!(" Vector Records (0/{}, total: {}) ", 0, total)
+        } else {
+            format!(
+                " Vector Records ({}/{}, total: {}) ",
+                app.vector_state.cursor + 1,
+                app.vector_state.records.len(),
+                total
+            )
+        }
     } else {
-        format!(
-            " Vector Records ({}/{}) ",
-            app.vector_state.cursor + 1,
-            app.vector_state.records.len()
-        )
+        if app.vector_state.records.is_empty() {
+            " Vector Records (0/0) ".to_string()
+        } else {
+            format!(
+                " Vector Records ({}/{}) ",
+                app.vector_state.cursor + 1,
+                app.vector_state.records.len()
+            )
+        }
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/ov_cli/src/tui/ui.rs` around lines 148 - 161, The title construction
can display "1/0" when app.vector_state.records is empty because it
unconditionally uses app.vector_state.cursor + 1; update the logic that builds
title (the block using app.vector_state.total_count, app.vector_state.cursor,
and app.vector_state.records.len()) to first check if
app.vector_state.records.is_empty() and produce " Vector Records (0/0) " (and
optionally include total when present) otherwise use the existing cursor+1 and
len() formatting; ensure this guard sits before the current title assignment so
the empty-state title is never computed from cursor+1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bot/vikingbot/config/loader.py`:
- Line 56: The bootstrap currently persists default_config too early via
save_config(default_config, config_path, include_defaults=True), which causes
bot.agents.model to be written and prevents _merge_vlm_model_config() from
applying vlm/provider overrides; fix by not persisting defaults before merging:
either change the save_config call to use include_defaults=False (or remove it)
and ensure _merge_vlm_model_config() runs and can mutate the in-memory config
first, or postpone the save_config call until after _merge_vlm_model_config()
has run; ensure the change prevents writing bot.agents.model/default provider
before merge.

In `@examples/openclaw-memory-plugin/index.ts`:
- Around line 357-365: The hook is mutating a shared OpenVikingClient by calling
client.setAgentId(hookAgentId), causing cross-request interference; instead make
agentId request-scoped by either creating/obtaining a per-agent client or
passing agentId into the memory operations rather than mutating the shared
instance. Update the before_agent_start and the other hook (the block around the
later capture/recall code) to use a request-scoped client accessor (e.g.,
getClientForAgent(agentId)) or call methods that accept an agentId parameter
instead of calling OpenVikingClient.setAgentId on the shared client, and ensure
any resolved-space cache is keyed by agentId rather than a global field.

In `@openviking_cli/client/http.py`:
- Around line 576-577: The response envelope's telemetry is currently discarded
in the client methods that call self._handle_response_data (e.g., in find() and
search()); extract telemetry = response_data.get("telemetry") and ensure it is
passed into the FindResult creation instead of dropping it—either by calling
FindResult.from_dict(result_dict, telemetry=telemetry) (or otherwise setting a
telemetry field on the returned FindResult) or by returning the full envelope
consistently; apply the same change in both places that call
FindResult.from_dict after _handle_response_data.
- Around line 225-227: The function _validate_telemetry currently calls
normalize_telemetry_request(telemetry) but returns the original telemetry
object; change it to capture and return the normalized result—i.e., call
normalized = normalize_telemetry_request(telemetry) (or return the call
directly) and then return that normalized payload so any non-mutating
normalization is sent to the server; update references in _validate_telemetry to
use the normalized value.

In `@openviking/console/bootstrap.py`:
- Around line 41-44: The default for the "--request-timeout-sec" argument is too
high (3600.0); change its default to a much lower safe value (e.g., 30.0 or
60.0) in the argparse add_argument call that defines "--request-timeout-sec" and
update the help string to note that longer timeouts require explicit opt-in for
long-running operations. Locate the add_argument invocation for
"--request-timeout-sec" in bootstrap.py and change the default float value and
help text accordingly.

In `@openviking/parse/parsers/html.py`:
- Around line 352-368: The extracted filename from _extract_filename_from_url
may decode to path-traversal values (e.g., "..", "../", or include separators)
and is used directly when composing temp URIs; sanitize it before returning by
stripping any path separators and dot-segments, normalizing/slugifying to a safe
filename (keep only alphanumerics, hyphen, underscore, dot), trim leading dots,
and fall back to a constant safe name like "download" if the result is empty or
equals "." or ".."; apply the same sanitization where original_filename is used
to build root_dir / file_uri to ensure no traversal escapes the temp subtree.

In `@openviking/retrieve/hierarchical_retriever.py`:
- Around line 355-367: The seeded level-2 hits in the initial_candidates loop
are added to collected_by_uri without applying the score gate; update the block
that iterates initial_candidates in hierarchical_retriever.py to compute
r["_final_score"] (using r.get("_score", 0.0)) and call
passes_threshold(r["_final_score"]) (or the appropriate scorer method used
elsewhere) before inserting into collected_by_uri and logging; ensure you still
only accept items where r.get("level", 2) == 2 and skip/log those that fail the
threshold so the global score_threshold is respected for seeded L2 hits.

In `@openviking/storage/collection_schemas.py`:
- Around line 165-189: The bug is that _merge_request_stats keeps appending the
same telemetry_id into cls._request_stats_order, so when the list exceeds
cls._max_cached_stats an old duplicate may be popped and its live aggregate
removed from cls._request_stats_by_telemetry_id; fix by ensuring
cls._request_stats_order contains each telemetry_id only once (or is updated to
reflect most-recent use) before eviction. Modify _merge_request_stats to, while
holding cls._request_stats_lock, remove any existing occurrence of telemetry_id
from cls._request_stats_order (or maintain a companion set) and then append
telemetry_id (or append/move-to-end) so eviction always removes a truly oldest
unused telemetry id and not a duplicate of an active one; keep
consume_request_stats behavior unchanged.
- Around line 309-312: The current logger.debug call (using record_id and
inserted_data['abstract'] and inserted_data['vector'][:5]) leaks raw content and
will KeyError on sparse-only runs; change the logging in the upsert/insert path
to avoid logging the raw abstract or vector slice and instead log safe metadata
(e.g., record_id, inserted_data.get("doc_id") or inserted_data.get("chunk_id"),
collection name or inserted_data.keys()) and a boolean indicating whether a
vector is present (vector_present = "vector" in inserted_data). Update the
logger.debug line that references record_id and inserted_data to use
inserted_data.get(...) and only include non-sensitive identifiers and vector
presence, not the actual abstract or vector contents.
- Around line 253-261: The 429 re-enqueue path currently requeues immediately
and indefinitely; modify it to add a retry counter on EmbeddingMsg (e.g.,
add/inspect a attempts or retry_count field on EmbeddingMsg), check and
increment that counter before re-enqueueing in the block guarded by is_429_error
and self._vikingdb.has_queue_manager, and apply an exponential backoff sleep
(use the same backoff logic pattern as in volcengine_vectorizer.py) before
calling self._vikingdb.enqueue_embedding_msg(embedding_msg); enforce a maximum
retry limit (drop/log and call a failure handler/report if retry_count exceeds
max) so messages are not re-enqueued forever, and ensure successful requeue
still calls report_success.

In `@openviking/storage/queuefs/semantic_processor.py`:
- Around line 76-103: _cache_dag_stats stores the same DagStats under both
cls._dag_stats_by_telemetry_id and cls._dag_stats_by_uri and appends
(telemetry_id, uri) to cls._dag_stats_order, but consume_dag_stats only removes
the entry it was queried by leaving the alias and stale order tuples behind; fix
by making caching idempotent (in _cache_dag_stats) — before appending a new
(telemetry_id, uri) remove any existing tuples for the same telemetry_id or uri
from cls._dag_stats_order and also remove their previous entries from
cls._dag_stats_by_telemetry_id/cls._dag_stats_by_uri to avoid duplicates, and
(in consume_dag_stats) when removing by telemetry_id or by uri also remove the
corresponding alias in the other dict and remove any matching tuples from
cls._dag_stats_order while holding _stats_lock so indices remain in sync and
eviction via _max_cached_stats works correctly.
- Around line 105-129: The current list-backed _request_stats_order in
_merge_request_stats lets duplicate telemetry ids accumulate and causes eviction
of live aggregates; replace the list with a keyed LRU (e.g.,
collections.OrderedDict or dict with move-to-end semantics) so each telemetry_id
is unique in the ordering. In _merge_request_stats (and related code paths),
after updating cls._request_stats_by_telemetry_id[telemetry_id], insert or move
the key to the end (OrderedDict.__setitem__ or move_to_end) and when len(order)
> cls._max_cached_stats pop the oldest via order.popitem(last=False) and also
pop that key from cls._request_stats_by_telemetry_id; keep consume_request_stats
behavior the same but it will now safely pop the stats by key without risk of
earlier duplicate eviction.

In `@openviking/storage/viking_fs.py`:
- Around line 648-650: The debug log in VikingFS.find leaks raw tenant/user
identifiers by interpolating real_ctx.account_id and real_ctx.user; change the
logger.debug calls (e.g., the one in VikingFS.find that currently references
real_ctx.account_id and real_ctx.user, and the similar calls around the 784-786
region) to redact or hash those values before logging (for example log counts,
masked values like "<redacted>" or a deterministic hash/ID instead of the raw
identifiers) so the message still conveys context without exposing PII.

In `@openviking/storage/viking_vector_index_backend.py`:
- Around line 620-637: The count() and clear() methods currently fall back to
_get_default_backend() when ctx is None which scopes operations to the "default"
account; change the no-ctx path to use _get_root_backend() instead so these
calls operate collection-wide (or alternatively make ctx mandatory).
Specifically, update both count() and clear() to call self._get_root_backend()
when ctx is falsy (keeping the existing use of _get_backend_for_context(ctx)
when ctx is provided), and ensure return await backend.count(...) and return
await backend.clear(...) remain unchanged.
- Around line 380-385: The health_check implementation always returns True if
collection_exists() doesn't raise, ignoring its boolean result; change
health_check (in VikingVectorIndexBackend) to await and capture the return value
from collection_exists() and return that boolean, and only return False on
exceptions (optionally logging the exception) so that an uninitialized backend
(collection_exists() == False) is reported unhealthy.
- Around line 137-141: The upsert currently only injects self._bound_account_id
when payload lacks "account_id", allowing callers to override and write into
other tenants; modify _SingleAccountBackend.upsert to validate that if
self._bound_account_id is set and payload.get("account_id") exists and differs
from it, the method rejects the request (raise a ValueError/ClientError) instead
of proceeding — enforce the check using self._bound_account_id and
payload["account_id"] and return/raise immediately with a clear error message
when they mismatch.

In `@openviking/utils/process_lock.py`:
- Around line 58-79: The current read -> check -> open("w") flow in acquire lock
is race-prone and returns success when writing fails; replace it with an atomic
create and fail-closed behavior: attempt to create the lock file using an atomic
open (os.open(lock_path, os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o644)), write
my_pid bytes and close; if os.open raises FileExistsError, re-read via
_read_pid_file(lock_path) and if the existing PID is alive (use _is_pid_alive)
raise DataDirectoryLocked (including existing_pid and data_dir), otherwise
remove the stale file and retry the atomic create once; any OSError when
creating or writing the lock should raise an exception (do not return lock_path)
so callers know acquisition failed. Use the existing symbols _read_pid_file,
_is_pid_alive, DataDirectoryLocked, lock_path, my_pid, and data_dir to locate
and implement this change.

In `@tests/misc/test_process_lock.py`:
- Around line 51-61: The test test_error_message_includes_remediation currently
will pass if acquire_data_dir_lock(tmpdir) does not raise; update it to
explicitly fail when no DataDirectoryLocked is raised by wrapping the call in a
try/except and adding a fail path (e.g., call to pytest.fail or raising
AssertionError) immediately after acquire_data_dir_lock(tmpdir) so the test only
succeeds when the except DataDirectoryLocked as exc branch runs and validates
the message; reference the test function name
test_error_message_includes_remediation and the exception class
DataDirectoryLocked as the places to change.

---

Outside diff comments:
In `@bot/vikingbot/config/schema.py`:
- Around line 733-737: The from_safe_name function currently assumes safe_name
splits into exactly three parts and can IndexError; change it to split with
maxsplit=2 (safe_name.split("__", 2)), validate the resulting parts length
equals 3, and raise a clear ValueError (or custom error) if malformed, then
construct and return SessionKey(type=parts[0], channel_id=parts[1],
chat_id=parts[2]); also consider stripping whitespace from parts before
constructing the SessionKey to harden parsing.

In `@openviking/models/embedder/volcengine_embedders.py`:
- Around line 206-227: The embed_batch method currently calls
self.client.multimodal_embeddings.create and self.client.embeddings.create
directly and lacks the exponential_backoff_retry used by embed; update
embed_batch to wrap the API calls with the same exponential_backoff_retry helper
(the one used by embed) so the multimodal and non-multimodal branches call
exponential_backoff_retry(lambda: self.client.multimodal_embeddings.create(...))
and exponential_backoff_retry(lambda: self.client.embeddings.create(...))
respectively, preserve the existing _update_telemetry_token_usage(response) and
subsequent parsing (EmbedResult creation and truncate_and_normalize) and keep
the existing exception handling semantics.

In `@openviking/session/session.py`:
- Around line 260-276: When self._session_compressor is unset but there are
messages_to_archive, ensure telemetry and result fields still report zero
memories extracted: in the branch where self._session_compressor is false (and
in the analogous second commit path handling messages_to_archive), set
result["memories_extracted"] = 0, call
get_current_telemetry().set("memory.extracted", 0), and update
self._stats.memories_extracted (no-op or add 0) and optionally log that no
compressor was available; use the same symbols (self._session_compressor,
result["memories_extracted"], self._stats.memories_extracted,
get_current_telemetry()) so both commit paths emit consistent telemetry.

In `@openviking/storage/local_fs.py`:
- Around line 202-226: The ZIP read fails because you normalize info.filename
into safe_zip_path before calling zf.read; instead preserve the original archive
entry name (info.filename or zip_path before replacing backslashes) for zf.read
and only use safe_zip_path (after _validate_ovpack_member_path and slash
normalization) for validation and computing rel_path/target URIs used by
get_viking_rel_path_from_zip and viking_fs.write_file_bytes; update the loop to
call zf.read(original_entry_name) and keep using safe_zip_path for path checks
and directory handling (referencing info.filename, zip_path, safe_zip_path,
_validate_ovpack_member_path, get_viking_rel_path_from_zip, zf.read, and
viking_fs.write_file_bytes).

---

Minor comments:
In `@crates/ov_cli/src/tui/tree.rs`:
- Around line 69-72: The synthetic root node is using "/" as the FsEntry.uri
which makes FsEntry::name() return an empty string and leaks that display value
into selected_uri(); instead set the root TreeNode.entry.uri to the canonical
internal root identifier (do not use "/"), and change the rendering/path-display
logic to derive and display "/" as the label for the synthetic root (i.e., keep
TreeNode / FsEntry.uri canonical, but map that canonical URI to the "/" display
string when rendering and when computing row labels), ensuring selected_uri()
continues to return the canonical URI unchanged.

In `@crates/ov_cli/src/tui/ui.rs`:
- Around line 148-161: The title construction can display "1/0" when
app.vector_state.records is empty because it unconditionally uses
app.vector_state.cursor + 1; update the logic that builds title (the block using
app.vector_state.total_count, app.vector_state.cursor, and
app.vector_state.records.len()) to first check if
app.vector_state.records.is_empty() and produce " Vector Records (0/0) " (and
optionally include total when present) otherwise use the existing cursor+1 and
len() formatting; ensure this guard sits before the current title assignment so
the empty-state title is never computed from cursor+1.

In `@openviking/models/embedder/base.py`:
- Around line 322-328: The jittered delay can exceed max_delay because max_delay
was applied before jitter; update the backoff logic in the function that
computes delay (the block using base_delay, attempt, max_delay, jitter, random)
so that after applying jitter you clamp again: after "if jitter: delay = delay *
(0.5 + random.random())" add "delay = min(delay, max_delay)" (then compute
remaining_time and cap by that as before). This ensures the per-retry ceiling
(max_delay) holds even when jitter is enabled.

In `@openviking/models/embedder/openai_embedders.py`:
- Around line 149-152: Telemetry is currently recorded before validating and
consuming response data; modify the flow in the embedding call that uses
self.client.embeddings.create so you first build the list of EmbedResult objects
from response.data (e.g., iterate response.data and create
EmbedResult(dense_vector=item.embedding) for each item), validate the results as
needed, and only after successful iteration call
self._update_telemetry_token_usage(response) to record token usage; this ensures
_update_telemetry_token_usage is invoked only when response.data was
successfully processed.
- Around line 119-121: Telemetry update is happening before validating the
embedding response, so _update_telemetry_token_usage may record usage for a
failed/malformed response; modify the embedding flow in the method that calls
self.client.embeddings.create so you first validate response.data and extract
vector (e.g., ensure response.data is a non-empty list and
response.data[0].embedding exists) and only then call
self._update_telemetry_token_usage(response); alternatively, perform the
validation and raise a clear exception before any telemetry call to avoid
counting failed requests.

In `@openviking/retrieve/retrieval_stats.py`:
- Around line 26-33: The scoring stats are keyed off total_results while only
scores update total_score_sum, leaving min_score as inf and max_score at 0 for
negative scores; add a new scored_results_count (int) field and increment it
whenever scores are processed (alongside updating total_score_sum, min_score,
max_score), change any avg_score computation to divide by scored_results_count
(not total_results), and ensure final getters/serializers handle
scored_results_count == 0 by returning sensible defaults (e.g., 0.0 or None)
instead of inf/0; update references to total_score_sum, max_score, min_score,
and any avg_score logic in the RetrievalStats class to use scored_results_count.

In `@tests/telemetry/test_resource_summary.py`:
- Around line 11-14: The test doesn't actually verify parsing of queue_status by
record_resource_wait_metrics because both request-stat helpers are stubbed and
assertions only check those stubs; update tests in test_resource_summary.py so
that either one of the request-stat collectors remains unpatched or add an
assertion that directly checks a value derived from queue_status (e.g.,
queue_status["Semantic"].error_count or queue_status["Embedding"].processed)
after calling record_resource_wait_metrics; ensure you reference the
queue_status fixture/variable and the record_resource_wait_metrics call and
assert that a metric or output reflects a queue_status field (for example assert
recorded metric equals queue_status["Semantic"].error_count) so the test will
fail if queue_status is not parsed.

In `@tests/unit/test_skill_processor_none.py`:
- Around line 47-63: The test test_parse_skill_long_raw_content_raises_oserror
asserts a POSIX-specific error message ("File name too long") when calling
SkillProcessor._parse_skill which will fail on Windows; change the test to
either remove the match so it simply expects any OSError (i.e., keep with
pytest.raises(OSError) without match) or add a platform guard
(pytest.mark.skipif(sys.platform == "win32", reason="POSIX-specific path length
limits")) so the assertion is only run on POSIX systems—update the test function
name test_parse_skill_long_raw_content_raises_oserror accordingly and adjust
imports if you choose the skip approach.

---

Nitpick comments:
In @.gitignore:
- Line 144: The .gitignore entry "docs/superpowers/" is unanchored and can match
directories anywhere; update that pattern to be rooted by changing
"docs/superpowers/" to "/docs/superpowers/" so it consistently anchors to the
repository root (matching the adjacent "/site" pattern) and prevents unintended
matches elsewhere.

In `@examples/opencode-memory-plugin/openviking-memory.ts`:
- Around line 9-10: Header sentence in openviking-memory.ts is awkward and uses
a nonstandard comma; update the top comment to a concise, grammatical
description—edit the header comment (the initial project description lines) to
use standard punctuation, consistent capitalization, and clearer phrasing (for
example: "Build an enterprise AI assistant for consumer brands with process
awareness and memory, serving product development through the pre-launch
lifecycle.") so the project metadata reads professionally and is easy to scan.

In `@openviking/console/app.py`:
- Around line 50-58: Extract the hardcoded upstream paths in
_should_default_telemetry into named constants and use them in the function to
avoid duplication; create a set like _TELEMETRY_DEFAULT_PATHS for exact-match
routes (e.g. "/api/v1/search/find", "/api/v1/resources"), and string constants
_TELEMETRY_SESSION_PREFIX and _TELEMETRY_COMMIT_SUFFIX for the
startswith/endswith checks, then update _should_default_telemetry to test
membership in _TELEMETRY_DEFAULT_PATHS and to use
startswith(_TELEMETRY_SESSION_PREFIX) and endswith(_TELEMETRY_COMMIT_SUFFIX) so
route values are defined once and can be reused elsewhere.

In `@openviking/models/embedder/volcengine_embedders.py`:
- Around line 21-34: Replace fragile string matching in is_429_error with a type
check against the SDK's ArkRateLimitError: import ArkRateLimitError from
volcenginesdkarkruntime and have is_429_error return True when
isinstance(exception, ArkRateLimitError); you can optionally keep the existing
string checks as a fallback for non-SDK exceptions, but primary detection must
use isinstance(exception, ArkRateLimitError) to reliably detect 429/rate-limit
errors.

In `@openviking/models/vlm/base.py`:
- Around line 83-90: The except block around the telemetry call in
openviking.models.vlm.base (the try that imports openviking.telemetry and calls
get_current_telemetry().add_token_usage(prompt_tokens, completion_tokens))
swallows all errors silently; update the handler to log the caught exception at
debug level so telemetry failures are recorded without impacting
inference—capture the exception and call the module/process logger (or
logging.debug) with a clear message and the exception info while preserving the
current behavior of not raising.

In `@openviking/server/routers/debug.py`:
- Around line 54-57: The parameter name `filter` in the debug_vector_count
function shadows Python's built-in filter(); rename the parameter (e.g., to
`filter_expr` or `query_filter`) and update all references inside
debug_vector_count and any callers to use the new name (`filter_expr`), ensuring
type annotation Optional[str] and the Depends(_ctx) signature remain unchanged;
search for usages of `filter` within openviking/server/routers/debug.py (and
related tests) to update variable references accordingly.
- Line 60: The local "import json" currently inside the request handler should
be moved to the module-level imports: add "import json" with the other
top-of-file imports and remove the inline import from the function body (the
function that currently contains the inline import/json usage). Update any
references to json in that handler to use the top-level import and run
lint/flake8 to ensure import ordering and no unused-import warnings.

In `@openviking/service/resource_service.py`:
- Around line 141-145: The telemetry set_error calls inside the timeout/except
handlers (calls to get_current_telemetry().set_error in the wait_complete
handler and the add_skill timeout branch, plus the other similar handlers) must
be best-effort: wrap each get_current_telemetry().set_error(...) invocation in
its own try/except that catches and swallows/logs any exceptions from telemetry
so they cannot mask the original business exception (re-raise or continue to
propagate the original timeout/processing error as before); update the three
occurrences that call set_error (the wait_complete exception handler, the
add_skill timeout branch, and the other similar handler) to use this defensive
pattern.

In `@openviking/storage/queuefs/embedding_msg.py`:
- Around line 9-24: The EmbeddingMsg dataclass currently defines a custom
__init__ that assigns self.id (so id isn't a declared dataclass field and asdict
won't include it); replace the custom initializer by declaring id as a dataclass
field with a default_factory that generates uuid4() and init=False (e.g., add an
id: str = field(default_factory=lambda: str(uuid4()), init=False)), keep
message, context_data, telemetry_id as dataclass fields, and remove the custom
__init__ so dataclass-generated __init__ and asdict(self) include all fields.

In `@openviking/storage/vikingdb_manager.py`:
- Around line 172-189: The class docstring for VikingDBManagerProxy is written
in Chinese; please translate it into English in place (maintain the same
docstring structure, examples and formatting) so it matches the rest of the
codebase; update the class-level docstring for VikingDBManagerProxy to an
English description that explains its purpose (tenant-bound proxy that injects
RequestContext into calls), how to initialize and use it (showing manager and
ctx usage), and keep API compatibility note and example unchanged aside from
language.
- Around line 402-421: The wrapper methods inconsistently pass the context
self._ctx positionally to the underlying manager methods (e.g., search_in_tenant
calling self._manager.search_in_tenant(self._ctx, ...)), which risks
argument-order bugs; change these wrappers (including search_in_tenant,
delete_uris, and any similar methods) to pass ctx as a keyword argument
(ctx=self._ctx) when calling the corresponding _manager methods (e.g.,
self._manager.search_in_tenant(ctx=self._ctx, query_vector=..., ...)) so all
calls use a consistent keyword-argument style.

In `@openviking/telemetry/operation.py`:
- Around line 218-237: The finish() method should guard against being called
multiple times: add a private boolean attribute _finished on the class, and
inside finish() acquire self._lock, check if self._finished and if so return the
previously created TelemetrySnapshot (store it on self, e.g., _snapshot) or
raise a clear exception; otherwise compute duration_ms using self._start_time,
build the summary with TelemetrySummaryBuilder (using
self._counters/_gauges/_error_*), set self._finished = True and cache the
constructed TelemetrySnapshot (with self.telemetry_id and summary) before
returning it to ensure subsequent calls return the same snapshot and do not
recompute durations.

In `@openviking/telemetry/registry.py`:
- Around line 11-33: The registry currently holds strong references in
_REGISTERED_TELEMETRY causing potential memory growth; change the implementation
so orphaned handles are automatically removed by either replacing
_REGISTERED_TELEMETRY with a weakref.WeakValueDictionary (so register_telemetry
stores the OperationTelemetry directly and entries drop when no other refs
exist) or by storing (handle, last_seen_timestamp) and adding a background
cleanup task function (e.g., cleanup_orphaned_telemetry) that periodically
removes entries older than a configurable TTL; update register_telemetry,
resolve_telemetry and unregister_telemetry to work with the chosen structure and
expose a config option or docstring describing the TTL/cleanup behavior if you
opt to only document rather than implement automatic cleanup.

In `@openviking/telemetry/resource_summary.py`:
- Around line 55-64: build_queue_status_payload currently assumes each value `s`
in `status` has attributes `processed`, `error_count`, and `errors`, which will
raise AttributeError for malformed entries; update the function to defensively
handle bad inputs by using getattr(s, "processed", 0) and getattr(s,
"error_count", 0), coerce `errors` via getattr(s, "errors", []) or treat `s` as
a dict (s.get("errors", [])) if necessary, ensure `errors` is iterable, and map
each error to {"message": getattr(e, "message", str(e))} (or skip non-iterable
values) so the function always returns a well-formed payload for
`build_queue_status_payload`.

In `@openviking/telemetry/runtime.py`:
- Around line 35-40: The record_histogram method currently appends values to
self._histograms[key] indefinitely causing potential unbounded memory growth;
modify this by introducing a bounded storage policy: either cap the list size
per key (e.g., drop oldest when length exceeds a MAX_HISTOGRAM_SIZE) or
implement reservoir sampling/rolling window logic when appending in
record_histogram, and add a periodic flush/aggregate method (e.g.,
flush_histograms or aggregate_histograms) that can be called by a background
thread or external scheduler to reduce memory (ensure thread-safety by using the
existing _lock around both updates and flushes); update references to
_histograms, _key, and record_histogram to use the new bounded/flush behavior
and document/configure the MAX_HISTOGRAM_SIZE or sampling parameters.

In `@openviking/utils/resource_processor.py`:
- Around line 155-170: The parse telemetry (telemetry.set calls) must run before
the early return when parse_result.temp_dir_path is falsy: compute and emit
"resource.parse.duration_ms" and "resource.parse.warnings_count" immediately
after parse completes (using parse_start and len(parse_result.warnings or [])),
then proceed to check parse_result.temp_dir_path and add errors/warnings and
return; update the block around parse_result, parse_start, and telemetry to call
telemetry.set(...) before the if not parse_result.temp_dir_path early-return
path so the “no content generated” case still emits duration and warnings
metrics.

In `@openviking/utils/skill_processor.py`:
- Around line 69-73: The telemetry timing for each stage (_parse_skill,
_build_overview, _write_files, _index_skill) is only set on the success path;
wrap each stage execution in a try/finally: record start = time.perf_counter()
before calling the stage, execute the stage in the try block, and in the finally
block compute and emit telemetry.set("skill.<stage>.duration_ms", ...) so the
duration is recorded even when the stage throws; apply this pattern to the parse
block around _parse_skill and to the blocks covering _build_overview (overview),
file writes (write), and indexing (index) referenced in the 94-99 and 103-129
ranges.

In `@pyproject.toml`:
- Line 250: The pytest minimum version is inconsistent: update the pytest
constraint so both the dependency group entry ("pytest>=9.0.2" in
[dependency-groups]) and the test extras entry (pytest in
[project.optional-dependencies] test) use the same minimum (e.g., change the
test extras constraint from "pytest>=7.0.0" to "pytest>=9.0.2"); ensure both
occurrences of the symbol pytest have identical constraint strings to keep
environments consistent.

In `@tests/client/test_windows_path_handling.py`:
- Around line 83-127: The tests only validate manual normalization and don't
exercise the real extraction path; add an end-to-end test that creates a ZIP
with backslash-containing entries and calls import_ovpack (from local_fs.py) to
perform the import, then assert the resulting extracted structure (e.g.,
directories/files under the returned base path) matches expectations;
alternatively, if you prefer not to exercise import_ovpack here, rename the
TestZipExtractionPathHandling class to clarify it only tests normalization
helpers rather than production extraction.

In `@tests/misc/test_process_lock.py`:
- Around line 41-43: The test currently hard-codes PID 1 by writing "1" into
lock_path which is brittle; instead spawn a short-lived subprocess (e.g., via
subprocess.Popen or multiprocessing.Process), obtain its pid, write that pid
into lock_path, and ensure you terminate/wait for the subprocess before/after
assertions so the lock check is deterministic; update the code around lock_path
usage in tests/misc/test_process_lock.py to create the helper process, use its
pid instead of "1", and clean up the process to avoid leaks.

In `@tests/server/conftest.py`:
- Around line 57-61: The fake Embedder's embed and embed_batch currently return
the identical vector for every input which prevents meaningful ranking; update
embed(self, text: str) to deterministically derive a distinct dense_vector from
the input (e.g., use a stable hash of text to seed a simple RNG or to generate
per-index values, then map to floats and normalize to length dimension) and have
embed_batch(self, texts: list[str]) call embed for each text (already present)
to return per-input discriminative EmbedResult objects; keep return type
EmbedResult and preserve the existing dimension and determinism so tests remain
reproducible.

In `@tests/session/test_memory_dedup_actions.py`:
- Around line 26-28: There are two different test-context creators in this
module (the module-level ctx = make_test_ctx() and the local helper
_make_ctx()), which can yield different user/account_id values and cause
inconsistent expectations; pick one approach and make all tests use it — either
remove the module-level ctx and replace its usages with calls to _make_ctx(), or
change _make_ctx() to call make_test_ctx() (or make_test_ctx() accept
account_id) so both produce the same account_id/user; update references to the
variable name (ctx or _make_ctx()) accordingly and ensure assertions expecting
account_id="acc1" match the chosen unified context.

In `@tests/session/test_memory_extractor_response_types.py`:
- Around line 59-64: Update the test docstring in
test_none_fallback_yields_empty to clearly state that using None or {} is
evaluated before calling _normalize_parsed_data (so the function receives an
empty dict) — e.g., change the docstring to explain we are verifying that an
empty dict (as produced by the None-or-{} fallback) is passed through unchanged;
leave the call to _normalize_parsed_data(None or {}) and assertions as-is and
only clarify the docstring text.

In `@tests/session/test_session_commit.py`:
- Around line 9-11: The module-level creation of ctx via make_test_ctx() in
tests/session/test_session_commit.py can leak mutable state between tests;
replace the module-scoped ctx with a pytest fixture (e.g., define a fixture
function named ctx using `@pytest.fixture` with desired scope, default "function"
or "module" if intentional) and update tests to accept ctx as a parameter so
each test gets its own context (or controlled shared context) instead of the
global ctx variable.

In `@tests/telemetry/test_layering_rules.py`:
- Around line 29-30: Replace the bare assertion "assert offenders == []" with an
assertion that includes a descriptive failure message referencing the offenders
variable (e.g., use an assertion message or f-string) so test failures show
"Expected no offenders but found: {offenders}" (or similar) to make CI/debugging
clearer; update the assertion in the test where the offenders list is checked.

In `@tests/unit/retrieve/test_retrieval_stats.py`:
- Around line 115-121: The test helper _setup_collector mutates the
module-private _collector in openviking.retrieve.retrieval_stats which tightly
couples tests to internal implementation; add a public setter in the module
(e.g., set_collector_for_testing) that accepts a RetrievalStatsCollector and
assigns it to the module's collector, then update the test to call
set_collector_for_testing(collector) instead of touching _collector directly so
tests remain stable if the module internals are refactored.

In `@tests/unit/tool_skill/test_tool_skill_calibration.py`:
- Around line 66-71: The test function name
test_suffix_like_weather_usage_does_not_match_weather is inconsistent with its
assertion that expects skill_name == "weather"; update the test name (or the
assertion) so intent is clear: either rename the test to something like
test_suffix_like_weather_usage_matches_weather, or change the assertion to
reflect non-matching behavior. Locate the test function in
tests/unit/tool_skill/test_tool_skill_calibration.py and adjust the test name
(or the expected value) accordingly; the related symbols to check are
SessionCompressor, _get_tool_skill_info, tool_parts and candidate used inside
the test to ensure the new name matches the actual behavior being asserted.

# Create default config with empty bot section
default_config = Config()
save_config(default_config, config_path)
save_config(default_config, config_path, include_defaults=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Persisting defaults at bootstrap can block vlm overrides.

With Line 56 saving defaults and Line 178 dumping all default fields, bot.agents.model is persisted on first-run. Later, _merge_vlm_model_config() exits early when a model already exists, so root vlm.model/provider may never apply for new installs.

💡 Minimal fix to preserve current precedence
-        save_config(default_config, config_path, include_defaults=True)
+        save_config(default_config, config_path, include_defaults=False)

Also applies to: 178-178

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bot/vikingbot/config/loader.py` at line 56, The bootstrap currently persists
default_config too early via save_config(default_config, config_path,
include_defaults=True), which causes bot.agents.model to be written and prevents
_merge_vlm_model_config() from applying vlm/provider overrides; fix by not
persisting defaults before merging: either change the save_config call to use
include_defaults=False (or remove it) and ensure _merge_vlm_model_config() runs
and can mutate the in-memory config first, or postpone the save_config call
until after _merge_vlm_model_config() has run; ensure the change prevents
writing bot.agents.model/default provider before merge.

Comment on lines +357 to +365
api.on("before_agent_start", async (event: { messages?: unknown[]; prompt: string }, ctx?: { agentId?: string }) => {
// Dynamically switch agent identity for multi-agent memory isolation.
// In multi-agent gateway deployments, the hook context carries the current
// agent's ID so we route memory operations to the correct agent_space.
const hookAgentId = ctx?.agentId;
if (hookAgentId) {
const client = await getClient();
client.setAgentId(hookAgentId);
api.logger.info?.(`memory-openviking: switched to agentId=${hookAgentId} for recall`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't switch agentId on the shared plugin client.

getClient() returns the same OpenVikingClient for the plugin instance, so these hook-level setAgentId() calls can interleave and overwrite each other's headers and resolved-space cache. That can route recall/capture to the wrong agent space, and hooks without ctx.agentId inherit whatever agent ran last. Keep agentId request-scoped, or use distinct client/cache entries per agent instead of mutating the shared instance.

Also applies to: 486-493

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/openclaw-memory-plugin/index.ts` around lines 357 - 365, The hook is
mutating a shared OpenVikingClient by calling client.setAgentId(hookAgentId),
causing cross-request interference; instead make agentId request-scoped by
either creating/obtaining a per-agent client or passing agentId into the memory
operations rather than mutating the shared instance. Update the
before_agent_start and the other hook (the block around the later capture/recall
code) to use a request-scoped client accessor (e.g., getClientForAgent(agentId))
or call methods that accept an agentId parameter instead of calling
OpenVikingClient.setAgentId on the shared client, and ensure any resolved-space
cache is keyed by agentId rather than a global field.

Comment on lines +225 to +227
def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest:
normalize_telemetry_request(telemetry)
return telemetry
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return the normalized telemetry payload.

normalize_telemetry_request() is called here, but _validate_telemetry() always returns the original object. Any non-mutating normalization path still sends the raw request shape to the server.

Possible fix
 `@staticmethod`
 def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest:
-    normalize_telemetry_request(telemetry)
-    return telemetry
+    normalized = normalize_telemetry_request(telemetry)
+    return telemetry if normalized is None else normalized
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest:
normalize_telemetry_request(telemetry)
return telemetry
def _validate_telemetry(telemetry: TelemetryRequest) -> TelemetryRequest:
normalized = normalize_telemetry_request(telemetry)
return telemetry if normalized is None else normalized
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking_cli/client/http.py` around lines 225 - 227, The function
_validate_telemetry currently calls normalize_telemetry_request(telemetry) but
returns the original telemetry object; change it to capture and return the
normalized result—i.e., call normalized = normalize_telemetry_request(telemetry)
(or return the call directly) and then return that normalized payload so any
non-mutating normalization is sent to the server; update references in
_validate_telemetry to use the normalized value.

Comment on lines +576 to +577
response_data = self._handle_response_data(response)
return FindResult.from_dict(response_data.get("result") or {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop telemetry from find() and search() responses.

Both methods build FindResult from response_data["result"] and discard the envelope's telemetry, so the new telemetry= parameter is effectively a no-op for these client APIs. Either teach FindResult to carry telemetry or return an envelope consistently here.

Also applies to: 609-610

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking_cli/client/http.py` around lines 576 - 577, The response
envelope's telemetry is currently discarded in the client methods that call
self._handle_response_data (e.g., in find() and search()); extract telemetry =
response_data.get("telemetry") and ensure it is passed into the FindResult
creation instead of dropping it—either by calling
FindResult.from_dict(result_dict, telemetry=telemetry) (or otherwise setting a
telemetry field on the returned FindResult) or by returning the full envelope
consistently; apply the same change in both places that call
FindResult.from_dict after _handle_response_data.

Comment on lines 41 to 44
"--request-timeout-sec",
type=float,
default=30.0,
default=3600.0,
help="Upstream request timeout in seconds",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Default upstream timeout is too high for console traffic.

Line 43 sets a 1-hour default timeout, which can keep stalled upstream calls alive too long and degrade console responsiveness/resource usage under failure conditions. Prefer a lower safe default and require explicit opt-in for long-running operations.

Suggested adjustment
     parser.add_argument(
         "--request-timeout-sec",
         type=float,
-        default=3600.0,
+        default=30.0,
         help="Upstream request timeout in seconds",
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"--request-timeout-sec",
type=float,
default=30.0,
default=3600.0,
help="Upstream request timeout in seconds",
"--request-timeout-sec",
type=float,
default=30.0,
help="Upstream request timeout in seconds",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/console/bootstrap.py` around lines 41 - 44, The default for the
"--request-timeout-sec" argument is too high (3600.0); change its default to a
much lower safe value (e.g., 30.0 or 60.0) in the argparse add_argument call
that defines "--request-timeout-sec" and update the help string to note that
longer timeouts require explicit opt-in for long-running operations. Locate the
add_argument invocation for "--request-timeout-sec" in bootstrap.py and change
the default float value and help text accordingly.

Comment on lines +137 to +141
if self._bound_account_id and not payload.get("account_id"):
payload["account_id"] = self._bound_account_id
logger.debug(
f"[_SingleAccountBackend.upsert] Final payload.account_id={payload.get('account_id')}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reject cross-tenant account_id on bound upserts.

A backend bound to tenant A only sets payload["account_id"] when it is missing. If the caller supplies tenant B explicitly, this path writes into tenant B's partition.

Possible fix
-        if self._bound_account_id and not payload.get("account_id"):
-            payload["account_id"] = self._bound_account_id
+        if self._bound_account_id:
+            incoming_account_id = payload.get("account_id")
+            if incoming_account_id and incoming_account_id != self._bound_account_id:
+                raise PermissionError("account_id does not match bound tenant")
+            payload["account_id"] = self._bound_account_id
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/viking_vector_index_backend.py` around lines 137 - 141,
The upsert currently only injects self._bound_account_id when payload lacks
"account_id", allowing callers to override and write into other tenants; modify
_SingleAccountBackend.upsert to validate that if self._bound_account_id is set
and payload.get("account_id") exists and differs from it, the method rejects the
request (raise a ValueError/ClientError) instead of proceeding — enforce the
check using self._bound_account_id and payload["account_id"] and return/raise
immediately with a clear error message when they mismatch.

Comment on lines +380 to +385
async def health_check(self) -> bool:
try:
await self.collection_exists()
return True
except Exception:
return False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return the actual health result.

This returns True whenever collection_exists() doesn't raise, even when it returns False, so an uninitialized backend is reported as healthy.

Possible fix
     async def health_check(self) -> bool:
         try:
-            await self.collection_exists()
-            return True
+            return await self.collection_exists()
         except Exception:
             return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/viking_vector_index_backend.py` around lines 380 - 385,
The health_check implementation always returns True if collection_exists()
doesn't raise, ignoring its boolean result; change health_check (in
VikingVectorIndexBackend) to await and capture the return value from
collection_exists() and return that boolean, and only return False on exceptions
(optionally logging the exception) so that an uninitialized backend
(collection_exists() == False) is reported unhealthy.

Comment on lines +620 to +637
async def count(
self,
filter: Optional[Dict[str, Any] | FilterExpr] = None,
*,
ctx: Optional[RequestContext] = None,
) -> int:
if ctx:
backend = self._get_backend_for_context(ctx)
else:
backend = self._get_default_backend()
return await backend.count(filter=filter)

async def clear(self, *, ctx: Optional[RequestContext] = None) -> bool:
if ctx:
backend = self._get_backend_for_context(ctx)
else:
backend = self._get_default_backend()
return await backend.clear()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently scope ctx-free count() and clear() to the default tenant.

When ctx is omitted, both methods fall back to _get_default_backend(), which is bound to account "default". That changes these APIs from collection-wide operations to single-tenant ones and will undercount or partially clear data. Prefer _get_root_backend() for the no-context path, or make ctx mandatory.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/viking_vector_index_backend.py` around lines 620 - 637,
The count() and clear() methods currently fall back to _get_default_backend()
when ctx is None which scopes operations to the "default" account; change the
no-ctx path to use _get_root_backend() instead so these calls operate
collection-wide (or alternatively make ctx mandatory). Specifically, update both
count() and clear() to call self._get_root_backend() when ctx is falsy (keeping
the existing use of _get_backend_for_context(ctx) when ctx is provided), and
ensure return await backend.count(...) and return await backend.clear(...)
remain unchanged.

Comment on lines +58 to +79
existing_pid = _read_pid_file(lock_path)
if existing_pid and existing_pid != my_pid and _is_pid_alive(existing_pid):
raise DataDirectoryLocked(
f"Another OpenViking process (PID {existing_pid}) is already using "
f"the data directory '{data_dir}'. Running multiple OpenViking "
f"instances on the same data directory causes silent storage "
f"contention and data corruption.\n\n"
f"To fix this, use one of these approaches:\n"
f" 1. Use HTTP mode: start a single openviking-server and connect "
f"via --transport http (recommended for multi-session hosts)\n"
f" 2. Use separate data directories for each instance\n"
f" 3. Stop the other process (PID {existing_pid}) first"
)

# Write our PID (overwrites stale lock from a dead process).
try:
os.makedirs(data_dir, exist_ok=True)
with open(lock_path, "w") as f:
f.write(str(my_pid))
except OSError as exc:
logger.warning("Could not write PID lock %s: %s", lock_path, exc)
return lock_path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Lock acquisition is race-prone and currently fails open on write errors.

The read -> check -> open("w") flow is non-atomic, so two processes can both “acquire” the lock. Also, returning success when writing the lock fails removes the safety guarantee and can reintroduce data corruption.

🔧 Proposed fix (atomic create + fail-closed)
 def acquire_data_dir_lock(data_dir: str) -> str:
@@
-    existing_pid = _read_pid_file(lock_path)
-    if existing_pid and existing_pid != my_pid and _is_pid_alive(existing_pid):
-        raise DataDirectoryLocked(
-            f"Another OpenViking process (PID {existing_pid}) is already using "
-            f"the data directory '{data_dir}'. Running multiple OpenViking "
-            f"instances on the same data directory causes silent storage "
-            f"contention and data corruption.\n\n"
-            f"To fix this, use one of these approaches:\n"
-            f"  1. Use HTTP mode: start a single openviking-server and connect "
-            f"via --transport http (recommended for multi-session hosts)\n"
-            f"  2. Use separate data directories for each instance\n"
-            f"  3. Stop the other process (PID {existing_pid}) first"
-        )
-
-    # Write our PID (overwrites stale lock from a dead process).
-    try:
-        os.makedirs(data_dir, exist_ok=True)
-        with open(lock_path, "w") as f:
-            f.write(str(my_pid))
-    except OSError as exc:
-        logger.warning("Could not write PID lock %s: %s", lock_path, exc)
-        return lock_path
+    os.makedirs(data_dir, exist_ok=True)
+    while True:
+        try:
+            fd = os.open(lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY)
+            with os.fdopen(fd, "w") as f:
+                f.write(str(my_pid))
+            break
+        except FileExistsError:
+            existing_pid = _read_pid_file(lock_path)
+            if existing_pid == my_pid:
+                break  # re-entrant acquire by same process
+            if existing_pid and _is_pid_alive(existing_pid):
+                raise DataDirectoryLocked(
+                    f"Another OpenViking process (PID {existing_pid}) is already using "
+                    f"the data directory '{data_dir}'. Running multiple OpenViking "
+                    f"instances on the same data directory causes silent storage "
+                    f"contention and data corruption.\n\n"
+                    f"To fix this, use one of these approaches:\n"
+                    f"  1. Use HTTP mode: start a single openviking-server and connect "
+                    f"via --transport http (recommended for multi-session hosts)\n"
+                    f"  2. Use separate data directories for each instance\n"
+                    f"  3. Stop the other process (PID {existing_pid}) first"
+                )
+            # stale/corrupt lock; try to replace
+            try:
+                os.remove(lock_path)
+            except FileNotFoundError:
+                pass
+        except OSError as exc:
+            raise RuntimeError(f"Could not acquire PID lock {lock_path}: {exc}") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/utils/process_lock.py` around lines 58 - 79, The current read ->
check -> open("w") flow in acquire lock is race-prone and returns success when
writing fails; replace it with an atomic create and fail-closed behavior:
attempt to create the lock file using an atomic open (os.open(lock_path,
os.O_WRONLY | os.O_CREAT | os.O_EXCL, 0o644)), write my_pid bytes and close; if
os.open raises FileExistsError, re-read via _read_pid_file(lock_path) and if the
existing PID is alive (use _is_pid_alive) raise DataDirectoryLocked (including
existing_pid and data_dir), otherwise remove the stale file and retry the atomic
create once; any OSError when creating or writing the lock should raise an
exception (do not return lock_path) so callers know acquisition failed. Use the
existing symbols _read_pid_file, _is_pid_alive, DataDirectoryLocked, lock_path,
my_pid, and data_dir to locate and implement this change.

Comment on lines +51 to +61
def test_error_message_includes_remediation(self):
with tempfile.TemporaryDirectory() as tmpdir:
lock_path = os.path.join(tmpdir, LOCK_FILENAME)
with open(lock_path, "w") as f:
f.write("1")
try:
acquire_data_dir_lock(tmpdir)
except DataDirectoryLocked as exc:
msg = str(exc)
assert "openviking-server" in msg
assert "separate data directories" in msg
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This test can pass without validating the expected exception path.

If acquire_data_dir_lock() does not raise, the test currently succeeds. Add an explicit failure when no DataDirectoryLocked is raised.

✅ Minimal fix
     def test_error_message_includes_remediation(self):
@@
             try:
                 acquire_data_dir_lock(tmpdir)
+                raise AssertionError("Should have raised DataDirectoryLocked")
             except DataDirectoryLocked as exc:
                 msg = str(exc)
                 assert "openviking-server" in msg
                 assert "separate data directories" in msg
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/misc/test_process_lock.py` around lines 51 - 61, The test
test_error_message_includes_remediation currently will pass if
acquire_data_dir_lock(tmpdir) does not raise; update it to explicitly fail when
no DataDirectoryLocked is raised by wrapping the call in a try/except and adding
a fail path (e.g., call to pytest.fail or raising AssertionError) immediately
after acquire_data_dir_lock(tmpdir) so the test only succeeds when the except
DataDirectoryLocked as exc branch runs and validates the message; reference the
test function name test_error_message_includes_remediation and the exception
class DataDirectoryLocked as the places to change.

@chethanuk chethanuk closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants