feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) by chethanuk · Pull Request #1 · chethanuk/OpenViking

chethanuk · 2026-03-14T12:05:05Z

Summary

New GeminiDenseEmbedder: Native text + multimodal (image, video, audio, PDF) embedding via gemini-embedding-2-preview (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via provider: "gemini" in ov.conf.
Queue pipeline extended: EmbeddingMsg carries media_uri/media_mime_type (URI-in-queue pattern); TextEmbeddingHandler.on_dequeue() reads raw bytes from viking_fs and calls embed_multimodal(). Security: media_uri is validated to match context_data["uri"] before any file read.
Config wired end-to-end: EmbeddingConfig registers provider: "gemini" with task_type field. vectorize_file() sets Vectorize(media=ModalContent(...)) for supported image formats when Gemini is configured.

Spec

Model: gemini-embedding-2-preview
Input: text, image, video, audio, PDF (17 MIME types)
Output dimension: 128–3072 (default: 3072, recommended: 768 / 1536 / 3072)
Input token limit: 8,192 tokens
Supported MIME types: image/jpeg, image/png, image/gif, image/webp, audio/mpeg, audio/mp3, audio/wav, audio/ogg, audio/flac, video/mp4, video/mpeg, video/mov, video/avi, video/webm, video/wmv, video/3gpp, application/pdf

Security fixes (identified by Codex review)

media_uri validated against context_data["uri"] before viking_fs.read_file_bytes() — forged queue messages fall back to text embed
TODO documented: full RequestContext reconstruction at dequeue time (requires user_id not currently in EmbeddingMsg)

Test plan

28 unit tests (test_gemini_embedder.py) — all passing, zero real API calls
8 handler tests (test_collection_schemas.py) — includes security test for URI mismatch
11 queue/converter tests (test_embedding_msg_converter_tenant.py)
4 config tests (test_config_validation.py) — factory creates GeminiDenseEmbedder
10 parametrized MIME tests (test_embedding_utils_mime.py)
8 E2E integration tests (test_gemini_e2e.py) — real Gemini API, dimension=3072 default verified
Full suite: 184 passing, 6 pre-existing unrelated failures
Smoke test: text dim: 3072, multimodal dim: 768 ✓

Configuration example

{
  "embedding": {
    "dense": {
      "model": "gemini-embedding-2-preview",
      "api_key": "${GEMINI_API_KEY}",
      "dimension": 3072,
      "provider": "gemini",
      "task_type": "RETRIEVAL_DOCUMENT"
    }
  }
}

Summary by CodeRabbit

New Features
- Google Gemini dense embedder with multimodal (image/video/audio/PDF) support and task_type option
- Multimodal content support throughout indexing and messaging (media fields, MIME inference, provider-aware indexing)
- Example Gemini configuration file
Bug Fixes
- Improved fallbacks when media is missing, unreadable, or mismatched
Documentation
- EN & ZH guides updated with Gemini provider and multimodal examples
Tests
- Extensive unit and integration tests covering Gemini, multimodal, batching, and config validation
Chores
- Added Gemini client runtime dependency and test markers

coderabbitai · 2026-03-14T12:05:21Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cb0dbc3d-6b5b-41f4-b332-497ef47c208a

📥 Commits

Reviewing files that changed from the base of the PR and between 381a55a and dd574cf.

📒 Files selected for processing (3)

openviking/storage/queuefs/semantic_processor.py
openviking/utils/resource_processor.py
tests/misc/test_config_validation.py

📝 Walkthrough

Walkthrough

Adds Google Gemini multimodal dense embedder and end-to-end multimodal embedding paths: new GeminiDenseEmbedder, ModalContent support, propagation of media through embedding queue and storage, MIME inference and file vectorization updates, configuration changes for gemini, new tests, and example config.

Changes

Cohort / File(s)	Summary
Gemini embedder & base API `openviking/models/embedder/gemini_embedders.py`, `openviking/models/embedder/base.py`, `openviking/models/embedder/__init__.py`	Adds `GeminiDenseEmbedder` (dense, multimodal) with batch APIs, error handling, model-dimension mapping, transient/error handling and shutdown; `DenseEmbedderBase` gains `supports_multimodal` and `embed_multimodal`; module exports updated.
Core data model `openviking/core/context.py`	Adds `ModalContent` dataclass (mime_type, uri, optional bytes) and extends `Vectorize` to carry optional `media`.
Storage & queue handling `openviking/storage/queuefs/embedding_msg.py`, `openviking/storage/queuefs/embedding_msg_converter.py`, `openviking/storage/collection_schemas.py`	`EmbeddingMsg` gains `media_uri`, `media_mime_type`, and `id`; converter propagates media metadata and falls back to context.uri when text empty; collection writer routes to `embed_multimodal` when media present and embedder supports multimodal, reads bytes from VikingFS, and falls back to text embedding on failures.
Embedding utilities & resource processing `openviking/utils/embedding_utils.py`, `openviking/utils/resource_processor.py`, `openviking/storage/queuefs/semantic_processor.py`	Adds MIME maps and inference helpers; `vectorize_file`/`index_resource` accept `embedding_provider` and create `ModalContent` for Gemini-supported media; callers resolve and pass provider from config.
Configuration & packaging `openviking_cli/utils/config/embedding_config.py`, `examples/ov.conf.gemini.example`, `pyproject.toml`	Adds `task_type` to embedding config, validates `gemini` provider and requires API key; factory maps `("gemini","dense")` to GeminiDenseEmbedder; example config added; `google-genai>=0.8.0` added to dependencies.
Tests: unit, storage, integration `tests/unit/`, `tests/storage/`, `tests/integration/test_gemini_e2e.py`, `tests/misc/test_config_validation.py`	Extensive tests for Gemini embedder, MIME inference, ModalContent/Vectorize, EmbeddingMsg serialization, multimodal routing/fallbacks, batch behavior, normalization and semantic checks; includes integration E2E test requiring GEMINI_API_KEY and several unit/storage tests.
Docs & examples `docs/en/guides/01-configuration.md`, `docs/zh/guides/01-configuration.md`, `examples/ov.conf.gemini.example`	Configuration docs updated to list `gemini` as a dense multimodal provider with examples, supported input types, model/dimension info, and an example config file.

Sequence Diagram

sequenceDiagram
    autonumber
    participant Client
    participant QueueHandler
    participant CollectionWriter
    participant VikingFS
    participant GeminiEmbedder
    participant GeminiAPI
    participant VectorDB as Database

    Client->>QueueHandler: enqueue EmbeddingMsg (may include media_uri)
    QueueHandler->>CollectionWriter: deliver embedding task
    CollectionWriter->>GeminiEmbedder: check supports_multimodal
    alt multimodal supported and media_uri present and matches context.uri
        CollectionWriter->>VikingFS: read_file_bytes(media_uri)
        VikingFS-->>CollectionWriter: bytes
        CollectionWriter->>GeminiEmbedder: embed_multimodal(text + media bytes)
        GeminiEmbedder->>GeminiAPI: embed_content(text + media)
        GeminiAPI-->>GeminiEmbedder: embedding vector
        GeminiEmbedder-->>CollectionWriter: EmbedResult
    else mismatch or read fails or not supported
        CollectionWriter->>GeminiEmbedder: embed(text)
        GeminiEmbedder->>GeminiAPI: embed_content(text only)
        GeminiAPI-->>GeminiEmbedder: embedding vector
        GeminiEmbedder-->>CollectionWriter: EmbedResult
    end
    CollectionWriter->>VectorDB: write vector
    VectorDB-->>CollectionWriter: ack
    CollectionWriter-->>Client: done

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) #1 — Implements Google Gemini multimodal embedding support (ModalContent, Vectorize.media, EmbeddingMsg media fields, GeminiDenseEmbedder, multimodal pipeline updates and tests).

Poem

🐇 I hopped from text to glossy frames tonight,
Gemini lent my whiskers a lens and light,
Bytes and words now twine in vector song,
I bounded through tests and danced along,
Carrots, pixels, vectors — all felt right.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding Gemini Embedding 2 with multimodal support. It is concise, specific, and directly reflects the primary objective of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/gemini-embedding-impl

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-14T12:05:33Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the embedding capabilities by integrating Google's Gemini Embedding 2 model, enabling the system to process and embed multimodal content alongside traditional text. It introduces a robust framework for handling various media types within the embedding pipeline, ensuring secure and configurable operation. The changes allow for a more comprehensive representation of data by leveraging advanced multimodal AI models.

Highlights

Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/ov.conf.gemini.example
- Added a new example configuration file for Gemini Embedding 2.
openviking/core/context.py
- Added ModalContent dataclass for media references.
- Extended Vectorize class to include an optional media field for multimodal content.
openviking/models/embedder/init.py
- Added Google Gemini: Dense only (multimodal) to the embedder list.
- Imported and exposed GeminiDenseEmbedder.
openviking/models/embedder/base.py
- Added supports_multimodal property to EmbedderBase to indicate multimodal capability.
- Implemented a default embed_multimodal method that falls back to text embedding.
openviking/models/embedder/gemini_embedders.py
- Added GeminiDenseEmbedder class for Google Gemini Embedding 2.
- Defined supported multimodal MIME types.
- Implemented embed, embed_multimodal, and embed_batch methods for Gemini API interaction.
- Included error handling and fallback logic for multimodal embedding failures.
openviking/storage/collection_schemas.py
- Modified on_dequeue to conditionally call embed_multimodal based on media_uri and media_mime_type.
- Implemented security validation to match media_uri with context_data['uri'].
- Added fallback to text embedding if multimodal processing fails or security validation is not met.
- Imported ModalContent and Vectorize for multimodal processing.
openviking/storage/queuefs/embedding_msg.py
- Added optional media_uri and media_mime_type fields to EmbeddingMsg.
- Updated __init__ and from_dict methods to handle new media fields.
openviking/storage/queuefs/embedding_msg_converter.py
- Updated from_context to populate media_uri and media_mime_type in EmbeddingMsg if Vectorize contains media.
- Added logic to use URI as fallback text if text is empty but media is present, preventing message drops.
openviking/utils/embedding_utils.py
- Added _IMAGE_MIME_MAP and _infer_image_mime for image MIME type detection.
- Integrated _GEMINI_MULTIMODAL_MIMES for supported image types.
- Modified vectorize_file to create ModalContent for image files when Gemini is configured and supported.
openviking_cli/utils/config/embedding_config.py
- Added task_type field to EmbeddingModelConfig for Gemini-specific configuration.
- Updated validate_config to include 'gemini' as a valid provider and enforce api_key for Gemini.
- Extended _create_embedder factory to instantiate GeminiDenseEmbedder.
pyproject.toml
- Added google-genai as a new dependency.
- Added integration marker for pytest to categorize tests that call external APIs.
tests/integration/test_gemini_e2e.py
- Added new end-to-end integration tests for GeminiDenseEmbedder covering text and multimodal embedding, batch processing, and task types.
tests/misc/test_config_validation.py
- Added new tests for Gemini provider configuration, including API key requirement and task type validation.
- Included a test to verify the factory correctly creates GeminiDenseEmbedder.
tests/storage/test_collection_schemas.py
- Added new multimodal path tests for TextEmbeddingHandler to verify correct embed_multimodal calls, fallback behavior, and security validation.
tests/storage/test_embedding_msg_converter_tenant.py
- Added tests for EmbeddingMsg to ensure media_uri and media_mime_type fields are handled correctly during serialization and deserialization.
- Included tests for EmbeddingMsgConverter to verify proper handling of media content in context conversion.
tests/unit/test_context_vectorize.py
- Added new unit tests for ModalContent and Vectorize classes, verifying field initialization and media handling.
tests/unit/test_embedder_base_multimodal.py
- Added new unit tests for the EmbedderBase multimodal protocol, confirming default supports_multimodal behavior and embed_multimodal fallback.
tests/unit/test_embedding_utils_mime.py
- Added new unit tests for _infer_image_mime helper function, covering various image file extensions.
tests/unit/test_gemini_embedder.py
- Added new unit tests for GeminiDenseEmbedder, covering initialization, text and multimodal embedding, batch processing, and error handling.

Activity

The author, chethanuk, implemented a new GeminiDenseEmbedder for multimodal embedding.
The queue pipeline was extended to support media_uri and media_mime_type in EmbeddingMsg.
End-to-end configuration for Gemini, including task_type, was wired.
Security validation for media_uri against context_data['uri'] was added to prevent unauthorized file access.
A TODO was documented regarding full RequestContext reconstruction at dequeue time, awaiting user_id propagation.
Comprehensive testing was performed, including 28 unit tests for test_gemini_embedder.py, 8 handler tests for test_collection_schemas.py, 11 queue/converter tests for test_embedding_msg_converter_tenant.py, 4 config tests for test_config_validation.py, 10 parametrized MIME tests for test_embedding_utils_mime.py, and 8 E2E integration tests for test_gemini_e2e.py.
The full test suite reported 184 passing tests, with 6 pre-existing unrelated failures.
Smoke tests verified text embedding dimension of 3072 and multimodal embedding dimension of 768.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for Gemini multimodal embeddings, which is a significant feature enhancement. The implementation is comprehensive, covering the embedder logic, queue pipeline extension, configuration, and security considerations. The addition of a new GeminiDenseEmbedder is well-designed, with robust error handling and fallback mechanisms. The test coverage is excellent, with extensive unit, integration, and end-to-end tests that provide high confidence in the changes. I have a few suggestions for improvement related to documentation clarity, code style, robustness, and a potential circular dependency. Overall, this is a high-quality contribution.

gemini-code-assist · 2026-03-14T12:07:32Z

openviking/utils/embedding_utils.py

                    return
+        elif content_type == ResourceContentType.IMAGE:
+            try:
+                from openviking_cli.utils.config import get_openviking_config


This local import from openviking_cli.utils.config import get_openviking_config introduces a dependency from the core openviking library to the openviking_cli application. This creates a circular dependency at the package level (openviking -> openviking_cli -> openviking) and violates the principle of separation of concerns. The core library should not depend on the CLI application.

A better approach would be to pass the configuration as a parameter to vectorize_file and its callers, rather than fetching it globally from a CLI-specific utility. This would require refactoring vectorize_file and index_resource to accept a configuration object.

gemini-code-assist · 2026-03-14T12:07:32Z

examples/ov.conf.gemini.example

+    "dimension MUST be 3072 — this is the native output size of gemini-embedding-2-preview.",
+    "Reducing dimension requires explicit truncation; the server will reject other values."


The comment regarding the dimension is slightly misleading. While 3072 is the native dimension for gemini-embedding-2-preview, the API supports other dimensions (e.g., 768, 1536). Stating that other values "will be rejected" could confuse users, as the code and API seem to handle them. Please consider clarifying this to reflect the model's capabilities better.

"The native output dimension is 3072. Other values (e.g., 768, 1536) are supported by the API.", "For best results, using one of the recommended dimensions (768, 1536, 3072) is advised."

gemini-code-assist · 2026-03-14T12:07:32Z

openviking/storage/collection_schemas.py

+                            from openviking.core.context import ModalContent, Vectorize
+                            from openviking.storage.viking_fs import get_viking_fs


For better readability and adherence to standard Python conventions (PEP 8), it's recommended to place imports at the top of the file rather than inside a method. This makes dependencies clearer and avoids repeated import lookups. Please move these imports to the top of the file.

gemini-code-assist · 2026-03-14T12:07:32Z

openviking/utils/embedding_utils.py

+                provider = getattr(dense_cfg, "provider", "").lower() if dense_cfg else ""
+                is_multimodal_provider = provider == "gemini"
+            except Exception:
+                is_multimodal_provider = False


The broad except Exception: clause here silently catches any error that might occur while checking the embedding provider configuration. This could hide potential configuration issues or bugs. It would be better to log the exception to aid in debugging.

Suggested change

is_multimodal_provider = False

logger.warning("Failed to check for multimodal provider, disabling feature.", exc_info=True)

is_multimodal_provider = False

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

openviking/utils/embedding_utils.py (1)

258-295: ⚠️ Potential issue | 🟠 Major

Gemini direct indexing still misses audio/video/PDF inputs.

Lines 269-289 only attach Vectorize.media for ResourceContentType.IMAGE. index_resource() calls vectorize_file() without summaries, so VIDEO/AUDIO files fall through to the summary-only branch at Lines 290-295, and PDFs never reach the multimodal path here at all. With Gemini configured, those files will still be skipped or text-only embedded instead of going through embed_multimodal().
openviking/storage/queuefs/embedding_msg.py (1)
9-31: ⚠️ Potential issue | 🟠 Major

id field is not serialized by asdict() — round-trip will lose message IDs.

The id attribute is assigned in __init__ (line 23) but is not declared as a dataclass field. Since to_dict() uses asdict(), which only serializes declared fields, the id will be missing from the serialized output. This breaks round-trip serialization despite from_dict() attempting to restore it at line 46.
🐛 Proposed fix: declare `id` as a dataclass field with a default factory
 `@dataclass`
 class EmbeddingMsg:
+    id: str = field(default_factory=lambda: str(uuid4()))
     message: Union[str, List[Dict[str, Any]]]
     context_data: Dict[str, Any]
     media_uri: Optional[str] = None
     media_mime_type: Optional[str] = None

-    def __init__(
-        self,
-        message: Union[str, List[Dict[str, Any]]],
-        context_data: Dict[str, Any],
-        media_uri: Optional[str] = None,
-        media_mime_type: Optional[str] = None,
-    ):
-        self.id = str(uuid4())
-        self.message = message
-        self.context_data = context_data
-        self.media_uri = media_uri
-        self.media_mime_type = media_mime_type
Update the import:
-from dataclasses import asdict, dataclass
+from dataclasses import asdict, dataclass, field
And update from_dict to accept id as constructor parameter:
     `@classmethod`
     def from_dict(cls, data: Dict[str, Any]) -> "EmbeddingMsg":
         """Create an embedding message object from dictionary."""
         obj = EmbeddingMsg(
+            id=data.get("id", str(uuid4())),
             message=data["message"],
             context_data=data["context_data"],
             media_uri=data.get("media_uri"),
             media_mime_type=data.get("media_mime_type"),
         )
-        obj.id = data.get("id", obj.id)
         return obj
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/queuefs/embedding_msg.py` around lines 9 - 31, The class
EmbeddingMsg currently assigns self.id in __init__ but id is not declared as a
dataclass field so to_dict() via asdict() omits it; declare id as a dataclass
field (e.g., id: str = field(default_factory=lambda: str(uuid4()))) and add the
necessary imports (field, uuid4), then update from_dict to accept/pass id when
constructing the EmbeddingMsg (or construct via EmbeddingMsg(**data)) so
round-trip serialization preserves the id; remove the manual self.id assignment
in __init__ if you rely on the dataclass default factory.

🧹 Nitpick comments (8)

tests/unit/test_context_vectorize.py (1)

4-4: Unused pytest import.

The pytest import on line 4 is not used in any of the tests. Consider removing it.

🧹 Proposed fix

 # SPDX-License-Identifier: Apache-2.0
 """Tests for ModalContent and Vectorize media extension."""
-import pytest
 from openviking.core.context import ModalContent, Vectorize

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_context_vectorize.py` at line 4, Remove the unused top-level
import of pytest (the "import pytest" statement) since it isn't referenced
anywhere in the test file; simply delete that import line to eliminate the
unused import warning and keep the test module clean.

examples/ov.conf.gemini.example (1)

5-6: Clarify dimension flexibility in comments.

The comment states dimension "MUST be 3072" and the server will reject other values, but according to the GeminiDenseEmbedder docstring, Gemini supports output dimensions 128–3072 (recommended: 768, 1536, 3072). Consider updating the comment to reflect that 3072 is the default and other values within the supported range are valid.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/ov.conf.gemini.example` around lines 5 - 6, Update the comment to
clarify that 3072 is the default/native output size for
gemini-embedding-2-preview but not strictly required; mention that Gemini
supports output dimensions in the range 128–3072 (with recommended values 768,
1536, 3072) and that reducing the dimension requires explicit truncation (the
server will reject dimensions outside that supported range). Reference the
gemini-embedding-2-preview model and the GeminiDenseEmbedder docstring so
readers know where the supported range is defined.

openviking/models/embedder/gemini_embedders.py (2)

145-151: Batch fallback may propagate RuntimeError from individual embeds.

The batch fallback catches APIError and falls back to individual embed() calls. However, embed() wraps APIError in RuntimeError (line 97). If a single text in the batch fails, the RuntimeError will propagate and the remaining texts in that batch won't be processed.

Consider wrapping the individual fallback calls to ensure all texts are attempted:

🔧 Proposed fix for robust batch fallback

             except APIError as e:
                 logger.warning(
                     f"Gemini batch embed failed (code={e.code}) for batch of {len(batch)}, "
                     "falling back to individual calls"
                 )
                 for text in batch:
-                    results.append(self.embed(text))
+                    try:
+                        results.append(self.embed(text))
+                    except RuntimeError as re:
+                        logger.error(f"Individual embed failed: {re}")
+                        # Return empty vector or re-raise depending on desired behavior
+                        raise

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/gemini_embedders.py` around lines 145 - 151, The
batch fallback currently calls self.embed(text) for each text inside the except
APIError block, but self.embed() wraps APIError into RuntimeError so a failure
on one text will abort the loop; update the fallback to call self.embed(text)
inside a per-item try/except (catch RuntimeError and APIError) so failures are
logged (use logger.warning/error with the text index or snippet and exception)
and the loop continues, ensuring you still append successful embeddings to
results and handle/append a placeholder (e.g., None) or skip entries
consistently for failed items; adjust code in the except APIError block around
variables batch, results, and self.embed to implement this per-item error
handling.

99-128: Consider handling empty text in multimodal fallback.

When media is None, has no data, or has an unsupported MIME type, the code falls back to self.embed(vectorize.text). If vectorize.text is empty or None, this will embed an empty string, which may produce unexpected results.

Consider adding a guard or logging when falling back with empty text:

🛡️ Proposed defensive check

         if (
             media is None
             or media.data is None
             or media.mime_type not in _SUPPORTED_MULTIMODAL_MIMES
         ):
+            if not vectorize.text:
+                logger.warning("embed_multimodal called with no usable media and empty text")
             return self.embed(vectorize.text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openviking/models/embedder/gemini_embedders.py` around lines 99 - 128, The
fallback in embed_multimodal currently calls self.embed(vectorize.text) even
when vectorize.text is empty/None; add a defensive check before any fallback
(both the early-return branch and the exception fallback) to detect falsy/blank
vectorize.text, log a warning via logger.warning mentioning embed_multimodal and
media.uri, and return a safe empty EmbedResult (e.g.,
EmbedResult(dense_vector=[])) instead of calling self.embed when text is empty;
update the logic inside embed_multimodal (references: embed_multimodal, embed,
EmbedResult, vectorize.text, media.uri) to perform this check in both fallback
paths.

openviking/utils/embedding_utils.py (1)

32-34: Move the shared MIME allowlist out of the Gemini private module.

embedding_utils is generic ingestion code, but this now imports _SUPPORTED_MULTIMODAL_MIMES, a private implementation detail of openviking.models.embedder.gemini_embedders, at module import time. That makes unrelated callers sensitive to provider-internal refactors. Prefer exposing the allowlist from a neutral/public module instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@openviking/utils/embedding_utils.py` around lines 32 - 34, The module
embedding_utils imports the private symbol _SUPPORTED_MULTIMODAL_MIMES from
openviking.models.embedder.gemini_embedders and derives
_GEMINI_SUPPORTED_IMAGE_MIMES from it, coupling generic ingestion code to a
provider-internal detail; instead move or expose the MIME allowlist as a
neutral/public constant (e.g. SUPPORTED_MULTIMODAL_MIMES or
SUPPORTED_IMAGE_MIMES) in a shared module (for example a new
openviking.models.embedder.mime or openviking.utils.mime) and update
embedding_utils to import that public symbol and derive the frozenset there
(replace references to _SUPPORTED_MULTIMODAL_MIMES and
_GEMINI_SUPPORTED_IMAGE_MIMES with the new public names).

openviking/storage/queuefs/embedding_msg_converter.py (1)

26-29: Keep media-only messages textless instead of injecting the URI.

The dequeue path later passes embedding_msg.message into Vectorize(text=...) for embed_multimodal(), so this fallback becomes part of the model input, not just queue bookkeeping. For media-only items that means embedding viking://.../file.ext alongside the bytes and also using the URI as the text-only fallback payload.

Possible direction

         vectorization_text = context.get_vectorization_text()
-        # If text is empty but media is present, use URI as fallback so the message isn't dropped
-        if not vectorization_text and getattr(context.vectorize, "media", None) is not None:
-            vectorization_text = context.uri or ""
-        if not vectorization_text:
+        has_media = getattr(context.vectorize, "media", None) is not None
+        if not vectorization_text and not has_media:
             return None

-            message=vectorization_text,
+            message=vectorization_text or "",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@openviking/storage/queuefs/embedding_msg_converter.py` around lines 26 - 29,
Remove the fallback that injects context.uri into vectorization_text for
media-only items so media-only messages remain textless; specifically, in
embedding_msg_converter.py do not set vectorization_text = context.uri when
getattr(context.vectorize, "media", None) is not None — leave vectorization_text
empty for media-only cases so embedding_msg.message (and subsequently
Vectorize(text=...) used by embed_multimodal()) does not receive the URI as
model input; if bookkeeping needs a non-empty identifier, handle that separately
(e.g., store URI in metadata) rather than in vectorization_text.

tests/storage/test_embedding_msg_converter_tenant.py (1)

43-45: Consolidate imports at the top of the file.

The imports at lines 45 and 75-76 are placed mid-file. Additionally, EmbeddingMsgConverter at line 76 duplicates the import at line 9. Moving all imports to the top improves readability and follows PEP 8 conventions.
♻️ Proposed refactor

Move these imports to the top of the file with the other imports:
 from openviking.core.context import Context
+from openviking.core.context import ModalContent, Vectorize
 from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter
+from openviking.storage.queuefs.embedding_msg import EmbeddingMsg
 from openviking_cli.session.user_id import UserIdentifier
Then remove the mid-file imports at lines 45 and 75-76.
Also applies to: 75-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/storage/test_embedding_msg_converter_tenant.py` around lines 43 - 45,
Imports for EmbeddingMsg and EmbeddingMsgConverter are currently placed
mid-file; move the import statements for EmbeddingMsg (from
openviking.storage.queuefs.embedding_msg) and EmbeddingMsgConverter to the top
alongside the other imports, remove the duplicate mid-file import(s), and ensure
there's only one canonical import per symbol (EmbeddingMsg,
EmbeddingMsgConverter) at the module-level to satisfy PEP8 and avoid
duplication.

tests/storage/test_collection_schemas.py (1)

220-250: Consider patching VikingFS in the security test for robustness.

The test relies on the URI mismatch check happening before the file read. While this matches the current implementation, adding a _FakeVikingFS that raises an error would make the test more defensive against future code changes that might reorder the checks.

♻️ Suggested defensive patch

+    class _FailingVikingFS:
+        async def read_file_bytes(self, uri, ctx=None):
+            raise AssertionError("Should not attempt to read mismatched URI")
+
-    handler = TextEmbeddingHandler(_FakeVikingDB())
-    handler.set_callbacks(on_success=lambda: None, on_error=lambda *_: None)
-    await handler.on_dequeue(payload)
+    with patch("openviking.storage.viking_fs.get_viking_fs", return_value=_FailingVikingFS()):
+        handler = TextEmbeddingHandler(_FakeVikingDB())
+        handler.set_callbacks(on_success=lambda: None, on_error=lambda *_: None)
+        await handler.on_dequeue(payload)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/storage/test_collection_schemas.py` around lines 220 - 250, The test
test_handler_rejects_media_uri_not_matching_context_uri should also patch
VikingFS to ensure no file read occurs if the media_uri mismatches; add a
_FakeVikingFS class that raises an exception on read/open and monkeypatch it
into the code path the TextEmbeddingHandler uses (so that any accidental reorder
of checks that tries to access the file will fail the test), keep the existing
_FakeVikingDB and assertions that embedder.embed_multimodal_calls == 0 and
embedder.embed_calls == 1 to verify fallback-to-text behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openviking_cli/utils/config/embedding_config.py`:
- Around line 92-95: The Gemini branch raises if api_key is missing but doesn't
ensure embedding dimensionality matches GeminiDenseEmbedder's default (3072),
causing init_context_collection and TextEmbeddingHandler to provision 2048-dim
collections while Gemini produces 3072-dim vectors; fix by either requiring an
explicit dimension when provider == "gemini" in the same branch (raise
ValueError if config.embedding.dimension is unset) or update get_dimension() to
detect provider == "gemini" and return the Gemini default based on model
(aligning GeminiDenseEmbedder and config.embedding.dimension); update references
in GeminiDenseEmbedder, get_dimension(), init_context_collection, and
TextEmbeddingHandler so the dimension used for provisioning and validation is
consistent.

In `@openviking/models/embedder/gemini_embedders.py`:
- Around line 157-161: The close() method currently accesses the private
attribute self.client._http_client; change it to use the public API by calling
the client's close method (e.g., call self.client.close() or check
hasattr(self.client, "close") and call it), removing direct access to
_http_client and optionally wrapping the call in a try/except to log or ignore
cleanup errors; update the close method in gemini_embedders.py (the def
close(self) implementation) to use genai.Client.close instead of
self.client._http_client.

In `@openviking/storage/collection_schemas.py`:
- Around line 200-223: The code currently calls
get_viking_fs().read_file_bytes(embedding_msg.media_uri, ctx=None) which permits
cross-tenant reads because ctx=None; replace this with a tenant-scoped
RequestContext constructed from embedding_msg.context_data (e.g.
account_id/tenant id) and pass it as the ctx argument to read_file_bytes (or, if
a valid tenant context cannot be reconstructed, skip the file branch and fall
back to text embedding to fail closed). Update the branch around embedding_msg,
expected_uri, get_viking_fs, and Vectorize to require a reconstructed
RequestContext (or bail out) before calling read_file_bytes.
- Around line 216-241: The current broad except around calling
self._embedder.embed_multimodal and viking_fs.read_file_bytes swallows transient
embedder failures (e.g., HTTP 429/5xx) and falls back to text-only embedding;
change the error handling in the try/except in collection_schemas.py so that
embed_multimodal errors (from GeminiDenseEmbedder via
self._embedder.embed_multimodal and self._embedder.embed) are not caught and
suppressed: only catch and handle IO/validation errors coming from
get_viking_fs() / viking_fs.read_file_bytes or construction of
Vectorize/ModalContent and, for other exceptions originating from the embedder,
re-raise them (or log as error and raise) so callers can retry; identify this
logic around the try block using the symbols Vectorize, ModalContent,
get_viking_fs, viking_fs.read_file_bytes, self._embedder.embed_multimodal, and
self._embedder.embed, and implement selective exception handling (or exception
filtering) to reserve text fallback solely for non-transient read/validation
failures.

In `@pyproject.toml`:
- Around line 237-241: The pytest-asyncio version ranges are inconsistent
between [dependency-groups].dev (pytest-asyncio>=1.3.0) and
[project.optional-dependencies].test (pytest-asyncio>=0.21.0); pick a single
range (e.g., pytest-asyncio>=1.3.0) and update the other entry so both
[dependency-groups].dev and [project.optional-dependencies].test reference the
same pytest-asyncio version constraint to avoid conflicting installs.

---

Outside diff comments:
In `@openviking/storage/queuefs/embedding_msg.py`:
- Around line 9-31: The class EmbeddingMsg currently assigns self.id in __init__
but id is not declared as a dataclass field so to_dict() via asdict() omits it;
declare id as a dataclass field (e.g., id: str = field(default_factory=lambda:
str(uuid4()))) and add the necessary imports (field, uuid4), then update
from_dict to accept/pass id when constructing the EmbeddingMsg (or construct via
EmbeddingMsg(**data)) so round-trip serialization preserves the id; remove the
manual self.id assignment in __init__ if you rely on the dataclass default
factory.

---

Nitpick comments:
In `@examples/ov.conf.gemini.example`:
- Around line 5-6: Update the comment to clarify that 3072 is the default/native
output size for gemini-embedding-2-preview but not strictly required; mention
that Gemini supports output dimensions in the range 128–3072 (with recommended
values 768, 1536, 3072) and that reducing the dimension requires explicit
truncation (the server will reject dimensions outside that supported range).
Reference the gemini-embedding-2-preview model and the GeminiDenseEmbedder
docstring so readers know where the supported range is defined.

In `@openviking/models/embedder/gemini_embedders.py`:
- Around line 145-151: The batch fallback currently calls self.embed(text) for
each text inside the except APIError block, but self.embed() wraps APIError into
RuntimeError so a failure on one text will abort the loop; update the fallback
to call self.embed(text) inside a per-item try/except (catch RuntimeError and
APIError) so failures are logged (use logger.warning/error with the text index
or snippet and exception) and the loop continues, ensuring you still append
successful embeddings to results and handle/append a placeholder (e.g., None) or
skip entries consistently for failed items; adjust code in the except APIError
block around variables batch, results, and self.embed to implement this per-item
error handling.
- Around line 99-128: The fallback in embed_multimodal currently calls
self.embed(vectorize.text) even when vectorize.text is empty/None; add a
defensive check before any fallback (both the early-return branch and the
exception fallback) to detect falsy/blank vectorize.text, log a warning via
logger.warning mentioning embed_multimodal and media.uri, and return a safe
empty EmbedResult (e.g., EmbedResult(dense_vector=[])) instead of calling
self.embed when text is empty; update the logic inside embed_multimodal
(references: embed_multimodal, embed, EmbedResult, vectorize.text, media.uri) to
perform this check in both fallback paths.

In `@openviking/storage/queuefs/embedding_msg_converter.py`:
- Around line 26-29: Remove the fallback that injects context.uri into
vectorization_text for media-only items so media-only messages remain textless;
specifically, in embedding_msg_converter.py do not set vectorization_text =
context.uri when getattr(context.vectorize, "media", None) is not None — leave
vectorization_text empty for media-only cases so embedding_msg.message (and
subsequently Vectorize(text=...) used by embed_multimodal()) does not receive
the URI as model input; if bookkeeping needs a non-empty identifier, handle that
separately (e.g., store URI in metadata) rather than in vectorization_text.

In `@openviking/utils/embedding_utils.py`:
- Around line 32-34: The module embedding_utils imports the private symbol
_SUPPORTED_MULTIMODAL_MIMES from openviking.models.embedder.gemini_embedders and
derives _GEMINI_SUPPORTED_IMAGE_MIMES from it, coupling generic ingestion code
to a provider-internal detail; instead move or expose the MIME allowlist as a
neutral/public constant (e.g. SUPPORTED_MULTIMODAL_MIMES or
SUPPORTED_IMAGE_MIMES) in a shared module (for example a new
openviking.models.embedder.mime or openviking.utils.mime) and update
embedding_utils to import that public symbol and derive the frozenset there
(replace references to _SUPPORTED_MULTIMODAL_MIMES and
_GEMINI_SUPPORTED_IMAGE_MIMES with the new public names).

In `@tests/storage/test_collection_schemas.py`:
- Around line 220-250: The test
test_handler_rejects_media_uri_not_matching_context_uri should also patch
VikingFS to ensure no file read occurs if the media_uri mismatches; add a
_FakeVikingFS class that raises an exception on read/open and monkeypatch it
into the code path the TextEmbeddingHandler uses (so that any accidental reorder
of checks that tries to access the file will fail the test), keep the existing
_FakeVikingDB and assertions that embedder.embed_multimodal_calls == 0 and
embedder.embed_calls == 1 to verify fallback-to-text behavior.

In `@tests/storage/test_embedding_msg_converter_tenant.py`:
- Around line 43-45: Imports for EmbeddingMsg and EmbeddingMsgConverter are
currently placed mid-file; move the import statements for EmbeddingMsg (from
openviking.storage.queuefs.embedding_msg) and EmbeddingMsgConverter to the top
alongside the other imports, remove the duplicate mid-file import(s), and ensure
there's only one canonical import per symbol (EmbeddingMsg,
EmbeddingMsgConverter) at the module-level to satisfy PEP8 and avoid
duplication.

In `@tests/unit/test_context_vectorize.py`:
- Line 4: Remove the unused top-level import of pytest (the "import pytest"
statement) since it isn't referenced anywhere in the test file; simply delete
that import line to eliminate the unused import warning and keep the test module
clean.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f99c6b3c-2b76-4661-9332-5c7a982d166e

📥 Commits

Reviewing files that changed from the base of the PR and between 73c7072 and 1de958f.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (19)

examples/ov.conf.gemini.example
openviking/core/context.py
openviking/models/embedder/__init__.py
openviking/models/embedder/base.py
openviking/models/embedder/gemini_embedders.py
openviking/storage/collection_schemas.py
openviking/storage/queuefs/embedding_msg.py
openviking/storage/queuefs/embedding_msg_converter.py
openviking/utils/embedding_utils.py
openviking_cli/utils/config/embedding_config.py
pyproject.toml
tests/integration/test_gemini_e2e.py
tests/misc/test_config_validation.py
tests/storage/test_collection_schemas.py
tests/storage/test_embedding_msg_converter_tenant.py
tests/unit/test_context_vectorize.py
tests/unit/test_embedder_base_multimodal.py
tests/unit/test_embedding_utils_mime.py
tests/unit/test_gemini_embedder.py

coderabbitai · 2026-03-14T12:17:05Z

openviking_cli/utils/config/embedding_config.py

+        elif self.provider == "gemini":
+            if not self.api_key:
+                raise ValueError("Gemini provider requires 'api_key' to be set")
+


⚠️ Potential issue | 🔴 Critical

Gemini without an explicit dimension will provision the wrong vector size.

GeminiDenseEmbedder defaults to 3072 when dimension is unset, but init_context_collection() and TextEmbeddingHandler both consume config.embedding.dimension, which still falls back to 2048. A config created through this new Gemini path can therefore provision the collection/validator for 2048 and then reject 3072-dim vectors at ingest time. Either require dimension for Gemini here or make get_dimension() resolve the Gemini default from model.

Also applies to: 241-249

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openviking_cli/utils/config/embedding_config.py` around lines 92 - 95, The Gemini branch raises if api_key is missing but doesn't ensure embedding dimensionality matches GeminiDenseEmbedder's default (3072), causing init_context_collection and TextEmbeddingHandler to provision 2048-dim collections while Gemini produces 3072-dim vectors; fix by either requiring an explicit dimension when provider == "gemini" in the same branch (raise ValueError if config.embedding.dimension is unset) or update get_dimension() to detect provider == "gemini" and return the Gemini default based on model (aligning GeminiDenseEmbedder and config.embedding.dimension); update references in GeminiDenseEmbedder, get_dimension(), init_context_collection, and TextEmbeddingHandler so the dimension used for provisioning and validation is consistent.

coderabbitai · 2026-03-14T12:17:05Z

openviking/models/embedder/gemini_embedders.py

+    def close(self):
+        if hasattr(self.client, "_http_client"):
+            try:
+                self.client._http_client.close()
+            except Exception:


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

google-genai python Client close method

💡 Result:

In the google-genai Python SDK (google.genai), the client does have a close method:

Sync client: call client.close() to release underlying HTTP resources. [1][2]

Async client: use Client(...).aio and call await aclient.aclose(). [1]

Example:

from google.genai import Client client = Client() # ... make requests ... client.close()

Async:

from google.genai import Client aclient = Client().aio # ... await requests ... await aclient.aclose()

Alternatively, use context managers to auto-close:

from google.genai import Client with Client() as client: ...

from google.genai import Client async with Client().aio as aclient: ...

[1] (github.com) [2] (googleapis.github.io)

Citations:

1: https://github.com/googleapis/python-genai?utm_source=openai

2: https://googleapis.github.io/python-genai/?utm_source=openai

🏁 Script executed:

cat -n openviking/models/embedder/gemini_embedders.py | head -170 | tail -40

Repository: chethanuk/OpenViking

Length of output: 1977

🏁 Script executed:

head -80 openviking/models/embedder/gemini_embedders.py

Repository: chethanuk/OpenViking

Length of output: 2245

Use the public close() method instead of accessing the private _http_client attribute.

The genai.Client from the google-genai SDK provides a public close() method. Replace the private attribute access with the public API:

def close(self): if hasattr(self.client, 'close'): self.client.close()

Or simply call it directly if close() is always available:

def close(self): self.client.close()

Alternatively, use a context manager pattern when creating the client to avoid manual cleanup altogether.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/gemini_embedders.py` around lines 157 - 161, The close() method currently accesses the private attribute self.client._http_client; change it to use the public API by calling the client's close method (e.g., call self.client.close() or check hasattr(self.client, "close") and call it), removing direct access to _http_client and optionally wrapping the call in a try/except to log or ignore cleanup errors; update the close method in gemini_embedders.py (the def close(self) implementation) to use genai.Client.close instead of self.client._http_client.

coderabbitai · 2026-03-14T12:17:05Z

openviking/storage/collection_schemas.py

+                    # Security: validate media_uri matches the record's own URI to prevent
+                    # forged queue messages from reading arbitrary files.
+                    expected_uri = embedding_msg.context_data.get("uri", "")
+                    if embedding_msg.media_uri != expected_uri:
+                        logger.warning(
+                            f"media_uri {embedding_msg.media_uri!r} does not match context uri "
+                            f"{expected_uri!r}, falling back to text embedding"
+                        )
+                        result: EmbedResult = await asyncio.to_thread(
+                            self._embedder.embed, embedding_msg.message
+                        )
+                    else:
+                        # TODO(security): reconstruct a tenant-scoped RequestContext from
+                        # context_data["account_id"] to prevent ROOT-context file reads.
+                        # Blocked on UserIdentifier requiring user_id/agent_id fields that
+                        # are not currently propagated through EmbeddingMsg.context_data.
+                        try:
+                            from openviking.core.context import ModalContent, Vectorize
+                            from openviking.storage.viking_fs import get_viking_fs
+
+                            viking_fs = get_viking_fs()
+                            # read_file_bytes is async — await directly (not asyncio.to_thread)
+                            raw_bytes = await viking_fs.read_file_bytes(embedding_msg.media_uri, ctx=None)
+                            vectorize = Vectorize(


⚠️ Potential issue | 🔴 Critical

ctx=None still allows cross-tenant media reads.

The URI equality check only proves the two fields inside the same queue message agree; it does not authorize that URI. Because read_file_bytes(..., ctx=None) runs outside tenant context, a crafted message that sets both context_data["uri"] and media_uri to another tenant's file will still read that file. This branch needs a tenant-scoped RequestContext, or it should fail closed until one can be reconstructed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openviking/storage/collection_schemas.py` around lines 200 - 223, The code currently calls get_viking_fs().read_file_bytes(embedding_msg.media_uri, ctx=None) which permits cross-tenant reads because ctx=None; replace this with a tenant-scoped RequestContext constructed from embedding_msg.context_data (e.g. account_id/tenant id) and pass it as the ctx argument to read_file_bytes (or, if a valid tenant context cannot be reconstructed, skip the file branch and fall back to text embedding to fail closed). Update the branch around embedding_msg, expected_uri, get_viking_fs, and Vectorize to require a reconstructed RequestContext (or bail out) before calling read_file_bytes.

coderabbitai · 2026-03-14T12:17:05Z

openviking/storage/collection_schemas.py

+                        try:
+                            from openviking.core.context import ModalContent, Vectorize
+                            from openviking.storage.viking_fs import get_viking_fs
+
+                            viking_fs = get_viking_fs()
+                            # read_file_bytes is async — await directly (not asyncio.to_thread)
+                            raw_bytes = await viking_fs.read_file_bytes(embedding_msg.media_uri, ctx=None)
+                            vectorize = Vectorize(
+                                text=embedding_msg.message,
+                                media=ModalContent(
+                                    mime_type=embedding_msg.media_mime_type,
+                                    uri=embedding_msg.media_uri,
+                                    data=raw_bytes,
+                                ),
+                            )
+                            result: EmbedResult = await asyncio.to_thread(
+                                self._embedder.embed_multimodal, vectorize
+                            )
+                        except Exception as e:
+                            logger.warning(
+                                f"Multimodal embed failed for {embedding_msg.media_uri!r}: {e}, "
+                                "falling back to text embedding"
+                            )
+                            result: EmbedResult = await asyncio.to_thread(
+                                self._embedder.embed, embedding_msg.message
+                            )


⚠️ Potential issue | 🟠 Major

Propagate transient multimodal failures instead of downgrading them to text.

GeminiDenseEmbedder.embed_multimodal() already raises on 429/5xx so the caller can retry, but the broad except Exception here catches that and permanently stores a text-only vector instead. Re-raise transient embedder failures, or report them as errors, and reserve text fallback for non-transient read/validation problems.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openviking/storage/collection_schemas.py` around lines 216 - 241, The current broad except around calling self._embedder.embed_multimodal and viking_fs.read_file_bytes swallows transient embedder failures (e.g., HTTP 429/5xx) and falls back to text-only embedding; change the error handling in the try/except in collection_schemas.py so that embed_multimodal errors (from GeminiDenseEmbedder via self._embedder.embed_multimodal and self._embedder.embed) are not caught and suppressed: only catch and handle IO/validation errors coming from get_viking_fs() / viking_fs.read_file_bytes or construction of Vectorize/ModalContent and, for other exceptions originating from the embedder, re-raise them (or log as error and raise) so callers can retry; identify this logic around the try block using the symbols Vectorize, ModalContent, get_viking_fs, viking_fs.read_file_bytes, self._embedder.embed_multimodal, and self._embedder.embed, and implement selective exception handling (or exception filtering) to reserve text fallback solely for non-transient read/validation failures.

pyproject.toml

…() default fallback to EmbedderBase

… for multimodal pipeline

… from Vectorize.media

…dding support

…ype field

…a URIs with bytes from viking_fs

…configured; add _infer_image_mime helper

…I calls 7 E2E tests covering text embedding dimension/normalization, batch consistency, semantic similarity, multimodal (xfail→xpass), fallback, and task-type differentiation. Adds pytest integration marker to pyproject.toml.

…e_bytes context (security) - Reject multimodal embeds where media_uri != context_data["uri"] (forged queue message attack) - Fix read_file_bytes call: was missing ctx= kwarg and incorrectly wrapped in asyncio.to_thread (it is async) - Add TODO noting ROOT-context leak pending UserIdentifier fields in EmbeddingMsg

…ly messages

…or test

…o, PDF); add token limit constant; fix E2E default dimension to 3072

- Remove circular dep: drop get_openviking_config from vectorize_file() body; add embedding_provider param, callers pass config.dense.provider - Fix dead PDF multimodal path: check is_pdf before None branch so PDF files reach the Gemini media branch in vectorize_file() - Extend multimodal MIME coverage: add _MIME_MAP_VIDEO/_AUDIO/_PDF and _infer_media_mime() helper; VIDEO/AUDIO/PDF now set Vectorize.media - Fix EmbeddingMsg.id: declare as dataclass field(default_factory=uuid4) so asdict()/to_dict()/from_dict() round-trip preserves message ID - Add exc_info=True to broad except in embedding_utils file-read path - Fix misleading dimension comment in ov.conf.gemini.example - Move inline imports (get_openviking_config, traceback) to top-level in collection_schemas.py - Add 'gemini' to all provider descriptions/docstrings/error messages in embedding_config.py - Remove unused pytest import from test_context_vectorize.py - Add Gemini provider section (details block + inline example + table) to docs/en and docs/zh configuration guides - Add tests: EmbeddingMsg id roundtrip, _infer_media_mime for all types, parameterised vectorize_file multimodal coverage (png/mp4/mp3/pdf/openai)

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

docs/zh/guides/01-configuration.md (1)

102-106: Consider documenting task_type in Gemini examples.

task_type is now supported and useful for retrieval tuning; adding it to both Gemini snippets will reduce under-configuration by users.

📝 Suggested doc tweak

 {
   "embedding": {
     "dense": {
       "provider": "gemini",
       "api_key": "your-google-api-key",
       "model": "gemini-embedding-2-preview",
-      "dimension": 3072
+      "dimension": 3072,
+      "task_type": "RETRIEVAL_DOCUMENT"
     }
   }
 }

Also applies to: 214-218

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/zh/guides/01-configuration.md` around lines 102 - 106, The Gemini
provider examples currently lack the optional task_type setting; update both
Gemini configuration snippets that include "provider": "gemini" (the JSON blocks
with "api_key" and "model" such as "gemini-embedding-2-preview") to add a
"task_type" field (e.g., "task_type": "text-embedding" or the appropriate Gemini
task like "embedding") so users can tune retrieval behavior; ensure the new key
appears alongside "model" and "dimension" in each Gemini snippet and use
consistent example values across both occurrences.

tests/storage/test_embedding_msg_converter_tenant.py (1)

43-46: Consolidate duplicate imports at file top.

Context and EmbeddingMsgConverter are imported twice (lines 8-9 and lines 75-76). Consider moving all imports to the top of the file for cleaner organization.

 from openviking.core.context import Context
+from openviking.core.context import ModalContent, Vectorize
 from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter
+from openviking.storage.queuefs.embedding_msg import EmbeddingMsg
 from openviking_cli.session.user_id import UserIdentifier
 
-...
-
-from openviking.storage.queuefs.embedding_msg import EmbeddingMsg
-
-...
-
-from openviking.core.context import Context, ModalContent, Vectorize
-from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter

Also applies to: 75-77

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/storage/test_embedding_msg_converter_tenant.py` around lines 43 - 46,
The file has duplicate imports of Context and EmbeddingMsgConverter (and
EmbeddingMsg) scattered later in the test; consolidate by moving all imports for
Context, EmbeddingMsgConverter, and EmbeddingMsg to the top import block and
remove the redundant import statements (references: Context,
EmbeddingMsgConverter, EmbeddingMsg) so each symbol is imported once at file
head for cleaner organization.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@openviking/storage/queuefs/semantic_processor.py`:
- Around line 571-576: The try/except around the embedding provider lookup is
swallowing all exceptions and silently disabling Gemini vectorization; remove
the broad try/except and either (a) perform the safe nested getattr without
catching exceptions: embedding_provider =
getattr(get_openviking_config().embedding.dense, "provider", None) (or chained
getattr calls as currently used) so attribute errors surface, or (b) if you must
handle missing attributes, catch only AttributeError and log the error before
defaulting to None; update the lookup near embedding_provider and use
get_openviking_config(), embedding, and dense identifiers to locate the code.

In `@openviking/utils/resource_processor.py`:
- Around line 85-92: The current broad try/except around get_openviking_config()
swallows all errors and sets embedding_provider to None; change this to catch
only expected exceptions (e.g., AttributeError, ImportError, or config-specific
errors), log the failure with context using the module logger (include the
exception message/stack), and preserve raising unexpected exceptions. Refactor
the nested attribute lookup into a shared helper function (e.g.,
resolve_embedding_provider(config) or get_embedding_provider_from_config) and
call it from both resource_processor.py (where embedding_provider is set) and
openviking/storage/queuefs/semantic_processor.py so both use the same logic and
error handling.

In `@tests/misc/test_config_validation.py`:
- Line 215: Replace the scanner-triggering literal api_key="key" used in the
test tuple (model="gemini-embedding-2-preview", provider="gemini",
api_key="key", dimension=3072) with a clearly non-secret, scanner-safe test
literal such as api_key="DUMMY_API_KEY" or "test-api-key" so the api_key test
value is explicitly a dummy; update the api_key value wherever that exact tuple
or the api_key parameter is used in tests to avoid secret-scanning noise.

---

Nitpick comments:
In `@docs/zh/guides/01-configuration.md`:
- Around line 102-106: The Gemini provider examples currently lack the optional
task_type setting; update both Gemini configuration snippets that include
"provider": "gemini" (the JSON blocks with "api_key" and "model" such as
"gemini-embedding-2-preview") to add a "task_type" field (e.g., "task_type":
"text-embedding" or the appropriate Gemini task like "embedding") so users can
tune retrieval behavior; ensure the new key appears alongside "model" and
"dimension" in each Gemini snippet and use consistent example values across both
occurrences.

In `@tests/storage/test_embedding_msg_converter_tenant.py`:
- Around line 43-46: The file has duplicate imports of Context and
EmbeddingMsgConverter (and EmbeddingMsg) scattered later in the test;
consolidate by moving all imports for Context, EmbeddingMsgConverter, and
EmbeddingMsg to the top import block and remove the redundant import statements
(references: Context, EmbeddingMsgConverter, EmbeddingMsg) so each symbol is
imported once at file head for cleaner organization.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b621305-a2df-4f1a-8314-b89fc935014d

📥 Commits

Reviewing files that changed from the base of the PR and between 1de958f and 23a4adb.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (24)

docs/en/guides/01-configuration.md
docs/zh/guides/01-configuration.md
examples/ov.conf.gemini.example
openviking/core/context.py
openviking/models/embedder/__init__.py
openviking/models/embedder/base.py
openviking/models/embedder/gemini_embedders.py
openviking/storage/collection_schemas.py
openviking/storage/queuefs/embedding_msg.py
openviking/storage/queuefs/embedding_msg_converter.py
openviking/storage/queuefs/semantic_processor.py
openviking/utils/embedding_utils.py
openviking/utils/resource_processor.py
openviking_cli/utils/config/embedding_config.py
pyproject.toml
tests/integration/test_gemini_e2e.py
tests/misc/test_config_validation.py
tests/storage/test_collection_schemas.py
tests/storage/test_embedding_msg_converter_tenant.py
tests/unit/test_context_vectorize.py
tests/unit/test_embedder_base_multimodal.py
tests/unit/test_embedding_msg.py
tests/unit/test_embedding_utils_mime.py
tests/unit/test_gemini_embedder.py

🚧 Files skipped from review as they are similar to previous changes (8)

tests/unit/test_embedder_base_multimodal.py
pyproject.toml
openviking_cli/utils/config/embedding_config.py
openviking/storage/queuefs/embedding_msg.py
tests/unit/test_context_vectorize.py
tests/unit/test_embedding_utils_mime.py
tests/storage/test_collection_schemas.py
openviking/models/embedder/base.py

openviking/storage/queuefs/semantic_processor.py

coderabbitai · 2026-03-14T13:38:05Z

openviking/utils/resource_processor.py

+        from openviking_cli.utils.config import get_openviking_config
+        try:
+            ov_config = get_openviking_config()
+            embedding_provider = (
+                getattr(getattr(getattr(ov_config, "embedding", None), "dense", None), "provider", None)
+            )
+        except Exception:
+            embedding_provider = None


⚠️ Potential issue | 🟡 Minor

Don’t swallow all provider-resolution errors silently.

This fallback can mask invalid config/state and silently disable provider-specific indexing behavior. Please at least log the failure and narrow exception scope; ideally extract this lookup into one shared helper used here and in openviking/storage/queuefs/semantic_processor.py.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@openviking/utils/resource_processor.py` around lines 85 - 92, The current broad try/except around get_openviking_config() swallows all errors and sets embedding_provider to None; change this to catch only expected exceptions (e.g., AttributeError, ImportError, or config-specific errors), log the failure with context using the module logger (include the exception message/stack), and preserve raising unexpected exceptions. Refactor the nested attribute lookup into a shared helper function (e.g., resolve_embedding_provider(config) or get_embedding_provider_from_config) and call it from both resource_processor.py (where embedding_provider is set) and openviking/storage/queuefs/semantic_processor.py so both use the same logic and error handling.

tests/misc/test_config_validation.py

- Add task_type to all 4 Gemini JSON snippets (details block + inline example in both en and zh docs) per CodeRabbit review - Consolidate duplicate imports in test_embedding_msg_converter_tenant.py: move EmbeddingMsg, ModalContent, Vectorize, and second EmbeddingMsgConverter import to top-level block, remove 3 mid-file inline import statements

…i_key literals - semantic_processor.py + resource_processor.py: change bare `except Exception:` to `except Exception as e:` with logger.warning(..., exc_info=True) so config load failures are visible instead of silently disabling multimodal path - test_config_validation.py: replace api_key="key" (2 occurrences) with api_key="test-key" to avoid secret-scanner noise

- Add embed_query() default to DenseEmbedderBase (delegates to embed()) - GeminiDenseEmbedder overrides embed_query() using _query_config (RETRIEVAL_QUERY); embed() uses _index_config (RETRIEVAL_DOCUMENT) - Update hierarchical_retriever + memory_deduplicator to call embed_query() - Deprecate task_type config field (still accepted, no validation error) - Add enable_multimodal: bool = False flag; supports_multimodal reflects it - Add embed_multimodal_batch / async_embed_multimodal_batch to base class - Add Gemini async_embed_multimodal_batch override (anyio semaphore) - Rewrite embed_multimodal: parts API + pdfminer PDF guard (gated by flag) - Fix PR volcengine#607 issues #1, #2, #4, #6

… embedding - Add ContentPart = Union[str, ModalContent] type alias - Add parts: Optional[List[ContentPart]] field to Vectorize - Add get_parts(): returns parts if set, else builds [text, media] from legacy fields - Add multi-part integration tests (TestGeminiE2EMultipartEmbedding) - Fixes PR volcengine#607 issue #1 (multi-part sequences)

gemini-code-assist bot reviewed Mar 14, 2026

View reviewed changes

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

chethanuk added 18 commits March 14, 2026 14:17

feat(context): add ModalContent dataclass and media field to Vectorize

a0d5000

feat(embedder): add supports_multimodal property and embed_multimodal…

a01dcee

…() default fallback to EmbedderBase

deps: add google-genai>=0.8.0 for Gemini Embedding 2 support

b84ece4

docs(examples): add Gemini Embedding 2 config example

db18240

feat(queue): add media_uri and media_mime_type fields to EmbeddingMsg…

1c8fa87

… for multimodal pipeline

feat(queue): EmbeddingMsgConverter forwards media_uri/media_mime_type…

a86d019

… from Vectorize.media

feat(embedder): add GeminiDenseEmbedder with text and multimodal embe…

5eadead

…dding support

feat(embedder): export GeminiDenseEmbedder from embedder package

44f265e

feat(config): register gemini provider in EmbeddingConfig with task_t…

718e22f

…ype field

feat(handler): TextEmbeddingHandler calls embed_multimodal() for medi…

fad66dc

…a URIs with bytes from viking_fs

feat(ingestion): set Vectorize.media for images when gemini provider …

580d660

…configured; add _infer_image_mime helper

fix(queue): EmbeddingMsgConverter uses URI fallback text for media-on…

999c4e0

…ly messages

fix(embedder): improve multimodal error logging; add transient APIErr…

d5156f6

…or test

refactor: centralize MIME allowlist to single source in gemini_embedders

2aee6b7

fix(embedder): expand multimodal MIME types to full Gemini spec (vide…

295f125

…o, PDF); add token limit constant; fix E2E default dimension to 3072

chethanuk force-pushed the feat/gemini-embedding-impl branch from 1de958f to 23a4adb Compare March 14, 2026 13:30

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

chethanuk added 2 commits March 14, 2026 20:07

chethanuk merged commit d565945 into main Mar 14, 2026

chethanuk deleted the feat/gemini-embedding-impl branch March 14, 2026 19:11

		"dimension MUST be 3072 — this is the native output size of gemini-embedding-2-preview.",
		"Reducing dimension requires explicit truncation; the server will reject other values."

		from openviking.core.context import ModalContent, Vectorize
		from openviking.storage.viking_fs import get_viking_fs

	is_multimodal_provider = False
	logger.warning("Failed to check for multimodal provider, disabling feature.", exc_info=True)
	is_multimodal_provider = False

Conversation

chethanuk commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Spec

Security fixes (identified by Codex review)

Test plan

Configuration example

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chethanuk commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2026 •

edited

Loading