Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds Google Gemini multimodal dense embedder and end-to-end multimodal embedding paths: new GeminiDenseEmbedder, ModalContent support, propagation of media through embedding queue and storage, MIME inference and file vectorization updates, configuration changes for gemini, new tests, and example config. Changes
Sequence DiagramsequenceDiagram
autonumber
participant Client
participant QueueHandler
participant CollectionWriter
participant VikingFS
participant GeminiEmbedder
participant GeminiAPI
participant VectorDB as Database
Client->>QueueHandler: enqueue EmbeddingMsg (may include media_uri)
QueueHandler->>CollectionWriter: deliver embedding task
CollectionWriter->>GeminiEmbedder: check supports_multimodal
alt multimodal supported and media_uri present and matches context.uri
CollectionWriter->>VikingFS: read_file_bytes(media_uri)
VikingFS-->>CollectionWriter: bytes
CollectionWriter->>GeminiEmbedder: embed_multimodal(text + media bytes)
GeminiEmbedder->>GeminiAPI: embed_content(text + media)
GeminiAPI-->>GeminiEmbedder: embedding vector
GeminiEmbedder-->>CollectionWriter: EmbedResult
else mismatch or read fails or not supported
CollectionWriter->>GeminiEmbedder: embed(text)
GeminiEmbedder->>GeminiAPI: embed_content(text only)
GeminiAPI-->>GeminiEmbedder: embedding vector
GeminiEmbedder-->>CollectionWriter: EmbedResult
end
CollectionWriter->>VectorDB: write vector
VectorDB-->>CollectionWriter: ack
CollectionWriter-->>Client: done
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the embedding capabilities by integrating Google's Gemini Embedding 2 model, enabling the system to process and embed multimodal content alongside traditional text. It introduces a robust framework for handling various media types within the embedding pipeline, ensuring secure and configurable operation. The changes allow for a more comprehensive representation of data by leveraging advanced multimodal AI models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Gemini multimodal embeddings, which is a significant feature enhancement. The implementation is comprehensive, covering the embedder logic, queue pipeline extension, configuration, and security considerations. The addition of a new GeminiDenseEmbedder is well-designed, with robust error handling and fallback mechanisms. The test coverage is excellent, with extensive unit, integration, and end-to-end tests that provide high confidence in the changes. I have a few suggestions for improvement related to documentation clarity, code style, robustness, and a potential circular dependency. Overall, this is a high-quality contribution.
openviking/utils/embedding_utils.py
Outdated
| return | ||
| elif content_type == ResourceContentType.IMAGE: | ||
| try: | ||
| from openviking_cli.utils.config import get_openviking_config |
There was a problem hiding this comment.
This local import from openviking_cli.utils.config import get_openviking_config introduces a dependency from the core openviking library to the openviking_cli application. This creates a circular dependency at the package level (openviking -> openviking_cli -> openviking) and violates the principle of separation of concerns. The core library should not depend on the CLI application.
A better approach would be to pass the configuration as a parameter to vectorize_file and its callers, rather than fetching it globally from a CLI-specific utility. This would require refactoring vectorize_file and index_resource to accept a configuration object.
examples/ov.conf.gemini.example
Outdated
| "dimension MUST be 3072 — this is the native output size of gemini-embedding-2-preview.", | ||
| "Reducing dimension requires explicit truncation; the server will reject other values." |
There was a problem hiding this comment.
The comment regarding the dimension is slightly misleading. While 3072 is the native dimension for gemini-embedding-2-preview, the API supports other dimensions (e.g., 768, 1536). Stating that other values "will be rejected" could confuse users, as the code and API seem to handle them. Please consider clarifying this to reflect the model's capabilities better.
"The native output dimension is 3072. Other values (e.g., 768, 1536) are supported by the API.",
"For best results, using one of the recommended dimensions (768, 1536, 3072) is advised."
| from openviking.core.context import ModalContent, Vectorize | ||
| from openviking.storage.viking_fs import get_viking_fs |
There was a problem hiding this comment.
openviking/utils/embedding_utils.py
Outdated
| provider = getattr(dense_cfg, "provider", "").lower() if dense_cfg else "" | ||
| is_multimodal_provider = provider == "gemini" | ||
| except Exception: | ||
| is_multimodal_provider = False |
There was a problem hiding this comment.
The broad except Exception: clause here silently catches any error that might occur while checking the embedding provider configuration. This could hide potential configuration issues or bugs. It would be better to log the exception to aid in debugging.
| is_multimodal_provider = False | |
| logger.warning("Failed to check for multimodal provider, disabling feature.", exc_info=True) | |
| is_multimodal_provider = False |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
openviking/utils/embedding_utils.py (1)
258-295:⚠️ Potential issue | 🟠 MajorGemini direct indexing still misses audio/video/PDF inputs.
Lines 269-289 only attach
Vectorize.mediaforResourceContentType.IMAGE.index_resource()callsvectorize_file()without summaries, so VIDEO/AUDIO files fall through to the summary-only branch at Lines 290-295, and PDFs never reach the multimodal path here at all. With Gemini configured, those files will still be skipped or text-only embedded instead of going throughembed_multimodal().openviking/storage/queuefs/embedding_msg.py (1)
9-31:⚠️ Potential issue | 🟠 Major
idfield is not serialized byasdict()— round-trip will lose message IDs.The
idattribute is assigned in__init__(line 23) but is not declared as a dataclass field. Sinceto_dict()usesasdict(), which only serializes declared fields, theidwill be missing from the serialized output. This breaks round-trip serialization despitefrom_dict()attempting to restore it at line 46.🐛 Proposed fix: declare `id` as a dataclass field with a default factory
`@dataclass` class EmbeddingMsg: + id: str = field(default_factory=lambda: str(uuid4())) message: Union[str, List[Dict[str, Any]]] context_data: Dict[str, Any] media_uri: Optional[str] = None media_mime_type: Optional[str] = None - def __init__( - self, - message: Union[str, List[Dict[str, Any]]], - context_data: Dict[str, Any], - media_uri: Optional[str] = None, - media_mime_type: Optional[str] = None, - ): - self.id = str(uuid4()) - self.message = message - self.context_data = context_data - self.media_uri = media_uri - self.media_mime_type = media_mime_typeUpdate the import:
-from dataclasses import asdict, dataclass +from dataclasses import asdict, dataclass, fieldAnd update
from_dictto accept id as constructor parameter:`@classmethod` def from_dict(cls, data: Dict[str, Any]) -> "EmbeddingMsg": """Create an embedding message object from dictionary.""" obj = EmbeddingMsg( + id=data.get("id", str(uuid4())), message=data["message"], context_data=data["context_data"], media_uri=data.get("media_uri"), media_mime_type=data.get("media_mime_type"), ) - obj.id = data.get("id", obj.id) return obj🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/storage/queuefs/embedding_msg.py` around lines 9 - 31, The class EmbeddingMsg currently assigns self.id in __init__ but id is not declared as a dataclass field so to_dict() via asdict() omits it; declare id as a dataclass field (e.g., id: str = field(default_factory=lambda: str(uuid4()))) and add the necessary imports (field, uuid4), then update from_dict to accept/pass id when constructing the EmbeddingMsg (or construct via EmbeddingMsg(**data)) so round-trip serialization preserves the id; remove the manual self.id assignment in __init__ if you rely on the dataclass default factory.
🧹 Nitpick comments (8)
tests/unit/test_context_vectorize.py (1)
4-4: Unusedpytestimport.The
pytestimport on line 4 is not used in any of the tests. Consider removing it.🧹 Proposed fix
# SPDX-License-Identifier: Apache-2.0 """Tests for ModalContent and Vectorize media extension.""" -import pytest from openviking.core.context import ModalContent, Vectorize🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_context_vectorize.py` at line 4, Remove the unused top-level import of pytest (the "import pytest" statement) since it isn't referenced anywhere in the test file; simply delete that import line to eliminate the unused import warning and keep the test module clean.examples/ov.conf.gemini.example (1)
5-6: Clarify dimension flexibility in comments.The comment states dimension "MUST be 3072" and the server will reject other values, but according to the
GeminiDenseEmbedderdocstring, Gemini supports output dimensions 128–3072 (recommended: 768, 1536, 3072). Consider updating the comment to reflect that 3072 is the default and other values within the supported range are valid.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/ov.conf.gemini.example` around lines 5 - 6, Update the comment to clarify that 3072 is the default/native output size for gemini-embedding-2-preview but not strictly required; mention that Gemini supports output dimensions in the range 128–3072 (with recommended values 768, 1536, 3072) and that reducing the dimension requires explicit truncation (the server will reject dimensions outside that supported range). Reference the gemini-embedding-2-preview model and the GeminiDenseEmbedder docstring so readers know where the supported range is defined.openviking/models/embedder/gemini_embedders.py (2)
145-151: Batch fallback may propagateRuntimeErrorfrom individual embeds.The batch fallback catches
APIErrorand falls back to individualembed()calls. However,embed()wrapsAPIErrorinRuntimeError(line 97). If a single text in the batch fails, theRuntimeErrorwill propagate and the remaining texts in that batch won't be processed.Consider wrapping the individual fallback calls to ensure all texts are attempted:
🔧 Proposed fix for robust batch fallback
except APIError as e: logger.warning( f"Gemini batch embed failed (code={e.code}) for batch of {len(batch)}, " "falling back to individual calls" ) for text in batch: - results.append(self.embed(text)) + try: + results.append(self.embed(text)) + except RuntimeError as re: + logger.error(f"Individual embed failed: {re}") + # Return empty vector or re-raise depending on desired behavior + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/gemini_embedders.py` around lines 145 - 151, The batch fallback currently calls self.embed(text) for each text inside the except APIError block, but self.embed() wraps APIError into RuntimeError so a failure on one text will abort the loop; update the fallback to call self.embed(text) inside a per-item try/except (catch RuntimeError and APIError) so failures are logged (use logger.warning/error with the text index or snippet and exception) and the loop continues, ensuring you still append successful embeddings to results and handle/append a placeholder (e.g., None) or skip entries consistently for failed items; adjust code in the except APIError block around variables batch, results, and self.embed to implement this per-item error handling.
99-128: Consider handling empty text in multimodal fallback.When
mediaisNone, has nodata, or has an unsupported MIME type, the code falls back toself.embed(vectorize.text). Ifvectorize.textis empty orNone, this will embed an empty string, which may produce unexpected results.Consider adding a guard or logging when falling back with empty text:
🛡️ Proposed defensive check
if ( media is None or media.data is None or media.mime_type not in _SUPPORTED_MULTIMODAL_MIMES ): + if not vectorize.text: + logger.warning("embed_multimodal called with no usable media and empty text") return self.embed(vectorize.text)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/models/embedder/gemini_embedders.py` around lines 99 - 128, The fallback in embed_multimodal currently calls self.embed(vectorize.text) even when vectorize.text is empty/None; add a defensive check before any fallback (both the early-return branch and the exception fallback) to detect falsy/blank vectorize.text, log a warning via logger.warning mentioning embed_multimodal and media.uri, and return a safe empty EmbedResult (e.g., EmbedResult(dense_vector=[])) instead of calling self.embed when text is empty; update the logic inside embed_multimodal (references: embed_multimodal, embed, EmbedResult, vectorize.text, media.uri) to perform this check in both fallback paths.openviking/utils/embedding_utils.py (1)
32-34: Move the shared MIME allowlist out of the Gemini private module.
embedding_utilsis generic ingestion code, but this now imports_SUPPORTED_MULTIMODAL_MIMES, a private implementation detail ofopenviking.models.embedder.gemini_embedders, at module import time. That makes unrelated callers sensitive to provider-internal refactors. Prefer exposing the allowlist from a neutral/public module instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/utils/embedding_utils.py` around lines 32 - 34, The module embedding_utils imports the private symbol _SUPPORTED_MULTIMODAL_MIMES from openviking.models.embedder.gemini_embedders and derives _GEMINI_SUPPORTED_IMAGE_MIMES from it, coupling generic ingestion code to a provider-internal detail; instead move or expose the MIME allowlist as a neutral/public constant (e.g. SUPPORTED_MULTIMODAL_MIMES or SUPPORTED_IMAGE_MIMES) in a shared module (for example a new openviking.models.embedder.mime or openviking.utils.mime) and update embedding_utils to import that public symbol and derive the frozenset there (replace references to _SUPPORTED_MULTIMODAL_MIMES and _GEMINI_SUPPORTED_IMAGE_MIMES with the new public names).openviking/storage/queuefs/embedding_msg_converter.py (1)
26-29: Keep media-only messages textless instead of injecting the URI.The dequeue path later passes
embedding_msg.messageintoVectorize(text=...)forembed_multimodal(), so this fallback becomes part of the model input, not just queue bookkeeping. For media-only items that means embeddingviking://.../file.extalongside the bytes and also using the URI as the text-only fallback payload.Possible direction
vectorization_text = context.get_vectorization_text() - # If text is empty but media is present, use URI as fallback so the message isn't dropped - if not vectorization_text and getattr(context.vectorize, "media", None) is not None: - vectorization_text = context.uri or "" - if not vectorization_text: + has_media = getattr(context.vectorize, "media", None) is not None + if not vectorization_text and not has_media: return None- message=vectorization_text, + message=vectorization_text or "",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openviking/storage/queuefs/embedding_msg_converter.py` around lines 26 - 29, Remove the fallback that injects context.uri into vectorization_text for media-only items so media-only messages remain textless; specifically, in embedding_msg_converter.py do not set vectorization_text = context.uri when getattr(context.vectorize, "media", None) is not None — leave vectorization_text empty for media-only cases so embedding_msg.message (and subsequently Vectorize(text=...) used by embed_multimodal()) does not receive the URI as model input; if bookkeeping needs a non-empty identifier, handle that separately (e.g., store URI in metadata) rather than in vectorization_text.tests/storage/test_embedding_msg_converter_tenant.py (1)
43-45: Consolidate imports at the top of the file.The imports at lines 45 and 75-76 are placed mid-file. Additionally,
EmbeddingMsgConverterat line 76 duplicates the import at line 9. Moving all imports to the top improves readability and follows PEP 8 conventions.♻️ Proposed refactor
Move these imports to the top of the file with the other imports:
from openviking.core.context import Context +from openviking.core.context import ModalContent, Vectorize from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter +from openviking.storage.queuefs.embedding_msg import EmbeddingMsg from openviking_cli.session.user_id import UserIdentifierThen remove the mid-file imports at lines 45 and 75-76.
Also applies to: 75-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/storage/test_embedding_msg_converter_tenant.py` around lines 43 - 45, Imports for EmbeddingMsg and EmbeddingMsgConverter are currently placed mid-file; move the import statements for EmbeddingMsg (from openviking.storage.queuefs.embedding_msg) and EmbeddingMsgConverter to the top alongside the other imports, remove the duplicate mid-file import(s), and ensure there's only one canonical import per symbol (EmbeddingMsg, EmbeddingMsgConverter) at the module-level to satisfy PEP8 and avoid duplication.tests/storage/test_collection_schemas.py (1)
220-250: Consider patching VikingFS in the security test for robustness.The test relies on the URI mismatch check happening before the file read. While this matches the current implementation, adding a
_FakeVikingFSthat raises an error would make the test more defensive against future code changes that might reorder the checks.♻️ Suggested defensive patch
+ class _FailingVikingFS: + async def read_file_bytes(self, uri, ctx=None): + raise AssertionError("Should not attempt to read mismatched URI") + - handler = TextEmbeddingHandler(_FakeVikingDB()) - handler.set_callbacks(on_success=lambda: None, on_error=lambda *_: None) - await handler.on_dequeue(payload) + with patch("openviking.storage.viking_fs.get_viking_fs", return_value=_FailingVikingFS()): + handler = TextEmbeddingHandler(_FakeVikingDB()) + handler.set_callbacks(on_success=lambda: None, on_error=lambda *_: None) + await handler.on_dequeue(payload)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/storage/test_collection_schemas.py` around lines 220 - 250, The test test_handler_rejects_media_uri_not_matching_context_uri should also patch VikingFS to ensure no file read occurs if the media_uri mismatches; add a _FakeVikingFS class that raises an exception on read/open and monkeypatch it into the code path the TextEmbeddingHandler uses (so that any accidental reorder of checks that tries to access the file will fail the test), keep the existing _FakeVikingDB and assertions that embedder.embed_multimodal_calls == 0 and embedder.embed_calls == 1 to verify fallback-to-text behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openviking_cli/utils/config/embedding_config.py`:
- Around line 92-95: The Gemini branch raises if api_key is missing but doesn't
ensure embedding dimensionality matches GeminiDenseEmbedder's default (3072),
causing init_context_collection and TextEmbeddingHandler to provision 2048-dim
collections while Gemini produces 3072-dim vectors; fix by either requiring an
explicit dimension when provider == "gemini" in the same branch (raise
ValueError if config.embedding.dimension is unset) or update get_dimension() to
detect provider == "gemini" and return the Gemini default based on model
(aligning GeminiDenseEmbedder and config.embedding.dimension); update references
in GeminiDenseEmbedder, get_dimension(), init_context_collection, and
TextEmbeddingHandler so the dimension used for provisioning and validation is
consistent.
In `@openviking/models/embedder/gemini_embedders.py`:
- Around line 157-161: The close() method currently accesses the private
attribute self.client._http_client; change it to use the public API by calling
the client's close method (e.g., call self.client.close() or check
hasattr(self.client, "close") and call it), removing direct access to
_http_client and optionally wrapping the call in a try/except to log or ignore
cleanup errors; update the close method in gemini_embedders.py (the def
close(self) implementation) to use genai.Client.close instead of
self.client._http_client.
In `@openviking/storage/collection_schemas.py`:
- Around line 200-223: The code currently calls
get_viking_fs().read_file_bytes(embedding_msg.media_uri, ctx=None) which permits
cross-tenant reads because ctx=None; replace this with a tenant-scoped
RequestContext constructed from embedding_msg.context_data (e.g.
account_id/tenant id) and pass it as the ctx argument to read_file_bytes (or, if
a valid tenant context cannot be reconstructed, skip the file branch and fall
back to text embedding to fail closed). Update the branch around embedding_msg,
expected_uri, get_viking_fs, and Vectorize to require a reconstructed
RequestContext (or bail out) before calling read_file_bytes.
- Around line 216-241: The current broad except around calling
self._embedder.embed_multimodal and viking_fs.read_file_bytes swallows transient
embedder failures (e.g., HTTP 429/5xx) and falls back to text-only embedding;
change the error handling in the try/except in collection_schemas.py so that
embed_multimodal errors (from GeminiDenseEmbedder via
self._embedder.embed_multimodal and self._embedder.embed) are not caught and
suppressed: only catch and handle IO/validation errors coming from
get_viking_fs() / viking_fs.read_file_bytes or construction of
Vectorize/ModalContent and, for other exceptions originating from the embedder,
re-raise them (or log as error and raise) so callers can retry; identify this
logic around the try block using the symbols Vectorize, ModalContent,
get_viking_fs, viking_fs.read_file_bytes, self._embedder.embed_multimodal, and
self._embedder.embed, and implement selective exception handling (or exception
filtering) to reserve text fallback solely for non-transient read/validation
failures.
In `@pyproject.toml`:
- Around line 237-241: The pytest-asyncio version ranges are inconsistent
between [dependency-groups].dev (pytest-asyncio>=1.3.0) and
[project.optional-dependencies].test (pytest-asyncio>=0.21.0); pick a single
range (e.g., pytest-asyncio>=1.3.0) and update the other entry so both
[dependency-groups].dev and [project.optional-dependencies].test reference the
same pytest-asyncio version constraint to avoid conflicting installs.
---
Outside diff comments:
In `@openviking/storage/queuefs/embedding_msg.py`:
- Around line 9-31: The class EmbeddingMsg currently assigns self.id in __init__
but id is not declared as a dataclass field so to_dict() via asdict() omits it;
declare id as a dataclass field (e.g., id: str = field(default_factory=lambda:
str(uuid4()))) and add the necessary imports (field, uuid4), then update
from_dict to accept/pass id when constructing the EmbeddingMsg (or construct via
EmbeddingMsg(**data)) so round-trip serialization preserves the id; remove the
manual self.id assignment in __init__ if you rely on the dataclass default
factory.
---
Nitpick comments:
In `@examples/ov.conf.gemini.example`:
- Around line 5-6: Update the comment to clarify that 3072 is the default/native
output size for gemini-embedding-2-preview but not strictly required; mention
that Gemini supports output dimensions in the range 128–3072 (with recommended
values 768, 1536, 3072) and that reducing the dimension requires explicit
truncation (the server will reject dimensions outside that supported range).
Reference the gemini-embedding-2-preview model and the GeminiDenseEmbedder
docstring so readers know where the supported range is defined.
In `@openviking/models/embedder/gemini_embedders.py`:
- Around line 145-151: The batch fallback currently calls self.embed(text) for
each text inside the except APIError block, but self.embed() wraps APIError into
RuntimeError so a failure on one text will abort the loop; update the fallback
to call self.embed(text) inside a per-item try/except (catch RuntimeError and
APIError) so failures are logged (use logger.warning/error with the text index
or snippet and exception) and the loop continues, ensuring you still append
successful embeddings to results and handle/append a placeholder (e.g., None) or
skip entries consistently for failed items; adjust code in the except APIError
block around variables batch, results, and self.embed to implement this per-item
error handling.
- Around line 99-128: The fallback in embed_multimodal currently calls
self.embed(vectorize.text) even when vectorize.text is empty/None; add a
defensive check before any fallback (both the early-return branch and the
exception fallback) to detect falsy/blank vectorize.text, log a warning via
logger.warning mentioning embed_multimodal and media.uri, and return a safe
empty EmbedResult (e.g., EmbedResult(dense_vector=[])) instead of calling
self.embed when text is empty; update the logic inside embed_multimodal
(references: embed_multimodal, embed, EmbedResult, vectorize.text, media.uri) to
perform this check in both fallback paths.
In `@openviking/storage/queuefs/embedding_msg_converter.py`:
- Around line 26-29: Remove the fallback that injects context.uri into
vectorization_text for media-only items so media-only messages remain textless;
specifically, in embedding_msg_converter.py do not set vectorization_text =
context.uri when getattr(context.vectorize, "media", None) is not None — leave
vectorization_text empty for media-only cases so embedding_msg.message (and
subsequently Vectorize(text=...) used by embed_multimodal()) does not receive
the URI as model input; if bookkeeping needs a non-empty identifier, handle that
separately (e.g., store URI in metadata) rather than in vectorization_text.
In `@openviking/utils/embedding_utils.py`:
- Around line 32-34: The module embedding_utils imports the private symbol
_SUPPORTED_MULTIMODAL_MIMES from openviking.models.embedder.gemini_embedders and
derives _GEMINI_SUPPORTED_IMAGE_MIMES from it, coupling generic ingestion code
to a provider-internal detail; instead move or expose the MIME allowlist as a
neutral/public constant (e.g. SUPPORTED_MULTIMODAL_MIMES or
SUPPORTED_IMAGE_MIMES) in a shared module (for example a new
openviking.models.embedder.mime or openviking.utils.mime) and update
embedding_utils to import that public symbol and derive the frozenset there
(replace references to _SUPPORTED_MULTIMODAL_MIMES and
_GEMINI_SUPPORTED_IMAGE_MIMES with the new public names).
In `@tests/storage/test_collection_schemas.py`:
- Around line 220-250: The test
test_handler_rejects_media_uri_not_matching_context_uri should also patch
VikingFS to ensure no file read occurs if the media_uri mismatches; add a
_FakeVikingFS class that raises an exception on read/open and monkeypatch it
into the code path the TextEmbeddingHandler uses (so that any accidental reorder
of checks that tries to access the file will fail the test), keep the existing
_FakeVikingDB and assertions that embedder.embed_multimodal_calls == 0 and
embedder.embed_calls == 1 to verify fallback-to-text behavior.
In `@tests/storage/test_embedding_msg_converter_tenant.py`:
- Around line 43-45: Imports for EmbeddingMsg and EmbeddingMsgConverter are
currently placed mid-file; move the import statements for EmbeddingMsg (from
openviking.storage.queuefs.embedding_msg) and EmbeddingMsgConverter to the top
alongside the other imports, remove the duplicate mid-file import(s), and ensure
there's only one canonical import per symbol (EmbeddingMsg,
EmbeddingMsgConverter) at the module-level to satisfy PEP8 and avoid
duplication.
In `@tests/unit/test_context_vectorize.py`:
- Line 4: Remove the unused top-level import of pytest (the "import pytest"
statement) since it isn't referenced anywhere in the test file; simply delete
that import line to eliminate the unused import warning and keep the test module
clean.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f99c6b3c-2b76-4661-9332-5c7a982d166e
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (19)
examples/ov.conf.gemini.exampleopenviking/core/context.pyopenviking/models/embedder/__init__.pyopenviking/models/embedder/base.pyopenviking/models/embedder/gemini_embedders.pyopenviking/storage/collection_schemas.pyopenviking/storage/queuefs/embedding_msg.pyopenviking/storage/queuefs/embedding_msg_converter.pyopenviking/utils/embedding_utils.pyopenviking_cli/utils/config/embedding_config.pypyproject.tomltests/integration/test_gemini_e2e.pytests/misc/test_config_validation.pytests/storage/test_collection_schemas.pytests/storage/test_embedding_msg_converter_tenant.pytests/unit/test_context_vectorize.pytests/unit/test_embedder_base_multimodal.pytests/unit/test_embedding_utils_mime.pytests/unit/test_gemini_embedder.py
| elif self.provider == "gemini": | ||
| if not self.api_key: | ||
| raise ValueError("Gemini provider requires 'api_key' to be set") | ||
|
|
There was a problem hiding this comment.
Gemini without an explicit dimension will provision the wrong vector size.
GeminiDenseEmbedder defaults to 3072 when dimension is unset, but init_context_collection() and TextEmbeddingHandler both consume config.embedding.dimension, which still falls back to 2048. A config created through this new Gemini path can therefore provision the collection/validator for 2048 and then reject 3072-dim vectors at ingest time. Either require dimension for Gemini here or make get_dimension() resolve the Gemini default from model.
Also applies to: 241-249
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking_cli/utils/config/embedding_config.py` around lines 92 - 95, The
Gemini branch raises if api_key is missing but doesn't ensure embedding
dimensionality matches GeminiDenseEmbedder's default (3072), causing
init_context_collection and TextEmbeddingHandler to provision 2048-dim
collections while Gemini produces 3072-dim vectors; fix by either requiring an
explicit dimension when provider == "gemini" in the same branch (raise
ValueError if config.embedding.dimension is unset) or update get_dimension() to
detect provider == "gemini" and return the Gemini default based on model
(aligning GeminiDenseEmbedder and config.embedding.dimension); update references
in GeminiDenseEmbedder, get_dimension(), init_context_collection, and
TextEmbeddingHandler so the dimension used for provisioning and validation is
consistent.
| def close(self): | ||
| if hasattr(self.client, "_http_client"): | ||
| try: | ||
| self.client._http_client.close() | ||
| except Exception: |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
google-genai python Client close method
💡 Result:
In the google-genai Python SDK (google.genai), the client does have a close method:
- Sync client: call
client.close()to release underlying HTTP resources. [1][2] - Async client: use
Client(...).aioand callawait aclient.aclose(). [1]
Example:
from google.genai import Client
client = Client()
# ... make requests ...
client.close()Async:
from google.genai import Client
aclient = Client().aio
# ... await requests ...
await aclient.aclose()Alternatively, use context managers to auto-close:
from google.genai import Client
with Client() as client:
...from google.genai import Client
async with Client().aio as aclient:
...[1] (github.com) [2] (googleapis.github.io)
Citations:
- 1: https://github.com/googleapis/python-genai?utm_source=openai
- 2: https://googleapis.github.io/python-genai/?utm_source=openai
🏁 Script executed:
cat -n openviking/models/embedder/gemini_embedders.py | head -170 | tail -40Repository: chethanuk/OpenViking
Length of output: 1977
🏁 Script executed:
head -80 openviking/models/embedder/gemini_embedders.pyRepository: chethanuk/OpenViking
Length of output: 2245
Use the public close() method instead of accessing the private _http_client attribute.
The genai.Client from the google-genai SDK provides a public close() method. Replace the private attribute access with the public API:
def close(self):
if hasattr(self.client, 'close'):
self.client.close()Or simply call it directly if close() is always available:
def close(self):
self.client.close()Alternatively, use a context manager pattern when creating the client to avoid manual cleanup altogether.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/models/embedder/gemini_embedders.py` around lines 157 - 161, The
close() method currently accesses the private attribute
self.client._http_client; change it to use the public API by calling the
client's close method (e.g., call self.client.close() or check
hasattr(self.client, "close") and call it), removing direct access to
_http_client and optionally wrapping the call in a try/except to log or ignore
cleanup errors; update the close method in gemini_embedders.py (the def
close(self) implementation) to use genai.Client.close instead of
self.client._http_client.
| # Security: validate media_uri matches the record's own URI to prevent | ||
| # forged queue messages from reading arbitrary files. | ||
| expected_uri = embedding_msg.context_data.get("uri", "") | ||
| if embedding_msg.media_uri != expected_uri: | ||
| logger.warning( | ||
| f"media_uri {embedding_msg.media_uri!r} does not match context uri " | ||
| f"{expected_uri!r}, falling back to text embedding" | ||
| ) | ||
| result: EmbedResult = await asyncio.to_thread( | ||
| self._embedder.embed, embedding_msg.message | ||
| ) | ||
| else: | ||
| # TODO(security): reconstruct a tenant-scoped RequestContext from | ||
| # context_data["account_id"] to prevent ROOT-context file reads. | ||
| # Blocked on UserIdentifier requiring user_id/agent_id fields that | ||
| # are not currently propagated through EmbeddingMsg.context_data. | ||
| try: | ||
| from openviking.core.context import ModalContent, Vectorize | ||
| from openviking.storage.viking_fs import get_viking_fs | ||
|
|
||
| viking_fs = get_viking_fs() | ||
| # read_file_bytes is async — await directly (not asyncio.to_thread) | ||
| raw_bytes = await viking_fs.read_file_bytes(embedding_msg.media_uri, ctx=None) | ||
| vectorize = Vectorize( |
There was a problem hiding this comment.
ctx=None still allows cross-tenant media reads.
The URI equality check only proves the two fields inside the same queue message agree; it does not authorize that URI. Because read_file_bytes(..., ctx=None) runs outside tenant context, a crafted message that sets both context_data["uri"] and media_uri to another tenant's file will still read that file. This branch needs a tenant-scoped RequestContext, or it should fail closed until one can be reconstructed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/storage/collection_schemas.py` around lines 200 - 223, The code
currently calls get_viking_fs().read_file_bytes(embedding_msg.media_uri,
ctx=None) which permits cross-tenant reads because ctx=None; replace this with a
tenant-scoped RequestContext constructed from embedding_msg.context_data (e.g.
account_id/tenant id) and pass it as the ctx argument to read_file_bytes (or, if
a valid tenant context cannot be reconstructed, skip the file branch and fall
back to text embedding to fail closed). Update the branch around embedding_msg,
expected_uri, get_viking_fs, and Vectorize to require a reconstructed
RequestContext (or bail out) before calling read_file_bytes.
| try: | ||
| from openviking.core.context import ModalContent, Vectorize | ||
| from openviking.storage.viking_fs import get_viking_fs | ||
|
|
||
| viking_fs = get_viking_fs() | ||
| # read_file_bytes is async — await directly (not asyncio.to_thread) | ||
| raw_bytes = await viking_fs.read_file_bytes(embedding_msg.media_uri, ctx=None) | ||
| vectorize = Vectorize( | ||
| text=embedding_msg.message, | ||
| media=ModalContent( | ||
| mime_type=embedding_msg.media_mime_type, | ||
| uri=embedding_msg.media_uri, | ||
| data=raw_bytes, | ||
| ), | ||
| ) | ||
| result: EmbedResult = await asyncio.to_thread( | ||
| self._embedder.embed_multimodal, vectorize | ||
| ) | ||
| except Exception as e: | ||
| logger.warning( | ||
| f"Multimodal embed failed for {embedding_msg.media_uri!r}: {e}, " | ||
| "falling back to text embedding" | ||
| ) | ||
| result: EmbedResult = await asyncio.to_thread( | ||
| self._embedder.embed, embedding_msg.message | ||
| ) |
There was a problem hiding this comment.
Propagate transient multimodal failures instead of downgrading them to text.
GeminiDenseEmbedder.embed_multimodal() already raises on 429/5xx so the caller can retry, but the broad except Exception here catches that and permanently stores a text-only vector instead. Re-raise transient embedder failures, or report them as errors, and reserve text fallback for non-transient read/validation problems.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/storage/collection_schemas.py` around lines 216 - 241, The current
broad except around calling self._embedder.embed_multimodal and
viking_fs.read_file_bytes swallows transient embedder failures (e.g., HTTP
429/5xx) and falls back to text-only embedding; change the error handling in the
try/except in collection_schemas.py so that embed_multimodal errors (from
GeminiDenseEmbedder via self._embedder.embed_multimodal and
self._embedder.embed) are not caught and suppressed: only catch and handle
IO/validation errors coming from get_viking_fs() / viking_fs.read_file_bytes or
construction of Vectorize/ModalContent and, for other exceptions originating
from the embedder, re-raise them (or log as error and raise) so callers can
retry; identify this logic around the try block using the symbols Vectorize,
ModalContent, get_viking_fs, viking_fs.read_file_bytes,
self._embedder.embed_multimodal, and self._embedder.embed, and implement
selective exception handling (or exception filtering) to reserve text fallback
solely for non-transient read/validation failures.
…() default fallback to EmbedderBase
… for multimodal pipeline
… from Vectorize.media
…a URIs with bytes from viking_fs
…configured; add _infer_image_mime helper
…I calls 7 E2E tests covering text embedding dimension/normalization, batch consistency, semantic similarity, multimodal (xfail→xpass), fallback, and task-type differentiation. Adds pytest integration marker to pyproject.toml.
…e_bytes context (security) - Reject multimodal embeds where media_uri != context_data["uri"] (forged queue message attack) - Fix read_file_bytes call: was missing ctx= kwarg and incorrectly wrapped in asyncio.to_thread (it is async) - Add TODO noting ROOT-context leak pending UserIdentifier fields in EmbeddingMsg
…o, PDF); add token limit constant; fix E2E default dimension to 3072
- Remove circular dep: drop get_openviking_config from vectorize_file() body; add embedding_provider param, callers pass config.dense.provider - Fix dead PDF multimodal path: check is_pdf before None branch so PDF files reach the Gemini media branch in vectorize_file() - Extend multimodal MIME coverage: add _MIME_MAP_VIDEO/_AUDIO/_PDF and _infer_media_mime() helper; VIDEO/AUDIO/PDF now set Vectorize.media - Fix EmbeddingMsg.id: declare as dataclass field(default_factory=uuid4) so asdict()/to_dict()/from_dict() round-trip preserves message ID - Add exc_info=True to broad except in embedding_utils file-read path - Fix misleading dimension comment in ov.conf.gemini.example - Move inline imports (get_openviking_config, traceback) to top-level in collection_schemas.py - Add 'gemini' to all provider descriptions/docstrings/error messages in embedding_config.py - Remove unused pytest import from test_context_vectorize.py - Add Gemini provider section (details block + inline example + table) to docs/en and docs/zh configuration guides - Add tests: EmbeddingMsg id roundtrip, _infer_media_mime for all types, parameterised vectorize_file multimodal coverage (png/mp4/mp3/pdf/openai)
1de958f to
23a4adb
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
docs/zh/guides/01-configuration.md (1)
102-106: Consider documentingtask_typein Gemini examples.
task_typeis now supported and useful for retrieval tuning; adding it to both Gemini snippets will reduce under-configuration by users.📝 Suggested doc tweak
{ "embedding": { "dense": { "provider": "gemini", "api_key": "your-google-api-key", "model": "gemini-embedding-2-preview", - "dimension": 3072 + "dimension": 3072, + "task_type": "RETRIEVAL_DOCUMENT" } } }Also applies to: 214-218
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/zh/guides/01-configuration.md` around lines 102 - 106, The Gemini provider examples currently lack the optional task_type setting; update both Gemini configuration snippets that include "provider": "gemini" (the JSON blocks with "api_key" and "model" such as "gemini-embedding-2-preview") to add a "task_type" field (e.g., "task_type": "text-embedding" or the appropriate Gemini task like "embedding") so users can tune retrieval behavior; ensure the new key appears alongside "model" and "dimension" in each Gemini snippet and use consistent example values across both occurrences.tests/storage/test_embedding_msg_converter_tenant.py (1)
43-46: Consolidate duplicate imports at file top.
ContextandEmbeddingMsgConverterare imported twice (lines 8-9 and lines 75-76). Consider moving all imports to the top of the file for cleaner organization.from openviking.core.context import Context +from openviking.core.context import ModalContent, Vectorize from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverter +from openviking.storage.queuefs.embedding_msg import EmbeddingMsg from openviking_cli.session.user_id import UserIdentifier -... - -from openviking.storage.queuefs.embedding_msg import EmbeddingMsg - -... - -from openviking.core.context import Context, ModalContent, Vectorize -from openviking.storage.queuefs.embedding_msg_converter import EmbeddingMsgConverterAlso applies to: 75-77
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/storage/test_embedding_msg_converter_tenant.py` around lines 43 - 46, The file has duplicate imports of Context and EmbeddingMsgConverter (and EmbeddingMsg) scattered later in the test; consolidate by moving all imports for Context, EmbeddingMsgConverter, and EmbeddingMsg to the top import block and remove the redundant import statements (references: Context, EmbeddingMsgConverter, EmbeddingMsg) so each symbol is imported once at file head for cleaner organization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openviking/storage/queuefs/semantic_processor.py`:
- Around line 571-576: The try/except around the embedding provider lookup is
swallowing all exceptions and silently disabling Gemini vectorization; remove
the broad try/except and either (a) perform the safe nested getattr without
catching exceptions: embedding_provider =
getattr(get_openviking_config().embedding.dense, "provider", None) (or chained
getattr calls as currently used) so attribute errors surface, or (b) if you must
handle missing attributes, catch only AttributeError and log the error before
defaulting to None; update the lookup near embedding_provider and use
get_openviking_config(), embedding, and dense identifiers to locate the code.
In `@openviking/utils/resource_processor.py`:
- Around line 85-92: The current broad try/except around get_openviking_config()
swallows all errors and sets embedding_provider to None; change this to catch
only expected exceptions (e.g., AttributeError, ImportError, or config-specific
errors), log the failure with context using the module logger (include the
exception message/stack), and preserve raising unexpected exceptions. Refactor
the nested attribute lookup into a shared helper function (e.g.,
resolve_embedding_provider(config) or get_embedding_provider_from_config) and
call it from both resource_processor.py (where embedding_provider is set) and
openviking/storage/queuefs/semantic_processor.py so both use the same logic and
error handling.
In `@tests/misc/test_config_validation.py`:
- Line 215: Replace the scanner-triggering literal api_key="key" used in the
test tuple (model="gemini-embedding-2-preview", provider="gemini",
api_key="key", dimension=3072) with a clearly non-secret, scanner-safe test
literal such as api_key="DUMMY_API_KEY" or "test-api-key" so the api_key test
value is explicitly a dummy; update the api_key value wherever that exact tuple
or the api_key parameter is used in tests to avoid secret-scanning noise.
---
Nitpick comments:
In `@docs/zh/guides/01-configuration.md`:
- Around line 102-106: The Gemini provider examples currently lack the optional
task_type setting; update both Gemini configuration snippets that include
"provider": "gemini" (the JSON blocks with "api_key" and "model" such as
"gemini-embedding-2-preview") to add a "task_type" field (e.g., "task_type":
"text-embedding" or the appropriate Gemini task like "embedding") so users can
tune retrieval behavior; ensure the new key appears alongside "model" and
"dimension" in each Gemini snippet and use consistent example values across both
occurrences.
In `@tests/storage/test_embedding_msg_converter_tenant.py`:
- Around line 43-46: The file has duplicate imports of Context and
EmbeddingMsgConverter (and EmbeddingMsg) scattered later in the test;
consolidate by moving all imports for Context, EmbeddingMsgConverter, and
EmbeddingMsg to the top import block and remove the redundant import statements
(references: Context, EmbeddingMsgConverter, EmbeddingMsg) so each symbol is
imported once at file head for cleaner organization.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0b621305-a2df-4f1a-8314-b89fc935014d
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (24)
docs/en/guides/01-configuration.mddocs/zh/guides/01-configuration.mdexamples/ov.conf.gemini.exampleopenviking/core/context.pyopenviking/models/embedder/__init__.pyopenviking/models/embedder/base.pyopenviking/models/embedder/gemini_embedders.pyopenviking/storage/collection_schemas.pyopenviking/storage/queuefs/embedding_msg.pyopenviking/storage/queuefs/embedding_msg_converter.pyopenviking/storage/queuefs/semantic_processor.pyopenviking/utils/embedding_utils.pyopenviking/utils/resource_processor.pyopenviking_cli/utils/config/embedding_config.pypyproject.tomltests/integration/test_gemini_e2e.pytests/misc/test_config_validation.pytests/storage/test_collection_schemas.pytests/storage/test_embedding_msg_converter_tenant.pytests/unit/test_context_vectorize.pytests/unit/test_embedder_base_multimodal.pytests/unit/test_embedding_msg.pytests/unit/test_embedding_utils_mime.pytests/unit/test_gemini_embedder.py
🚧 Files skipped from review as they are similar to previous changes (8)
- tests/unit/test_embedder_base_multimodal.py
- pyproject.toml
- openviking_cli/utils/config/embedding_config.py
- openviking/storage/queuefs/embedding_msg.py
- tests/unit/test_context_vectorize.py
- tests/unit/test_embedding_utils_mime.py
- tests/storage/test_collection_schemas.py
- openviking/models/embedder/base.py
| from openviking_cli.utils.config import get_openviking_config | ||
| try: | ||
| ov_config = get_openviking_config() | ||
| embedding_provider = ( | ||
| getattr(getattr(getattr(ov_config, "embedding", None), "dense", None), "provider", None) | ||
| ) | ||
| except Exception: | ||
| embedding_provider = None |
There was a problem hiding this comment.
Don’t swallow all provider-resolution errors silently.
This fallback can mask invalid config/state and silently disable provider-specific indexing behavior. Please at least log the failure and narrow exception scope; ideally extract this lookup into one shared helper used here and in openviking/storage/queuefs/semantic_processor.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@openviking/utils/resource_processor.py` around lines 85 - 92, The current
broad try/except around get_openviking_config() swallows all errors and sets
embedding_provider to None; change this to catch only expected exceptions (e.g.,
AttributeError, ImportError, or config-specific errors), log the failure with
context using the module logger (include the exception message/stack), and
preserve raising unexpected exceptions. Refactor the nested attribute lookup
into a shared helper function (e.g., resolve_embedding_provider(config) or
get_embedding_provider_from_config) and call it from both resource_processor.py
(where embedding_provider is set) and
openviking/storage/queuefs/semantic_processor.py so both use the same logic and
error handling.
- Add task_type to all 4 Gemini JSON snippets (details block + inline example in both en and zh docs) per CodeRabbit review - Consolidate duplicate imports in test_embedding_msg_converter_tenant.py: move EmbeddingMsg, ModalContent, Vectorize, and second EmbeddingMsgConverter import to top-level block, remove 3 mid-file inline import statements
…i_key literals - semantic_processor.py + resource_processor.py: change bare `except Exception:` to `except Exception as e:` with logger.warning(..., exc_info=True) so config load failures are visible instead of silently disabling multimodal path - test_config_validation.py: replace api_key="key" (2 occurrences) with api_key="test-key" to avoid secret-scanner noise
- Add embed_query() default to DenseEmbedderBase (delegates to embed()) - GeminiDenseEmbedder overrides embed_query() using _query_config (RETRIEVAL_QUERY); embed() uses _index_config (RETRIEVAL_DOCUMENT) - Update hierarchical_retriever + memory_deduplicator to call embed_query() - Deprecate task_type config field (still accepted, no validation error) - Add enable_multimodal: bool = False flag; supports_multimodal reflects it - Add embed_multimodal_batch / async_embed_multimodal_batch to base class - Add Gemini async_embed_multimodal_batch override (anyio semaphore) - Rewrite embed_multimodal: parts API + pdfminer PDF guard (gated by flag) - Fix PR volcengine#607 issues #1, #2, #4, #6
… embedding - Add ContentPart = Union[str, ModalContent] type alias - Add parts: Optional[List[ContentPart]] field to Vectorize - Add get_parts(): returns parts if set, else builds [text, media] from legacy fields - Add multi-part integration tests (TestGeminiE2EMultipartEmbedding) - Fixes PR volcengine#607 issue #1 (multi-part sequences)
Summary
GeminiDenseEmbedder: Native text + multimodal (image, video, audio, PDF) embedding viagemini-embedding-2-preview(google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in viaprovider: "gemini"inov.conf.EmbeddingMsgcarriesmedia_uri/media_mime_type(URI-in-queue pattern);TextEmbeddingHandler.on_dequeue()reads raw bytes fromviking_fsand callsembed_multimodal(). Security:media_uriis validated to matchcontext_data["uri"]before any file read.EmbeddingConfigregistersprovider: "gemini"withtask_typefield.vectorize_file()setsVectorize(media=ModalContent(...))for supported image formats when Gemini is configured.Spec
gemini-embedding-2-previewimage/jpeg,image/png,image/gif,image/webp,audio/mpeg,audio/mp3,audio/wav,audio/ogg,audio/flac,video/mp4,video/mpeg,video/mov,video/avi,video/webm,video/wmv,video/3gpp,application/pdfSecurity fixes (identified by Codex review)
media_urivalidated againstcontext_data["uri"]beforeviking_fs.read_file_bytes()— forged queue messages fall back to text embedRequestContextreconstruction at dequeue time (requiresuser_idnot currently inEmbeddingMsg)Test plan
test_gemini_embedder.py) — all passing, zero real API callstest_collection_schemas.py) — includes security test for URI mismatchtest_embedding_msg_converter_tenant.py)test_config_validation.py) — factory createsGeminiDenseEmbeddertest_embedding_utils_mime.py)test_gemini_e2e.py) — real Gemini API,dimension=3072default verifiedtext dim: 3072,multimodal dim: 768✓Configuration example
{ "embedding": { "dense": { "model": "gemini-embedding-2-preview", "api_key": "${GEMINI_API_KEY}", "dimension": 3072, "provider": "gemini", "task_type": "RETRIEVAL_DOCUMENT" } } }Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores