feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) by chethanuk · Pull Request #607 · volcengine/OpenViking

chethanuk · 2026-03-14T19:15:45Z

Description

Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)

Native text + multimodal (image, video, audio, PDF) embedding via gemini-embedding-2-preview (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via provider: "gemini" in ov.conf

- **Model**: `gemini-embedding-2-preview`
- **Input**: text, image, video, audio, PDF (17 MIME types)
- **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072)
- **Input token limit**: **8,192 tokens**
- **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf`

Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding

Related Issue

Closes: #566

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

chethanuk · 2026-03-14T19:19:44Z

@MaojiaSheng @yangxinxin-7 @chenjw @hanxiao Please review and merge thank you :)

Would love to use OpenViking with gemini 2 thank you :)

lazmo88 · 2026-03-14T22:15:11Z

Subscribing — we're using Gemini Embedding 2 (3072d) via OpenAI-compatible proxy and would benefit from native multimodal embedding support. Looking forward to this landing!

MaojiaSheng · 2026-03-15T01:41:12Z

@MaojiaSheng @yangxinxin-7 @chenjw @hanxiao Please review and merge thank you :)

Would love to use OpenViking with gemini 2 thank you :)

@chethanuk thank you, it's a great congratulation. Did you finish the testing in different situations?

ZaynJarvis · 2026-03-15T10:03:56Z

openviking/models/embedder/gemini_embedders.py

+            raise RuntimeError(f"Gemini embedding failed (code={e.code}): {e}") from e
+
+    def embed_multimodal(self, vectorize: "Vectorize") -> EmbedResult:  # type: ignore[name-defined]
+        media = getattr(vectorize, "media", None)


according to Gemini Embedding 2 doc, parts API support sequence of content parts

e.g.

text..
<media 1>
text..
<image 2>

this often happens in pdfs. when chunking strategy produce chunks with multimodal.

as such, current impl with 1 media & 1 text field is limited. when working with multimodal, vectorize might need to use parts field to make full use of MultiModal Embedding's parts aggregate feature rather than an extra media only.

ZaynJarvis · 2026-03-15T10:24:17Z

openviking/utils/embedding_utils.py

+                            text=fallback_text,
+                            media=ModalContent(uri=file_path, mime_type=mime_type),
+                        )
+                    )


Gemini has limit on Documents (PDF) for Maximum of 6 pages. to make this work reliably, chunking is required before Embedding.

I suggest not to depend on PDF embedding directly, not a general way.

ZaynJarvis · 2026-03-15T10:25:28Z

openviking/utils/embedding_utils.py

+            ResourceContentType.VIDEO,
+            ResourceContentType.AUDIO,
+        ) or is_pdf:
+            is_multimodal_provider = (embedding_provider or "").lower() == "gemini"


use supports_multimodal func. (doubao-embedding-vision is also a multi-modal embedding which supports aggregation)

ZaynJarvis · 2026-03-15T10:59:11Z

main limitation here is OpenViking doesn't support multimodal embedding for now. Take pdf as example, best practice is = 1 => chunk pdf to series of text and images = 2 => use multimodal embedding to aggregate chunk to 1 embedding.

Current implementation supports multimodal embedding with 1 text and 1 media (pdf/image), with three possible issues

text is summary generated by vlm, doesn't provide new context (multimodal aggregate actually used for)
no safe guard on pdf exceeding page limit
this feat works on TextEmbeddingHandler used here, defined here, which is a possible architecture violation. a MultiModalEmbeddingHandler shall be defined (or it should no longer named as TextEmbeddingHandler)

My suggestion is: add Gemini Embedding 2 as Text Embedder, then includes its multimodal and parts aggregate capability once openviking fully embrace multimodal embedding as add-resource is improved. This is in our timeline in April.
I would love to merge this if it's updated to a new TextEmbedder for simplicity.

@MaojiaSheng @chethanuk looking forward to hear from you.

other things to consider

some tests are redundant, need cleanup.
task_type should not set in config file, it might be best to set to different type in indexing and querying scenario. this PR is a reference for settings task type.

Embeddings optimized for general search queries. Use RETRIEVAL_QUERY for queries; RETRIEVAL_DOCUMENT for documents to be retrieved.

possible misuse: embed_multimodal doesn't support embed_batch and async_embed_batch

- Add embed_query() default to DenseEmbedderBase (delegates to embed()) - GeminiDenseEmbedder overrides embed_query() using _query_config (RETRIEVAL_QUERY); embed() uses _index_config (RETRIEVAL_DOCUMENT) - Update hierarchical_retriever + memory_deduplicator to call embed_query() - Deprecate task_type config field (still accepted, no validation error) - Add enable_multimodal: bool = False flag; supports_multimodal reflects it - Add embed_multimodal_batch / async_embed_multimodal_batch to base class - Add Gemini async_embed_multimodal_batch override (anyio semaphore) - Rewrite embed_multimodal: parts API + pdfminer PDF guard (gated by flag) - Fix PR volcengine#607 issues #1, #2, #4, #6

…n queuefs/ - Create openviking/storage/queuefs/embedding_handler.py with EmbeddingHandler (same logic, corrected class name + docstring) - Replace TextEmbeddingHandler class body in collection_schemas.py with import + backward-compat alias TextEmbeddingHandler = EmbeddingHandler - Update queue_manager.py to import EmbeddingHandler directly from queuefs - Fixes PR volcengine#607 issue #3

… embedding - Add ContentPart = Union[str, ModalContent] type alias - Add parts: Optional[List[ContentPart]] field to Vectorize - Add get_parts(): returns parts if set, else builds [text, media] from legacy fields - Add multi-part integration tests (TestGeminiE2EMultipartEmbedding) - Fixes PR volcengine#607 issue #1 (multi-part sequences)

…gine#607 review) Merge 17 upstream commits (feat/fix: trace metrics, multi-agent isolation, vikingdb TUI, rate-limit simplification, session fixes, etc.) and resolve conflicts in embedding_msg, embedding_msg_converter, collection_schemas, pyproject.toml, and uv.lock. Scope reduction (defers multimodal pipeline to April when add-resource pipeline is ready): - Remove multimodal dispatch from embedding_utils.vectorize_file; media files always fall back to text/summary (resolves r2936557627: removes hardcoded "gemini" string check; resolves r2936556516: no PDF 6-page limit concern) - Drop media_uri/media_mime_type from EmbeddingMsg; add telemetry_id from upstream; preserve id through to_dict/from_dict round-trip - Drop media fields from EmbeddingMsgConverter; pass telemetry_id - Remove multimodal on_dequeue path from TextEmbeddingHandler; adopt upstream's telemetry-tracked text-only path - Add TODO in GeminiDenseEmbedder.embed_multimodal for parts-list aggregation (resolves r2936536550 — deferred to April) GeminiDenseEmbedder class retains supports_multimodal + embed_multimodal for future activation once Vectorize.parts is supported end-to-end.

ZaynJarvis · 2026-03-16T07:57:59Z

@chethanuk the two PRs are submitted to your fork. when this PR is ready, you can convert this as "ready for review"

please help to resolve conflict as well.

…ideo/audio/PDF) Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`. - **Model**: `gemini-embedding-2-preview` - **Input**: text, image, video, audio, PDF (17 MIME types) - **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072) - **Input token limit**: **8,192 tokens** - **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf` - Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration. - Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable. - End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding. - Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding

…mini embedder.

chethanuk · 2026-03-16T10:55:22Z

@ZaynJarvis Please review now

qin-ctx · 2026-03-17T08:37:49Z

Hi @chethanuk, thanks for the update! It looks like there are merge conflicts again — likely caused by another recently merged provider PR. Could you rebase on the latest main and resolve the conflicts when you get a chance? Thanks!

chethanuk · 2026-03-17T11:51:55Z

Please review now

… image/video/audio/PDF) (#607)" This reverts commit 95bd197.

qin-ctx · 2026-03-17T13:27:10Z

Hi @chethanuk, we had to revert this PR (#703) after a code review found several critical issues. Here is a summary:

Critical

embedding_config.py has syntax errors that prevent Python from parsing the file. When 'gemini' was inserted into the provider/backend description strings, it broke the string literals in multiple places (lines ~44, 51, 109). Also the task_type Field() on line ~62 is missing a closing parenthesis. This breaks all embedding functionality for every provider, not just Gemini.
EmbeddingMsg.__init__ does not accept the new media_uri/media_mime_type/id parameters. The custom __init__ overrides the dataclass-generated one, but was not updated to include the new fields. Both from_dict() and EmbeddingMsgConverter.from_context() pass these kwargs, causing TypeError at runtime.

High

GeminiDenseEmbedder never overrides supports_multimodal (inherits False from base class), and does not implement embed_multimodal(). The multimodal code path in collection_schemas.py is gated by supports_multimodal, so the entire multimodal feature is effectively dead code.
Security: viking_fs.read_file_bytes is called with ctx=None, bypassing tenant-scoped access control. In multi-tenant environments this could allow cross-tenant file reads.

Medium

google-genai is added as a core dependency instead of an optional one — all users must install the Google GenAI SDK even if they only use Volcengine/OpenAI.
.mpeg extension appears in both _MIME_MAP_VIDEO and _MIME_MAP_AUDIO — the video map always wins due to check order.
_infer_image_mime / _infer_media_mime are defined but never called from production code.

We appreciate the effort on this PR! Please fix the issues above and feel free to resubmit. Happy to help review again once it is ready.

… image/video/audio/PDF) (#607)" (#703) This reverts commit 95bd197.

chethanuk · 2026-03-17T14:56:12Z

Please fix the issues above and feel free to resubmit. Happy to help review again once it is ready.

Okay will ix and raise PR

…ideo/audio/PDF) (volcengine#607) * feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`. - **Model**: `gemini-embedding-2-preview` - **Input**: text, image, video, audio, PDF (17 MIME types) - **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072) - **Input token limit**: **8,192 tokens** - **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf` - Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration. - Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable. - End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding. - Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding * feat: Add asynchronous batch embedding with concurrency control to Gemini embedder. * Reduce scope to use GeminiDenseEmbedder as only text embed

… image/video/audio/PDF) (volcengine#607)" (volcengine#703) This reverts commit 95bd197.

github-project-automation bot added this to OpenViking project Mar 14, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 14, 2026

chethanuk force-pushed the main branch from d565945 to aa991c1 Compare March 14, 2026 19:33

MaojiaSheng mentioned this pull request Mar 14, 2026

feat: add Google/Gemini embedding provider support #589

Closed

This comment was marked as spam.

Sign in to view

ZaynJarvis reviewed Mar 15, 2026

View reviewed changes

chethanuk mentioned this pull request Mar 15, 2026

fix(embedder): resolve PR #607 review — task_type per-op, EmbeddingHandler relocation, multimodal feature flag chethanuk/OpenViking#2

Closed

6 tasks

chethanuk mentioned this pull request Mar 15, 2026

fix(gemini): scope-reduce to TextEmbedder, resolve PR #607 review comments chethanuk/OpenViking#3

Closed

6 tasks

ZaynJarvis marked this pull request as draft March 16, 2026 07:58

chethanuk added 2 commits March 16, 2026 09:57

feat: Add asynchronous batch embedding with concurrency control to Ge…

a212fa7

…mini embedder.

chethanuk force-pushed the main branch from e775c6c to dcc6138 Compare March 16, 2026 08:57

Reduce scope to use GeminiDenseEmbedder as only text embed

e2bb85b

chethanuk force-pushed the main branch from dcc6138 to e2bb85b Compare March 16, 2026 09:00

chethanuk marked this pull request as ready for review March 16, 2026 10:53

Merge branch 'main' into main

fdb951b

Merge branch 'main' into main

33eba96

qin-ctx mentioned this pull request Mar 17, 2026

[Feature]: 支持gemini embedding 2这种多模态向量模型 #695

Open

1 task

Merge branch 'main' into main

9376991

chethanuk requested a review from MaojiaSheng March 17, 2026 11:51

qin-ctx approved these changes Mar 17, 2026

View reviewed changes

qin-ctx merged commit 95bd197 into volcengine:main Mar 17, 2026
1 check passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 17, 2026

qin-ctx added a commit that referenced this pull request Mar 17, 2026

Revert "feat(embedder): Gemini Embedding 2 multimodal support (text +…

8294ef5

… image/video/audio/PDF) (#607)" This reverts commit 95bd197.

qin-ctx mentioned this pull request Mar 17, 2026

Revert: feat(embedder): Gemini Embedding 2 multimodal support (#607) #703

Merged

qin-ctx added a commit that referenced this pull request Mar 17, 2026

Revert "feat(embedder): Gemini Embedding 2 multimodal support (text +…

ae35f46

… image/video/audio/PDF) (#607)" (#703) This reverts commit 95bd197.

ZaynJarvis pushed a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026

Revert "feat(embedder): Gemini Embedding 2 multimodal support (text +…

eb5182e

… image/video/audio/PDF) (volcengine#607)" (volcengine#703) This reverts commit 95bd197.

Conversation

chethanuk commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

chethanuk commented Mar 14, 2026

Uh oh!

lazmo88 commented Mar 14, 2026

Uh oh!

MaojiaSheng commented Mar 15, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

ZaynJarvis Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ZaynJarvis Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

MaojiaSheng Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZaynJarvis Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ZaynJarvis commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZaynJarvis commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chethanuk commented Mar 16, 2026

Uh oh!

qin-ctx commented Mar 17, 2026

Uh oh!

chethanuk commented Mar 17, 2026

Uh oh!

Uh oh!

qin-ctx commented Mar 17, 2026

Critical

High

Medium

Uh oh!

chethanuk commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

chethanuk commented Mar 14, 2026 •

edited

Loading

MaojiaSheng Mar 16, 2026 •

edited

Loading

ZaynJarvis commented Mar 15, 2026 •

edited

Loading

ZaynJarvis commented Mar 16, 2026 •

edited

Loading