Skip to content

feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)#607

Merged
qin-ctx merged 6 commits intovolcengine:mainfrom
chethanuk:main
Mar 17, 2026
Merged

feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)#607
qin-ctx merged 6 commits intovolcengine:mainfrom
chethanuk:main

Conversation

@chethanuk
Copy link
Copy Markdown
Contributor

@chethanuk chethanuk commented Mar 14, 2026

Description

Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)

Native text + multimodal (image, video, audio, PDF) embedding via gemini-embedding-2-preview (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via provider: "gemini" in ov.conf

- **Model**: `gemini-embedding-2-preview`
- **Input**: text, image, video, audio, PDF (17 MIME types)
- **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072)
- **Input token limit**: **8,192 tokens**
- **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf`
  • Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
  • Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
  • End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
  • Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding

Related Issue

Closes: #566

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows
image image

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

@chethanuk
Copy link
Copy Markdown
Contributor Author

@MaojiaSheng @yangxinxin-7 @chenjw @hanxiao Please review and merge thank you :)

Would love to use OpenViking with gemini 2 thank you :)

@lazmo88
Copy link
Copy Markdown

lazmo88 commented Mar 14, 2026

Subscribing — we're using Gemini Embedding 2 (3072d) via OpenAI-compatible proxy and would benefit from native multimodal embedding support. Looking forward to this landing!

@MaojiaSheng
Copy link
Copy Markdown
Collaborator

@MaojiaSheng @yangxinxin-7 @chenjw @hanxiao Please review and merge thank you :)

Would love to use OpenViking with gemini 2 thank you :)

@chethanuk thank you, it's a great congratulation. Did you finish the testing in different situations?

codeCraft-Ritik

This comment was marked as spam.

raise RuntimeError(f"Gemini embedding failed (code={e.code}): {e}") from e

def embed_multimodal(self, vectorize: "Vectorize") -> EmbedResult: # type: ignore[name-defined]
media = getattr(vectorize, "media", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to Gemini Embedding 2 doc, parts API support sequence of content parts

e.g.

text..
<media 1>
text..
<image 2>

this often happens in pdfs. when chunking strategy produce chunks with multimodal.


as such, current impl with 1 media & 1 text field is limited. when working with multimodal, vectorize might need to use parts field to make full use of MultiModal Embedding's parts aggregate feature rather than an extra media only.

text=fallback_text,
media=ModalContent(uri=file_path, mime_type=mime_type),
)
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini has limit on Documents (PDF) for Maximum of 6 pages. to make this work reliably, chunking is required before Embedding.

Copy link
Copy Markdown
Collaborator

@MaojiaSheng MaojiaSheng Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not to depend on PDF embedding directly, not a general way.

ResourceContentType.VIDEO,
ResourceContentType.AUDIO,
) or is_pdf:
is_multimodal_provider = (embedding_provider or "").lower() == "gemini"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use supports_multimodal func. (doubao-embedding-vision is also a multi-modal embedding which supports aggregation)

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 15, 2026

main limitation here is OpenViking doesn't support multimodal embedding for now. Take pdf as example, best practice is = 1 => chunk pdf to series of text and images = 2 => use multimodal embedding to aggregate chunk to 1 embedding.

Current implementation supports multimodal embedding with 1 text and 1 media (pdf/image), with three possible issues

  1. text is summary generated by vlm, doesn't provide new context (multimodal aggregate actually used for)
  2. no safe guard on pdf exceeding page limit
  3. this feat works on TextEmbeddingHandler used here, defined here, which is a possible architecture violation. a MultiModalEmbeddingHandler shall be defined (or it should no longer named as TextEmbeddingHandler)

My suggestion is: add Gemini Embedding 2 as Text Embedder, then includes its multimodal and parts aggregate capability once openviking fully embrace multimodal embedding as add-resource is improved. This is in our timeline in April.
I would love to merge this if it's updated to a new TextEmbedder for simplicity.

@MaojiaSheng @chethanuk looking forward to hear from you.


other things to consider

  1. some tests are redundant, need cleanup.
  2. task_type should not set in config file, it might be best to set to different type in indexing and querying scenario. this PR is a reference for settings task type.

Embeddings optimized for general search queries. Use RETRIEVAL_QUERY for queries; RETRIEVAL_DOCUMENT for documents to be retrieved.

  1. possible misuse: embed_multimodal doesn't support embed_batch and async_embed_batch

chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 15, 2026
- Add embed_query() default to DenseEmbedderBase (delegates to embed())
- GeminiDenseEmbedder overrides embed_query() using _query_config
  (RETRIEVAL_QUERY); embed() uses _index_config (RETRIEVAL_DOCUMENT)
- Update hierarchical_retriever + memory_deduplicator to call embed_query()
- Deprecate task_type config field (still accepted, no validation error)
- Add enable_multimodal: bool = False flag; supports_multimodal reflects it
- Add embed_multimodal_batch / async_embed_multimodal_batch to base class
- Add Gemini async_embed_multimodal_batch override (anyio semaphore)
- Rewrite embed_multimodal: parts API + pdfminer PDF guard (gated by flag)
- Fix PR volcengine#607 issues #1, #2, #4, #6
chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 15, 2026
…n queuefs/

- Create openviking/storage/queuefs/embedding_handler.py with EmbeddingHandler
  (same logic, corrected class name + docstring)
- Replace TextEmbeddingHandler class body in collection_schemas.py with
  import + backward-compat alias TextEmbeddingHandler = EmbeddingHandler
- Update queue_manager.py to import EmbeddingHandler directly from queuefs
- Fixes PR volcengine#607 issue #3
chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 15, 2026
… embedding

- Add ContentPart = Union[str, ModalContent] type alias
- Add parts: Optional[List[ContentPart]] field to Vectorize
- Add get_parts(): returns parts if set, else builds [text, media] from legacy fields
- Add multi-part integration tests (TestGeminiE2EMultipartEmbedding)
- Fixes PR volcengine#607 issue #1 (multi-part sequences)
chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 15, 2026
…gine#607 review)

Merge 17 upstream commits (feat/fix: trace metrics, multi-agent isolation,
vikingdb TUI, rate-limit simplification, session fixes, etc.) and resolve
conflicts in embedding_msg, embedding_msg_converter, collection_schemas,
pyproject.toml, and uv.lock.

Scope reduction (defers multimodal pipeline to April when add-resource
pipeline is ready):
- Remove multimodal dispatch from embedding_utils.vectorize_file; media
  files always fall back to text/summary (resolves r2936557627: removes
  hardcoded "gemini" string check; resolves r2936556516: no PDF 6-page
  limit concern)
- Drop media_uri/media_mime_type from EmbeddingMsg; add telemetry_id
  from upstream; preserve id through to_dict/from_dict round-trip
- Drop media fields from EmbeddingMsgConverter; pass telemetry_id
- Remove multimodal on_dequeue path from TextEmbeddingHandler; adopt
  upstream's telemetry-tracked text-only path
- Add TODO in GeminiDenseEmbedder.embed_multimodal for parts-list
  aggregation (resolves r2936536550 — deferred to April)

GeminiDenseEmbedder class retains supports_multimodal + embed_multimodal
for future activation once Vectorize.parts is supported end-to-end.
@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 16, 2026

@chethanuk the two PRs are submitted to your fork. when this PR is ready, you can convert this as "ready for review"

please help to resolve conflict as well.

@ZaynJarvis ZaynJarvis marked this pull request as draft March 16, 2026 07:58
…ideo/audio/PDF)

Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`.
    - **Model**: `gemini-embedding-2-preview`
    - **Input**: text, image, video, audio, PDF (17 MIME types)
    - **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072)
    - **Input token limit**: **8,192 tokens**
    - **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf`
- Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
- Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
- End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
- Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding
@chethanuk chethanuk marked this pull request as ready for review March 16, 2026 10:53
@chethanuk
Copy link
Copy Markdown
Contributor Author

@ZaynJarvis Please review now

@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Mar 17, 2026

Hi @chethanuk, thanks for the update! It looks like there are merge conflicts again — likely caused by another recently merged provider PR. Could you rebase on the latest main and resolve the conflicts when you get a chance? Thanks!

@chethanuk chethanuk requested a review from MaojiaSheng March 17, 2026 11:51
@chethanuk
Copy link
Copy Markdown
Contributor Author

Please review now

@qin-ctx qin-ctx merged commit 95bd197 into volcengine:main Mar 17, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 17, 2026
qin-ctx added a commit that referenced this pull request Mar 17, 2026
@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Mar 17, 2026

Hi @chethanuk, we had to revert this PR (#703) after a code review found several critical issues. Here is a summary:

Critical

  1. embedding_config.py has syntax errors that prevent Python from parsing the file. When 'gemini' was inserted into the provider/backend description strings, it broke the string literals in multiple places (lines ~44, 51, 109). Also the task_type Field() on line ~62 is missing a closing parenthesis. This breaks all embedding functionality for every provider, not just Gemini.

  2. EmbeddingMsg.__init__ does not accept the new media_uri/media_mime_type/id parameters. The custom __init__ overrides the dataclass-generated one, but was not updated to include the new fields. Both from_dict() and EmbeddingMsgConverter.from_context() pass these kwargs, causing TypeError at runtime.

High

  1. GeminiDenseEmbedder never overrides supports_multimodal (inherits False from base class), and does not implement embed_multimodal(). The multimodal code path in collection_schemas.py is gated by supports_multimodal, so the entire multimodal feature is effectively dead code.

  2. Security: viking_fs.read_file_bytes is called with ctx=None, bypassing tenant-scoped access control. In multi-tenant environments this could allow cross-tenant file reads.

Medium

  1. google-genai is added as a core dependency instead of an optional one — all users must install the Google GenAI SDK even if they only use Volcengine/OpenAI.
  2. .mpeg extension appears in both _MIME_MAP_VIDEO and _MIME_MAP_AUDIO — the video map always wins due to check order.
  3. _infer_image_mime / _infer_media_mime are defined but never called from production code.

We appreciate the effort on this PR! Please fix the issues above and feel free to resubmit. Happy to help review again once it is ready.

qin-ctx added a commit that referenced this pull request Mar 17, 2026
@chethanuk
Copy link
Copy Markdown
Contributor Author

Please fix the issues above and feel free to resubmit. Happy to help review again once it is ready.

Okay will ix and raise PR

ZaynJarvis pushed a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026
…ideo/audio/PDF) (volcengine#607)

* feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)

Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`.
    - **Model**: `gemini-embedding-2-preview`
    - **Input**: text, image, video, audio, PDF (17 MIME types)
    - **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072)
    - **Input token limit**: **8,192 tokens**
    - **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf`
- Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
- Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
- End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
- Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding

* feat: Add asynchronous batch embedding with concurrency control to Gemini embedder.

* Reduce scope to use GeminiDenseEmbedder as only text embed
ZaynJarvis pushed a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval

6 participants