feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)#607
feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)#607qin-ctx merged 6 commits intovolcengine:mainfrom
Conversation
|
@MaojiaSheng @yangxinxin-7 @chenjw @hanxiao Please review and merge thank you :) Would love to use OpenViking with gemini 2 thank you :) |
|
Subscribing — we're using Gemini Embedding 2 (3072d) via OpenAI-compatible proxy and would benefit from native multimodal embedding support. Looking forward to this landing! |
@chethanuk thank you, it's a great congratulation. Did you finish the testing in different situations? |
| raise RuntimeError(f"Gemini embedding failed (code={e.code}): {e}") from e | ||
|
|
||
| def embed_multimodal(self, vectorize: "Vectorize") -> EmbedResult: # type: ignore[name-defined] | ||
| media = getattr(vectorize, "media", None) |
There was a problem hiding this comment.
according to Gemini Embedding 2 doc, parts API support sequence of content parts
e.g.
text..
<media 1>
text..
<image 2>
this often happens in pdfs. when chunking strategy produce chunks with multimodal.
as such, current impl with 1 media & 1 text field is limited. when working with multimodal, vectorize might need to use parts field to make full use of MultiModal Embedding's parts aggregate feature rather than an extra media only.
openviking/utils/embedding_utils.py
Outdated
| text=fallback_text, | ||
| media=ModalContent(uri=file_path, mime_type=mime_type), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Gemini has limit on Documents (PDF) for Maximum of 6 pages. to make this work reliably, chunking is required before Embedding.
There was a problem hiding this comment.
I suggest not to depend on PDF embedding directly, not a general way.
openviking/utils/embedding_utils.py
Outdated
| ResourceContentType.VIDEO, | ||
| ResourceContentType.AUDIO, | ||
| ) or is_pdf: | ||
| is_multimodal_provider = (embedding_provider or "").lower() == "gemini" |
There was a problem hiding this comment.
use supports_multimodal func. (doubao-embedding-vision is also a multi-modal embedding which supports aggregation)
|
main limitation here is OpenViking doesn't support multimodal embedding for now. Take pdf as example, best practice is = 1 => chunk pdf to series of text and images = 2 => use multimodal embedding to aggregate chunk to 1 embedding. Current implementation supports multimodal embedding with 1 text and 1 media (pdf/image), with three possible issues
My suggestion is: add Gemini Embedding 2 as Text Embedder, then includes its multimodal and parts aggregate capability once openviking fully embrace multimodal embedding as add-resource is improved. This is in our timeline in April. @MaojiaSheng @chethanuk looking forward to hear from you. other things to consider
|
- Add embed_query() default to DenseEmbedderBase (delegates to embed()) - GeminiDenseEmbedder overrides embed_query() using _query_config (RETRIEVAL_QUERY); embed() uses _index_config (RETRIEVAL_DOCUMENT) - Update hierarchical_retriever + memory_deduplicator to call embed_query() - Deprecate task_type config field (still accepted, no validation error) - Add enable_multimodal: bool = False flag; supports_multimodal reflects it - Add embed_multimodal_batch / async_embed_multimodal_batch to base class - Add Gemini async_embed_multimodal_batch override (anyio semaphore) - Rewrite embed_multimodal: parts API + pdfminer PDF guard (gated by flag) - Fix PR volcengine#607 issues #1, #2, #4, #6
…n queuefs/ - Create openviking/storage/queuefs/embedding_handler.py with EmbeddingHandler (same logic, corrected class name + docstring) - Replace TextEmbeddingHandler class body in collection_schemas.py with import + backward-compat alias TextEmbeddingHandler = EmbeddingHandler - Update queue_manager.py to import EmbeddingHandler directly from queuefs - Fixes PR volcengine#607 issue #3
… embedding - Add ContentPart = Union[str, ModalContent] type alias - Add parts: Optional[List[ContentPart]] field to Vectorize - Add get_parts(): returns parts if set, else builds [text, media] from legacy fields - Add multi-part integration tests (TestGeminiE2EMultipartEmbedding) - Fixes PR volcengine#607 issue #1 (multi-part sequences)
…gine#607 review) Merge 17 upstream commits (feat/fix: trace metrics, multi-agent isolation, vikingdb TUI, rate-limit simplification, session fixes, etc.) and resolve conflicts in embedding_msg, embedding_msg_converter, collection_schemas, pyproject.toml, and uv.lock. Scope reduction (defers multimodal pipeline to April when add-resource pipeline is ready): - Remove multimodal dispatch from embedding_utils.vectorize_file; media files always fall back to text/summary (resolves r2936557627: removes hardcoded "gemini" string check; resolves r2936556516: no PDF 6-page limit concern) - Drop media_uri/media_mime_type from EmbeddingMsg; add telemetry_id from upstream; preserve id through to_dict/from_dict round-trip - Drop media fields from EmbeddingMsgConverter; pass telemetry_id - Remove multimodal on_dequeue path from TextEmbeddingHandler; adopt upstream's telemetry-tracked text-only path - Add TODO in GeminiDenseEmbedder.embed_multimodal for parts-list aggregation (resolves r2936536550 — deferred to April) GeminiDenseEmbedder class retains supports_multimodal + embed_multimodal for future activation once Vectorize.parts is supported end-to-end.
|
@chethanuk the two PRs are submitted to your fork. when this PR is ready, you can convert this as "ready for review" please help to resolve conflict as well. |
…ideo/audio/PDF)
Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`.
- **Model**: `gemini-embedding-2-preview`
- **Input**: text, image, video, audio, PDF (17 MIME types)
- **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072)
- **Input token limit**: **8,192 tokens**
- **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf`
- Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration.
- Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable.
- End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding.
- Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding
|
@ZaynJarvis Please review now |
|
Hi @chethanuk, thanks for the update! It looks like there are merge conflicts again — likely caused by another recently merged provider PR. Could you rebase on the latest |
|
Please review now |
|
Hi @chethanuk, we had to revert this PR (#703) after a code review found several critical issues. Here is a summary: Critical
High
Medium
We appreciate the effort on this PR! Please fix the issues above and feel free to resubmit. Happy to help review again once it is ready. |
Okay will ix and raise PR |
…ideo/audio/PDF) (volcengine#607) * feat(embedder): Gemini Embedding 2 multimodal support (text + image/video/audio/PDF) Native text + multimodal (image, video, audio, PDF) embedding via `gemini-embedding-2-preview` (google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in via `provider: "gemini"` in `ov.conf`. - **Model**: `gemini-embedding-2-preview` - **Input**: text, image, video, audio, PDF (17 MIME types) - **Output dimension**: 128–3072 (default: **3072**, recommended: 768 / 1536 / 3072) - **Input token limit**: **8,192 tokens** - **Supported MIME types**: `image/jpeg`, `image/png`, `image/gif`, `image/webp`, `audio/mpeg`, `audio/mp3`, `audio/wav`, `audio/ogg`, `audio/flac`, `video/mp4`, `video/mpeg`, `video/mov`, `video/avi`, `video/webm`, `video/wmv`, `video/3gpp`, `application/pdf` - Gemini Embedding 2 Multimodal Support: Introduced a new GeminiDenseEmbedder to support native text and multimodal (image, video, audio, PDF) embedding using the gemini-embedding-2-preview model. This is an opt-in provider via configuration. - Extended Queue Pipeline for Multimodal Content: The EmbeddingMsg now carries media_uri and media_mime_type to facilitate multimodal content processing. The TextEmbeddingHandler.on_dequeue() method was updated to read raw bytes from viking_fs and call embed_multimodal() when applicable. - End-to-End Configuration and Security: The EmbeddingConfig now registers the 'gemini' provider with a task_type field. A critical security validation was added to ensure media_uri matches context_data['uri'] before file reads, preventing forged queue messages from accessing arbitrary files. If validation fails or multimodal embedding fails, it falls back to text embedding. - Multimodal Content Representation: A new ModalContent dataclass was introduced to represent media references, including MIME type, URI, and optional raw data, enabling the Vectorize object to encapsulate both text and media for embedding * feat: Add asynchronous batch embedding with concurrency control to Gemini embedder. * Reduce scope to use GeminiDenseEmbedder as only text embed
… image/video/audio/PDF) (volcengine#607)" (volcengine#703) This reverts commit 95bd197.
Description
Gemini Embedding 2 multimodal support (text + image/video/audio/PDF)
Native text + multimodal (image, video, audio, PDF) embedding via
gemini-embedding-2-preview(google-genai 1.67.0). Additive provider pattern — Volcengine remains the default; Gemini is opt-in viaprovider: "gemini"inov.confRelated Issue
Closes: #566
Type of Change
Changes Made
Testing
Checklist
Screenshots (if applicable)
Additional Notes