Skip to content

feat(gemini): add GeminiDenseEmbedder text embedding provider#751

Merged
ZaynJarvis merged 2 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing
Mar 20, 2026
Merged

feat(gemini): add GeminiDenseEmbedder text embedding provider#751
ZaynJarvis merged 2 commits intovolcengine:mainfrom
chethanuk:feat/gemini-landing

Conversation

@chethanuk
Copy link
Copy Markdown
Contributor

@chethanuk chethanuk commented Mar 18, 2026

Description

Phase 1 of #566: Adds GeminiDenseEmbedder — a production-ready, opt-in text-only dense embedding provider backed by Google's google-genai SDK. Volcengine remains the default; Gemini is enabled via provider: "gemini" in ov.conf. No existing behaviour changes.

This PR implements Phase 1 (text-only). Phase 2 (multimodal embedding) is tracked in #566 — the issue remains open.

Phased Implementation

Phase Scope Status
Phase 1 (this PR) Text-only dense embedding: 8 task types, MRL 1-3072, SDK retry, async batching, non-symmetric routing via #702 pattern
Phase 2 (future) Multimodal embedding: image/audio/video via Parts API, cross-modal retrieval, align with #718 approach Planned

Key capabilities

  • Text-onlysupports_multimodal = False; multimodal is Phase 2
  • Non-symmetric routingget_query_embedder() / get_document_embedder() route query_param/document_param to Gemini task types via factory (follows feat(embedding): combine document embedder and query embedder to avoi… #702 pattern)
  • Per-call task_type + title — override at call time; all 8 Gemini task types
  • Case-insensitive configtask_type, query_param, document_param auto-uppercased in config normalization
  • SDK-native retryHttpRetryOptions(attempts=3, exp_base=2, max_delay=30s) with import guard for SDK < 0.8
  • Async concurrent batchingasync_embed_batch() dispatches 100-text chunks in parallel via anyio
  • Empty-text guard — returns zero-vectors with warning log; preserves index alignment
flowchart LR
    subgraph Config["ov.conf / EmbeddingModelConfig"]
        CFG["provider: gemini\napi_key: sk-…\ntask_type: RETRIEVAL_DOCUMENT\ndimension: 1536"]
    end

    subgraph Embedder["GeminiDenseEmbedder(DenseEmbedderBase)"]
        direction TB
        BC["_build_config(task_type?, title?)"]
        E["embed(text, task_type?, title?)"]
        EB["embed_batch(texts, titles?)"]
        AEB["async_embed_batch(texts)"]
        E --> BC
        EB --> BC
        AEB --> BC
    end

    subgraph SDK["google-genai SDK"]
        direction TB
        HTTP["HttpOptions\n(retry: 3× exp backoff)"]
        SYNC["models.embed_content()"]
        ASYNC["aio.models.embed_content()\n(semaphore-bounded)"]
        HTTP --> SYNC
        HTTP --> ASYNC
    end

    Config --> Embedder
    BC --> SYNC
    AEB --> ASYNC
Loading
image

Related Issue

Partial implementation of #566 (Phase 1: text-only dense embedding; Phase 2: multimodal — issue stays open)

Type of Change

  • Bug fix
  • New feature (non-breaking, opt-in via config)
  • Breaking change
  • Documentation update
  • Refactoring
  • Performance improvement
  • Test update

Changes Made

  • openviking/models/embedder/gemini_embedders.pyGeminiDenseEmbedder(DenseEmbedderBase): _build_config(), per-call task_type/title, SDK retry, __repr__, empty-text guard with warning log
  • openviking_cli/utils/config/embedding_config.py — registers "gemini" provider; validates + auto-uppercases task_type/query_param/document_param; factory routes context for non-symmetric mode
  • openviking/models/embedder/__init__.py — exports GeminiDenseEmbedder; annotation: "text-only; multimodal planned"
  • examples/ov.conf.example — Gemini config snippet
  • pyproject.tomlgoogle-genai>=0.8 dep; optional gemini-async extra for anyio
  • tests/unit/test_gemini_embedder.py — 41 unit tests (mock-only, no API key needed)
  • tests/unit/test_embedding_config_gemini.py — 18 config validation tests (incl. case-insensitive task_type)
  • tests/integration/ — 49 integration tests (auto-skip without GOOGLE_API_KEY)

Testing

Suite Tests Live key required
test_gemini_embedder.py 41 No
test_embedding_config_gemini.py 18 No
test_gemini_e2e.py 7 Yes
test_gemini_embedding_it.py 22 Yes
test_gemini_openviking_it.py 20 Yes
108 passed, 0 skipped  ← with GOOGLE_API_KEY
 59 passed, 49 skipped ← without GOOGLE_API_KEY (CI)

Checklist

  • Code follows project coding style (ruff + black formatted)
  • Self-review completed
  • Tests added and passing
  • Tested on Linux and macOS

@chethanuk
Copy link
Copy Markdown
Contributor Author

@qin-ctx Please review and merge this

image

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

#702 is megred for query and document embedding support.

get_query_embedder() is not the expected usage.

btw check if you can improve based on #718

Copy link
Copy Markdown
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR adds Google Gemini as a text embedding provider with excellent test coverage (107 tests) and solid error handling. However, there are three blocking issues that need to be addressed:

  1. Design mismatch: Issue #566 explicitly requests multimodal retrieval support, but this PR only implements text-only embedding (supports_multimodal = False). The PR description should clarify this is a phased implementation with a roadmap for multimodal support, or change the issue reference to indicate partial completion.

  2. Documentation inconsistency: Code comment claims "multimodal" but implementation is text-only.

  3. CI failure: lint job failed due to formatting issues in test_gemini_embedder.py.

Once these are resolved, the PR will be ready to merge. The implementation quality is strong, with comprehensive testing, graceful error handling, and proper SDK retry integration.

🤖 I am a bot owned by @qin-ctx.

- OpenAI: Dense only
- Volcengine: Dense, Sparse, Hybrid
- Jina AI: Dense only
- Google Gemini: Dense only (multimodal)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

Comment says "Google Gemini: Dense only (multimodal)" but the actual implementation has GeminiDenseEmbedder.supports_multimodal = False. This is contradictory.

Should either:

  • Change to "Google Gemini: Dense only (text-only; multimodal planned)"
  • Or remove the "(multimodal)" annotation entirely until multimodal support is actually implemented

# text-embedding-004: 768 fixed-dim legacy model, does not support MRL truncation
# Future gemini-embedding-*: default 3072 via _default_dimension() fallback
# Future text-embedding-*: default 768 via _default_dimension() prefix rule
supports_multimodal: bool = False # text-only; multimodal planned separately
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

This PR claims to close issue #566, which explicitly requests "first-class Gemini Embedding 2 support for multimodal retrieval". The issue emphasizes:

  • "it should not reduce everything to plain text before embedding"
  • "multimodal inputs handled as multimodal, not flattened by default"

However, this implementation sets supports_multimodal = False and only handles text. This is a fundamental mismatch between the issue requirement and the PR implementation.

Recommendation: Either:

  1. Update the PR description to clarify this is phase 1 (text-only) with a roadmap for phase 2 (multimodal), and change the issue reference to "Partial implementation of [Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566", keeping the issue open
  2. Or implement multimodal support in this PR as originally requested by the issue

The current "Closes: #566" statement is misleading and will cause users to expect multimodal functionality that doesn't exist.

@@ -0,0 +1,479 @@
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking)

CI lint job failed because this file needs formatting:

Would reformat: tests/unit/test_gemini_embedder.py

Please run black tests/unit/test_gemini_embedder.py and commit the formatted version.

if backend is not None and provider is None:
data["provider"] = backend
for key in ("query_param", "document_param"):
for key in ("input_type",):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

The for key in ("input_type",): loop normalizes input_type to lowercase, but the new Gemini provider uses task_type instead of input_type. Consider also normalizing task_type here, or document that task_type must be uppercase (which is currently enforced in the Gemini validator at line 176).

task_type: Optional[str] = None,
title: Optional[str] = None,
) -> EmbedResult:
if not text or not text.strip():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Returning a zero vector for empty text is a reasonable fallback, but it happens silently. Consider logging a warning so callers know their empty input was handled specially:

if not text or not text.strip():
    logger.warning("Empty text passed to embed(), returning zero vector")
    return EmbedResult(dense_vector=[0.0] * self._dimension)

This helps with debugging when unexpected empty strings appear in production.

chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 19, 2026
…fig normalization, empty-text warning

- Fix "(multimodal)" annotation to "(text-only; multimodal planned)" to match supports_multimodal=False (B1)
- Run black formatter on test_gemini_embedder.py (B3)
- Add auto-uppercase normalization for task_type/query_param/document_param in sync_provider_backend so lowercase config values pass validation (NB1)
- Add logger.warning on empty text in embed() for production debuggability (NB2)
- Add test_gemini_task_type_case_insensitive test (NB1)
- Fix ruff import sorting in __init__.py
@chethanuk
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review! All issues addressed in commit bcfd612:

@qin-ptr — blocking fixes

@qin-ptr — non-blocking (adopted both)

  • NB1 task_type normalization: Added auto-uppercase in sync_provider_backend for task_type, query_param, document_param. Users can now write task_type: "retrieval_document" (any case) in config. Added test test_gemini_task_type_case_insensitive.
  • NB2 Empty text warning: Added logger.warning("Empty text passed to embed(), returning zero vector")

@ZaynJarvis#702 pattern + #718

Verification

ruff check: All checks passed!
pytest: 59 passed (41 embedder + 18 config, incl. 1 new normalization test)
black --check: 1 file would be left unchanged

@chethanuk chethanuk force-pushed the feat/gemini-landing branch 2 times, most recently from 1586f65 to 893a1da Compare March 19, 2026 13:00
@chethanuk
Copy link
Copy Markdown
Contributor Author

chethanuk commented Mar 19, 2026

@qin-ctx Can we merge? Main keeps changing now tests are working

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 20, 2026

for query and document embedding

    def get_query_embedder(self):
        """Get embedder instance for query embeddings."""
        return self._get_contextual_embedder("query")

    def get_document_embedder(self):
        """Get embedder instance for document/passage embeddings."""
        return self._get_contextual_embedder("document")

    def _get_contextual_embedder(self, context: str):
        if not self.dense:
            return self.get_embedder()

        provider = (self.dense.provider or "").lower()
        if provider == "openai":
            non_symmetric = (
                self.dense.query_param is not None or self.dense.document_param is not None
            )
            effective_context = context if non_symmetric else None
            return self._create_embedder(provider, "dense", self.dense, context=effective_context)

        if provider == "jina":
            return self._create_embedder(provider, "dense", self.dense, context=context)

        if provider == "gemini":
            non_symmetric = (
                self.dense.query_param is not None or self.dense.document_param is not None
            )
            effective_context = context if non_symmetric else None
            return self._create_embedder(provider, "dense", self.dense, context=effective_context)

        return self.get_embedder()

this is removed in purpose in #702 and replaced by

@abstractmethod
    def embed(self, text: str, is_query: bool = False)

please remove the _get_contextual_embedder etc. and use is_query

  1. [note, no need to change] after discovering, gemini embedding 1 supports taskType, gemini embedding 2 ignores taskType in request.

for gemini embedding 2, recommended task type instruction is through prompt instruction, e.g. embedding prompt: "taskType: {taskType}, content: {content}"


others lgtm, if this PR changes 1, i will merge and closes #718

…ngine#702 pattern

- GeminiDenseEmbedder: accept query_param/document_param, use is_query
  in embed() and embed_batch() to select task_type at call time
- EmbeddingConfig: add Gemini provider, factory, validation, dimension
- No get_query_embedder/get_document_embedder/_get_contextual_embedder
  (removed in volcengine#702; embed(is_query=True/False) is the pattern)
- Tests use embed(text, is_query=True/False) pattern throughout
- Rebased onto current upstream/main
@chethanuk chethanuk force-pushed the feat/gemini-landing branch from 98f7ae4 to fba32cf Compare March 20, 2026 09:12
@chethanuk
Copy link
Copy Markdown
Contributor Author

@ZaynJarvis Fixed per #702:

  • Removed get_query_embedder() / get_document_embedder() / _get_contextual_embedder() + context param — none of these exist in this PR
  • GeminiDenseEmbedder now accepts query_param / document_param in constructor
  • embed(text, is_query=True) selects query_param; is_query=False selects document_param — same pattern as OpenAI._build_extra_body(is_query) and Jina._build_extra_body(is_query)
  • Rebased onto current main (clean single commit on top of upstream/main)
  • Tests: 58 unit tests pass, integration tests skip cleanly without API key

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 20, 2026

pls fix workflow

"QUESTION_ANSWERING, FACT_VERIFICATION, CODE_RETRIEVAL_QUERY. "
"For non-symmetric mode set query_param/document_param instead."
),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think task_type is needed, query_param and document_param is supposed to contain task_type.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others lgtm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed task_type field from EmbeddingModelConfig. query_param/document_param handle task type routing.

…r CI

- Remove task_type from EmbeddingModelConfig (query_param/document_param suffice)
- Wrap GeminiDenseEmbedder import in try/except (google-genai is optional)
- Update tests for removed field
@chethanuk
Copy link
Copy Markdown
Contributor Author

Addressed round 2 feedback:

  1. Removed task_type from EmbeddingModelConfigquery_param/document_param already contain the task type.
  2. Fixed CI — wrapped GeminiDenseEmbedder import in try/except ImportError since google-genai is an optional dependency.
  3. Updated tests accordingly (removed 4 task_type-specific config tests).

All unit tests pass locally (54/54 Gemini tests green).

Copy link
Copy Markdown
Collaborator

@ZaynJarvis ZaynJarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ZaynJarvis ZaynJarvis merged commit 33f1a8f into volcengine:main Mar 20, 2026
31 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 20, 2026
@ZaynJarvis
Copy link
Copy Markdown
Collaborator

Doc might need to be updated for Gemini usage. google genai pkg needs to be installed. Consider a follow up PR for this. Thx.

chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 20, 2026
Documents GeminiDenseEmbedder (shipped in volcengine#751) in user-facing docs:
- docs/en/guides/01-configuration.md: add gemini to supported providers
  list, parameter table, and a full provider example with models, MRL
  dimensions, task types, and non-symmetric query/document routing
- README.md: update provider note to include gemini and all other
  supported providers; add Example 3 config block for Gemini embedding

Closes follow-up requested in volcengine#751 review (comment 4097971909).
chethanuk added a commit to chethanuk/OpenViking that referenced this pull request Mar 21, 2026
Documents GeminiDenseEmbedder (shipped in volcengine#751) in user-facing docs:
- docs/en/guides/01-configuration.md: add gemini to supported providers
  list, parameter table, and a full provider example with models, MRL
  dimensions, task types, and non-symmetric query/document routing
- README.md: update provider note to include gemini and all other
  supported providers; add Example 3 config block for Gemini embedding

Closes follow-up requested in volcengine#751 review (comment 4097971909).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants