Support more embedding function platforms by hnwyllmm · Pull Request #132 · oceanbase/pyseekdb

hnwyllmm · 2026-01-22T08:12:07Z

Summary

close #127

Pull request overview

This pull request adds support for 8 new embedding function platforms to pyseekdb: Ollama, Jina AI, Voyage AI, Tencent Hunyuan, SiliconFlow, Google Vertex AI, Cohere, and Amazon Bedrock. The changes enable users to leverage a wider variety of embedding models through a consistent interface.

Changes:

Added 8 new embedding function implementations with comprehensive documentation and examples
Renamed LiteLLMEmbeddingFunction to LiteLLMBaseEmbeddingFunction to better reflect its role as a base class
Updated the embedding function registry to automatically register all new platforms
Added test files for Ollama and Jina embedding functions

Summary by CodeRabbit

New Features
- Added support for many new embedding providers (Amazon Bedrock, Cohere, Google Vertex, Jina, Ollama, SiliconFlow, Tencent Hunyuan, Voyage AI) and exposed them through the embedding registry.
Tests
- Large suite of new unit tests covering each new provider, configuration persistence, and round‑trip scenarios.
Chores
- CI now runs integration tests alongside unit tests.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-22T08:12:17Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds eight new embedding-provider implementations, exposes them via the embedding-function registry, renames the LiteLLM base class, updates registry auto-registration to log ImportError warnings, expands CI to install additional test deps and run integration tests, and adds many environment-gated unit tests plus an integration-fixture refactor.

Changes

Cohort / File(s)	Summary
CI & Test Orchestration `.github/workflows/ci.yml`	Adds `boto3`, `litellm`, `voyageai` to unit-test install and adds an integration-test step (no test mode) that runs tests with `-k` filtering.
Embedding registry & client API `src/pyseekdb/client/embedding_function.py`	Changes `DefaultEmbeddingFunction.build_from_config` signature to `(_config: dict[str, Any]) -> Self`; expands auto-registration to attempt importing many optional providers and logs warnings on ImportError.
Library exports `src/pyseekdb/utils/embedding_functions/__init__.py`	Replaces `LiteLLMEmbeddingFunction` export with `LiteLLMBaseEmbeddingFunction`; adds exports for AmazonBedrock, Cohere, GoogleVertex, Jina, Ollama, Siliconflow, TencentHunyuan, Voyageai, and related base classes.
LiteLLM base refactor `src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py`	Renames `LiteLLMEmbeddingFunction` → `LiteLLMBaseEmbeddingFunction` and updates docstrings and a length-check to compare embeddings vs. `documents`.
New embedding providers `src/pyseekdb/utils/embedding_functions/*.py` `amazon_bedrock_embedding_function.py`, `cohere_embedding_function.py`, `google_vertex_embedding_function.py`, `jina_embedding_function.py`, `ollama_embedding_function.py`, `siliconflow_embedding_function.py`, `tencent_hunyuan_embedding_function.py`, `voyageai_embedding_function.py`	Adds eight provider-specific embedding classes implementing constructor/model/api-key handling, lazy/cached `dimension`, `__call__`, `name()`, `get_config()`, and `build_from_config()` plus provider-specific dimension maps and validation.
Docstring cleanups `src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py`, `openai_embedding_function.py`, `qwen_embedding_function.py`, `sentence_transformer_embedding_function.py`	Removes or tightens several method docstrings and examples; no behavioral/API signature changes.
Integration test fixture refactor `tests/integration_tests/test_detect_db_type_and_version.py`	Replaces `server_client`/`oceanbase_client` fixtures with indirect-parametrized `db_client` and updates test function signatures and assertions.
New unit tests `tests/unit_tests/*.py` `test_jina_embedding_function.py`, `test_ollama_embedding_function.py`, `test_amazon_bedrock_embedding_function.py`, `test_cohere_embedding_function.py`, `test_google_vertex_embedding_function.py`, `test_siliconflow_embedding_function.py`, `test_tencent_hunyuan_embedding_function.py`, `test_voyageai_embedding_function.py`	Adds comprehensive, environment-gated unit test suites for each new provider: initialization variants, dimension probing, single/multi/empty embedding calls, error paths, and config persistence round-trips.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Registry as Registry
  participant EF as EmbeddingFunction
  participant API as ExternalAPI

  Client->>Registry: request embedding function by name
  Registry-->>Client: returns EmbeddingFunction instance
  Client->>EF: __call__(documents)
  EF->>API: call provider API (model, api_key, params)
  API-->>EF: embeddings response
  EF->>EF: validate and cache dimension
  EF-->>Client: return embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

support embedding function metadata persistence #119: Overlaps on adding and registering many embedding-function implementations and registry initialization changes.
support embedding function persistence #114: Related to adding persistence APIs (name(), get_config(), build_from_config) used by provider implementations and tests.
Support more embedding function platform #110: Touches the embedding-function client and registry changes similar to this PR.

Poem

🐰 I hopped through code to add new friends,
Bedrock, Cohere, Vertex—hops without end.
Jina, Ollama, Silicon in tow,
Hunyuan and Voyage joining the row.
Tests and registry sing—hop, hop, let's go!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Support more embedding function platforms' clearly describes the main change: adding support for multiple new embedding function platforms (Ollama, Jina, Voyage AI, Tencent Hunyuan, SiliconFlow, Google Vertex, Cohere, Amazon Bedrock).
Linked Issues check	✅ Passed	The PR fully implements the requirement from issue `#127` to support SiliconFlow, Tencent Hunyuan, and Cohere embedding, and exceeds expectations by also adding Ollama, Jina AI, Voyage AI, Google Vertex AI, and Amazon Bedrock implementations with proper registry integration and tests.
Out of Scope Changes check	✅ Passed	All changes are directly related to adding embedding function platform support: new embedding function implementations, registry updates, tests, and refactoring LiteLLMEmbeddingFunction to LiteLLMBaseEmbeddingFunction for architectural clarity.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request adds support for 8 new embedding function platforms to pyseekdb: Ollama, Jina AI, Voyage AI, Tencent Hunyuan, SiliconFlow, Google Vertex AI, Cohere, and Amazon Bedrock. The changes enable users to leverage a wider variety of embedding models through a consistent interface.

Changes:

Added 8 new embedding function implementations with comprehensive documentation and examples
Renamed LiteLLMEmbeddingFunction to LiteLLMBaseEmbeddingFunction to better reflect its role as a base class
Updated the embedding function registry to automatically register all new platforms
Added test files for Ollama and Jina embedding functions

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py	New embedding function for local Ollama models using OpenAI-compatible API
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py	New embedding function for Jina AI models using LiteLLM base
src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py	New embedding function for Voyage AI models with native voyageai package
src/pyseekdb/utils/embedding_functions/tengxun_hunyuan_embedding_function.py	New embedding function for Tencent Hunyuan models using OpenAI-compatible API
src/pyseekdb/utils/embedding_functions/siliconflow_embedding_function.py	New embedding function for SiliconFlow models using OpenAI-compatible API
src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py	New embedding function for Google Vertex AI models using LiteLLM base
src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py	New embedding function for Cohere models using LiteLLM base
src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py	New embedding function for Amazon Bedrock models using boto3
src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py	Renamed class from LiteLLMEmbeddingFunction to LiteLLMBaseEmbeddingFunction for clarity
src/pyseekdb/utils/embedding_functions/init.py	Updated exports to include all new embedding functions
src/pyseekdb/client/embedding_function.py	Updated registry initialization and improved type hints
tests/unit_tests/test_ollama_embedding_function.py	Comprehensive test suite for Ollama embedding function
tests/unit_tests/test_jina_embedding_function.py	Comprehensive test suite for Jina embedding function

Comments suppressed due to low confidence (1)

src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py:60

The docstring on line 60 says "Initialize." but should be more descriptive, like "Initialize LiteLLMBaseEmbeddingFunction." to match the naming convention used in other embedding functions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/integration_tests/test_detect_db_type_and_version.py (1)
45-61: Scope db_client fixture per test to avoid cross‑mode failures.

db_client is parameterized for embedded/server/oceanbase, but these tests assert a single db_type. They’ll fail in mismatched modes (e.g., seekdb test running on oceanbase). Scope each test to the intended mode(s).
✅ Suggested fix (scope fixture per test)
+    `@pytest.mark.parametrize`("db_client", ["server"], indirect=True)
     def test_seekdb_type_detection(self, db_client):
         """Test: detect seekdb Server type and version"""
         # Verify client type
         assert db_client is not None
         assert hasattr(db_client, "_server")
         assert isinstance(db_client._server, pyseekdb.RemoteServerClient)
@@
+    `@pytest.mark.parametrize`("db_client", ["oceanbase"], indirect=True)
     def test_ob_type_detection(self, db_client):
         """Test: detect OceanBase Server type and version"""
         # Verify client type
         assert db_client is not None
         assert hasattr(db_client, "_server")
         assert isinstance(db_client._server, pyseekdb.RemoteServerClient)
Also applies to: 68-85
src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py (3)
171-172: Critical bug: input is undefined; should be documents.

The validation at lines 171-172 references input, which is not defined in this scope. The local variable holding the documents is documents.
🐛 Proposed fix
         # Validate that we got the expected number of embeddings
-        if len(embeddings) != len(input):
-            raise ValueError(f"Expected {len(input)} embeddings but got {len(embeddings)} from LiteLLM")
+        if len(embeddings) != len(documents):
+            raise ValueError(f"Expected {len(documents)} embeddings but got {len(embeddings)} from LiteLLM")
19-19: Docstring references outdated class name LiteLLMEmbeddingFunction.

The class was renamed to LiteLLMBaseEmbeddingFunction, but the example in the docstring still imports the old name.
📝 Proposed fix
-        from pyseekdb.utils.embedding_functions import LiteLLMEmbeddingFunction
+        from pyseekdb.utils.embedding_functions import LiteLLMBaseEmbeddingFunction
Also update subsequent usages in the docstring examples (lines 23, 30, 36) to use LiteLLMBaseEmbeddingFunction.
123-125: Error message references outdated class name.

The error message at line 124 still refers to LiteLLMEmbeddingFunction instead of the renamed LiteLLMBaseEmbeddingFunction.
📝 Proposed fix
             raise ValueError(
                 f"API key environment variable '{self.api_key_env}' is not set. "
-                f"Please set it before using LiteLLMEmbeddingFunction."
+                f"Please set it before using {self.__class__.__name__}."
             )
Using self.__class__.__name__ makes the error message accurate for both the base class and any subclasses.

🤖 Fix all issues with AI agents

In `@src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py`:
- Around line 270-271: The code passes session_args (from
config.get("session_args")) directly into boto3.Session which can include keys
with None values and cause unexpected behavior; filter out any entries where the
value is None before constructing the session (e.g., build a cleaned dict from
session_args removing None values) and then call
boto3.Session(**cleaned_session_args) if cleaned_session_args is non-empty,
otherwise call boto3.Session() — update the logic around the
session_args/session creation in amazon_bedrock_embedding_function.py to use the
cleaned dictionary.

In `@tests/unit_tests/test_jina_embedding_function.py`:
- Around line 83-95: The test test_initialization_with_custom_api_key_env leaks
the CUSTOM_JINA_KEY env var; wrap setting of the env var in an env guard (or use
pytest's monkeypatch) so the environment is restored after the test and no
global state persists; specifically, when creating the
JinaEmbeddingFunction(model_name="jina-embeddings-v3",
api_key_env=custom_key_env) ensure you set the env var via an
env_guard/monkeypatch helper and remove/restore it in teardown so ef.api_key_env
can be asserted without leaving CUSTOM_JINA_KEY in os.environ.

In `@tests/unit_tests/test_ollama_embedding_function.py`:
- Around line 429-440: The test currently asserts that
OllamaEmbeddingFunction.build_from_config raises ValueError for a non-dict
client_kwargs, but the implementation raises TypeError; update the test to
expect TypeError instead of ValueError (change the pytest.raises call in
test_build_from_config_invalid_client_kwargs to match "TypeError" or the string
message from the raised TypeError) so it aligns with the behavior of
OllamaEmbeddingFunction.build_from_config.
- Around line 108-121: The test_initialization_with_custom_api_key_env sets
CUSTOM_OLLAMA_KEY but doesn't restore or remove it, risking cross-test
pollution; update this test to mirror test_initialization_without_api_key by
using env_guard (or a try/finally) around setting custom_key_env so the original
environment value is restored or the variable removed after the test, ensuring
test_ollama_env and the OllamaEmbeddingFunction(model_name="nomic-embed-text",
api_key_env=custom_key_env) assertion remain unchanged.

♻️ Duplicate comments (8)

src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (2)
110-112: kwargs validation may be too restrictive for boto3 Config objects.

This validation rejects non-primitive types, but boto3 users may want to pass botocore.config.Config objects via kwargs. This was previously flagged.

255-257: Error message references wrong parameter name.

The error message says "kwargs must be a dictionary" but the config key is "client_kwargs".
📝 Proposed fix
         kwargs = config.get("client_kwargs", {})
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (2)
84-85: Docstring default value is inconsistent with actual default.

The docstring states the default is "jina-embeddings-v2-base-en" but the actual default parameter on line 77 is "jina-embeddings-v3".
📝 Proposed fix
         Args:
             model_name (str, optional): Name of the Jina AI embedding model.
-                Defaults to "jina-embeddings-v2-base-en".
+                Defaults to "jina-embeddings-v3".
212-214: Error message references wrong parameter name.

The error message says "kwargs must be a dictionary" but the config key being validated is "client_kwargs".
📝 Proposed fix
         kwargs = config.get("client_kwargs", {})
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/tengxun_hunyuan_embedding_function.py (1)
81-89: Redundant assignment of dimensions variable.

The dimensions = None assignment on line 89 is never used since line 104 always passes dimensions=None explicitly. While harmless, it can be removed for clarity.
Proposed fix
         # Warn if dimensions parameter is provided since it's not supported
         if dimensions is not None:
             warnings.warn(
                 "The dimensions parameter is not supported by Tencent Hunyuan API. "
                 "Dimensions are fixed at 1024. The provided dimensions parameter will be ignored.",
                 UserWarning,
                 stacklevel=2,
             )
-            dimensions = None
src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (1)
242-243: Error message should reference client_kwargs for consistency.

The error message says "kwargs" but the config parameter is named "client_kwargs".
Proposed fix
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
216-217: Error message should reference client_kwargs for consistency.

Same issue as in other embedding functions - the error message says "kwargs" but the config parameter is "client_kwargs".
Proposed fix
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
275-276: Error message should reference client_kwargs for consistency.

Same issue as in other embedding functions.
Proposed fix
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")

🧹 Nitpick comments (6)

.github/workflows/ci.yml (1)
69-70: Fragile failure detection pattern.

The tail -n 1 | grep -q 'failed' approach can produce false negatives if:

pytest crashes before producing summary output

The summary format changes or doesn't end with "failed" on the last line

Tests error (vs fail) may not contain "failed" in the last line

A more robust approach is to capture pytest's exit code directly:
♻️ Suggested improvement
-          uv run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k "not server and not embedded and not oceanbase" | tee pytest.log
-          tail -n 1 pytest.log | grep -q 'failed' && exit 1 || exit 0
+          uv run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k "not server and not embedded and not oceanbase" 2>&1 | tee pytest.log
+          exit ${PIPESTATUS[0]}
Note: This is an existing pattern throughout the workflow, so this is a low-priority suggestion that could be addressed separately across all steps.
src/pyseekdb/client/embedding_function.py (1)
517-519: Self return type on static method is unconventional.

Using Self as the return type for a @staticmethod is semantically incorrect. Self is designed for instance methods and @classmethod where it refers to the type of the current class instance. For a static method, it doesn't have access to the class context to determine Self.

Since DefaultEmbeddingFunction.build_from_config always returns DefaultEmbeddingFunction(), the return type should be the concrete class.
📝 Suggested fix
     `@staticmethod`
-    def build_from_config(_config: dict[str, Any]) -> Self:
+    def build_from_config(_config: dict[str, Any]) -> "DefaultEmbeddingFunction":
         return DefaultEmbeddingFunction()
src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (1)

186-209: Sequential API calls may be slow for large document batches.

Each document triggers a separate API call. While this matches Bedrock's Titan embedding API constraints (single input per call), consider adding a note in the docstring about potential latency for large batches, or consider batching where supported by newer models.

tests/unit_tests/test_jina_embedding_function.py (1)

107-119: Consider using env_guard for consistency.

This test correctly uses try/finally for cleanup, but using env_guard(JINA_AI_API_KEY=None) would be more consistent with the persistence tests and cleaner.
tests/unit_tests/test_ollama_embedding_function.py (2)
54-55: Consider being more explicit about caught exceptions.

While catching all exceptions is acceptable for availability checks, consider catching more specific exceptions (e.g., openai.APIConnectionError, ConnectionError) or adding a brief comment explaining the intent. This helps distinguish intentional broad catching from accidental oversight.
💡 Optional: Add comment or narrow exception types
-    except Exception:
+    except Exception:  # Any failure means Ollama is not available
         return False
Or more specific:
-    except Exception:
+    except (ConnectionError, openai.APIConnectionError, openai.APIError):
         return False
67-77: Redundant environment checks in each test method.

The class-level @pytest.mark.skipif(not is_ollama_available(), ...) decorator already ensures tests only run when Ollama is available. Calling self.test_ollama_env() at the start of each test method is redundant and adds unnecessary overhead.

Consider removing these calls from individual test methods, keeping only test_ollama_env as a standalone sanity check if desired.

coderabbitai

Actionable comments posted: 12

🤖 Fix all issues with AI agents

In `@tests/unit_tests/test_amazon_bedrock_embedding_function.py`:
- Around line 220-236: The test test_dimension_property_unknown_model
incorrectly uses a known model ("amazon.titan-embed-text-v2") that's present in
the _AMAZON_BEDROCK_MODEL_DIMENSIONS map so
AmazonBedrockEmbeddingFunction.dimension returns from the known-map path instead
of exercising the fallback probe; either rename the test to
test_dimension_property_known_model to reflect current behavior or change the
model_name in that test to a truly unknown model ID (one not present in
_AMAZON_BEDROCK_MODEL_DIMENSIONS) so the
AmazonBedrockEmbeddingFunction.dimension property executes the probing/API call
path.
- Around line 41-52: The code currently catches a broad Exception in
are_aws_credentials_available() (and the profile-name test) which can hide real
failures; narrow the except blocks to handle specific failures by importing
botocore.exceptions and catching ImportError for missing boto3,
botocore.exceptions.NoCredentialsError for missing credentials,
botocore.exceptions.ProfileNotFound for invalid profiles, and
botocore.exceptions.BotoCoreError as a final botocore-specific fallback, leaving
other exceptions to propagate; update the exception handlers in the
are_aws_credentials_available() function and the profile-name test to use these
specific exception types and ensure boto3 import errors are handled with
ImportError.

In `@tests/unit_tests/test_cohere_embedding_function.py`:
- Around line 134-146: The test restores COHERE_API_KEY only when original_key
is truthy, which fails to restore an original empty-string value; update the
teardown in test_initialization_missing_api_key to restore the environment
variable unconditionally by checking for presence with "is not None" and setting
os.environ["COHERE_API_KEY"] = original_key when original_key is not None so the
environment is exactly as it was before the test (reference:
CohereEmbeddingFunction and the test_initialization_missing_api_key function).
- Around line 212-224: The test test_dimension_property_unknown_model is using a
model present in the cached known-dimension map so the probe logic is bypassed;
update the test to remove the cached entry for "embed-english-v3.0" from the
CohereEmbeddingFunction known-dimensions cache (e.g., pop or delete the map
entry on CohereEmbeddingFunction) before creating the instance so ef.dimension
triggers the fallback API probe path; reference the CohereEmbeddingFunction
class and the cache (the known-dimension map used by that class) to locate where
to remove the cached model entry.

In `@tests/unit_tests/test_google_vertex_embedding_function.py`:
- Around line 255-364: Tests expect a "region" key and AssertionError on missing
fields, but the implementation of GoogleVertexEmbeddingFunction uses
get_config() returning "location" and build_from_config() only requires
"project_id" while defaulting other fields; update the tests to match the
implementation by (1) asserting config["location"] (not "region") in
test_get_config_with_defaults and test_get_config_with_custom_values and
asserting restored_ef.location (or check region mapping if present) in
test_build_from_config_with_defaults/custom, and (2) remove or change the tests
test_build_from_config_missing_api_key_env,
test_build_from_config_missing_model_name,
test_build_from_config_missing_project_id, and
test_build_from_config_missing_region so they no longer expect AssertionError
but instead verify the actual behavior of build_from_config() (e.g., that
missing optional fields are defaulted and only missing project_id raises if
that's the real required field); update references to get_config and
build_from_config in those tests accordingly.
- Around line 50-138: Tests instantiate GoogleVertexEmbeddingFunction() with
wrong expectations: the constructor requires project_id, uses "location" not
"region", and model_name is prefixed with "vertex_ai/". Update the tests
(functions like test_initialization_with_defaults,
test_initialization_with_custom_region, test_initialization_with_all_parameters,
etc.) to pass a valid project_id when constructing
GoogleVertexEmbeddingFunction, assert on ef.location instead of ef.region, and
either assert the prefixed model_name ("vertex_ai/textembedding-gecko" or the
exact provided variant) or call ef.get_config() to compare the base model name;
also update any api_key_env expectations if the implementation uses a different
default name.
- Around line 34-48: The tests call GoogleVertexEmbeddingFunction() without the
required project_id and also assume an API key env var is mandatory; update the
tests to pass a valid project_id to every GoogleVertexEmbeddingFunction(...)
instantiation and change the pytest.mark.skipif guard (which currently checks
CHROMA_GOOGLE_VERTEX_API_KEY and is_vertexai_available()) so the API key env var
is optional when ADC is available (i.e., only skip when is_vertexai_available()
is false and no ADC can be detected), and if you intended API key validation
enforce it in GoogleVertexEmbeddingFunction.__init__ (or adjust
LiteLLMBaseEmbeddingFunction.__init__ behavior) so that missing api_key_env
raises the expected error rather than causing a TypeError from missing
project_id.

In `@tests/unit_tests/test_siliconflow_embedding_function.py`:
- Around line 199-211: The test test_dimension_property_unknown_model currently
uses a known model ("BAAI/bge-large-zh-v1.5") so
SiliconflowEmbeddingFunction.dimension reads from _SILICONFLOW_MODEL_DIMENSIONS
instead of exercising the API fallback; update the test to use a model name not
present in _SILICONFLOW_MODEL_DIMENSIONS (e.g., "unknown-model-xyz") or mock the
dimensions lookup so that SiliconflowEmbeddingFunction (class) performs the API
call path, ensuring the assertion still checks dim > 0 and the printed message
reflects an API-derived dimension.
- Around line 104-114: The test test_initialization_with_custom_api_base
incorrectly uses the default URL as the "custom" API base, so it won't catch an
implementation that ignores the api_base argument; update the test to pass a
clearly different URL (e.g., "https://custom.example/v1") to
SiliconflowEmbeddingFunction (constructed as ef =
SiliconflowEmbeddingFunction(..., api_base=custom_base)) and assert ef.api_base
equals that distinct custom_base string to ensure the api_base parameter is
honored by the constructor.

In `@tests/unit_tests/test_tencent_hunyuan_embedding_function.py`:
- Around line 67-79: The test_initialization_with_custom_api_key_env sets an
environment variable custom_key_env ("CUSTOM_HUNYUAN_KEY") without cleaning it
up, which can leak into other tests; modify the test to use the existing
env_guard fixture/context to set CUSTOM_HUNYUAN_KEY (or wrap the os.environ
assignment) so the variable is automatically restored/cleared after the test,
and retain the rest of the assertions that check ef.api_key_env.

In `@tests/unit_tests/test_voyageai_embedding_function.py`:
- Around line 190-198: The test_initialization_without_voyageai method is never
executed due to a class-level skipif that skips the whole test class when
voyageai is unavailable while the test itself skips when voyageai is present;
extract test_initialization_without_voyageai out of the skipped test class and
place it at module level with an inverse skip decorator (e.g.,
skipif(is_voyageai_available(), reason="voyageai is available")) so it only runs
when voyageai is not installed, remove the in-method pytest.skip check, and
ensure it uses the existing is_voyageai_available() helper and does not depend
on VOYAGE_API_KEY.
- Around line 241-253: The test test_dimension_property_unknown_model currently
uses "voyage-4-large" which is present in the known-dimensions map so the
API-probing branch isn't exercised; modify the test to accept the pytest
monkeypatch fixture, remove the entry for "voyage-4-large" from
VoyageaiEmbeddingFunction.known_dimensions (or the specific cache dict) using
monkeypatch.delitem or monkeypatch.setitem to ensure the lookup falls through to
the API-probing path, then instantiate
VoyageaiEmbeddingFunction(model_name="voyage-4-large") and access ef.dimension
to trigger the fallback API call and assert dim > 0.

🧹 Nitpick comments (5)

tests/unit_tests/test_tencent_hunyuan_embedding_function.py (2)
49-66: Redundant environment checks inside test methods.

Since the class is decorated with @pytest.mark.skipif that already checks both conditions, the self.test_hunyuan_env() calls inside each test method (lines 54, 71, 85, 97, etc.) are redundant. The class will only run if conditions are already met.
♻️ Suggested simplification

Remove the self.test_hunyuan_env() calls from individual test methods. The class-level skipif decorator already guarantees the prerequisites:
     def test_initialization_with_defaults(self):
         """Test TencentHunyuanEmbeddingFunction initialization with default values"""
         print("\n✅ Testing TencentHunyuanEmbeddingFunction initialization with defaults")
 
-        # Check if openai is available and env vars are set
-        self.test_hunyuan_env()
-
         ef = TencentHunyuanEmbeddingFunction()
Apply similar changes to other test methods that call self.test_hunyuan_env().
169-173: Consider suppressing expected warning.

Line 170 creates an instance with dimensions=512, which triggers a warning (as verified in test_initialization_with_dimensions_warning). Suppress it here to reduce test noise, consistent with test_get_config_with_dimensions (lines 316-318).
♻️ Suggested fix
         # Test that dimension is always 1024 regardless of initialization
-        ef2 = TencentHunyuanEmbeddingFunction(dimensions=512)  # Should warn but dimension still 1024
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            ef2 = TencentHunyuanEmbeddingFunction(dimensions=512)  # Should warn but dimension still 1024
         dim2 = ef2.dimension
tests/unit_tests/test_siliconflow_embedding_function.py (2)
42-46: Redundant environment checks within test methods.

Since the class is decorated with @pytest.mark.skipif that already verifies SILICONFLOW_API_KEY and is_openai_available(), the repeated calls to self.test_siliconflow_env() at the start of each test method (lines 53, 70, 94, 108, 120, 131, 155, 203, etc.) are redundant. These conditions are guaranteed to be true when any test in this class executes.

Consider removing the redundant self.test_siliconflow_env() calls from individual test methods to reduce test execution overhead.

90-102: Environment variable not cleaned up after test.

The custom environment variable set at line 98 is never cleaned up. If this test or subsequent tests fail, CUSTOM_SILICONFLOW_KEY remains in the environment, potentially affecting test isolation.

Consider using the env_guard context manager (already imported from test_utils) for consistency with the persistence tests.
♻️ Proposed fix
     def test_initialization_with_custom_api_key_env(self):
         """Test SiliconflowEmbeddingFunction initialization with custom API key env"""
         print("\n✅ Testing SiliconflowEmbeddingFunction initialization with custom API key env")

-        self.test_siliconflow_env()
-
         custom_key_env = "CUSTOM_SILICONFLOW_KEY"
-        if not os.environ.get(custom_key_env):
-            os.environ[custom_key_env] = os.environ.get("SILICONFLOW_API_KEY", "your-custom-key")
-
-        ef = SiliconflowEmbeddingFunction(model_name="BAAI/bge-large-zh-v1.5", api_key_env=custom_key_env)
-        assert ef.api_key_env == custom_key_env
-        print(f"   Custom API key env: {ef.api_key_env}")
+        with env_guard(**{custom_key_env: os.environ.get("SILICONFLOW_API_KEY", "test-key")}):
+            ef = SiliconflowEmbeddingFunction(model_name="BAAI/bge-large-zh-v1.5", api_key_env=custom_key_env)
+            assert ef.api_key_env == custom_key_env
+            print(f"   Custom API key env: {ef.api_key_env}")
tests/unit_tests/test_voyageai_embedding_function.py (1)
102-108: Avoid leaking CUSTOM_VOYAGE_KEY into the process environment.
Use env_guard (or restore the original value) so this test doesn’t affect later tests.
♻️ Proposed refactor
-        custom_key_env = "CUSTOM_VOYAGE_KEY"
-        if not os.environ.get(custom_key_env):
-            os.environ[custom_key_env] = os.environ.get("VOYAGE_API_KEY", "your-custom-key")
-
-        ef = VoyageaiEmbeddingFunction(model_name="voyage-4-large", api_key_env=custom_key_env)
-        assert ef.api_key_env == custom_key_env
-        print(f"   Custom API key env: {ef.api_key_env}")
+        custom_key_env = "CUSTOM_VOYAGE_KEY"
+        with env_guard(**{custom_key_env: os.environ.get("VOYAGE_API_KEY", "your-custom-key")}):
+            ef = VoyageaiEmbeddingFunction(model_name="voyage-4-large", api_key_env=custom_key_env)
+            assert ef.api_key_env == custom_key_env
+            print(f"   Custom API key env: {ef.api_key_env}")

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py (1)
17-19: Docstring example still references old class name.

The example code in the docstring imports LiteLLMEmbeddingFunction, but the class has been renamed to LiteLLMBaseEmbeddingFunction.
📝 Suggested fix
-        from pyseekdb.utils.embedding_functions import LiteLLMEmbeddingFunction
+        from pyseekdb.utils.embedding_functions import LiteLLMBaseEmbeddingFunction
Also update the instantiation examples below (lines 23, 30, 36) to use LiteLLMBaseEmbeddingFunction.

🤖 Fix all issues with AI agents

In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py`:
- Line 45: The default environment variable name for the API key is incorrectly
set to "CHROMA_GOOGLE_VERTEX_API_KEY"; update the default value of the
api_key_env parameter in the GoogleVertexEmbeddingFunction (the api_key_env
parameter in google_vertex_embedding_function.py) to a project-appropriate name
such as "GOOGLE_VERTEX_API_KEY" or "PYSEEKDB_GOOGLE_VERTEX_API_KEY" and ensure
any related documentation or default usage in that module is adjusted
accordingly.
- Around line 107-122: The build_from_config method in
GoogleVertexEmbeddingFunction is stricter than the constructor: instead of
raising ValueError when api_key_env, model_name, project_id, or region are
missing, use the constructor defaults for any missing values; update
build_from_config to extract those keys from config and pass them (even if None)
into GoogleVertexEmbeddingFunction(api_key_env=..., model_name=...,
project_id=..., region=...) so it behaves consistently with the __init__
defaults and can reconstruct configs that rely on defaults.

♻️ Duplicate comments (4)

src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (1)
265-266: Passing None values in session_args to boto3.Session() may cause issues.

When session_args contains None values (e.g., {"region_name": None, "profile_name": None}), passing them to boto3.Session() may cause unexpected behavior compared to not passing them at all.
📝 Suggested fix (from past review)
         session_args = config.get("session_args")
-        session = boto3.Session(**session_args) if session_args else boto3.Session()
+        if session_args:
+            # Filter out None values to avoid passing them to boto3.Session
+            filtered_args = {k: v for k, v in session_args.items() if v is not None}
+            session = boto3.Session(**filtered_args) if filtered_args else boto3.Session()
+        else:
+            session = boto3.Session()
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (1)
196-199: Error message references "kwargs" but config key is "client_kwargs".

For consistency with the config structure and better debugging experience, the error message should reference the actual config key name.
📝 Suggested fix
         kwargs = config.get("client_kwargs", {})
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
206-219: Error message references wrong parameter name.

The error message on line 212 says "kwargs must be a dictionary" but the config key being validated is client_kwargs. This inconsistency could confuse users debugging configuration issues.
🐛 Proposed fix
         kwargs = config.get("client_kwargs", {})
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
259-278: Error message references wrong parameter name.

Similar to the Cohere implementation, the error message on line 269 says "kwargs must be a dictionary" but the config key being validated is client_kwargs.
🐛 Proposed fix
         kwargs = config.get("client_kwargs", {})
         if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+            raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")

🧹 Nitpick comments (2)

src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (1)
9-101: Missing dimension property for consistency.

Other embedding functions in this PR (Jina, Siliconflow, AmazonBedrock, Voyageai) implement a dimension property that returns the embedding vector size. This is missing from GoogleVertexEmbeddingFunction, which may cause issues if callers expect this property.
♻️ Suggested addition
`@property`
def dimension(self) -> int:
    """Get the dimension of embeddings produced by this function."""
    # textembedding-gecko produces 768-dimensional embeddings
    # Add known dimensions mapping or probe via API
    known_dims = {
        "textembedding-gecko": 768,
        "textembedding-gecko@003": 768,
        "text-embedding-004": 768,
    }
    if self.model_name in known_dims:
        return known_dims[self.model_name]
    
    # Fallback: probe via API
    test_embeddings = self(["dimension probing"])
    return len(test_embeddings[0])
src/pyseekdb/client/embedding_function.py (1)
631-660: Consider partial registration for optional embedding functions.

The current implementation uses a single try/except block for all imports. If any one optional dependency is missing, none of the embedding functions will be registered. This could be problematic if a user has only some dependencies installed (e.g., has voyageai but not boto3).

Consider importing and registering each embedding function individually to allow partial availability:
♻️ Suggested approach for partial registration
# Register each optional embedding function individually
optional_functions = [
    ("sentence_transformer", "SentenceTransformerEmbeddingFunction"),
    ("openai", "OpenAIEmbeddingFunction"),
    ("qwen", "QwenEmbeddingFunction"),
    ("siliconflow", "SiliconflowEmbeddingFunction"),
    ("tencent_hunyuan", "TencentHunyuanEmbeddingFunction"),
    ("ollama", "OllamaEmbeddingFunction"),
    ("voyageai", "VoyageaiEmbeddingFunction"),
    ("google_vertex", "GoogleVertexEmbeddingFunction"),
    ("cohere", "CohereEmbeddingFunction"),
    ("jina", "JinaEmbeddingFunction"),
    ("amazon_bedrock", "AmazonBedrockEmbeddingFunction"),
]

for registry_name, class_name in optional_functions:
    try:
        module = __import__(
            "pyseekdb.utils.embedding_functions",
            fromlist=[class_name]
        )
        cls = getattr(module, class_name)
        cls._registry[registry_name] = cls
    except (ImportError, AttributeError) as e:
        logger.debug(f"Optional embedding function {class_name} not available: {e}")

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py`:
- Around line 68-78: The constructor currently enforces an API key via
self.api_key_env/self.api_key and raises if missing even though
TextEmbeddingModel.from_pretrained() uses ADC/OAuth2; remove the API key
validation and storage so ADC can be used: delete the lines that set
self.api_key_env, self.api_key and the if-not-self.api_key ValueError, and
update the __init__ signature or docs to no longer require api_key_env (or make
it optional and only store it without raising); keep
vertexai.init(project=project_id, location=region) and the call to
TextEmbeddingModel.from_pretrained(model_name) unchanged so authentication is
handled by ADC.

In `@src/pyseekdb/utils/embedding_functions/jina_embedding_function.py`:
- Line 49: Update the docstring in
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py so the
environment variable name matches the implementation: change the example text
that currently says "Set JINA_API_KEY environment variable first" to reference
the actual env var used by the code, JINA_AI_API_KEY (used in the module when
reading the API key). Ensure any docstring examples or comments in the
JinaEmbeddingFunction/Jina embedding helper mention JINA_AI_API_KEY
consistently.

In `@src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py`:
- Around line 96-101: The code in OllamaEmbeddingFunction currently mutates the
process-wide environment variable (api_key_env) by setting
os.environ[api_key_env] = "ollama"; remove that global mutation and instead
accept an api_key parameter (or read it from kwargs) in
OllamaEmbeddingFunction.__init__ (or the factory that constructs the
OpenAI/Ollama client) and pass that api_key directly into the client creation
path (where the Ollama/OpenAI client is instantiated) so no global env is
changed; if the base class does not accept api_key yet, add an optional api_key
kwarg to the base constructor and thread it through to OllamaEmbeddingFunction
so the fallback default "ollama" is applied locally rather than written to
os.environ, or alternatively add clear documentation in OllamaEmbeddingFunction
explaining why the env is intentionally left unmodified.

♻️ Duplicate comments (6)

src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (2)
243-267: Align client_kwargs naming and filter session_args None values.

Two small robustness fixes:

The error message should reference client_kwargs for consistency.

Filter None values before building a boto3.Session.
🛠️ Proposed fix
-        kwargs = config.get("client_kwargs", {})
-        if not isinstance(kwargs, dict):
-            raise TypeError(f"kwargs must be a dictionary, but got {kwargs}")
+        client_kwargs = config.get("client_kwargs", {})
+        if not isinstance(client_kwargs, dict):
+            raise TypeError(f"client_kwargs must be a dictionary, but got {client_kwargs}")
@@
-        session_args = config.get("session_args")
-        session = boto3.Session(**session_args) if session_args else boto3.Session()
+        session_args = config.get("session_args") or {}
+        if session_args:
+            session_args = {k: v for k, v in session_args.items() if v is not None}
+        session = boto3.Session(**session_args) if session_args else boto3.Session()
@@
-            **kwargs,
+            **client_kwargs,
Verification (boto3 Session args handling):
boto3 Session parameters None values behavior region_name profile_name
95-115: Config object is documented but will be rejected by validation, blocking a documented feature.

The docstring at line 92 explicitly documents the config: boto3 Config object as a valid kwarg, but the validation at lines 98–100 rejects it because botocore.config.Config is not a primitive type. Additionally, even if validation allowed it, storing Config in self._client_kwargs (line 111) would break serialization in get_config() since Config objects are not JSON-serializable.

To fix this, separate runtime kwargs (which may include Config) from serializable kwargs (which must remain primitive):
🐛 Suggested fix (allow runtime Config, keep persisted kwargs serializable)
-        for key, value in kwargs.items():
-            if not isinstance(value, (str, int, float, bool, list, dict, tuple)):
-                raise TypeError(f"Keyword argument {key} is not a primitive type: {type(value)!s}")
+        runtime_kwargs = kwargs.copy()
+        serializable_kwargs: dict[str, Any] = {}
+        try:
+            from botocore.config import Config as BotoConfig
+        except Exception:
+            BotoConfig = None
+        for key, value in kwargs.items():
+            if key == "config" and BotoConfig is not None and isinstance(value, BotoConfig):
+                # Allow at runtime, but don't persist
+                continue
+            if not isinstance(value, (str, int, float, bool, list, dict, tuple, type(None))):
+                raise TypeError(f"Keyword argument {key} is not a primitive type: {type(value)!s}")
+            serializable_kwargs[key] = value
@@
-        self._client_kwargs = kwargs
+        self._client_kwargs = serializable_kwargs
@@
-        self._client = session.client("bedrock-runtime", **kwargs)
+        self._client = session.client("bedrock-runtime", **runtime_kwargs)
src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (2)

45-45: Environment variable name uses "CHROMA" prefix.

The default api_key_env is "CHROMA_GOOGLE_VERTEX_API_KEY" which appears to be copied from Chroma's codebase. Consider using a more project-appropriate name like "GOOGLE_VERTEX_API_KEY" or "PYSEEKDB_GOOGLE_VERTEX_API_KEY".

107-122: build_from_config is stricter than the constructor.

The constructor has defaults for all parameters, but build_from_config raises an error if any are missing. This means configs saved with defaults cannot be restored if any field is None. Consider using defaults similar to the constructor.

src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (1)

193-195: Error message references wrong parameter name.

The error message says "kwargs must be a dictionary" but the config parameter is client_kwargs. Update for consistency.

src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)

204-206: Error message references wrong parameter name.

The error message says "kwargs must be a dictionary" but the config parameter is client_kwargs. Update for consistency with the parameter name.

🧹 Nitpick comments (2)

src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
163-199: Optional: normalize embeddings to plain lists for defensive robustness.

The Voyage AI API with output_dtype="float" (already set explicitly) returns List[List[float]] per official documentation, which matches the Embeddings type requirement. However, defensive normalization could guard against unexpected API behavior changes:
♻️ Proposed refinement
-        # Convert to list of lists (in case voyageai returns numpy arrays or other formats)
-        return embeddings
+        # Convert to list of lists (in case voyageai returns numpy arrays or other formats)
+        normalized = []
+        for embedding in embeddings:
+            if hasattr(embedding, "tolist"):
+                normalized.append(embedding.tolist())
+            else:
+                normalized.append(list(embedding))
+        return normalized
src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
200-200: Inconsistent handling of missing model_name in build_from_config.

Unlike other embedding functions (e.g., JinaEmbeddingFunction, OpenAIEmbeddingFunction) that raise ValueError when model_name is missing, this implementation uses a default value. This inconsistency could lead to unexpected behavior if a config is partially corrupted. Consider aligning with other implementations by requiring model_name.
📝 Suggested fix for consistency
-        model_name = config.get("model_name", _DEFAULT_MODEL_NAME)
+        model_name = config.get("model_name")
+        if model_name is None:
+            raise ValueError("Missing required field 'model_name' in configuration")

coderabbitai · 2026-01-23T03:51:39Z

+        self.api_key_env = api_key_env
+        self.api_key = os.getenv(api_key_env)
+        if not self.api_key:
+            raise ValueError(f"The {api_key_env} environment variable is not set.")
+
+        self.model_name = model_name
+        self.project_id = project_id
+        self.region = region
+
+        vertexai.init(project=project_id, location=region)
+        self._model = TextEmbeddingModel.from_pretrained(model_name)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Google Vertex AI Python SDK authentication API key vs Application Default Credentials

💡 Result:

Vertex AI Python: API key vs Application Default Credentials (ADC)

What ADC is (recommended for most Vertex AI usage)

ADC is OAuth/IAM-based authentication: the SDK obtains short‑lived access tokens tied to a user or service account identity, and Vertex AI authorizes requests via IAM roles. (cloud.google.com)

The Vertex AI Python SDK setup docs explicitly guide you to authenticate locally with ADC via gcloud auth application-default login. (docs.cloud.google.com)

Use ADC for production and for essentially all non-trivial Vertex AI access (training, pipelines, endpoints, model management, etc.). (cloud.google.com)

What an API key is (more limited; often “testing”)

There are two “API key” ideas people conflate:

Plain API keys (project/quota identifier, not an IAM identity)

API keys generally don’t identify a caller principal for IAM authorization; they mainly associate calls with a project for billing/quota and only work where the API method explicitly allows API keys. (cloud.google.com)

Example: Vertex AI Search supports API keys only for specific methods (not broadly for everything). (cloud.google.com)

API keys bound to a service account (special case)

Google documents a special feature: API keys bound to a service account (so requests are authorized as that service account). It notes the only API that supports bound API keys is aiplatform.googleapis.com and warns not to use them in production. (us.tlz.app)

Generative AI on Vertex AI (Gemini): both are commonly mentioned

For Gemini on Vertex AI, Google’s docs say you can authenticate with a Google Cloud API key or ADC, and they recommend API keys for testing and ADC for production. (cloud.google.com)

The newer vertexai Python client reference shows a Client(..., api_key=..., credentials=...), i.e., it can be configured either way depending on your use case/library. (docs.cloud.google.com)

Practical guidance

Use ADC when you need: IAM enforcement, service accounts, least privilege, production deployments, or broad Vertex AI feature coverage. (cloud.google.com)

Use an API key only when the specific Vertex AI surface you’re calling supports it and you’re doing quick testing/prototyping (and restrict the key). (cloud.google.com)

If you tell me which Python package you’re using (google-cloud-aiplatform, vertexai, or google-genai) and which feature (Gemini, endpoints, pipelines, etc.), I can give the exact auth snippet for that SDK.

Citations:

1: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai

2: https://docs.cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk?utm_source=openai

3: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai

4: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai

5: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai

6: https://us.tlz.app/default/https/docs.cloud.google.com/docs/authentication/api-keys?utm_source=openai

7: https://cloud.google.com/vertex-ai/generative-ai/docs/start/gcp-auth?utm_source=openai

8: https://docs.cloud.google.com/python/docs/reference/vertexai/latest/vertexai.Client?utm_source=openai

9: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai

10: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai

🌐 Web query:

vertexai.init() function parameters credentials api_key Google Cloud Python SDK

💡 Result:

vertexai.init() (Vertex AI SDK for Python / google-cloud-aiplatform) signature includes both credentials and api_key:

vertexai.init( *, project: Optional[str] = None, location: Optional[str] = None, experiment: Optional[str] = None, experiment_description: Optional[str] = None, experiment_tensorboard: Optional[Union[str, Tensorboard, bool]] = None, staging_bucket: Optional[str] = None, credentials: Optional[google.auth.credentials.Credentials] = None, encryption_spec_key_name: Optional[str] = None, network: Optional[str] = None, service_account: Optional[str] = None, api_endpoint: Optional[str] = None, api_key: Optional[str] = None, api_transport: Optional[str] = None, # 'grpc' or 'rest' request_metadata: Optional[Sequence[Tuple[str, str]]] = None, )

Key auth-related parameters:

credentials: a google.auth.credentials.Credentials object (commonly from Application Default Credentials); if omitted, the SDK auto-detects credentials from the environment. [1], [3]

api_key: an API key string for service calls (not all Vertex AI services support API keys). [1]
If you need to create/get an API key in Google Cloud, follow the “Get a Google Cloud API key” guide. [2]

Sources: [1] Vertex AI Python reference (vertexai.init parameters), [2] Google Cloud “Get a Google Cloud API key”, [3] Google Cloud “Authenticate to Vertex AI” (ADC).

🌐 Web query:

TextEmbeddingModel from_pretrained Vertex AI authentication API key

💡 Result:

TextEmbeddingModel.from_pretrained(...) (from the Vertex AI Python SDK, vertexai.language_models) authenticates with Google Cloud credentials (OAuth2 / Application Default Credentials), not an “API key” parameter. The client libraries look for ADC automatically. (docs.cloud.google.com)

Use ADC (recommended)

Local dev:

gcloud auth application-default login gcloud config set project YOUR_PROJECT_ID

Then:

import vertexai from vertexai.language_models import TextEmbeddingModel vertexai.init(project="YOUR_PROJECT_ID", location="us-central1") model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001") emb = model.get_embeddings(["hello"])[0].values

(from_pretrained usage example is in the official reference.) (cloud.google.com)

Or use a Service Account key (non-interactive)

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

Then run the same Python code (ADC will pick it up). (docs.cloud.google.com)

About “API keys”

Google Cloud docs mention API keys for “Gemini on Vertex AI,” but many Vertex AI REST endpoints reject API keys and require OAuth2/ADC; if you try ?key=... you can get API keys are not supported by this API. (cloud.google.com)

If you tell me which library you’re using (vertexai / google-cloud-aiplatform vs the newer google-genai SDK) and your runtime (local, GCE, Cloud Run, etc.), I can give the exact minimal setup for that environment.

Citations:

1: https://docs.cloud.google.com/vertex-ai/docs/authentication?utm_source=openai

2: https://cloud.google.com/python/docs/reference/aiplatform/1.28.0/vertexai.language_models.TextEmbeddingModel?utm_source=openai

3: https://docs.cloud.google.com/vertex-ai/docs/authentication?utm_source=openai

4: https://cloud.google.com/vertex-ai/generative-ai/docs/start/api-keys?utm_source=openai

API key validation blocks legitimate usage but is never passed to TextEmbeddingModel.

The code validates an API key from an environment variable and raises an error if missing, but TextEmbeddingModel.from_pretrained() authenticates using Application Default Credentials (ADC) / OAuth2, not API keys. This unnecessary validation will prevent the code from running when ADC is properly configured but the environment variable is not set.

Remove the API key requirement since TextEmbeddingModel does not support API key authentication:

Current code (lines 68-71)

self.api_key_env = api_key_env self.api_key = os.getenv(api_key_env) if not self.api_key: raise ValueError(f"The {api_key_env} environment variable is not set.")

Either remove the api_key_env parameter and related validation entirely, or document why it's stored if not used.

🧰 Tools

🪛 Ruff (0.14.13)

71-71: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py` around lines 68 - 78, The constructor currently enforces an API key via self.api_key_env/self.api_key and raises if missing even though TextEmbeddingModel.from_pretrained() uses ADC/OAuth2; remove the API key validation and storage so ADC can be used: delete the lines that set self.api_key_env, self.api_key and the if-not-self.api_key ValueError, and update the __init__ signature or docs to no longer require api_key_env (or make it optional and only store it without raising); keep vertexai.init(project=project_id, location=region) and the call to TextEmbeddingModel.from_pretrained(model_name) unchanged so authentication is handled by ADC.

coderabbitai · 2026-01-23T03:51:39Z

+        from pyseekdb.utils.embedding_functions import JinaEmbeddingFunction
+
+        # Using Jina AI embedding model
+        # Set JINA_API_KEY environment variable first


⚠️ Potential issue | 🟡 Minor

Docstring references incorrect environment variable name.

The docstring example says Set JINA_API_KEY environment variable first, but the actual default env var name in the code is JINA_AI_API_KEY (line 98). Update the docstring to match the actual implementation.

📝 Suggested fix

- # Set JINA_API_KEY environment variable first + # Set JINA_AI_API_KEY environment variable first

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Set JINA_API_KEY environment variable first

# Set JINA_AI_API_KEY environment variable first

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/jina_embedding_function.py` at line 49, Update the docstring in src/pyseekdb/utils/embedding_functions/jina_embedding_function.py so the environment variable name matches the implementation: change the example text that currently says "Set JINA_API_KEY environment variable first" to reference the actual env var used by the code, JINA_AI_API_KEY (used in the module when reading the API key). Ensure any docstring examples or comments in the JinaEmbeddingFunction/Jina embedding helper mention JINA_AI_API_KEY consistently.

coderabbitai · 2026-01-23T03:51:39Z

+        # Handle the case where API key env var is not set (Ollama ignores the key anyway)
+        # We'll set a default value of "ollama" if not found
+        api_key = os.environ.get(api_key_env)
+        if api_key is None:
+            # Set a default API key since Ollama requires it but ignores it
+            os.environ[api_key_env] = "ollama"


⚠️ Potential issue | 🟡 Minor

Global environment modification has side effects.

Setting os.environ[api_key_env] = "ollama" persists globally for the process lifetime. This could affect other code that checks for the presence of this environment variable, or cause issues in concurrent scenarios where multiple embedding functions are instantiated.

Consider passing the API key directly to the OpenAI client via kwargs instead of modifying the global environment:

📝 Suggested alternative approach

# Handle the case where API key env var is not set (Ollama ignores the key anyway) - # We'll set a default value of "ollama" if not found api_key = os.environ.get(api_key_env) if api_key is None: - # Set a default API key since Ollama requires it but ignores it - os.environ[api_key_env] = "ollama" + # Ollama requires an API key but ignores it - pass directly + kwargs["api_key"] = "ollama" + else: + kwargs["api_key"] = api_key

Note: This would require the base class to accept api_key in kwargs. Alternatively, document this behavior clearly if the current approach is intentional.

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py` around lines 96 - 101, The code in OllamaEmbeddingFunction currently mutates the process-wide environment variable (api_key_env) by setting os.environ[api_key_env] = "ollama"; remove that global mutation and instead accept an api_key parameter (or read it from kwargs) in OllamaEmbeddingFunction.__init__ (or the factory that constructs the OpenAI/Ollama client) and pass that api_key directly into the client creation path (where the Ollama/OpenAI client is instantiated) so no global env is changed; if the base class does not accept api_key yet, add an optional api_key kwarg to the base constructor and thread it through to OllamaEmbeddingFunction so the fallback default "ollama" is applied locally rather than written to os.environ, or alternatively add clear documentation in OllamaEmbeddingFunction explaining why the env is intentionally left unmodified.

hnwyllmm added 2 commits January 22, 2026 16:09

support more embedding function platforms

ebe7186

refine embedding functions

9c49f6a

hnwyllmm requested a review from Copilot January 22, 2026 08:12

Copilot started reviewing on behalf of hnwyllmm January 22, 2026 08:12 View session

Copilot AI reviewed Jan 22, 2026

View reviewed changes

ruff check

23a8ce3

coderabbitai Bot reviewed Jan 22, 2026

View reviewed changes

Comment thread src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py

Comment thread tests/unit_tests/test_jina_embedding_function.py

Comment thread tests/unit_tests/test_ollama_embedding_function.py

Comment thread tests/unit_tests/test_ollama_embedding_function.py

hnwyllmm added 2 commits January 22, 2026 21:26

add unittest for embedding function classes

aac5e22

fix version detect issue

fd9f3f6

hnwyllmm requested a review from Copilot January 22, 2026 13:33

Copilot started reviewing on behalf of hnwyllmm January 22, 2026 13:33 View session

coderabbitai Bot reviewed Jan 22, 2026

View reviewed changes

Copilot AI reviewed Jan 22, 2026

View reviewed changes

rename tecent to tencent

d8758bd

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

Comment thread src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py

Comment thread src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py

hnwyllmm added 2 commits January 23, 2026 09:45

fix test issue from pilot

c5af3e1

refine comments

006e138

hnwyllmm merged commit 7f467b2 into oceanbase:develop Jan 23, 2026
6 of 7 checks passed

hnwyllmm deleted the embedding-function branch January 23, 2026 03:48

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

This was referenced Jan 26, 2026

feat: add Morph embedding function support #143

Merged

feat: add mistral embedding function support #153

Merged

fix ruff check #162

Merged

support python3.14 by using sentence-transformers #163

Merged

	# Set JINA_API_KEY environment variable first
	# Set JINA_AI_API_KEY environment variable first

Conversation

hnwyllmm commented Jan 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pull request overview

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Vertex AI Python: API key vs Application Default Credentials (ADC)

What ADC is (recommended for most Vertex AI usage)

What an API key is (more limited; often “testing”)

Generative AI on Vertex AI (Gemini): both are commonly mentioned

hnwyllmm commented Jan 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 22, 2026 •

edited

Loading