Support more embedding function platforms#132
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds eight new embedding-provider implementations, exposes them via the embedding-function registry, renames the LiteLLM base class, updates registry auto-registration to log ImportError warnings, expands CI to install additional test deps and run integration tests, and adds many environment-gated unit tests plus an integration-fixture refactor. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Registry as Registry
participant EF as EmbeddingFunction
participant API as ExternalAPI
Client->>Registry: request embedding function by name
Registry-->>Client: returns EmbeddingFunction instance
Client->>EF: __call__(documents)
EF->>API: call provider API (model, api_key, params)
API-->>EF: embeddings response
EF->>EF: validate and cache dimension
EF-->>Client: return embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This pull request adds support for 8 new embedding function platforms to pyseekdb: Ollama, Jina AI, Voyage AI, Tencent Hunyuan, SiliconFlow, Google Vertex AI, Cohere, and Amazon Bedrock. The changes enable users to leverage a wider variety of embedding models through a consistent interface.
Changes:
- Added 8 new embedding function implementations with comprehensive documentation and examples
- Renamed
LiteLLMEmbeddingFunctiontoLiteLLMBaseEmbeddingFunctionto better reflect its role as a base class - Updated the embedding function registry to automatically register all new platforms
- Added test files for Ollama and Jina embedding functions
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py | New embedding function for local Ollama models using OpenAI-compatible API |
| src/pyseekdb/utils/embedding_functions/jina_embedding_function.py | New embedding function for Jina AI models using LiteLLM base |
| src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py | New embedding function for Voyage AI models with native voyageai package |
| src/pyseekdb/utils/embedding_functions/tengxun_hunyuan_embedding_function.py | New embedding function for Tencent Hunyuan models using OpenAI-compatible API |
| src/pyseekdb/utils/embedding_functions/siliconflow_embedding_function.py | New embedding function for SiliconFlow models using OpenAI-compatible API |
| src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py | New embedding function for Google Vertex AI models using LiteLLM base |
| src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py | New embedding function for Cohere models using LiteLLM base |
| src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py | New embedding function for Amazon Bedrock models using boto3 |
| src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py | Renamed class from LiteLLMEmbeddingFunction to LiteLLMBaseEmbeddingFunction for clarity |
| src/pyseekdb/utils/embedding_functions/init.py | Updated exports to include all new embedding functions |
| src/pyseekdb/client/embedding_function.py | Updated registry initialization and improved type hints |
| tests/unit_tests/test_ollama_embedding_function.py | Comprehensive test suite for Ollama embedding function |
| tests/unit_tests/test_jina_embedding_function.py | Comprehensive test suite for Jina embedding function |
Comments suppressed due to low confidence (1)
src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py:60
- The docstring on line 60 says "Initialize." but should be more descriptive, like "Initialize LiteLLMBaseEmbeddingFunction." to match the naming convention used in other embedding functions.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tests/integration_tests/test_detect_db_type_and_version.py (1)
45-61: Scope db_client fixture per test to avoid cross‑mode failures.
db_clientis parameterized for embedded/server/oceanbase, but these tests assert a single db_type. They’ll fail in mismatched modes (e.g., seekdb test running on oceanbase). Scope each test to the intended mode(s).✅ Suggested fix (scope fixture per test)
+ `@pytest.mark.parametrize`("db_client", ["server"], indirect=True) def test_seekdb_type_detection(self, db_client): """Test: detect seekdb Server type and version""" # Verify client type assert db_client is not None assert hasattr(db_client, "_server") assert isinstance(db_client._server, pyseekdb.RemoteServerClient) @@ + `@pytest.mark.parametrize`("db_client", ["oceanbase"], indirect=True) def test_ob_type_detection(self, db_client): """Test: detect OceanBase Server type and version""" # Verify client type assert db_client is not None assert hasattr(db_client, "_server") assert isinstance(db_client._server, pyseekdb.RemoteServerClient)Also applies to: 68-85
src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py (3)
171-172: Critical bug:inputis undefined; should bedocuments.The validation at lines 171-172 references
input, which is not defined in this scope. The local variable holding the documents isdocuments.🐛 Proposed fix
# Validate that we got the expected number of embeddings - if len(embeddings) != len(input): - raise ValueError(f"Expected {len(input)} embeddings but got {len(embeddings)} from LiteLLM") + if len(embeddings) != len(documents): + raise ValueError(f"Expected {len(documents)} embeddings but got {len(embeddings)} from LiteLLM")
19-19: Docstring references outdated class nameLiteLLMEmbeddingFunction.The class was renamed to
LiteLLMBaseEmbeddingFunction, but the example in the docstring still imports the old name.📝 Proposed fix
- from pyseekdb.utils.embedding_functions import LiteLLMEmbeddingFunction + from pyseekdb.utils.embedding_functions import LiteLLMBaseEmbeddingFunctionAlso update subsequent usages in the docstring examples (lines 23, 30, 36) to use
LiteLLMBaseEmbeddingFunction.
123-125: Error message references outdated class name.The error message at line 124 still refers to
LiteLLMEmbeddingFunctioninstead of the renamedLiteLLMBaseEmbeddingFunction.📝 Proposed fix
raise ValueError( f"API key environment variable '{self.api_key_env}' is not set. " - f"Please set it before using LiteLLMEmbeddingFunction." + f"Please set it before using {self.__class__.__name__}." )Using
self.__class__.__name__makes the error message accurate for both the base class and any subclasses.
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py`:
- Around line 270-271: The code passes session_args (from
config.get("session_args")) directly into boto3.Session which can include keys
with None values and cause unexpected behavior; filter out any entries where the
value is None before constructing the session (e.g., build a cleaned dict from
session_args removing None values) and then call
boto3.Session(**cleaned_session_args) if cleaned_session_args is non-empty,
otherwise call boto3.Session() — update the logic around the
session_args/session creation in amazon_bedrock_embedding_function.py to use the
cleaned dictionary.
In `@tests/unit_tests/test_jina_embedding_function.py`:
- Around line 83-95: The test test_initialization_with_custom_api_key_env leaks
the CUSTOM_JINA_KEY env var; wrap setting of the env var in an env guard (or use
pytest's monkeypatch) so the environment is restored after the test and no
global state persists; specifically, when creating the
JinaEmbeddingFunction(model_name="jina-embeddings-v3",
api_key_env=custom_key_env) ensure you set the env var via an
env_guard/monkeypatch helper and remove/restore it in teardown so ef.api_key_env
can be asserted without leaving CUSTOM_JINA_KEY in os.environ.
In `@tests/unit_tests/test_ollama_embedding_function.py`:
- Around line 429-440: The test currently asserts that
OllamaEmbeddingFunction.build_from_config raises ValueError for a non-dict
client_kwargs, but the implementation raises TypeError; update the test to
expect TypeError instead of ValueError (change the pytest.raises call in
test_build_from_config_invalid_client_kwargs to match "TypeError" or the string
message from the raised TypeError) so it aligns with the behavior of
OllamaEmbeddingFunction.build_from_config.
- Around line 108-121: The test_initialization_with_custom_api_key_env sets
CUSTOM_OLLAMA_KEY but doesn't restore or remove it, risking cross-test
pollution; update this test to mirror test_initialization_without_api_key by
using env_guard (or a try/finally) around setting custom_key_env so the original
environment value is restored or the variable removed after the test, ensuring
test_ollama_env and the OllamaEmbeddingFunction(model_name="nomic-embed-text",
api_key_env=custom_key_env) assertion remain unchanged.
♻️ Duplicate comments (8)
src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (2)
110-112: kwargs validation may be too restrictive for boto3 Config objects.This validation rejects non-primitive types, but boto3 users may want to pass
botocore.config.Configobjects via kwargs. This was previously flagged.
255-257: Error message references wrong parameter name.The error message says
"kwargs must be a dictionary"but the config key is"client_kwargs".📝 Proposed fix
kwargs = config.get("client_kwargs", {}) if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (2)
84-85: Docstring default value is inconsistent with actual default.The docstring states the default is
"jina-embeddings-v2-base-en"but the actual default parameter on line 77 is"jina-embeddings-v3".📝 Proposed fix
Args: model_name (str, optional): Name of the Jina AI embedding model. - Defaults to "jina-embeddings-v2-base-en". + Defaults to "jina-embeddings-v3".
212-214: Error message references wrong parameter name.The error message says
"kwargs must be a dictionary"but the config key being validated is"client_kwargs".📝 Proposed fix
kwargs = config.get("client_kwargs", {}) if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/tengxun_hunyuan_embedding_function.py (1)
81-89: Redundant assignment ofdimensionsvariable.The
dimensions = Noneassignment on line 89 is never used since line 104 always passesdimensions=Noneexplicitly. While harmless, it can be removed for clarity.Proposed fix
# Warn if dimensions parameter is provided since it's not supported if dimensions is not None: warnings.warn( "The dimensions parameter is not supported by Tencent Hunyuan API. " "Dimensions are fixed at 1024. The provided dimensions parameter will be ignored.", UserWarning, stacklevel=2, ) - dimensions = Nonesrc/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (1)
242-243: Error message should referenceclient_kwargsfor consistency.The error message says "kwargs" but the config parameter is named "client_kwargs".
Proposed fix
if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
216-217: Error message should referenceclient_kwargsfor consistency.Same issue as in other embedding functions - the error message says "kwargs" but the config parameter is "client_kwargs".
Proposed fix
if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
275-276: Error message should referenceclient_kwargsfor consistency.Same issue as in other embedding functions.
Proposed fix
if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
🧹 Nitpick comments (6)
.github/workflows/ci.yml (1)
69-70: Fragile failure detection pattern.The
tail -n 1 | grep -q 'failed'approach can produce false negatives if:
- pytest crashes before producing summary output
- The summary format changes or doesn't end with "failed" on the last line
- Tests error (vs fail) may not contain "failed" in the last line
A more robust approach is to capture pytest's exit code directly:
♻️ Suggested improvement
- uv run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k "not server and not embedded and not oceanbase" | tee pytest.log - tail -n 1 pytest.log | grep -q 'failed' && exit 1 || exit 0 + uv run pytest tests/integration_tests/ -v --log-cli-level=${log_level} -k "not server and not embedded and not oceanbase" 2>&1 | tee pytest.log + exit ${PIPESTATUS[0]}Note: This is an existing pattern throughout the workflow, so this is a low-priority suggestion that could be addressed separately across all steps.
src/pyseekdb/client/embedding_function.py (1)
517-519:Selfreturn type on static method is unconventional.Using
Selfas the return type for a@staticmethodis semantically incorrect.Selfis designed for instance methods and@classmethodwhere it refers to the type of the current class instance. For a static method, it doesn't have access to the class context to determineSelf.Since
DefaultEmbeddingFunction.build_from_configalways returnsDefaultEmbeddingFunction(), the return type should be the concrete class.📝 Suggested fix
`@staticmethod` - def build_from_config(_config: dict[str, Any]) -> Self: + def build_from_config(_config: dict[str, Any]) -> "DefaultEmbeddingFunction": return DefaultEmbeddingFunction()src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (1)
186-209: Sequential API calls may be slow for large document batches.Each document triggers a separate API call. While this matches Bedrock's Titan embedding API constraints (single input per call), consider adding a note in the docstring about potential latency for large batches, or consider batching where supported by newer models.
tests/unit_tests/test_jina_embedding_function.py (1)
107-119: Consider usingenv_guardfor consistency.This test correctly uses try/finally for cleanup, but using
env_guard(JINA_AI_API_KEY=None)would be more consistent with the persistence tests and cleaner.tests/unit_tests/test_ollama_embedding_function.py (2)
54-55: Consider being more explicit about caught exceptions.While catching all exceptions is acceptable for availability checks, consider catching more specific exceptions (e.g.,
openai.APIConnectionError,ConnectionError) or adding a brief comment explaining the intent. This helps distinguish intentional broad catching from accidental oversight.💡 Optional: Add comment or narrow exception types
- except Exception: + except Exception: # Any failure means Ollama is not available return FalseOr more specific:
- except Exception: + except (ConnectionError, openai.APIConnectionError, openai.APIError): return False
67-77: Redundant environment checks in each test method.The class-level
@pytest.mark.skipif(not is_ollama_available(), ...)decorator already ensures tests only run when Ollama is available. Callingself.test_ollama_env()at the start of each test method is redundant and adds unnecessary overhead.Consider removing these calls from individual test methods, keeping only
test_ollama_envas a standalone sanity check if desired.
There was a problem hiding this comment.
Actionable comments posted: 12
🤖 Fix all issues with AI agents
In `@tests/unit_tests/test_amazon_bedrock_embedding_function.py`:
- Around line 220-236: The test test_dimension_property_unknown_model
incorrectly uses a known model ("amazon.titan-embed-text-v2") that's present in
the _AMAZON_BEDROCK_MODEL_DIMENSIONS map so
AmazonBedrockEmbeddingFunction.dimension returns from the known-map path instead
of exercising the fallback probe; either rename the test to
test_dimension_property_known_model to reflect current behavior or change the
model_name in that test to a truly unknown model ID (one not present in
_AMAZON_BEDROCK_MODEL_DIMENSIONS) so the
AmazonBedrockEmbeddingFunction.dimension property executes the probing/API call
path.
- Around line 41-52: The code currently catches a broad Exception in
are_aws_credentials_available() (and the profile-name test) which can hide real
failures; narrow the except blocks to handle specific failures by importing
botocore.exceptions and catching ImportError for missing boto3,
botocore.exceptions.NoCredentialsError for missing credentials,
botocore.exceptions.ProfileNotFound for invalid profiles, and
botocore.exceptions.BotoCoreError as a final botocore-specific fallback, leaving
other exceptions to propagate; update the exception handlers in the
are_aws_credentials_available() function and the profile-name test to use these
specific exception types and ensure boto3 import errors are handled with
ImportError.
In `@tests/unit_tests/test_cohere_embedding_function.py`:
- Around line 134-146: The test restores COHERE_API_KEY only when original_key
is truthy, which fails to restore an original empty-string value; update the
teardown in test_initialization_missing_api_key to restore the environment
variable unconditionally by checking for presence with "is not None" and setting
os.environ["COHERE_API_KEY"] = original_key when original_key is not None so the
environment is exactly as it was before the test (reference:
CohereEmbeddingFunction and the test_initialization_missing_api_key function).
- Around line 212-224: The test test_dimension_property_unknown_model is using a
model present in the cached known-dimension map so the probe logic is bypassed;
update the test to remove the cached entry for "embed-english-v3.0" from the
CohereEmbeddingFunction known-dimensions cache (e.g., pop or delete the map
entry on CohereEmbeddingFunction) before creating the instance so ef.dimension
triggers the fallback API probe path; reference the CohereEmbeddingFunction
class and the cache (the known-dimension map used by that class) to locate where
to remove the cached model entry.
In `@tests/unit_tests/test_google_vertex_embedding_function.py`:
- Around line 255-364: Tests expect a "region" key and AssertionError on missing
fields, but the implementation of GoogleVertexEmbeddingFunction uses
get_config() returning "location" and build_from_config() only requires
"project_id" while defaulting other fields; update the tests to match the
implementation by (1) asserting config["location"] (not "region") in
test_get_config_with_defaults and test_get_config_with_custom_values and
asserting restored_ef.location (or check region mapping if present) in
test_build_from_config_with_defaults/custom, and (2) remove or change the tests
test_build_from_config_missing_api_key_env,
test_build_from_config_missing_model_name,
test_build_from_config_missing_project_id, and
test_build_from_config_missing_region so they no longer expect AssertionError
but instead verify the actual behavior of build_from_config() (e.g., that
missing optional fields are defaulted and only missing project_id raises if
that's the real required field); update references to get_config and
build_from_config in those tests accordingly.
- Around line 50-138: Tests instantiate GoogleVertexEmbeddingFunction() with
wrong expectations: the constructor requires project_id, uses "location" not
"region", and model_name is prefixed with "vertex_ai/". Update the tests
(functions like test_initialization_with_defaults,
test_initialization_with_custom_region, test_initialization_with_all_parameters,
etc.) to pass a valid project_id when constructing
GoogleVertexEmbeddingFunction, assert on ef.location instead of ef.region, and
either assert the prefixed model_name ("vertex_ai/textembedding-gecko" or the
exact provided variant) or call ef.get_config() to compare the base model name;
also update any api_key_env expectations if the implementation uses a different
default name.
- Around line 34-48: The tests call GoogleVertexEmbeddingFunction() without the
required project_id and also assume an API key env var is mandatory; update the
tests to pass a valid project_id to every GoogleVertexEmbeddingFunction(...)
instantiation and change the pytest.mark.skipif guard (which currently checks
CHROMA_GOOGLE_VERTEX_API_KEY and is_vertexai_available()) so the API key env var
is optional when ADC is available (i.e., only skip when is_vertexai_available()
is false and no ADC can be detected), and if you intended API key validation
enforce it in GoogleVertexEmbeddingFunction.__init__ (or adjust
LiteLLMBaseEmbeddingFunction.__init__ behavior) so that missing api_key_env
raises the expected error rather than causing a TypeError from missing
project_id.
In `@tests/unit_tests/test_siliconflow_embedding_function.py`:
- Around line 199-211: The test test_dimension_property_unknown_model currently
uses a known model ("BAAI/bge-large-zh-v1.5") so
SiliconflowEmbeddingFunction.dimension reads from _SILICONFLOW_MODEL_DIMENSIONS
instead of exercising the API fallback; update the test to use a model name not
present in _SILICONFLOW_MODEL_DIMENSIONS (e.g., "unknown-model-xyz") or mock the
dimensions lookup so that SiliconflowEmbeddingFunction (class) performs the API
call path, ensuring the assertion still checks dim > 0 and the printed message
reflects an API-derived dimension.
- Around line 104-114: The test test_initialization_with_custom_api_base
incorrectly uses the default URL as the "custom" API base, so it won't catch an
implementation that ignores the api_base argument; update the test to pass a
clearly different URL (e.g., "https://custom.example/v1") to
SiliconflowEmbeddingFunction (constructed as ef =
SiliconflowEmbeddingFunction(..., api_base=custom_base)) and assert ef.api_base
equals that distinct custom_base string to ensure the api_base parameter is
honored by the constructor.
In `@tests/unit_tests/test_tencent_hunyuan_embedding_function.py`:
- Around line 67-79: The test_initialization_with_custom_api_key_env sets an
environment variable custom_key_env ("CUSTOM_HUNYUAN_KEY") without cleaning it
up, which can leak into other tests; modify the test to use the existing
env_guard fixture/context to set CUSTOM_HUNYUAN_KEY (or wrap the os.environ
assignment) so the variable is automatically restored/cleared after the test,
and retain the rest of the assertions that check ef.api_key_env.
In `@tests/unit_tests/test_voyageai_embedding_function.py`:
- Around line 190-198: The test_initialization_without_voyageai method is never
executed due to a class-level skipif that skips the whole test class when
voyageai is unavailable while the test itself skips when voyageai is present;
extract test_initialization_without_voyageai out of the skipped test class and
place it at module level with an inverse skip decorator (e.g.,
skipif(is_voyageai_available(), reason="voyageai is available")) so it only runs
when voyageai is not installed, remove the in-method pytest.skip check, and
ensure it uses the existing is_voyageai_available() helper and does not depend
on VOYAGE_API_KEY.
- Around line 241-253: The test test_dimension_property_unknown_model currently
uses "voyage-4-large" which is present in the known-dimensions map so the
API-probing branch isn't exercised; modify the test to accept the pytest
monkeypatch fixture, remove the entry for "voyage-4-large" from
VoyageaiEmbeddingFunction.known_dimensions (or the specific cache dict) using
monkeypatch.delitem or monkeypatch.setitem to ensure the lookup falls through to
the API-probing path, then instantiate
VoyageaiEmbeddingFunction(model_name="voyage-4-large") and access ef.dimension
to trigger the fallback API call and assert dim > 0.
🧹 Nitpick comments (5)
tests/unit_tests/test_tencent_hunyuan_embedding_function.py (2)
49-66: Redundant environment checks inside test methods.Since the class is decorated with
@pytest.mark.skipifthat already checks both conditions, theself.test_hunyuan_env()calls inside each test method (lines 54, 71, 85, 97, etc.) are redundant. The class will only run if conditions are already met.♻️ Suggested simplification
Remove the
self.test_hunyuan_env()calls from individual test methods. The class-level skipif decorator already guarantees the prerequisites:def test_initialization_with_defaults(self): """Test TencentHunyuanEmbeddingFunction initialization with default values""" print("\n✅ Testing TencentHunyuanEmbeddingFunction initialization with defaults") - # Check if openai is available and env vars are set - self.test_hunyuan_env() - ef = TencentHunyuanEmbeddingFunction()Apply similar changes to other test methods that call
self.test_hunyuan_env().
169-173: Consider suppressing expected warning.Line 170 creates an instance with
dimensions=512, which triggers a warning (as verified intest_initialization_with_dimensions_warning). Suppress it here to reduce test noise, consistent withtest_get_config_with_dimensions(lines 316-318).♻️ Suggested fix
# Test that dimension is always 1024 regardless of initialization - ef2 = TencentHunyuanEmbeddingFunction(dimensions=512) # Should warn but dimension still 1024 + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + ef2 = TencentHunyuanEmbeddingFunction(dimensions=512) # Should warn but dimension still 1024 dim2 = ef2.dimensiontests/unit_tests/test_siliconflow_embedding_function.py (2)
42-46: Redundant environment checks within test methods.Since the class is decorated with
@pytest.mark.skipifthat already verifiesSILICONFLOW_API_KEYandis_openai_available(), the repeated calls toself.test_siliconflow_env()at the start of each test method (lines 53, 70, 94, 108, 120, 131, 155, 203, etc.) are redundant. These conditions are guaranteed to be true when any test in this class executes.Consider removing the redundant
self.test_siliconflow_env()calls from individual test methods to reduce test execution overhead.
90-102: Environment variable not cleaned up after test.The custom environment variable set at line 98 is never cleaned up. If this test or subsequent tests fail,
CUSTOM_SILICONFLOW_KEYremains in the environment, potentially affecting test isolation.Consider using the
env_guardcontext manager (already imported fromtest_utils) for consistency with the persistence tests.♻️ Proposed fix
def test_initialization_with_custom_api_key_env(self): """Test SiliconflowEmbeddingFunction initialization with custom API key env""" print("\n✅ Testing SiliconflowEmbeddingFunction initialization with custom API key env") - self.test_siliconflow_env() - custom_key_env = "CUSTOM_SILICONFLOW_KEY" - if not os.environ.get(custom_key_env): - os.environ[custom_key_env] = os.environ.get("SILICONFLOW_API_KEY", "your-custom-key") - - ef = SiliconflowEmbeddingFunction(model_name="BAAI/bge-large-zh-v1.5", api_key_env=custom_key_env) - assert ef.api_key_env == custom_key_env - print(f" Custom API key env: {ef.api_key_env}") + with env_guard(**{custom_key_env: os.environ.get("SILICONFLOW_API_KEY", "test-key")}): + ef = SiliconflowEmbeddingFunction(model_name="BAAI/bge-large-zh-v1.5", api_key_env=custom_key_env) + assert ef.api_key_env == custom_key_env + print(f" Custom API key env: {ef.api_key_env}")tests/unit_tests/test_voyageai_embedding_function.py (1)
102-108: Avoid leaking CUSTOM_VOYAGE_KEY into the process environment.
Useenv_guard(or restore the original value) so this test doesn’t affect later tests.♻️ Proposed refactor
- custom_key_env = "CUSTOM_VOYAGE_KEY" - if not os.environ.get(custom_key_env): - os.environ[custom_key_env] = os.environ.get("VOYAGE_API_KEY", "your-custom-key") - - ef = VoyageaiEmbeddingFunction(model_name="voyage-4-large", api_key_env=custom_key_env) - assert ef.api_key_env == custom_key_env - print(f" Custom API key env: {ef.api_key_env}") + custom_key_env = "CUSTOM_VOYAGE_KEY" + with env_guard(**{custom_key_env: os.environ.get("VOYAGE_API_KEY", "your-custom-key")}): + ef = VoyageaiEmbeddingFunction(model_name="voyage-4-large", api_key_env=custom_key_env) + assert ef.api_key_env == custom_key_env + print(f" Custom API key env: {ef.api_key_env}")
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pyseekdb/utils/embedding_functions/litellm_base_embedding_function.py (1)
17-19: Docstring example still references old class name.The example code in the docstring imports
LiteLLMEmbeddingFunction, but the class has been renamed toLiteLLMBaseEmbeddingFunction.📝 Suggested fix
- from pyseekdb.utils.embedding_functions import LiteLLMEmbeddingFunction + from pyseekdb.utils.embedding_functions import LiteLLMBaseEmbeddingFunctionAlso update the instantiation examples below (lines 23, 30, 36) to use
LiteLLMBaseEmbeddingFunction.
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py`:
- Line 45: The default environment variable name for the API key is incorrectly
set to "CHROMA_GOOGLE_VERTEX_API_KEY"; update the default value of the
api_key_env parameter in the GoogleVertexEmbeddingFunction (the api_key_env
parameter in google_vertex_embedding_function.py) to a project-appropriate name
such as "GOOGLE_VERTEX_API_KEY" or "PYSEEKDB_GOOGLE_VERTEX_API_KEY" and ensure
any related documentation or default usage in that module is adjusted
accordingly.
- Around line 107-122: The build_from_config method in
GoogleVertexEmbeddingFunction is stricter than the constructor: instead of
raising ValueError when api_key_env, model_name, project_id, or region are
missing, use the constructor defaults for any missing values; update
build_from_config to extract those keys from config and pass them (even if None)
into GoogleVertexEmbeddingFunction(api_key_env=..., model_name=...,
project_id=..., region=...) so it behaves consistently with the __init__
defaults and can reconstruct configs that rely on defaults.
♻️ Duplicate comments (4)
src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (1)
265-266: PassingNonevalues insession_argstoboto3.Session()may cause issues.When
session_argscontainsNonevalues (e.g.,{"region_name": None, "profile_name": None}), passing them toboto3.Session()may cause unexpected behavior compared to not passing them at all.📝 Suggested fix (from past review)
session_args = config.get("session_args") - session = boto3.Session(**session_args) if session_args else boto3.Session() + if session_args: + # Filter out None values to avoid passing them to boto3.Session + filtered_args = {k: v for k, v in session_args.items() if v is not None} + session = boto3.Session(**filtered_args) if filtered_args else boto3.Session() + else: + session = boto3.Session()src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (1)
196-199: Error message references "kwargs" but config key is "client_kwargs".For consistency with the config structure and better debugging experience, the error message should reference the actual config key name.
📝 Suggested fix
kwargs = config.get("client_kwargs", {}) if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
206-219: Error message references wrong parameter name.The error message on line 212 says "kwargs must be a dictionary" but the config key being validated is
client_kwargs. This inconsistency could confuse users debugging configuration issues.🐛 Proposed fix
kwargs = config.get("client_kwargs", {}) if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
259-278: Error message references wrong parameter name.Similar to the Cohere implementation, the error message on line 269 says "kwargs must be a dictionary" but the config key being validated is
client_kwargs.🐛 Proposed fix
kwargs = config.get("client_kwargs", {}) if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + raise TypeError(f"client_kwargs must be a dictionary, but got {kwargs}")
🧹 Nitpick comments (2)
src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (1)
9-101: Missingdimensionproperty for consistency.Other embedding functions in this PR (Jina, Siliconflow, AmazonBedrock, Voyageai) implement a
dimensionproperty that returns the embedding vector size. This is missing fromGoogleVertexEmbeddingFunction, which may cause issues if callers expect this property.♻️ Suggested addition
`@property` def dimension(self) -> int: """Get the dimension of embeddings produced by this function.""" # textembedding-gecko produces 768-dimensional embeddings # Add known dimensions mapping or probe via API known_dims = { "textembedding-gecko": 768, "textembedding-gecko@003": 768, "text-embedding-004": 768, } if self.model_name in known_dims: return known_dims[self.model_name] # Fallback: probe via API test_embeddings = self(["dimension probing"]) return len(test_embeddings[0])src/pyseekdb/client/embedding_function.py (1)
631-660: Consider partial registration for optional embedding functions.The current implementation uses a single try/except block for all imports. If any one optional dependency is missing, none of the embedding functions will be registered. This could be problematic if a user has only some dependencies installed (e.g., has
voyageaibut notboto3).Consider importing and registering each embedding function individually to allow partial availability:
♻️ Suggested approach for partial registration
# Register each optional embedding function individually optional_functions = [ ("sentence_transformer", "SentenceTransformerEmbeddingFunction"), ("openai", "OpenAIEmbeddingFunction"), ("qwen", "QwenEmbeddingFunction"), ("siliconflow", "SiliconflowEmbeddingFunction"), ("tencent_hunyuan", "TencentHunyuanEmbeddingFunction"), ("ollama", "OllamaEmbeddingFunction"), ("voyageai", "VoyageaiEmbeddingFunction"), ("google_vertex", "GoogleVertexEmbeddingFunction"), ("cohere", "CohereEmbeddingFunction"), ("jina", "JinaEmbeddingFunction"), ("amazon_bedrock", "AmazonBedrockEmbeddingFunction"), ] for registry_name, class_name in optional_functions: try: module = __import__( "pyseekdb.utils.embedding_functions", fromlist=[class_name] ) cls = getattr(module, class_name) cls._registry[registry_name] = cls except (ImportError, AttributeError) as e: logger.debug(f"Optional embedding function {class_name} not available: {e}")
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py`:
- Around line 68-78: The constructor currently enforces an API key via
self.api_key_env/self.api_key and raises if missing even though
TextEmbeddingModel.from_pretrained() uses ADC/OAuth2; remove the API key
validation and storage so ADC can be used: delete the lines that set
self.api_key_env, self.api_key and the if-not-self.api_key ValueError, and
update the __init__ signature or docs to no longer require api_key_env (or make
it optional and only store it without raising); keep
vertexai.init(project=project_id, location=region) and the call to
TextEmbeddingModel.from_pretrained(model_name) unchanged so authentication is
handled by ADC.
In `@src/pyseekdb/utils/embedding_functions/jina_embedding_function.py`:
- Line 49: Update the docstring in
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py so the
environment variable name matches the implementation: change the example text
that currently says "Set JINA_API_KEY environment variable first" to reference
the actual env var used by the code, JINA_AI_API_KEY (used in the module when
reading the API key). Ensure any docstring examples or comments in the
JinaEmbeddingFunction/Jina embedding helper mention JINA_AI_API_KEY
consistently.
In `@src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py`:
- Around line 96-101: The code in OllamaEmbeddingFunction currently mutates the
process-wide environment variable (api_key_env) by setting
os.environ[api_key_env] = "ollama"; remove that global mutation and instead
accept an api_key parameter (or read it from kwargs) in
OllamaEmbeddingFunction.__init__ (or the factory that constructs the
OpenAI/Ollama client) and pass that api_key directly into the client creation
path (where the Ollama/OpenAI client is instantiated) so no global env is
changed; if the base class does not accept api_key yet, add an optional api_key
kwarg to the base constructor and thread it through to OllamaEmbeddingFunction
so the fallback default "ollama" is applied locally rather than written to
os.environ, or alternatively add clear documentation in OllamaEmbeddingFunction
explaining why the env is intentionally left unmodified.
♻️ Duplicate comments (6)
src/pyseekdb/utils/embedding_functions/amazon_bedrock_embedding_function.py (2)
243-267: Alignclient_kwargsnaming and filtersession_argsNone values.Two small robustness fixes:
- The error message should reference
client_kwargsfor consistency.- Filter
Nonevalues before building aboto3.Session.🛠️ Proposed fix
- kwargs = config.get("client_kwargs", {}) - if not isinstance(kwargs, dict): - raise TypeError(f"kwargs must be a dictionary, but got {kwargs}") + client_kwargs = config.get("client_kwargs", {}) + if not isinstance(client_kwargs, dict): + raise TypeError(f"client_kwargs must be a dictionary, but got {client_kwargs}") @@ - session_args = config.get("session_args") - session = boto3.Session(**session_args) if session_args else boto3.Session() + session_args = config.get("session_args") or {} + if session_args: + session_args = {k: v for k, v in session_args.items() if v is not None} + session = boto3.Session(**session_args) if session_args else boto3.Session() @@ - **kwargs, + **client_kwargs,Verification (boto3
Sessionargs handling):boto3 Session parameters None values behavior region_name profile_name
95-115: Config object is documented but will be rejected by validation, blocking a documented feature.The docstring at line 92 explicitly documents the
config: boto3 Config objectas a valid kwarg, but the validation at lines 98–100 rejects it becausebotocore.config.Configis not a primitive type. Additionally, even if validation allowed it, storingConfiginself._client_kwargs(line 111) would break serialization inget_config()sinceConfigobjects are not JSON-serializable.To fix this, separate runtime kwargs (which may include
Config) from serializable kwargs (which must remain primitive):🐛 Suggested fix (allow runtime Config, keep persisted kwargs serializable)
- for key, value in kwargs.items(): - if not isinstance(value, (str, int, float, bool, list, dict, tuple)): - raise TypeError(f"Keyword argument {key} is not a primitive type: {type(value)!s}") + runtime_kwargs = kwargs.copy() + serializable_kwargs: dict[str, Any] = {} + try: + from botocore.config import Config as BotoConfig + except Exception: + BotoConfig = None + for key, value in kwargs.items(): + if key == "config" and BotoConfig is not None and isinstance(value, BotoConfig): + # Allow at runtime, but don't persist + continue + if not isinstance(value, (str, int, float, bool, list, dict, tuple, type(None))): + raise TypeError(f"Keyword argument {key} is not a primitive type: {type(value)!s}") + serializable_kwargs[key] = value @@ - self._client_kwargs = kwargs + self._client_kwargs = serializable_kwargs @@ - self._client = session.client("bedrock-runtime", **kwargs) + self._client = session.client("bedrock-runtime", **runtime_kwargs)src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py (2)
45-45: Environment variable name uses "CHROMA" prefix.The default
api_key_envis"CHROMA_GOOGLE_VERTEX_API_KEY"which appears to be copied from Chroma's codebase. Consider using a more project-appropriate name like"GOOGLE_VERTEX_API_KEY"or"PYSEEKDB_GOOGLE_VERTEX_API_KEY".
107-122:build_from_configis stricter than the constructor.The constructor has defaults for all parameters, but
build_from_configraises an error if any are missing. This means configs saved with defaults cannot be restored if any field isNone. Consider using defaults similar to the constructor.src/pyseekdb/utils/embedding_functions/jina_embedding_function.py (1)
193-195: Error message references wrong parameter name.The error message says "kwargs must be a dictionary" but the config parameter is
client_kwargs. Update for consistency.src/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
204-206: Error message references wrong parameter name.The error message says "kwargs must be a dictionary" but the config parameter is
client_kwargs. Update for consistency with the parameter name.
🧹 Nitpick comments (2)
src/pyseekdb/utils/embedding_functions/voyageai_embedding_function.py (1)
163-199: Optional: normalize embeddings to plain lists for defensive robustness.The Voyage AI API with
output_dtype="float"(already set explicitly) returnsList[List[float]]per official documentation, which matches theEmbeddingstype requirement. However, defensive normalization could guard against unexpected API behavior changes:♻️ Proposed refinement
- # Convert to list of lists (in case voyageai returns numpy arrays or other formats) - return embeddings + # Convert to list of lists (in case voyageai returns numpy arrays or other formats) + normalized = [] + for embedding in embeddings: + if hasattr(embedding, "tolist"): + normalized.append(embedding.tolist()) + else: + normalized.append(list(embedding)) + return normalizedsrc/pyseekdb/utils/embedding_functions/cohere_embedding_function.py (1)
200-200: Inconsistent handling of missingmodel_nameinbuild_from_config.Unlike other embedding functions (e.g.,
JinaEmbeddingFunction,OpenAIEmbeddingFunction) that raiseValueErrorwhenmodel_nameis missing, this implementation uses a default value. This inconsistency could lead to unexpected behavior if a config is partially corrupted. Consider aligning with other implementations by requiringmodel_name.📝 Suggested fix for consistency
- model_name = config.get("model_name", _DEFAULT_MODEL_NAME) + model_name = config.get("model_name") + if model_name is None: + raise ValueError("Missing required field 'model_name' in configuration")
| self.api_key_env = api_key_env | ||
| self.api_key = os.getenv(api_key_env) | ||
| if not self.api_key: | ||
| raise ValueError(f"The {api_key_env} environment variable is not set.") | ||
|
|
||
| self.model_name = model_name | ||
| self.project_id = project_id | ||
| self.region = region | ||
|
|
||
| vertexai.init(project=project_id, location=region) | ||
| self._model = TextEmbeddingModel.from_pretrained(model_name) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Google Vertex AI Python SDK authentication API key vs Application Default Credentials
💡 Result:
Vertex AI Python: API key vs Application Default Credentials (ADC)
What ADC is (recommended for most Vertex AI usage)
- ADC is OAuth/IAM-based authentication: the SDK obtains short‑lived access tokens tied to a user or service account identity, and Vertex AI authorizes requests via IAM roles. (cloud.google.com)
- The Vertex AI Python SDK setup docs explicitly guide you to authenticate locally with ADC via
gcloud auth application-default login. (docs.cloud.google.com) - Use ADC for production and for essentially all non-trivial Vertex AI access (training, pipelines, endpoints, model management, etc.). (cloud.google.com)
What an API key is (more limited; often “testing”)
There are two “API key” ideas people conflate:
- Plain API keys (project/quota identifier, not an IAM identity)
- API keys generally don’t identify a caller principal for IAM authorization; they mainly associate calls with a project for billing/quota and only work where the API method explicitly allows API keys. (cloud.google.com)
- Example: Vertex AI Search supports API keys only for specific methods (not broadly for everything). (cloud.google.com)
- API keys bound to a service account (special case)
- Google documents a special feature: API keys bound to a service account (so requests are authorized as that service account). It notes the only API that supports bound API keys is
aiplatform.googleapis.comand warns not to use them in production. (us.tlz.app)
Generative AI on Vertex AI (Gemini): both are commonly mentioned
- For Gemini on Vertex AI, Google’s docs say you can authenticate with a Google Cloud API key or ADC, and they recommend API keys for testing and ADC for production. (cloud.google.com)
- The newer
vertexaiPython client reference shows aClient(..., api_key=..., credentials=...), i.e., it can be configured either way depending on your use case/library. (docs.cloud.google.com)
Practical guidance
- Use ADC when you need: IAM enforcement, service accounts, least privilege, production deployments, or broad Vertex AI feature coverage. (cloud.google.com)
- Use an API key only when the specific Vertex AI surface you’re calling supports it and you’re doing quick testing/prototyping (and restrict the key). (cloud.google.com)
If you tell me which Python package you’re using (google-cloud-aiplatform, vertexai, or google-genai) and which feature (Gemini, endpoints, pipelines, etc.), I can give the exact auth snippet for that SDK.
Citations:
- 1: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai
- 2: https://docs.cloud.google.com/vertex-ai/docs/python-sdk/use-vertex-ai-python-sdk?utm_source=openai
- 3: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai
- 4: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai
- 5: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai
- 6: https://us.tlz.app/default/https/docs.cloud.google.com/docs/authentication/api-keys?utm_source=openai
- 7: https://cloud.google.com/vertex-ai/generative-ai/docs/start/gcp-auth?utm_source=openai
- 8: https://docs.cloud.google.com/python/docs/reference/vertexai/latest/vertexai.Client?utm_source=openai
- 9: https://cloud.google.com/vertex-ai/docs/authentication?utm_source=openai
- 10: https://cloud.google.com/generative-ai-app-builder/docs/authentication?utm_source=openai
🌐 Web query:
vertexai.init() function parameters credentials api_key Google Cloud Python SDK
💡 Result:
vertexai.init() (Vertex AI SDK for Python / google-cloud-aiplatform) signature includes both credentials and api_key:
vertexai.init(
*,
project: Optional[str] = None,
location: Optional[str] = None,
experiment: Optional[str] = None,
experiment_description: Optional[str] = None,
experiment_tensorboard: Optional[Union[str, Tensorboard, bool]] = None,
staging_bucket: Optional[str] = None,
credentials: Optional[google.auth.credentials.Credentials] = None,
encryption_spec_key_name: Optional[str] = None,
network: Optional[str] = None,
service_account: Optional[str] = None,
api_endpoint: Optional[str] = None,
api_key: Optional[str] = None,
api_transport: Optional[str] = None, # 'grpc' or 'rest'
request_metadata: Optional[Sequence[Tuple[str, str]]] = None,
)Key auth-related parameters:
credentials: agoogle.auth.credentials.Credentialsobject (commonly from Application Default Credentials); if omitted, the SDK auto-detects credentials from the environment. [1], [3]api_key: an API key string for service calls (not all Vertex AI services support API keys). [1]
If you need to create/get an API key in Google Cloud, follow the “Get a Google Cloud API key” guide. [2]
Sources: [1] Vertex AI Python reference (vertexai.init parameters), [2] Google Cloud “Get a Google Cloud API key”, [3] Google Cloud “Authenticate to Vertex AI” (ADC).
🌐 Web query:
TextEmbeddingModel from_pretrained Vertex AI authentication API key
💡 Result:
TextEmbeddingModel.from_pretrained(...) (from the Vertex AI Python SDK, vertexai.language_models) authenticates with Google Cloud credentials (OAuth2 / Application Default Credentials), not an “API key” parameter. The client libraries look for ADC automatically. (docs.cloud.google.com)
Use ADC (recommended)
Local dev:
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_IDThen:
import vertexai
from vertexai.language_models import TextEmbeddingModel
vertexai.init(project="YOUR_PROJECT_ID", location="us-central1")
model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")
emb = model.get_embeddings(["hello"])[0].values(from_pretrained usage example is in the official reference.) (cloud.google.com)
Or use a Service Account key (non-interactive)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"Then run the same Python code (ADC will pick it up). (docs.cloud.google.com)
About “API keys”
Google Cloud docs mention API keys for “Gemini on Vertex AI,” but many Vertex AI REST endpoints reject API keys and require OAuth2/ADC; if you try ?key=... you can get API keys are not supported by this API. (cloud.google.com)
If you tell me which library you’re using (vertexai / google-cloud-aiplatform vs the newer google-genai SDK) and your runtime (local, GCE, Cloud Run, etc.), I can give the exact minimal setup for that environment.
Citations:
- 1: https://docs.cloud.google.com/vertex-ai/docs/authentication?utm_source=openai
- 2: https://cloud.google.com/python/docs/reference/aiplatform/1.28.0/vertexai.language_models.TextEmbeddingModel?utm_source=openai
- 3: https://docs.cloud.google.com/vertex-ai/docs/authentication?utm_source=openai
- 4: https://cloud.google.com/vertex-ai/generative-ai/docs/start/api-keys?utm_source=openai
API key validation blocks legitimate usage but is never passed to TextEmbeddingModel.
The code validates an API key from an environment variable and raises an error if missing, but TextEmbeddingModel.from_pretrained() authenticates using Application Default Credentials (ADC) / OAuth2, not API keys. This unnecessary validation will prevent the code from running when ADC is properly configured but the environment variable is not set.
Remove the API key requirement since TextEmbeddingModel does not support API key authentication:
Current code (lines 68-71)
self.api_key_env = api_key_env
self.api_key = os.getenv(api_key_env)
if not self.api_key:
raise ValueError(f"The {api_key_env} environment variable is not set.")
Either remove the api_key_env parameter and related validation entirely, or document why it's stored if not used.
🧰 Tools
🪛 Ruff (0.14.13)
71-71: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/google_vertex_embedding_function.py`
around lines 68 - 78, The constructor currently enforces an API key via
self.api_key_env/self.api_key and raises if missing even though
TextEmbeddingModel.from_pretrained() uses ADC/OAuth2; remove the API key
validation and storage so ADC can be used: delete the lines that set
self.api_key_env, self.api_key and the if-not-self.api_key ValueError, and
update the __init__ signature or docs to no longer require api_key_env (or make
it optional and only store it without raising); keep
vertexai.init(project=project_id, location=region) and the call to
TextEmbeddingModel.from_pretrained(model_name) unchanged so authentication is
handled by ADC.
| from pyseekdb.utils.embedding_functions import JinaEmbeddingFunction | ||
|
|
||
| # Using Jina AI embedding model | ||
| # Set JINA_API_KEY environment variable first |
There was a problem hiding this comment.
Docstring references incorrect environment variable name.
The docstring example says Set JINA_API_KEY environment variable first, but the actual default env var name in the code is JINA_AI_API_KEY (line 98). Update the docstring to match the actual implementation.
📝 Suggested fix
- # Set JINA_API_KEY environment variable first
+ # Set JINA_AI_API_KEY environment variable first📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set JINA_API_KEY environment variable first | |
| # Set JINA_AI_API_KEY environment variable first |
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/jina_embedding_function.py` at line
49, Update the docstring in
src/pyseekdb/utils/embedding_functions/jina_embedding_function.py so the
environment variable name matches the implementation: change the example text
that currently says "Set JINA_API_KEY environment variable first" to reference
the actual env var used by the code, JINA_AI_API_KEY (used in the module when
reading the API key). Ensure any docstring examples or comments in the
JinaEmbeddingFunction/Jina embedding helper mention JINA_AI_API_KEY
consistently.
| # Handle the case where API key env var is not set (Ollama ignores the key anyway) | ||
| # We'll set a default value of "ollama" if not found | ||
| api_key = os.environ.get(api_key_env) | ||
| if api_key is None: | ||
| # Set a default API key since Ollama requires it but ignores it | ||
| os.environ[api_key_env] = "ollama" |
There was a problem hiding this comment.
Global environment modification has side effects.
Setting os.environ[api_key_env] = "ollama" persists globally for the process lifetime. This could affect other code that checks for the presence of this environment variable, or cause issues in concurrent scenarios where multiple embedding functions are instantiated.
Consider passing the API key directly to the OpenAI client via kwargs instead of modifying the global environment:
📝 Suggested alternative approach
# Handle the case where API key env var is not set (Ollama ignores the key anyway)
- # We'll set a default value of "ollama" if not found
api_key = os.environ.get(api_key_env)
if api_key is None:
- # Set a default API key since Ollama requires it but ignores it
- os.environ[api_key_env] = "ollama"
+ # Ollama requires an API key but ignores it - pass directly
+ kwargs["api_key"] = "ollama"
+ else:
+ kwargs["api_key"] = api_keyNote: This would require the base class to accept api_key in kwargs. Alternatively, document this behavior clearly if the current approach is intentional.
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/ollama_embedding_function.py` around
lines 96 - 101, The code in OllamaEmbeddingFunction currently mutates the
process-wide environment variable (api_key_env) by setting
os.environ[api_key_env] = "ollama"; remove that global mutation and instead
accept an api_key parameter (or read it from kwargs) in
OllamaEmbeddingFunction.__init__ (or the factory that constructs the
OpenAI/Ollama client) and pass that api_key directly into the client creation
path (where the Ollama/OpenAI client is instantiated) so no global env is
changed; if the base class does not accept api_key yet, add an optional api_key
kwarg to the base constructor and thread it through to OllamaEmbeddingFunction
so the fallback default "ollama" is applied locally rather than written to
os.environ, or alternatively add clear documentation in OllamaEmbeddingFunction
explaining why the env is intentionally left unmodified.
Summary
close #127
Pull request overview
This pull request adds support for 8 new embedding function platforms to pyseekdb: Ollama, Jina AI, Voyage AI, Tencent Hunyuan, SiliconFlow, Google Vertex AI, Cohere, and Amazon Bedrock. The changes enable users to leverage a wider variety of embedding models through a consistent interface.
Changes:
LiteLLMEmbeddingFunctiontoLiteLLMBaseEmbeddingFunctionto better reflect its role as a base classSummary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.