support embedding function persistence#114
Conversation
📝 WalkthroughWalkthroughAdds embedding-function serialization APIs ( Changes
Sequence DiagramsequenceDiagram
participant Client
participant Registry as EmbeddingFunction<br/>(Protocol)
participant Impl as Concrete<br/>(OpenAI/Qwen/SentenceTransformer)
participant Store as Configuration<br/>Storage
rect rgba(100, 150, 255, 0.5)
Note over Client,Impl: Serialization (save)
Client->>Impl: instance = Impl(...)
Client->>Impl: name = instance.name()
Impl-->>Client: "openai"/"qwen"/"sentence_transformer"
Client->>Impl: config = instance.get_config()
Impl-->>Client: {model_name, api_key_env, ...}
Client->>Store: save(name, config)
end
rect rgba(255, 150, 100, 0.5)
Note over Client,Registry: Deserialization (restore)
Client->>Store: (name, config) = load()
Store-->>Client: ("openai", {...})
Client->>Registry: choose implementation by name
Registry->>Impl: Impl.build_from_config(config)
Impl-->>Client: instance
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Line 8: The requires-python constraint currently uses ">=3.11,<=3.13" which
wrongly excludes 3.13.x patch releases; update the requires-python entry (the
requires-python key in pyproject.toml) to allow all 3.13 patch versions (for
example use a range like ">=3.11,<3.14") so 3.13.1+ are accepted while still
capping before 3.14.
In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 138-166: The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).
🧹 Nitpick comments (3)
src/pyseekdb/client/embedding_function.py (2)
46-96: Addname()andbuild_from_config()to the Protocol for type-safety.The docstring documents these methods, but the Protocol doesn’t declare them, so type checking won’t enforce the persistence API. Consider adding them as abstract static methods.
♻️ Suggested interface addition
class EmbeddingFunction(Protocol[D]): @@ + `@staticmethod` + `@abstractmethod` + def name() -> str: + ... + @@ `@abstractmethod` def get_config(self) -> Dict[str, Any]: ... + + `@staticmethod` + `@abstractmethod` + def build_from_config(config: Dict[str, Any]) -> "EmbeddingFunction": + ...
506-508: Silence unused-argument lint inbuild_from_config.
configis intentionally unused here, but lint will flag it. Rename or explicitly ignore to avoid noise.✅ Minimal lint fix
- def build_from_config(config: Dict[str, Any]) -> "DefaultEmbeddingFunction": + def build_from_config(_config: Dict[str, Any]) -> "DefaultEmbeddingFunction": return DefaultEmbeddingFunction()tests/unit_tests/test_openai_embedding_function.py (1)
17-31: Centralize theis_openai_available()helper intest_utils.py.This helper is duplicated identically in both
test_qwen_embedding_function.pyandtest_openai_embedding_function.py. Moving it totest_utils.pywhere other test utilities likeEnvGuardalready live will eliminate duplication and prevent future drift.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/ci.ymlpyproject.tomlsrc/pyseekdb/client/embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_base_embedding_function.pysrc/pyseekdb/utils/embedding_functions/openai_embedding_function.pysrc/pyseekdb/utils/embedding_functions/qwen_embedding_function.pysrc/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.pytests/unit_tests/test_default_embedding_function.pytests/unit_tests/test_env_guard.pytests/unit_tests/test_openai_embedding_function.pytests/unit_tests/test_qwen_embedding_function.pytests/unit_tests/test_sentence_transformer_embedding_function.pytests/unit_tests/test_utils.py
🧰 Additional context used
🧬 Code graph analysis (9)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
name(120-126)get_config(128-134)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
name(86-92)get_config(94-105)build_from_config(108-130)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
tests/unit_tests/test_sentence_transformer_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (4)
SentenceTransformerEmbeddingFunction(5-130)name(86-92)get_config(94-105)build_from_config(108-130)
tests/unit_tests/test_openai_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
env_guard(134-152)tests/unit_tests/test_qwen_embedding_function.py (1)
is_openai_available(20-27)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
OpenAIEmbeddingFunction(13-166)name(122-128)get_config(130-136)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
tests/unit_tests/test_env_guard.py (1)
tests/unit_tests/test_utils.py (4)
EnvGuard(11-130)env_guard(134-152)save(51-67)restore(96-117)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
src/pyseekdb/client/embedding_function.py (2)
get_config(85-96)get_config(503-504)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
get_config(130-136)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
get_config(128-134)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
get_config(94-105)
tests/unit_tests/test_qwen_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
env_guard(134-152)tests/unit_tests/test_openai_embedding_function.py (7)
is_openai_available(20-27)test_name(370-372)test_get_config_with_defaults(374-387)test_get_config_with_dimensions(409-418)test_build_from_config_with_defaults(420-437)test_build_from_config_with_custom_values(439-458)test_persistence_roundtrip(460-475)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
QwenEmbeddingFunction(14-164)name(120-126)get_config(128-134)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
src/pyseekdb/client/embedding_function.py (4)
name(500-501)get_config(85-96)get_config(503-504)build_from_config(507-508)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
name(122-128)get_config(130-136)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
name(120-126)get_config(128-134)build_from_config(137-164)
tests/unit_tests/test_default_embedding_function.py (1)
src/pyseekdb/client/embedding_function.py (5)
DefaultEmbeddingFunction(115-511)name(500-501)get_config(85-96)get_config(503-504)build_from_config(507-508)
src/pyseekdb/client/embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
get_config(203-218)src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
get_config(130-136)name(122-128)build_from_config(139-166)src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
get_config(128-134)name(120-126)build_from_config(137-164)src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
get_config(94-105)name(86-92)build_from_config(108-130)
🪛 Ruff (0.14.11)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
153-153: Avoid specifying long messages outside the exception class
(TRY003)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
151-151: Avoid specifying long messages outside the exception class
(TRY003)
tests/unit_tests/test_sentence_transformer_embedding_function.py
93-93: Possible hardcoded password assigned to argument: "use_auth_token"
(S106)
98-98: Possible hardcoded password assigned to: "use_auth_token"
(S105)
src/pyseekdb/client/embedding_function.py
507-507: Unused static method argument: config
(ARG004)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: integration-test (server)
- GitHub Check: integration-test (oceanbase)
- GitHub Check: integration-test (embedded)
- GitHub Check: unit-test
🔇 Additional comments (16)
.github/workflows/ci.yml (2)
28-30: No additional concerns for this change.
90-90: No additional concerns for this change.src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
119-163: LGTM — clean persistence hooks added.
name(),get_config(), andbuild_from_config()align with the new embedding function protocol and keep configuration restoration consistent.src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
85-130: LGTM — consistent serialization/restore API.Defaults and config round‑trip look correct and match the new protocol.
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
203-218: Client_kwargs may contain non-serializable objects; however, no JSON serialization of configs occurs in the current implementation.The current
get_config()implementation returns_client_kwargsdirectly, which could theoretically contain non-serializable objects likehttpx.Timeout(the OpenAI client does accept these). However, based on a comprehensive review of the codebase, embedding function configs are not persisted as JSON. Configs are only used in the in-memory roundtrip pattern:get_config()→build_from_config(). Both operations work with Python dictionaries and do not require JSON serialization. JSON serialization in the client code is limited to metadata and document fields, not embedding function configurations. The suggested guard is not needed for the current architecture.Likely an incorrect or invalid review comment.
tests/unit_tests/test_default_embedding_function.py (1)
18-58: Good persistence coverage.Covers identifier, config shape, build_from_config, and roundtrip, which matches the new contract.
tests/unit_tests/test_env_guard.py (1)
15-112: Nice coverage of EnvGuard usage patterns.The scenarios exercise restore semantics, nesting, and convenience helpers clearly.
tests/unit_tests/test_utils.py (1)
39-152: EnvGuard utility looks solid.Clear save/set/restore semantics and a lightweight contextmanager wrapper.
tests/unit_tests/test_openai_embedding_function.py (1)
363-475: Nice coverage for config persistence and round‑trip behavior.
The tests cover name/get_config/build_from_config plus round‑trip. Please verify thatOpenAIEmbeddingFunctioninitialization here remains side‑effect free (no network calls) when using dummy keys.tests/unit_tests/test_qwen_embedding_function.py (2)
17-31: Guarding Qwen tests with the OpenAI client check is sensible.
This avoids import errors when the openai-compatible client isn’t installed. Please confirm the guard matches the actual import path used byQwenEmbeddingFunction.
340-449: Persistence test coverage for Qwen looks solid.
Defaults, custom values, build_from_config, and round‑trip are well covered. Please verify these remain API‑call free withenv_guardso they can run offline.tests/unit_tests/test_sentence_transformer_embedding_function.py (5)
1-40: Module setup and availability helpers are clear and robust.
The guards for sentence‑transformers and CUDA keep the suite resilient in limited environments. Please confirm CI images that should run these tests include the required deps.
42-85: get_config defaults and custom CUDA values are well exercised.
Good coverage for name/defaults and CUDA-path config serialization.
100-186: build_from_config defaults, partial configs, and round‑trip are nicely covered.
These tests make the persistence contract explicit.
187-302: CUDA‑specific persistence and embedding checks are well gated.
Skip conditions keep these tests safe on non‑CUDA systems while still validating behavior when available.
86-99: [Your rewritten review comment text here]
[Exactly ONE classification tag]
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Summary
close #50
Solution Description
Summary by CodeRabbit
New Features
Chores
Tests
✏️ Tip: You can customize this high-level summary in your review settings.