Skip to content

support embedding function persistence#114

Merged
hnwyllmm merged 3 commits into
oceanbase:developfrom
hnwyllmm:embedding-persistent
Jan 19, 2026
Merged

support embedding function persistence#114
hnwyllmm merged 3 commits into
oceanbase:developfrom
hnwyllmm:embedding-persistent

Conversation

@hnwyllmm

@hnwyllmm hnwyllmm commented Jan 16, 2026

Copy link
Copy Markdown
Member

Summary

close #50

Solution Description

Summary by CodeRabbit

  • New Features

    • Embedding functions can be identified, export their configuration, and be reconstructed from saved configs.
  • Chores

    • Tightened Python compatibility upper bound to allow up to 3.13.
    • CI adds a pre-test step to install ML/embedding test dependencies.
  • Tests

    • Many new unit tests for embedding persistence and reconstruction.
    • New test utilities for safely managing environment variables during tests.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Jan 16, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds embedding-function serialization APIs (name(), get_config(), build_from_config()), implements them across default and concrete embedding functions, adds EnvGuard test utilities and many persistence tests, tightens Python upper bound, and adds CI test-dependency installation steps.

Changes

Cohort / File(s) Summary
CI & Project Config
​.github/workflows/ci.yml, pyproject.toml
Adds a CI step to install test deps (openai, sentence-transformers, torch, torchvision, torchaudio) and narrows project.requires-python from >=3.11,<4.0 to >=3.11,<3.14.
Embedding Function Protocol & Default
src/pyseekdb/client/embedding_function.py
Adds protocol methods name(), get_config(), build_from_config() and implements them in DefaultEmbeddingFunction.
OpenAI base + OpenAI impl
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py, .../openai_embedding_function.py
Adds get_config() to base; openai impl gains name(), get_config(), build_from_config() with validation and config handling.
Qwen embedding
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
Adds name(), get_config(), and build_from_config() with validation/default extraction.
SentenceTransformer embedding
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
Adds name(), get_config(), and build_from_config() supporting model/device/kwargs serialization and reconstruction.
Test utilities
tests/unit_tests/test_utils.py
Adds EnvGuard class and env_guard context manager to save/set/restore environment variables during tests.
EnvGuard tests
tests/unit_tests/test_env_guard.py
Adds tests for EnvGuard usage patterns: context manager, save/restore, nesting, unsetting, and chaining.
Default embedding tests
tests/unit_tests/test_default_embedding_function.py
Adds tests for name(), get_config(), build_from_config(), and a persistence roundtrip.
OpenAI embedding tests
tests/unit_tests/test_openai_embedding_function.py
Adds persistence-focused tests with an availability guard for the openai package (note: duplicated test class observed).
Qwen embedding tests
tests/unit_tests/test_qwen_embedding_function.py
Adds persistence-focused tests and an openai availability check to skip logic (note: duplicated test class observed).
SentenceTransformer tests
tests/unit_tests/test_sentence_transformer_embedding_function_persistence.py
Adds extensive persistence tests for defaults, custom kwargs, device/CUDA paths, and roundtrip verification.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Registry as EmbeddingFunction<br/>(Protocol)
    participant Impl as Concrete<br/>(OpenAI/Qwen/SentenceTransformer)
    participant Store as Configuration<br/>Storage

    rect rgba(100, 150, 255, 0.5)
    Note over Client,Impl: Serialization (save)
    Client->>Impl: instance = Impl(...)
    Client->>Impl: name = instance.name()
    Impl-->>Client: "openai"/"qwen"/"sentence_transformer"
    Client->>Impl: config = instance.get_config()
    Impl-->>Client: {model_name, api_key_env, ...}
    Client->>Store: save(name, config)
    end

    rect rgba(255, 150, 100, 0.5)
    Note over Client,Registry: Deserialization (restore)
    Client->>Store: (name, config) = load()
    Store-->>Client: ("openai", {...})
    Client->>Registry: choose implementation by name
    Registry->>Impl: Impl.build_from_config(config)
    Impl-->>Client: instance
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 I nibble configs, soft and light,

name and config set just right.
Save my settings, build anew,
Roundtrip tested, through and through.
Hooray — embeddings wake and hop! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning PR adds name(), get_config(), and build_from_config() methods to embedding functions enabling persistence, but does not implement the complete feature: no kv_attributes field in table metadata, no retrieval logic in get_collection, and no embedding function reconstruction in the API. Implement storage of embedding function name in table metadata, add retrieval logic in get_collection API, and instantiate the embedding function object from stored metadata when calling get_collection.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'support embedding function persistence' is concise and directly describes the main change: adding serialization/deserialization capabilities to embedding functions.
Out of Scope Changes check ✅ Passed All changes directly support embedding function persistence infrastructure: protocol extensions, concrete implementations of serialization methods, unit tests, test utilities (EnvGuard), and CI/dependencies. No unrelated changes detected.
Docstring Coverage ✅ Passed Docstring coverage is 94.87% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@pyproject.toml`:
- Line 8: The requires-python constraint currently uses ">=3.11,<=3.13" which
wrongly excludes 3.13.x patch releases; update the requires-python entry (the
requires-python key in pyproject.toml) to allow all 3.13 patch versions (for
example use a range like ">=3.11,<3.14") so 3.13.1+ are accepted while still
capping before 3.14.

In `@src/pyseekdb/utils/embedding_functions/openai_embedding_function.py`:
- Around line 138-166: The build_from_config in OpenAIEmbeddingFunction should
validate and normalize client_kwargs before unpacking; update build_from_config
to retrieve client_kwargs = config.get("client_kwargs", {}) then check its type
(isinstance(client_kwargs, dict)), if not, either coerce to {} or raise a
ValueError with a clear message, and only then pass **client_kwargs into the
OpenAIEmbeddingFunction constructor (reference:
OpenAIEmbeddingFunction.build_from_config and the client_kwargs variable).
🧹 Nitpick comments (3)
src/pyseekdb/client/embedding_function.py (2)

46-96: Add name() and build_from_config() to the Protocol for type-safety.

The docstring documents these methods, but the Protocol doesn’t declare them, so type checking won’t enforce the persistence API. Consider adding them as abstract static methods.

♻️ Suggested interface addition
 class EmbeddingFunction(Protocol[D]):
@@
+    `@staticmethod`
+    `@abstractmethod`
+    def name() -> str:
+        ...
+
@@
     `@abstractmethod`
     def get_config(self) -> Dict[str, Any]:
         ...
+
+    `@staticmethod`
+    `@abstractmethod`
+    def build_from_config(config: Dict[str, Any]) -> "EmbeddingFunction":
+        ...

506-508: Silence unused-argument lint in build_from_config.

config is intentionally unused here, but lint will flag it. Rename or explicitly ignore to avoid noise.

✅ Minimal lint fix
-    def build_from_config(config: Dict[str, Any]) -> "DefaultEmbeddingFunction":
+    def build_from_config(_config: Dict[str, Any]) -> "DefaultEmbeddingFunction":
         return DefaultEmbeddingFunction()
tests/unit_tests/test_openai_embedding_function.py (1)

17-31: Centralize the is_openai_available() helper in test_utils.py.

This helper is duplicated identically in both test_qwen_embedding_function.py and test_openai_embedding_function.py. Moving it to test_utils.py where other test utilities like EnvGuard already live will eliminate duplication and prevent future drift.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 702f93c and 5c483b9.

📒 Files selected for processing (13)
  • .github/workflows/ci.yml
  • pyproject.toml
  • src/pyseekdb/client/embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py
  • src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py
  • tests/unit_tests/test_default_embedding_function.py
  • tests/unit_tests/test_env_guard.py
  • tests/unit_tests/test_openai_embedding_function.py
  • tests/unit_tests/test_qwen_embedding_function.py
  • tests/unit_tests/test_sentence_transformer_embedding_function.py
  • tests/unit_tests/test_utils.py
🧰 Additional context used
🧬 Code graph analysis (9)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
  • name (86-92)
  • get_config (94-105)
  • build_from_config (108-130)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
tests/unit_tests/test_sentence_transformer_embedding_function.py (1)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (4)
  • SentenceTransformerEmbeddingFunction (5-130)
  • name (86-92)
  • get_config (94-105)
  • build_from_config (108-130)
tests/unit_tests/test_openai_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
  • env_guard (134-152)
tests/unit_tests/test_qwen_embedding_function.py (1)
  • is_openai_available (20-27)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (4)
  • OpenAIEmbeddingFunction (13-166)
  • name (122-128)
  • get_config (130-136)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
tests/unit_tests/test_env_guard.py (1)
tests/unit_tests/test_utils.py (4)
  • EnvGuard (11-130)
  • env_guard (134-152)
  • save (51-67)
  • restore (96-117)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (4)
src/pyseekdb/client/embedding_function.py (2)
  • get_config (85-96)
  • get_config (503-504)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (1)
  • get_config (130-136)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)
  • get_config (128-134)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)
  • get_config (94-105)
tests/unit_tests/test_qwen_embedding_function.py (4)
tests/unit_tests/test_utils.py (1)
  • env_guard (134-152)
tests/unit_tests/test_openai_embedding_function.py (7)
  • is_openai_available (20-27)
  • test_name (370-372)
  • test_get_config_with_defaults (374-387)
  • test_get_config_with_dimensions (409-418)
  • test_build_from_config_with_defaults (420-437)
  • test_build_from_config_with_custom_values (439-458)
  • test_persistence_roundtrip (460-475)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (4)
  • QwenEmbeddingFunction (14-164)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
src/pyseekdb/client/embedding_function.py (4)
  • name (500-501)
  • get_config (85-96)
  • get_config (503-504)
  • build_from_config (507-508)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • name (122-128)
  • get_config (130-136)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • name (120-126)
  • get_config (128-134)
  • build_from_config (137-164)
tests/unit_tests/test_default_embedding_function.py (1)
src/pyseekdb/client/embedding_function.py (5)
  • DefaultEmbeddingFunction (115-511)
  • name (500-501)
  • get_config (85-96)
  • get_config (503-504)
  • build_from_config (507-508)
src/pyseekdb/client/embedding_function.py (4)
src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)
  • get_config (203-218)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py (3)
  • get_config (130-136)
  • name (122-128)
  • build_from_config (139-166)
src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (3)
  • get_config (128-134)
  • name (120-126)
  • build_from_config (137-164)
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (3)
  • get_config (94-105)
  • name (86-92)
  • build_from_config (108-130)
🪛 Ruff (0.14.11)
src/pyseekdb/utils/embedding_functions/openai_embedding_function.py

153-153: Avoid specifying long messages outside the exception class

(TRY003)

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py

151-151: Avoid specifying long messages outside the exception class

(TRY003)

tests/unit_tests/test_sentence_transformer_embedding_function.py

93-93: Possible hardcoded password assigned to argument: "use_auth_token"

(S106)


98-98: Possible hardcoded password assigned to: "use_auth_token"

(S105)

src/pyseekdb/client/embedding_function.py

507-507: Unused static method argument: config

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: integration-test (server)
  • GitHub Check: integration-test (oceanbase)
  • GitHub Check: integration-test (embedded)
  • GitHub Check: unit-test
🔇 Additional comments (16)
.github/workflows/ci.yml (2)

28-30: No additional concerns for this change.


90-90: No additional concerns for this change.

src/pyseekdb/utils/embedding_functions/qwen_embedding_function.py (1)

119-163: LGTM — clean persistence hooks added.

name(), get_config(), and build_from_config() align with the new embedding function protocol and keep configuration restoration consistent.

src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py (1)

85-130: LGTM — consistent serialization/restore API.

Defaults and config round‑trip look correct and match the new protocol.

src/pyseekdb/utils/embedding_functions/openai_base_embedding_function.py (1)

203-218: Client_kwargs may contain non-serializable objects; however, no JSON serialization of configs occurs in the current implementation.

The current get_config() implementation returns _client_kwargs directly, which could theoretically contain non-serializable objects like httpx.Timeout (the OpenAI client does accept these). However, based on a comprehensive review of the codebase, embedding function configs are not persisted as JSON. Configs are only used in the in-memory roundtrip pattern: get_config()build_from_config(). Both operations work with Python dictionaries and do not require JSON serialization. JSON serialization in the client code is limited to metadata and document fields, not embedding function configurations. The suggested guard is not needed for the current architecture.

Likely an incorrect or invalid review comment.

tests/unit_tests/test_default_embedding_function.py (1)

18-58: Good persistence coverage.

Covers identifier, config shape, build_from_config, and roundtrip, which matches the new contract.

tests/unit_tests/test_env_guard.py (1)

15-112: Nice coverage of EnvGuard usage patterns.

The scenarios exercise restore semantics, nesting, and convenience helpers clearly.

tests/unit_tests/test_utils.py (1)

39-152: EnvGuard utility looks solid.

Clear save/set/restore semantics and a lightweight contextmanager wrapper.

tests/unit_tests/test_openai_embedding_function.py (1)

363-475: Nice coverage for config persistence and round‑trip behavior.
The tests cover name/get_config/build_from_config plus round‑trip. Please verify that OpenAIEmbeddingFunction initialization here remains side‑effect free (no network calls) when using dummy keys.

tests/unit_tests/test_qwen_embedding_function.py (2)

17-31: Guarding Qwen tests with the OpenAI client check is sensible.
This avoids import errors when the openai-compatible client isn’t installed. Please confirm the guard matches the actual import path used by QwenEmbeddingFunction.


340-449: Persistence test coverage for Qwen looks solid.
Defaults, custom values, build_from_config, and round‑trip are well covered. Please verify these remain API‑call free with env_guard so they can run offline.

tests/unit_tests/test_sentence_transformer_embedding_function.py (5)

1-40: Module setup and availability helpers are clear and robust.
The guards for sentence‑transformers and CUDA keep the suite resilient in limited environments. Please confirm CI images that should run these tests include the required deps.


42-85: get_config defaults and custom CUDA values are well exercised.
Good coverage for name/defaults and CUDA-path config serialization.


100-186: build_from_config defaults, partial configs, and round‑trip are nicely covered.
These tests make the persistence contract explicit.


187-302: CUDA‑specific persistence and embedding checks are well gated.
Skip conditions keep these tests safe on non‑CUDA systems while still validating behavior when available.


86-99: [Your rewritten review comment text here]
[Exactly ONE classification tag]

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread pyproject.toml Outdated
Comment thread src/pyseekdb/utils/embedding_functions/openai_embedding_function.py
@hnwyllmm hnwyllmm marked this pull request as draft January 16, 2026 05:29
@hnwyllmm hnwyllmm marked this pull request as ready for review January 19, 2026 01:13
@hnwyllmm hnwyllmm changed the title [Draft] support embedding function persistence support embedding function persistence Jan 19, 2026
@hnwyllmm hnwyllmm merged commit 1d284dc into oceanbase:develop Jan 19, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Remove the embedding_function argument in get_collection API

1 participant