Skip to content

support embedding function metadata persistence#119

Merged
hnwyllmm merged 6 commits into
oceanbase:developfrom
hnwyllmm:persistent-collection
Jan 20, 2026
Merged

support embedding function metadata persistence#119
hnwyllmm merged 6 commits into
oceanbase:developfrom
hnwyllmm:persistent-collection

Conversation

@hnwyllmm

@hnwyllmm hnwyllmm commented Jan 20, 2026

Copy link
Copy Markdown
Member

Summary

close #50
For details of solution, refer to doc.

Summary by CodeRabbit

  • New Features

    • Public registration API to persist and restore custom embedding functions (decorator + registry).
    • Dual-schema collection support with explicit collection identity, get_or_create/get_collection APIs, and collection_id in constructor.
    • Batch insertion via collection.add and unified handling so v1 and v2 collections can coexist.
  • Tests

    • New/expanded tests for embedding-function persistence, registry behavior, and v1 compatibility.
    • Integration tests updated to use higher-level collection APIs.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Jan 20, 2026

Copy link
Copy Markdown

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds an embedding-function registry and decorator exported via package, persists/restores embedding functions in collection metadata, implements dual v1/v2 collection metadata flows in the client, and updates tests to use a batch collection.add API instead of raw SQL inserts.

Changes

Cohort / File(s) Summary
Public API Exports
src/pyseekdb/__init__.py, src/pyseekdb/client/__init__.py
Re-export register_embedding_function in package and client __all__ to expose the new decorator.
Embedding Function Registry & Decorator
src/pyseekdb/client/embedding_function.py
Add EmbeddingFunctionRegistry, lazy initialization and auto-registration, register_embedding_function decorator, EmbeddingFunction.support_persistence(), and change base get_config() to return NotImplemented.
Client — Dual v1/v2 Collection Flows
src/pyseekdb/client/client_base.py
Implement dual-path metadata (v1/v2) for create/get/delete/list/has; wire collection_id for v2; persist/restore embedding functions via registry/config; add get_or_create_collection(); adapt DML/query paths to resolve table names for both schemas.
Metadata / Naming Helpers
src/pyseekdb/client/meta_info.py
Add v2 prefix constant, table_name_v2(collection_id) and sdk_collections_table_name() helpers.
Tests — Embedding Function Coverage
tests/integration_tests/test_collection_embedding_function.py, tests/unit_tests/test_embedding_function_registry.py
Add integration and unit tests for registry behavior, decorator registration, persistence/restoration, error cases, and decorator idempotency.
Tests — V1 Compatibility Suite
tests/integration_tests/test_collection_v1_compatibility.py
New comprehensive integration tests verifying v1 collection behavior across creation, CRUD, queries, counts, embedding-function usage, and coexistence with v2.
Tests — Replace SQL Inserts with Collection API
tests/integration_tests/test_collection_get.py, tests/integration_tests/test_collection_hybrid_search.py, tests/integration_tests/test_collection_hybrid_search_builder_integration.py, tests/integration_tests/test_collection_query.py, tests/integration_tests/test_fulltext_parser_config.py
Replace manual SQL per-row inserts with collection.add(ids=..., embeddings=..., documents=..., metadatas=...); prefer v2 table naming when collection.id present; tests now pass explicit dimension where needed.
Tests — Client Creation Cleanup
tests/integration_tests/test_client_creation.py
Remove low-level table description/validation; rely on collection-level APIs.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DB as SDK_Metadata/Storage
    participant Registry as EmbeddingFunctionRegistry
    participant Collection

    Client->>DB: Query collection metadata by name (v2 sdk_collections or v1)
    DB-->>Client: Return metadata (may include embedding_function_name + config)
    alt embedding_function_name present
        Client->>Registry: get_class(embedding_function_name)
        Registry-->>Client: embedding class or None
        alt class found
            Client->>Client: instantiate via build_from_config(config)
        else not found
            Client-->>Client: raise error or require embedding_function param
        end
    else embedding_function provided on call
        Client->>Client: use provided embedding_function
    end
    Client->>Collection: construct Collection(name, collection_id?, embedding_function, metadata)
    Collection-->>Client: Collection instance ready for add/query/update/etc.
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hopped through tables old and new,
I learned each function's name and view,
V1 tiptoes while V2 strides,
Registries tuck configs inside,
Carrots and configs — persistence anew!

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'support embedding function metadata persistence' directly and clearly describes the main change: implementing persistence of embedding function metadata to avoid requiring embedding_function arguments in get_collection calls.
Linked Issues check ✅ Passed The PR successfully implements all requirements from issue #50: adds embedding function registry for persistence [embedding_function.py], stores embedding function metadata in collections [client_base.py], retrieves and instantiates embedding functions on get_collection [client_base.py], supports dual-schema operations with v1/v2 metadata [client_base.py, meta_info.py], and provides public APIs (register_embedding_function decorator, EmbeddingFunctionRegistry, get_collection/get_or_create_collection with embedding_function parameter).
Out of Scope Changes check ✅ Passed All code changes are directly related to embedding function persistence: core infrastructure (embedding_function.py, client_base.py, meta_info.py), public API exports (init.py files), dual-schema support for v1/v2 compatibility, and comprehensive tests for persistence/restoration workflows, registry functionality, and v1/v2 coexistence.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for embedding function metadata persistence in the pyseekdb library. It introduces a v2 collection format that stores collection metadata (including embedding function configuration) in a dedicated sdk_collections table, enabling automatic restoration of embedding functions when collections are retrieved.

Changes:

  • Implemented EmbeddingFunctionRegistry and @register_embedding_function decorator for managing custom embedding functions
  • Introduced v2 collection format with metadata persistence in sdk_collections table
  • Maintained backward compatibility with v1 collections through dual-version support in collection operations

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
src/pyseekdb/client/embedding_function.py Added registry system and decorator for embedding function persistence
src/pyseekdb/client/client_base.py Implemented v2 collection metadata storage and dual-version collection management
src/pyseekdb/client/meta_info.py Added v2 table naming convention and sdk_collections table reference
src/pyseekdb/client/init.py Exported new register_embedding_function decorator
src/pyseekdb/init.py Exported new register_embedding_function decorator at package level
tests/unit_tests/test_embedding_function_registry.py Comprehensive unit tests for registry and decorator functionality
tests/integration_tests/test_collection_embedding_function.py Integration tests for embedding function persistence
tests/integration_tests/test_collection_v1_compatibility.py Tests verifying v1 collection backward compatibility
tests/integration_tests/test_fulltext_parser_config.py Updated to support both v1 and v2 table naming
tests/integration_tests/test_collection_query.py Refactored to use collection.add() instead of raw SQL
tests/integration_tests/test_collection_hybrid_search_builder_integration.py Refactored test data insertion to use collection API
tests/integration_tests/test_collection_hybrid_search.py Refactored test data insertion to use collection API
tests/integration_tests/test_collection_get.py Refactored test data insertion to use collection API
tests/integration_tests/test_client_creation.py Removed manual table verification code
Comments suppressed due to low confidence (1)

src/pyseekdb/client/client_base.py:1185

  • This SQL statement is vulnerable to SQL injection. The name parameter is directly interpolated into the SQL string without proper escaping or parameterization.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/pyseekdb/client/client_base.py
Comment thread src/pyseekdb/client/client_base.py
Comment thread src/pyseekdb/client/embedding_function.py
Comment thread src/pyseekdb/client/client_base.py Outdated
Comment thread src/pyseekdb/client/client_base.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/pyseekdb/client/client_base.py (1)

870-887: Normalize ip distance into metadata.

You normalize ipinner_product, but metadata["distance"] stays as ip, so query distance mapping falls back to l2. Update the stored metadata with the normalized value.

🛠️ Suggested fix
-                if distance_match:
-                    distance = metadata["distance"] = distance_match.group(2).lower()
-                    # Normalize distance values
-                    if distance == "ip":
-                        distance = "inner_product"
-                    elif distance in ["l2", "cosine", "inner_product"]:
-                        pass
-                    else:
-                        # Unknown distance, default to None
-                        logger.warning(
-                            f"Unknown distance value '{distance}' in CREATE TABLE statement, defaulting to None"
-                        )
-                        metadata["distance"] = None
+                if distance_match:
+                    distance = distance_match.group(2).lower()
+                    # Normalize distance values
+                    if distance == "ip":
+                        distance = "inner_product"
+                    if distance in ["l2", "cosine", "inner_product"]:
+                        metadata["distance"] = distance
+                    else:
+                        # Unknown distance, default to None
+                        logger.warning(
+                            f"Unknown distance value '{distance}' in CREATE TABLE statement, defaulting to None"
+                        )
+                        metadata["distance"] = None
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Around line 921-926: The current logic replaces a persisted None with the
default EF and also blocks explicit overrides; change it so that when
embedding_function is _NOT_PROVIDED you return embedding_function_persistence
as-is (including None) rather than substituting
get_default_embedding_function(), and only call get_default_embedding_function()
if both embedding_function and embedding_function_persistence are _NOT_PROVIDED;
keep the existing conflict check between embedding_function and
embedding_function_persistence (use the symbols embedding_function_persistence,
embedding_function, _NOT_PROVIDED, and get_default_embedding_function to locate
the code).
- Around line 742-767: The method _create_collection_meta_v2 is vulnerable
because it interpolates collection_name and settings_str into SQL (via
CollectionNames.sdk_collections_table_name() and _execute) causing SQL injection
and will raise an undefined variable e when creation fails; change these queries
to use parameterized SQL execution (or properly escape/quote identifiers and use
bind parameters for values) instead of f-strings, pass collection_name and
settings_str as parameters to _execute (or your DB wrapper), and in the error
path replace the undefined e with a caught exception message (catch Exception as
e and raise or log with str(e)); apply the same parameterization/escaping fixes
to _get_collection_v2, _has_collection_v2, and drop_collection wherever
collection names or JSON settings are used in SQL.

In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 624-629: Remove the unused variable assignments for
created_collection, coll1, and coll2 (or replace them with a single underscore
_) so the calls to db_client.create_collection(...) remain for their side
effects but do not assign unused names; specifically update the calls that
currently assign to created_collection, coll1, and coll2 to either drop the
assignment or use "_" to satisfy Ruff while keeping the same arguments and
embedding_function usage.
- Line 405: Several print statements use redundant f-strings (no placeholders)
causing Ruff F541; remove the unnecessary `f` prefix so they are plain string
literals. Locate the print calls like print(f"\n✅ Testing custom embedding
function persistence and restoration") and the similar prints at the other noted
locations (lines showing print(f"...") around 500, 567, 587, 653, 673, 761, 781,
833) and change them to print("\n✅ Testing custom embedding function persistence
and restoration") (and likewise for each identical occurrence) to eliminate the
unused f-strings.
- Around line 770-833: The test currently registers
UnregisteredEmbeddingFunction (via register_embedding_function) then retrieves
the collection, so it never exercises the unregistered restoration error path;
after creating the collection with
db_client.create_collection(name=collection_name, configuration=config,
embedding_function=ef), clear the registry entry (e.g.,
EmbeddingFunctionRegistry._registry.clear() or delete the specific key) before
calling db_client.get_collection(name=collection_name) and assert that calling
get_collection raises a ValueError whose message contains "Embedding function
class 'unregistered_embedding' not found" to validate proper error handling
during restoration.

In `@tests/integration_tests/test_collection_v1_compatibility.py`:
- Around line 423-431: The test_v1_collection_table_name_format assigns the
result of db_client.create_collection to an unused variable named collection
which triggers a lint F841; remove the unused assignment and call
db_client.create_collection(...) without assigning it (or assign to _ if you
prefer a convention), keeping the call parameters name=collection_name and
_collection_version=1 so the side-effect still occurs; update the line in
test_v1_collection_table_name_format accordingly.
🧹 Nitpick comments (4)
tests/integration_tests/test_fulltext_parser_config.py (1)

49-52: Consider extracting a helper for v1/v2 table name resolution.

Same conditional appears in three methods; a small helper would reduce duplication and keep naming logic consistent.

Also applies to: 131-134, 180-183

tests/unit_tests/test_embedding_function_registry.py (1)

95-104: Silence unused-parameter lint in stub embedding functions.

Several stubs ignore input/config, which typically triggers unused-arg lint. Consider prefixing with _ (or adding a noqa) across these stubs.

♻️ Suggested tweak (apply similarly to other stubs)
-            def __call__(self, input: Documents) -> Embeddings:
+            def __call__(self, _input: Documents) -> Embeddings:
                 return [[0.1, 0.2, 0.3]]

-            def build_from_config(config: Dict[str, Any]) -> "InvalidEmbeddingFunction":
+            def build_from_config(_config: Dict[str, Any]) -> "InvalidEmbeddingFunction":
                 return InvalidEmbeddingFunction()
src/pyseekdb/client/embedding_function.py (1)

731-736: Validate get_config during registry registration.

Docs require get_config, but register only checks name/build_from_config. Adding the check fails fast for non‑persistable classes and aligns with the contract.

♻️ Suggested fix
-        if (not hasattr(embedding_function_class, "name")
-            or not hasattr(embedding_function_class, "build_from_config")):
+        if (not hasattr(embedding_function_class, "name")
+            or not hasattr(embedding_function_class, "build_from_config")
+            or not hasattr(embedding_function_class, "get_config")):
tests/integration_tests/test_collection_embedding_function.py (1)

407-442: Ensure embedding-function registration is idempotent across parametrized runs.
These decorators execute on every parametrized run. If the registry rejects duplicate names, this can flake. Consider registering once at module scope or guarding registration with a registry lookup/try-if-absent.

Also applies to: 589-618, 675-724

Comment thread src/pyseekdb/client/client_base.py Outdated
Comment thread src/pyseekdb/client/client_base.py Outdated
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py Outdated
Comment thread tests/integration_tests/test_collection_v1_compatibility.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 110-125: The initial falsy check in support_persistence
incorrectly treats objects with custom __bool__/__len__ as missing; change the
guard from "if not embedding_function" to an explicit None check ("if
embedding_function is None") in support_persistence(embedding_function: Any),
keep the subsequent attribute checks (name, build_from_config, get_config) and
the get_config() comparison to NotImplemented unchanged so the function still
returns False for None or non-conforming objects and True only for valid,
persistent-capable embedding_function instances.

In `@tests/unit_tests/test_embedding_function_registry.py`:
- Around line 66-188: Tests define several embedding function classes (e.g.,
TestEmbeddingFunction.__call__, InvalidEmbeddingFunction.__call__/get_config,
FirstEmbeddingFunction, SecondEmbeddingFunction.build_from_config, and other
build_from_config/get_config methods) that declare parameters they never use,
causing Ruff ARG002/ARG004; fix by renaming unused parameters to begin with an
underscore (e.g., input -> _input, config -> _config) or add a per-parameter
noqa comment, ensuring you update all relevant methods (__call__, get_config,
build_from_config) across the test classes so the linter no longer flags
unused-argument warnings.
♻️ Duplicate comments (5)
tests/integration_tests/test_collection_v1_compatibility.py (1)

428-431: Drop unused collection assignment
This variable is unused and triggers Ruff F841.

🧹 Suggested fix
-        collection = db_client.create_collection(
+        db_client.create_collection(
             name=collection_name,
             _collection_version=1,
         )
src/pyseekdb/client/client_base.py (2)

756-767: Escape collection_name/settings_str in v2 metadata SQL (and fix undefined e)
Unescaped values allow SQL injection and break JSON quoting. Line 766 also references undefined e. Apply escaping (or parameterization) here and in the other v2 SQL statements listed.

🛠️ Suggested fix (core pattern)
-            settings_str = json.dumps(settings)
+            settings_str = json.dumps(settings, ensure_ascii=False)
+            collection_name_escaped = escape_string(collection_name)
+            settings_str_escaped = escape_string(settings_str)
@@
-            delete_sql = f"DELETE FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name}'"
+            delete_sql = f"DELETE FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name_escaped}'"
             self._execute(delete_sql)
-            insert_sql = f"INSERT INTO {CollectionNames.sdk_collections_table_name()} (COLLECTION_NAME, SETTINGS) VALUES ('{collection_name}', '{settings_str}')"
+            insert_sql = f"INSERT INTO {CollectionNames.sdk_collections_table_name()} (COLLECTION_NAME, SETTINGS) VALUES ('{collection_name_escaped}', '{settings_str_escaped}')"
             self._execute(insert_sql)
-            query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name}'"
+            query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name_escaped}'"
             rows = self._execute(query_sql)
-            if not rows or len(rows) == 0:
-                raise ValueError(f"Failed to create collection metadata: {e}")
+            if not rows:
+                raise ValueError("Failed to create collection metadata: no COLLECTION_ID returned")

Also applies to: 948-949, 1042-1043, 1204-1205


928-940: Persisted None embedding functions should not be replaced by default
When no persistence exists and the caller doesn’t pass an embedding function, this returns DefaultEmbeddingFunction, which can silently change behavior and cause dimension mismatches for collections created with embedding_function=None.

🛠️ Suggested fix
-        if embedding_function is _NOT_PROVIDED:
-            return (
-                embedding_function_persistence
-                if embedding_function_persistence is not None
-                else get_default_embedding_function()
-            )
-        else:
-            return embedding_function
+        if embedding_function is _NOT_PROVIDED:
+            return embedding_function_persistence
+        return embedding_function
tests/integration_tests/test_collection_embedding_function.py (2)

641-645: Remove unused collection assignments in tests
created_collection, coll1, and coll2 are assigned but unused (Ruff F841). Drop the assignment or use _.

🧹 Example fix (apply similarly)
-        created_collection = db_client.create_collection(
+        db_client.create_collection(
             name=collection_name,
             configuration=config,
             embedding_function=custom_ef,
         )

Also applies to: 746-758, 844-848


13-18: Exercise the “unregistered embedding function” error path
The test registers the class and then retrieves successfully, so it never validates the failure case described in the docstring. Remove the registry entry before get_collection() and assert the error.

🛠️ Suggested fix
-from pyseekdb.client.embedding_function import (
+from pyseekdb.client.embedding_function import (
     EmbeddingFunction,
     Documents,
     Embeddings,
     register_embedding_function,
+    EmbeddingFunctionRegistry,
 )
@@
-            # Actually, since we registered it, getting should work
-            retrieved_collection = db_client.get_collection(name=collection_name)
-            assert retrieved_collection.embedding_function is not None
+            # Remove from registry to simulate "unregistered" at restore time
+            EmbeddingFunctionRegistry._registry.pop("unregistered_embedding", None)
+
+            with pytest.raises(ValueError, match="not found"):
+                db_client.get_collection(name=collection_name)

Also applies to: 838-856

🧹 Nitpick comments (1)
src/pyseekdb/client/embedding_function.py (1)

15-26: Annotate registry state as ClassVar
This is class-level shared state; adding ClassVar avoids Ruff RUF012 and clarifies intent.

♻️ Suggested fix
-from typing import (
+from typing import (
     List,
     Protocol,
     Union,
     runtime_checkable,
     Optional,
     TypeVar,
     cast,
     Any,
     Dict,
     Type,
+    ClassVar,
 )
@@
-    _registry: Dict[str, type] = {}
-    _initialized: bool = False
+    _registry: ClassVar[Dict[str, type]] = {}
+    _initialized: ClassVar[bool] = False

Also applies to: 663-664

Comment thread tests/unit_tests/test_embedding_function_registry.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 110-125: The support_persistence function currently calls
embedding_function.get_config() directly which can raise and abort collection
creation; modify support_persistence to call get_config() inside a try/except
(catch Exception) and treat any exception as "not persistable" (return False),
and only proceed to compare the returned value to NotImplemented if no exception
occurred; reference: support_persistence(embedding_function) and
embedding_function.get_config().

In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 409-445: The test defines TestPersistentEmbeddingFunction and
decorates it with register_embedding_function each parametrized run, causing
duplicate registration on subsequent runs; fix by resetting the
embedding-function registry between runs—add a setup_method in the test class
that calls the registry reset/clear API used by register_embedding_function (or
import the embedding_function_registry and call its clear/reset method) so
TestPersistentEmbeddingFunction can re-register, or alternatively move the class
to module scope or make TestPersistentEmbeddingFunction.name() return a unique
name per run (e.g., append a random/suffix).

In `@tests/unit_tests/test_embedding_function_registry.py`:
- Around line 325-329: The build_from_config static method in
TypedEmbeddingFunction incorrectly references an undefined name config; update
TypedEmbeddingFunction.build_from_config to use the provided parameter _config
(e.g., _config.get("model_name", "test")) when constructing and returning the
TypedEmbeddingFunction so the lookup uses the actual argument instead of the
undefined variable.
♻️ Duplicate comments (3)
src/pyseekdb/client/client_base.py (3)

931-945: Preserve persisted “no embedding function” instead of defaulting.

If a collection was created with embedding_function=None, there’s no persisted EF and _validate_embedding_function() currently substitutes DefaultEmbeddingFunction, which can cause dimension mismatches and unexpected embedding behavior.

🛠️ Suggested fix
-        if embedding_function is _NOT_PROVIDED:
-            return (
-                embedding_function_persistence
-                if embedding_function_persistence is not None
-                else get_default_embedding_function()
-            )
-        else:
-            return embedding_function
+        if embedding_function is _NOT_PROVIDED:
+            return embedding_function_persistence
+        return embedding_function

1045-1047: Escape collection name in _delete_collection_v2 to prevent SQL injection.

collection_name is user input and is interpolated directly into SQL.

🔒 Suggested fix
-        query_sql = f"DELETE FROM `{CollectionNames.sdk_collections_table_name()}` WHERE COLLECTION_NAME = '{name}'"
+        name_escaped = escape_string(name)
+        query_sql = (
+            f"DELETE FROM `{CollectionNames.sdk_collections_table_name()}` "
+            f"WHERE COLLECTION_NAME = '{name_escaped}'"
+        )

1208-1209: Escape collection name in _has_collection_v2 queries.

This query uses raw user input; it should be escaped or parameterized.

🔒 Suggested fix
-            query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{name}'"
+            name_escaped = escape_string(name)
+            query_sql = (
+                f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} "
+                f"WHERE COLLECTION_NAME = '{name_escaped}'"
+            )
🧹 Nitpick comments (1)
src/pyseekdb/client/embedding_function.py (1)

663-665: Annotate registry class attributes with ClassVar.

These are class-level singletons; annotating them as ClassVar improves typing clarity and matches Ruff’s guidance.

♻️ Suggested refactor
-from typing import (
+from typing import (
     List,
     Protocol,
     Union,
     runtime_checkable,
     Optional,
     TypeVar,
     cast,
     Any,
     Dict,
     Type,
+    ClassVar,
 )
@@
-    _registry: Dict[str, type] = {}
-    _initialized: bool = False
+    _registry: ClassVar[Dict[str, type]] = {}
+    _initialized: ClassVar[bool] = False

Comment thread src/pyseekdb/client/embedding_function.py
Comment thread tests/integration_tests/test_collection_embedding_function.py
Comment thread tests/unit_tests/test_embedding_function_registry.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 894-931: The test creates two collections with the same
timestamp-based collection_name and never deletes them; change to generate
unique names (e.g., append a UUID or distinct suffix) for each create_collection
call and wrap the operations in a try/finally so you call
db_client.delete_collection(...) for each created collection (references:
collection_name variable, db_client.create_collection, db_client.get_collection,
db_client.get_or_create_collection, Simple3DEmbeddingFunction,
HNSWConfiguration) to ensure cleanup and avoid name collisions/leaks.
♻️ Duplicate comments (1)
tests/integration_tests/test_collection_embedding_function.py (1)

429-466: Reset the embedding-function registry between parametrized runs.
These test-scoped classes register fixed name() values and can collide across parametrized db_client runs.

♻️ Suggested fix
 class TestCollectionEmbeddingFunction:
     """Test collection creation with embedding function handling using parameterized db_client fixture"""
+    def setup_method(self):
+        EmbeddingFunctionRegistry._registry.clear()
+        EmbeddingFunctionRegistry._initialized = False

Comment thread tests/integration_tests/test_collection_embedding_function.py
@hnwyllmm hnwyllmm merged commit 8650ab7 into oceanbase:develop Jan 20, 2026
7 checks passed
@hnwyllmm hnwyllmm deleted the persistent-collection branch January 20, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Remove the embedding_function argument in get_collection API

2 participants