support embedding function metadata persistence#119
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds an embedding-function registry and decorator exported via package, persists/restores embedding functions in collection metadata, implements dual v1/v2 collection metadata flows in the client, and updates tests to use a batch collection.add API instead of raw SQL inserts. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DB as SDK_Metadata/Storage
participant Registry as EmbeddingFunctionRegistry
participant Collection
Client->>DB: Query collection metadata by name (v2 sdk_collections or v1)
DB-->>Client: Return metadata (may include embedding_function_name + config)
alt embedding_function_name present
Client->>Registry: get_class(embedding_function_name)
Registry-->>Client: embedding class or None
alt class found
Client->>Client: instantiate via build_from_config(config)
else not found
Client-->>Client: raise error or require embedding_function param
end
else embedding_function provided on call
Client->>Client: use provided embedding_function
end
Client->>Collection: construct Collection(name, collection_id?, embedding_function, metadata)
Collection-->>Client: Collection instance ready for add/query/update/etc.
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This pull request adds support for embedding function metadata persistence in the pyseekdb library. It introduces a v2 collection format that stores collection metadata (including embedding function configuration) in a dedicated sdk_collections table, enabling automatic restoration of embedding functions when collections are retrieved.
Changes:
- Implemented
EmbeddingFunctionRegistryand@register_embedding_functiondecorator for managing custom embedding functions - Introduced v2 collection format with metadata persistence in
sdk_collectionstable - Maintained backward compatibility with v1 collections through dual-version support in collection operations
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pyseekdb/client/embedding_function.py | Added registry system and decorator for embedding function persistence |
| src/pyseekdb/client/client_base.py | Implemented v2 collection metadata storage and dual-version collection management |
| src/pyseekdb/client/meta_info.py | Added v2 table naming convention and sdk_collections table reference |
| src/pyseekdb/client/init.py | Exported new register_embedding_function decorator |
| src/pyseekdb/init.py | Exported new register_embedding_function decorator at package level |
| tests/unit_tests/test_embedding_function_registry.py | Comprehensive unit tests for registry and decorator functionality |
| tests/integration_tests/test_collection_embedding_function.py | Integration tests for embedding function persistence |
| tests/integration_tests/test_collection_v1_compatibility.py | Tests verifying v1 collection backward compatibility |
| tests/integration_tests/test_fulltext_parser_config.py | Updated to support both v1 and v2 table naming |
| tests/integration_tests/test_collection_query.py | Refactored to use collection.add() instead of raw SQL |
| tests/integration_tests/test_collection_hybrid_search_builder_integration.py | Refactored test data insertion to use collection API |
| tests/integration_tests/test_collection_hybrid_search.py | Refactored test data insertion to use collection API |
| tests/integration_tests/test_collection_get.py | Refactored test data insertion to use collection API |
| tests/integration_tests/test_client_creation.py | Removed manual table verification code |
Comments suppressed due to low confidence (1)
src/pyseekdb/client/client_base.py:1185
- This SQL statement is vulnerable to SQL injection. The name parameter is directly interpolated into the SQL string without proper escaping or parameterization.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pyseekdb/client/client_base.py (1)
870-887: Normalizeipdistance into metadata.You normalize
ip→inner_product, butmetadata["distance"]stays asip, so query distance mapping falls back tol2. Update the stored metadata with the normalized value.🛠️ Suggested fix
- if distance_match: - distance = metadata["distance"] = distance_match.group(2).lower() - # Normalize distance values - if distance == "ip": - distance = "inner_product" - elif distance in ["l2", "cosine", "inner_product"]: - pass - else: - # Unknown distance, default to None - logger.warning( - f"Unknown distance value '{distance}' in CREATE TABLE statement, defaulting to None" - ) - metadata["distance"] = None + if distance_match: + distance = distance_match.group(2).lower() + # Normalize distance values + if distance == "ip": + distance = "inner_product" + if distance in ["l2", "cosine", "inner_product"]: + metadata["distance"] = distance + else: + # Unknown distance, default to None + logger.warning( + f"Unknown distance value '{distance}' in CREATE TABLE statement, defaulting to None" + ) + metadata["distance"] = None
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Around line 921-926: The current logic replaces a persisted None with the
default EF and also blocks explicit overrides; change it so that when
embedding_function is _NOT_PROVIDED you return embedding_function_persistence
as-is (including None) rather than substituting
get_default_embedding_function(), and only call get_default_embedding_function()
if both embedding_function and embedding_function_persistence are _NOT_PROVIDED;
keep the existing conflict check between embedding_function and
embedding_function_persistence (use the symbols embedding_function_persistence,
embedding_function, _NOT_PROVIDED, and get_default_embedding_function to locate
the code).
- Around line 742-767: The method _create_collection_meta_v2 is vulnerable
because it interpolates collection_name and settings_str into SQL (via
CollectionNames.sdk_collections_table_name() and _execute) causing SQL injection
and will raise an undefined variable e when creation fails; change these queries
to use parameterized SQL execution (or properly escape/quote identifiers and use
bind parameters for values) instead of f-strings, pass collection_name and
settings_str as parameters to _execute (or your DB wrapper), and in the error
path replace the undefined e with a caught exception message (catch Exception as
e and raise or log with str(e)); apply the same parameterization/escaping fixes
to _get_collection_v2, _has_collection_v2, and drop_collection wherever
collection names or JSON settings are used in SQL.
In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 624-629: Remove the unused variable assignments for
created_collection, coll1, and coll2 (or replace them with a single underscore
_) so the calls to db_client.create_collection(...) remain for their side
effects but do not assign unused names; specifically update the calls that
currently assign to created_collection, coll1, and coll2 to either drop the
assignment or use "_" to satisfy Ruff while keeping the same arguments and
embedding_function usage.
- Line 405: Several print statements use redundant f-strings (no placeholders)
causing Ruff F541; remove the unnecessary `f` prefix so they are plain string
literals. Locate the print calls like print(f"\n✅ Testing custom embedding
function persistence and restoration") and the similar prints at the other noted
locations (lines showing print(f"...") around 500, 567, 587, 653, 673, 761, 781,
833) and change them to print("\n✅ Testing custom embedding function persistence
and restoration") (and likewise for each identical occurrence) to eliminate the
unused f-strings.
- Around line 770-833: The test currently registers
UnregisteredEmbeddingFunction (via register_embedding_function) then retrieves
the collection, so it never exercises the unregistered restoration error path;
after creating the collection with
db_client.create_collection(name=collection_name, configuration=config,
embedding_function=ef), clear the registry entry (e.g.,
EmbeddingFunctionRegistry._registry.clear() or delete the specific key) before
calling db_client.get_collection(name=collection_name) and assert that calling
get_collection raises a ValueError whose message contains "Embedding function
class 'unregistered_embedding' not found" to validate proper error handling
during restoration.
In `@tests/integration_tests/test_collection_v1_compatibility.py`:
- Around line 423-431: The test_v1_collection_table_name_format assigns the
result of db_client.create_collection to an unused variable named collection
which triggers a lint F841; remove the unused assignment and call
db_client.create_collection(...) without assigning it (or assign to _ if you
prefer a convention), keeping the call parameters name=collection_name and
_collection_version=1 so the side-effect still occurs; update the line in
test_v1_collection_table_name_format accordingly.
🧹 Nitpick comments (4)
tests/integration_tests/test_fulltext_parser_config.py (1)
49-52: Consider extracting a helper for v1/v2 table name resolution.Same conditional appears in three methods; a small helper would reduce duplication and keep naming logic consistent.
Also applies to: 131-134, 180-183
tests/unit_tests/test_embedding_function_registry.py (1)
95-104: Silence unused-parameter lint in stub embedding functions.Several stubs ignore
input/config, which typically triggers unused-arg lint. Consider prefixing with_(or adding a noqa) across these stubs.♻️ Suggested tweak (apply similarly to other stubs)
- def __call__(self, input: Documents) -> Embeddings: + def __call__(self, _input: Documents) -> Embeddings: return [[0.1, 0.2, 0.3]] - def build_from_config(config: Dict[str, Any]) -> "InvalidEmbeddingFunction": + def build_from_config(_config: Dict[str, Any]) -> "InvalidEmbeddingFunction": return InvalidEmbeddingFunction()src/pyseekdb/client/embedding_function.py (1)
731-736: Validateget_configduring registry registration.Docs require
get_config, butregisteronly checksname/build_from_config. Adding the check fails fast for non‑persistable classes and aligns with the contract.♻️ Suggested fix
- if (not hasattr(embedding_function_class, "name") - or not hasattr(embedding_function_class, "build_from_config")): + if (not hasattr(embedding_function_class, "name") + or not hasattr(embedding_function_class, "build_from_config") + or not hasattr(embedding_function_class, "get_config")):tests/integration_tests/test_collection_embedding_function.py (1)
407-442: Ensure embedding-function registration is idempotent across parametrized runs.
These decorators execute on every parametrized run. If the registry rejects duplicate names, this can flake. Consider registering once at module scope or guarding registration with a registry lookup/try-if-absent.Also applies to: 589-618, 675-724
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 110-125: The initial falsy check in support_persistence
incorrectly treats objects with custom __bool__/__len__ as missing; change the
guard from "if not embedding_function" to an explicit None check ("if
embedding_function is None") in support_persistence(embedding_function: Any),
keep the subsequent attribute checks (name, build_from_config, get_config) and
the get_config() comparison to NotImplemented unchanged so the function still
returns False for None or non-conforming objects and True only for valid,
persistent-capable embedding_function instances.
In `@tests/unit_tests/test_embedding_function_registry.py`:
- Around line 66-188: Tests define several embedding function classes (e.g.,
TestEmbeddingFunction.__call__, InvalidEmbeddingFunction.__call__/get_config,
FirstEmbeddingFunction, SecondEmbeddingFunction.build_from_config, and other
build_from_config/get_config methods) that declare parameters they never use,
causing Ruff ARG002/ARG004; fix by renaming unused parameters to begin with an
underscore (e.g., input -> _input, config -> _config) or add a per-parameter
noqa comment, ensuring you update all relevant methods (__call__, get_config,
build_from_config) across the test classes so the linter no longer flags
unused-argument warnings.
♻️ Duplicate comments (5)
tests/integration_tests/test_collection_v1_compatibility.py (1)
428-431: Drop unusedcollectionassignment
This variable is unused and triggers Ruff F841.🧹 Suggested fix
- collection = db_client.create_collection( + db_client.create_collection( name=collection_name, _collection_version=1, )src/pyseekdb/client/client_base.py (2)
756-767: Escape collection_name/settings_str in v2 metadata SQL (and fix undefinede)
Unescaped values allow SQL injection and break JSON quoting. Line 766 also references undefinede. Apply escaping (or parameterization) here and in the other v2 SQL statements listed.🛠️ Suggested fix (core pattern)
- settings_str = json.dumps(settings) + settings_str = json.dumps(settings, ensure_ascii=False) + collection_name_escaped = escape_string(collection_name) + settings_str_escaped = escape_string(settings_str) @@ - delete_sql = f"DELETE FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name}'" + delete_sql = f"DELETE FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name_escaped}'" self._execute(delete_sql) - insert_sql = f"INSERT INTO {CollectionNames.sdk_collections_table_name()} (COLLECTION_NAME, SETTINGS) VALUES ('{collection_name}', '{settings_str}')" + insert_sql = f"INSERT INTO {CollectionNames.sdk_collections_table_name()} (COLLECTION_NAME, SETTINGS) VALUES ('{collection_name_escaped}', '{settings_str_escaped}')" self._execute(insert_sql) - query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name}'" + query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{collection_name_escaped}'" rows = self._execute(query_sql) - if not rows or len(rows) == 0: - raise ValueError(f"Failed to create collection metadata: {e}") + if not rows: + raise ValueError("Failed to create collection metadata: no COLLECTION_ID returned")Also applies to: 948-949, 1042-1043, 1204-1205
928-940: PersistedNoneembedding functions should not be replaced by default
When no persistence exists and the caller doesn’t pass an embedding function, this returns DefaultEmbeddingFunction, which can silently change behavior and cause dimension mismatches for collections created withembedding_function=None.🛠️ Suggested fix
- if embedding_function is _NOT_PROVIDED: - return ( - embedding_function_persistence - if embedding_function_persistence is not None - else get_default_embedding_function() - ) - else: - return embedding_function + if embedding_function is _NOT_PROVIDED: + return embedding_function_persistence + return embedding_functiontests/integration_tests/test_collection_embedding_function.py (2)
641-645: Remove unused collection assignments in tests
created_collection,coll1, andcoll2are assigned but unused (Ruff F841). Drop the assignment or use_.🧹 Example fix (apply similarly)
- created_collection = db_client.create_collection( + db_client.create_collection( name=collection_name, configuration=config, embedding_function=custom_ef, )Also applies to: 746-758, 844-848
13-18: Exercise the “unregistered embedding function” error path
The test registers the class and then retrieves successfully, so it never validates the failure case described in the docstring. Remove the registry entry beforeget_collection()and assert the error.🛠️ Suggested fix
-from pyseekdb.client.embedding_function import ( +from pyseekdb.client.embedding_function import ( EmbeddingFunction, Documents, Embeddings, register_embedding_function, + EmbeddingFunctionRegistry, ) @@ - # Actually, since we registered it, getting should work - retrieved_collection = db_client.get_collection(name=collection_name) - assert retrieved_collection.embedding_function is not None + # Remove from registry to simulate "unregistered" at restore time + EmbeddingFunctionRegistry._registry.pop("unregistered_embedding", None) + + with pytest.raises(ValueError, match="not found"): + db_client.get_collection(name=collection_name)Also applies to: 838-856
🧹 Nitpick comments (1)
src/pyseekdb/client/embedding_function.py (1)
15-26: Annotate registry state as ClassVar
This is class-level shared state; addingClassVaravoids Ruff RUF012 and clarifies intent.♻️ Suggested fix
-from typing import ( +from typing import ( List, Protocol, Union, runtime_checkable, Optional, TypeVar, cast, Any, Dict, Type, + ClassVar, ) @@ - _registry: Dict[str, type] = {} - _initialized: bool = False + _registry: ClassVar[Dict[str, type]] = {} + _initialized: ClassVar[bool] = FalseAlso applies to: 663-664
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 110-125: The support_persistence function currently calls
embedding_function.get_config() directly which can raise and abort collection
creation; modify support_persistence to call get_config() inside a try/except
(catch Exception) and treat any exception as "not persistable" (return False),
and only proceed to compare the returned value to NotImplemented if no exception
occurred; reference: support_persistence(embedding_function) and
embedding_function.get_config().
In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 409-445: The test defines TestPersistentEmbeddingFunction and
decorates it with register_embedding_function each parametrized run, causing
duplicate registration on subsequent runs; fix by resetting the
embedding-function registry between runs—add a setup_method in the test class
that calls the registry reset/clear API used by register_embedding_function (or
import the embedding_function_registry and call its clear/reset method) so
TestPersistentEmbeddingFunction can re-register, or alternatively move the class
to module scope or make TestPersistentEmbeddingFunction.name() return a unique
name per run (e.g., append a random/suffix).
In `@tests/unit_tests/test_embedding_function_registry.py`:
- Around line 325-329: The build_from_config static method in
TypedEmbeddingFunction incorrectly references an undefined name config; update
TypedEmbeddingFunction.build_from_config to use the provided parameter _config
(e.g., _config.get("model_name", "test")) when constructing and returning the
TypedEmbeddingFunction so the lookup uses the actual argument instead of the
undefined variable.
♻️ Duplicate comments (3)
src/pyseekdb/client/client_base.py (3)
931-945: Preserve persisted “no embedding function” instead of defaulting.If a collection was created with
embedding_function=None, there’s no persisted EF and_validate_embedding_function()currently substitutesDefaultEmbeddingFunction, which can cause dimension mismatches and unexpected embedding behavior.🛠️ Suggested fix
- if embedding_function is _NOT_PROVIDED: - return ( - embedding_function_persistence - if embedding_function_persistence is not None - else get_default_embedding_function() - ) - else: - return embedding_function + if embedding_function is _NOT_PROVIDED: + return embedding_function_persistence + return embedding_function
1045-1047: Escape collection name in_delete_collection_v2to prevent SQL injection.
collection_nameis user input and is interpolated directly into SQL.🔒 Suggested fix
- query_sql = f"DELETE FROM `{CollectionNames.sdk_collections_table_name()}` WHERE COLLECTION_NAME = '{name}'" + name_escaped = escape_string(name) + query_sql = ( + f"DELETE FROM `{CollectionNames.sdk_collections_table_name()}` " + f"WHERE COLLECTION_NAME = '{name_escaped}'" + )
1208-1209: Escape collection name in_has_collection_v2queries.This query uses raw user input; it should be escaped or parameterized.
🔒 Suggested fix
- query_sql = f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} WHERE COLLECTION_NAME = '{name}'" + name_escaped = escape_string(name) + query_sql = ( + f"SELECT COLLECTION_ID FROM {CollectionNames.sdk_collections_table_name()} " + f"WHERE COLLECTION_NAME = '{name_escaped}'" + )
🧹 Nitpick comments (1)
src/pyseekdb/client/embedding_function.py (1)
663-665: Annotate registry class attributes withClassVar.These are class-level singletons; annotating them as
ClassVarimproves typing clarity and matches Ruff’s guidance.♻️ Suggested refactor
-from typing import ( +from typing import ( List, Protocol, Union, runtime_checkable, Optional, TypeVar, cast, Any, Dict, Type, + ClassVar, ) @@ - _registry: Dict[str, type] = {} - _initialized: bool = False + _registry: ClassVar[Dict[str, type]] = {} + _initialized: ClassVar[bool] = False
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/integration_tests/test_collection_embedding_function.py`:
- Around line 894-931: The test creates two collections with the same
timestamp-based collection_name and never deletes them; change to generate
unique names (e.g., append a UUID or distinct suffix) for each create_collection
call and wrap the operations in a try/finally so you call
db_client.delete_collection(...) for each created collection (references:
collection_name variable, db_client.create_collection, db_client.get_collection,
db_client.get_or_create_collection, Simple3DEmbeddingFunction,
HNSWConfiguration) to ensure cleanup and avoid name collisions/leaks.
♻️ Duplicate comments (1)
tests/integration_tests/test_collection_embedding_function.py (1)
429-466: Reset the embedding-function registry between parametrized runs.
These test-scoped classes register fixedname()values and can collide across parametrizeddb_clientruns.♻️ Suggested fix
class TestCollectionEmbeddingFunction: """Test collection creation with embedding function handling using parameterized db_client fixture""" + def setup_method(self): + EmbeddingFunctionRegistry._registry.clear() + EmbeddingFunctionRegistry._initialized = False
Summary
close #50
For details of solution, refer to doc.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.