feat: add Text2Vec embedding function support#142
Conversation
Add Text2VecEmbeddingFunction class using text2vec library. - Lazy model loading (only imports text2vec when needed) - Config serialization (get_config, build_from_config) - Dimension property with fallback - Supports device selection and normalization - Issue: oceanbase#134
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughAdds a new Text2VecEmbeddingFunction class, exposes it in the embedding-functions public API, registers it in the client embedding-function registry, and adds unit tests that mock the text2vec dependency to verify initialization, encoding, dimension, config round-trip, and model caching. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Registry
participant EmbFn as Text2VecEmbeddingFunction
participant Lib as text2vec.SentenceModel
Client->>Registry: request embedding function "text2vec"
Registry->>EmbFn: instantiate or return factory
Client->>EmbFn: call(documents)
EmbFn->>Lib: _get_model(model_name, device, kwargs) (lazy cached)
Lib-->>EmbFn: SentenceModel instance / embeddings
EmbFn-->>Client: list[list[float]] embeddings
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 45-62: The cache currently keys models only by model_name causing
collisions when device or kwargs differ; change the cache key in the
lazy-import/initialization block (where self.models, model_name, device, kwargs
and self._model are used) to include device and a stable, hashable
representation of kwargs (e.g., a tuple of sorted (k, v_repr) pairs or frozenset
of items where non-hashable values are converted via repr) so you construct
something like key = (model_name, device, kwargs_key) and use that key instead
of model_name when reading/writing self.models; ensure subsequent lookup for
self._model uses the same composite key.
🧹 Nitpick comments (4)
src/pyseekdb/utils/embedding_functions/__init__.py (1)
38-40: Import placement and__all__ordering inconsistency.The import is placed after
__all__, unlike all other imports which are at the top. Additionally, the__all__list appears to follow alphabetical order, butText2VecEmbeddingFunctionis appended at the end instead of betweenTencentHunyuanEmbeddingFunctionandVoyageaiEmbeddingFunction.♻️ Suggested fix
Move the import to the top with other imports:
from .tencent_hunyuan_embedding_function import TencentHunyuanEmbeddingFunction +from .text2vec_embedding_function import Text2VecEmbeddingFunction from .voyageai_embedding_function import VoyageaiEmbeddingFunctionAnd update
__all__ordering:"TencentHunyuanEmbeddingFunction", + "Text2VecEmbeddingFunction", "VoyageaiEmbeddingFunction", - "Text2VecEmbeddingFunction", ] -from .text2vec_embedding_function import Text2VecEmbeddingFunctiontests/unit_tests/test_text2vec_embedding.py (1)
98-110: Silence unused variable warnings with_prefix.The
efandef2variables are assigned to trigger model loading side effects but are never read. Prefix with_to indicate intentional non-use and silence the linter warnings (F841).♻️ Suggested fix
# First init should load model - ef = Text2VecEmbeddingFunction(model_name="new-model") + _ef = Text2VecEmbeddingFunction(model_name="new-model") mock_sentence_model.assert_called_once() # Second init with same model should not call constructor again - ef2 = Text2VecEmbeddingFunction(model_name="new-model") + _ef2 = Text2VecEmbeddingFunction(model_name="new-model") mock_sentence_model.assert_called_once() # Call count remains 1src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py (2)
64-72: Cache thedimensionproperty to avoid repeated model inference.Every access to
dimensiontriggers anencode()call, which is expensive. Consider caching the result after the first computation.♻️ Suggested fix
+ _cached_dimension: int | None = None + `@property` def dimension(self) -> int: """Get the dimension of embeddings produced by this function.""" + if self._cached_dimension is not None: + return self._cached_dimension # Get dimension from the model's encoding directly if possible # Or try encoding a dummy string sample = self._model.encode("test", normalize_embeddings=self.normalize_embeddings) if hasattr(sample, 'shape'): - return sample.shape[0] if len(sample.shape) == 1 else sample.shape[1] - return len(sample) + self._cached_dimension = sample.shape[0] if len(sample.shape) == 1 else sample.shape[1] + else: + self._cached_dimension = len(sample) + return self._cached_dimension
35-38: Consider recursive validation for nested structures (optional).The validation accepts
list,dict, andtuplebut doesn't recursively validate their contents. A nested structure containing a non-serializable object (e.g.,{"key": some_callable}) would pass validation but fail serialization. This is likely acceptable for typical usage, but worth noting.
| ) from exc | ||
|
|
||
| # Get the actual model instance | ||
| self._model = self.models[model_name] |
|
Hi @NTLx please fix the CI/quality. You can execute |
|
You also should register it to embedding function registry. You can refer to #143 for example. |
…y loading This update improves the Text2VecEmbeddingFunction implementation based on maintainer feedback: - Implemented dimension caching for O(1) access after first computation. - Enhanced model cache keys using (model_name, device, kwargs) for better isolation. - Implemented true lazy loading to defer heavy model initialization until first usage. - Registered Text2Vec in EmbeddingFunctionRegistry for global discovery. - Standardized export order in __init__.py. - Updated unit tests to align with lazy loading behavior and fixed lint warnings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Hello @oceanbase maintainers, 根据之前的反馈和代码规范,我已经对 PR #142 进行了深度优化,主要改动如下:
代码已推送至当前分支。期待您的再次评审! 🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 36-44: The constructor validation for Text2VecEmbeddingFunction
currently allows unhashable types (list, dict) in kwargs but _get_model builds a
cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by
either restricting kwargs to only hashable types (e.g., remove list and dict
from allowed types in the constructor validation so only
str/int/float/bool/tuple remain) or change the cache-key generation in
_get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted
keys or convert mutable values to tuples) so frozenset is not required—apply the
chosen approach consistently to the self.kwargs usage and _get_model cache
logic.
| # Validate kwargs - only allow primitive types | ||
| for key, value in kwargs.items(): | ||
| if not isinstance(value, (str, int, float, bool, list, dict, tuple)): | ||
| raise TypeError(f"Keyword argument {key} is not a primitive type") | ||
|
|
||
| self.model_name = model_name | ||
| self.device = device | ||
| self.normalize_embeddings = normalize_embeddings | ||
| self.kwargs = kwargs |
There was a problem hiding this comment.
Potential runtime error: frozenset(kwargs.items()) fails for unhashable nested values.
The validation at lines 37-39 allows list, dict, and tuple in kwargs values, but frozenset(kwargs.items()) at line 53 will raise TypeError if any value is a mutable type (list or dict). For example:
ef = Text2VecEmbeddingFunction(some_kwarg=[1, 2, 3]) # Passes validation
ef._get_model() # TypeError: unhashable type: 'list'Either restrict kwargs to hashable primitives only, or use a different approach for cache key generation.
🔧 Proposed fix: restrict to hashable types or use JSON serialization
Option 1: Restrict to hashable types only
# Validate kwargs - only allow primitive types
for key, value in kwargs.items():
- if not isinstance(value, (str, int, float, bool, list, dict, tuple)):
+ if not isinstance(value, (str, int, float, bool, type(None))):
raise TypeError(f"Keyword argument {key} is not a primitive type")Option 2: Use JSON for cache key (keeps current validation)
+ `@staticmethod`
+ def _make_kwargs_key(kwargs: dict[str, Any]) -> str:
+ """Create a hashable key from kwargs."""
+ import json
+ return json.dumps(kwargs, sort_keys=True)
+
def _get_model(self) -> Any:
"""Get or initialize the text2vec model instance."""
if self._model_instance is not None:
return self._model_instance
- cache_key = (self.model_name, self.device, frozenset(self.kwargs.items()))
+ cache_key = (self.model_name, self.device, self._make_kwargs_key(self.kwargs))🧰 Tools
🪛 Ruff (0.14.14)
[warning] 39-39: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py` around
lines 36 - 44, The constructor validation for Text2VecEmbeddingFunction
currently allows unhashable types (list, dict) in kwargs but _get_model builds a
cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by
either restricting kwargs to only hashable types (e.g., remove list and dict
from allowed types in the constructor validation so only
str/int/float/bool/tuple remain) or change the cache-key generation in
_get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted
keys or convert mutable values to tuples) so frozenset is not required—apply the
chosen approach consistently to the self.kwargs usage and _get_model cache
logic.
|
Hi @hnwyllmm! This PR is now MERGEABLE with all CI checks passing. Summary:
Ready for merge! |
Add Text2VecEmbeddingFunction class using text2vec library.
Lazy model loading (only imports text2vec when needed)
Config serialization (get_config, build_from_config)
Dimension property with fallback
Supports device selection and normalization
Issue: [Enhancement]: Support Text2Vec embedding function #134
🎯 任务认领与 PR 提交
关联 Issue: #134 - Support Text2Vec embedding function
PR: [系统自动生成的 PR 链接]
状态: Open, Awaiting Review
贡献者: @NTLx
🤖 AI-Native 开发流程
本次任务由 Claude Code (Anthropic AI 助手) 全程驱动,实现了高度自动化的端到端工作流:
uv创建虚拟环境,安装text2vec及其依赖(PyTorch, numpy<2)SentenceTransformerEmbeddingFunction的架构,实现Text2VecEmbeddingFunction类unittest.mock进行深度 Mock 验证,绕过环境依赖问题(_lzma模块缺失),确保测试在任何 CI 环境下都能运行✅ 技术亮点
1. 环境隔离与代理加速
uv(Rust 编写的极速包管理器)创建虚拟环境proxy命令加速 PyTorch 等大型依赖的下载2. 深度 Mocking 技术
sys.modules注入 Mock 对象(Pre-Mocking),绕过 Python 编译环境缺失_lzma模块的问题setup_method机制,确保每个测试前重置 Mock 状态3. 架构设计
SentenceTransformerEmbeddingFunction的模式实现get_config,build_from_config)4. 单元测试覆盖
期待 Review!如有任何反馈或需要调整,随时响应。
Summary
Solution Description
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.