feat: add Text2Vec embedding function support by NTLx · Pull Request #142 · oceanbase/pyseekdb

NTLx · 2026-01-26T14:34:19Z

Add Text2VecEmbeddingFunction class using text2vec library.

Lazy model loading (only imports text2vec when needed)
Config serialization (get_config, build_from_config)
Dimension property with fallback
Supports device selection and normalization
Issue: [Enhancement]: Support Text2Vec embedding function #134

🎯 任务认领与 PR 提交

关联 Issue: #134 - Support Text2Vec embedding function
PR: [系统自动生成的 PR 链接]
状态: Open, Awaiting Review
贡献者: @NTLx

🤖 AI-Native 开发流程

本次任务由 Claude Code (Anthropic AI 助手) 全程驱动，实现了高度自动化的端到端工作流：

自动认领: 通过 BrowserOS 在 Issue 页面留言认领任务
环境准备: 使用 uv 创建虚拟环境，安装 text2vec 及其依赖（PyTorch, numpy<2）
核心实现: 参照 SentenceTransformerEmbeddingFunction 的架构，实现 Text2VecEmbeddingFunction 类
单元测试: 使用 unittest.mock 进行深度 Mock 验证，绕过环境依赖问题（_lzma 模块缺失），确保测试在任何 CI 环境下都能运行
代码提交: 本地 Commit 完成，准备推送

✅ 技术亮点

1. 环境隔离与代理加速

使用 uv（Rust 编写的极速包管理器）创建虚拟环境
通过 proxy 命令加速 PyTorch 等大型依赖的下载

2. 深度 Mocking 技术

使用 sys.modules 注入 Mock 对象（Pre-Mocking），绕过 Python 编译环境缺失 _lzma 模块的问题
实现了 setup_method 机制，确保每个测试前重置 Mock 状态
这保证了测试在受限的 CI 环境中也能稳定运行

3. 架构设计

参照现有 SentenceTransformerEmbeddingFunction 的模式实现
支持配置序列化（get_config, build_from_config）
实现了维度属性的动态获取与回退机制

4. 单元测试覆盖

测试初始化、embedding 生成、维度计算、配置序列化
测试了模型的 Lazy Loading 和缓存机制
测试结果: 5 passed in 0.09s

期待 Review！如有任何反馈或需要调整，随时响应。

Summary

Solution Description

Summary by CodeRabbit

New Features
- Added a Text2Vec embedding function for generating document embeddings with configurable model, device, optional normalization, lazy loading and per-model caching for improved performance and memory use.
- Exposes name, config serialization, and reconstruction for easy integration.
Tests
- Added unit tests covering initialization, embedding generation, dimension reporting, config round-trip, and lazy-loading/caching behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add Text2VecEmbeddingFunction class using text2vec library. - Lazy model loading (only imports text2vec when needed) - Config serialization (get_config, build_from_config) - Dimension property with fallback - Supports device selection and normalization - Issue: oceanbase#134

CLAassistant · 2026-01-26T14:34:29Z

All committers have signed the CLA.

coderabbitai · 2026-01-26T14:34:46Z

Warning

Rate limit exceeded

@NTLx has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new Text2VecEmbeddingFunction class, exposes it in the embedding-functions public API, registers it in the client embedding-function registry, and adds unit tests that mock the text2vec dependency to verify initialization, encoding, dimension, config round-trip, and model caching.

Changes

Cohort / File(s)	Summary
Public API Export `src/pyseekdb/utils/embedding_functions/__init__.py`	Import and add `Text2VecEmbeddingFunction` to `__all__` to expose the new embedding function.
Embedding Implementation `src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`	New `Text2VecEmbeddingFunction` class with lazy per-model caching, kwargs validation, safe import handling, `dimension` property, `__call__` encoding, `get_config` / `build_from_config`.
Client Registry `src/pyseekdb/client/embedding_function.py`	Include `Text2VecEmbeddingFunction` in optional imports and register it under the `"text2vec"` key in the embedding-function registry.
Unit Tests `tests/unit_tests/test_text2vec_embedding.py`	New tests (mocked `text2vec`) covering defaults, encoding path, dimension extraction, config round-trip, and model caching/lazy-load behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Registry
  participant EmbFn as Text2VecEmbeddingFunction
  participant Lib as text2vec.SentenceModel

  Client->>Registry: request embedding function "text2vec"
  Registry->>EmbFn: instantiate or return factory
  Client->>EmbFn: call(documents)
  EmbFn->>Lib: _get_model(model_name, device, kwargs) (lazy cached)
  Lib-->>EmbFn: SentenceModel instance / embeddings
  EmbFn-->>Client: list[list[float]] embeddings

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: add Morph embedding function support #143 — Adds a text2vec embedding-function implementation and registers/exports it similarly.

Suggested reviewers

hnwyllmm

Poem

🐰 I sniffed a model in the glade so green,
Text2Vec hummed vectors sleek and keen,
Cached in a burrow, tidy on demand,
Tests laid carrots in a careful hand,
A rabbit applauds this clever new scheme. 🎋

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding Text2Vec embedding function support to the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 45-62: The cache currently keys models only by model_name causing
collisions when device or kwargs differ; change the cache key in the
lazy-import/initialization block (where self.models, model_name, device, kwargs
and self._model are used) to include device and a stable, hashable
representation of kwargs (e.g., a tuple of sorted (k, v_repr) pairs or frozenset
of items where non-hashable values are converted via repr) so you construct
something like key = (model_name, device, kwargs_key) and use that key instead
of model_name when reading/writing self.models; ensure subsequent lookup for
self._model uses the same composite key.

🧹 Nitpick comments (4)

src/pyseekdb/utils/embedding_functions/__init__.py (1)
38-40: Import placement and __all__ ordering inconsistency.

The import is placed after __all__, unlike all other imports which are at the top. Additionally, the __all__ list appears to follow alphabetical order, but Text2VecEmbeddingFunction is appended at the end instead of between TencentHunyuanEmbeddingFunction and VoyageaiEmbeddingFunction.
♻️ Suggested fix

Move the import to the top with other imports:
 from .tencent_hunyuan_embedding_function import TencentHunyuanEmbeddingFunction
+from .text2vec_embedding_function import Text2VecEmbeddingFunction
 from .voyageai_embedding_function import VoyageaiEmbeddingFunction
And update __all__ ordering:
     "TencentHunyuanEmbeddingFunction",
+    "Text2VecEmbeddingFunction",
     "VoyageaiEmbeddingFunction",
-    "Text2VecEmbeddingFunction",
 ]
-from .text2vec_embedding_function import Text2VecEmbeddingFunction
tests/unit_tests/test_text2vec_embedding.py (1)
98-110: Silence unused variable warnings with _ prefix.

The ef and ef2 variables are assigned to trigger model loading side effects but are never read. Prefix with _ to indicate intentional non-use and silence the linter warnings (F841).
♻️ Suggested fix
         # First init should load model
-        ef = Text2VecEmbeddingFunction(model_name="new-model")
+        _ef = Text2VecEmbeddingFunction(model_name="new-model")
         mock_sentence_model.assert_called_once()

         # Second init with same model should not call constructor again
-        ef2 = Text2VecEmbeddingFunction(model_name="new-model")
+        _ef2 = Text2VecEmbeddingFunction(model_name="new-model")
         mock_sentence_model.assert_called_once()  # Call count remains 1
src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py (2)
64-72: Cache the dimension property to avoid repeated model inference.

Every access to dimension triggers an encode() call, which is expensive. Consider caching the result after the first computation.
♻️ Suggested fix
+    _cached_dimension: int | None = None
+
     `@property`
     def dimension(self) -> int:
         """Get the dimension of embeddings produced by this function."""
+        if self._cached_dimension is not None:
+            return self._cached_dimension
         # Get dimension from the model's encoding directly if possible
         # Or try encoding a dummy string
         sample = self._model.encode("test", normalize_embeddings=self.normalize_embeddings)
         if hasattr(sample, 'shape'):
-            return sample.shape[0] if len(sample.shape) == 1 else sample.shape[1]
-        return len(sample)
+            self._cached_dimension = sample.shape[0] if len(sample.shape) == 1 else sample.shape[1]
+        else:
+            self._cached_dimension = len(sample)
+        return self._cached_dimension
35-38: Consider recursive validation for nested structures (optional).

The validation accepts list, dict, and tuple but doesn't recursively validate their contents. A nested structure containing a non-serializable object (e.g., {"key": some_callable}) would pass validation but fail serialization. This is likely acceptable for typical usage, but worth noting.

hnwyllmm · 2026-01-31T06:20:34Z

+                ) from exc
+
+        # Get the actual model instance
+        self._model = self.models[model_name]


fix me later.

hnwyllmm

LGTM

hnwyllmm · 2026-01-31T06:21:27Z

Hi @NTLx please fix the CI/quality. You can execute make pre-commit to fix the issue.

hnwyllmm · 2026-01-31T06:23:46Z

You also should register it to embedding function registry. You can refer to #143 for example.

…y loading This update improves the Text2VecEmbeddingFunction implementation based on maintainer feedback: - Implemented dimension caching for O(1) access after first computation. - Enhanced model cache keys using (model_name, device, kwargs) for better isolation. - Implemented true lazy loading to defer heavy model initialization until first usage. - Registered Text2Vec in EmbeddingFunctionRegistry for global discovery. - Standardized export order in __init__.py. - Updated unit tests to align with lazy loading behavior and fixed lint warnings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

NTLx · 2026-02-01T14:04:43Z

Hello @oceanbase maintainers,

根据之前的反馈和代码规范，我已经对 PR #142 进行了深度优化，主要改动如下：

性能优化 (Text2VecEmbeddingFunction):
- 维度缓存: 引入了 _cached_dimension 属性，维度信息在首次计算后会被持久缓存，后续访问实现 O(1) 的即时响应。
- 真正的延迟加载 (Lazy Loading): 将模型初始化逻辑完全推迟到 _get_model 阶段，只有在真正执行向量生成或查询维度时才会加载权重，显著提升了库的初始导入性能。
- 复合缓存键: 升级了模型缓存逻辑，使用 (model_name, device, kwargs) 作为键，确保了在多配置环境下的实例隔离与安全复用。
系统集成:
- 全局注册: 在 EmbeddingFunctionRegistry 中正式注册了 text2vec 标识符。
- 接口规范化: 按字母顺序重新整理了 embedding_functions/__init__.py 的导出列表。
质量保证:
- 测试适配: 更新了 test_text2vec_embedding.py，通过验证初始化后的“零调用”状态和触发后的“加载”状态，确保测试与延迟加载逻辑一致。
- Lint 修复: 处理了所有 F841 未使用变量警告，测试已全部通过。

代码已推送至当前分支。期待您的再次评审！

🤖 Generated with Claude Code

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 36-44: The constructor validation for Text2VecEmbeddingFunction
currently allows unhashable types (list, dict) in kwargs but _get_model builds a
cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by
either restricting kwargs to only hashable types (e.g., remove list and dict
from allowed types in the constructor validation so only
str/int/float/bool/tuple remain) or change the cache-key generation in
_get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted
keys or convert mutable values to tuples) so frozenset is not required—apply the
chosen approach consistently to the self.kwargs usage and _get_model cache
logic.

coderabbitai · 2026-02-01T14:04:48Z

+        # Validate kwargs - only allow primitive types
+        for key, value in kwargs.items():
+            if not isinstance(value, (str, int, float, bool, list, dict, tuple)):
+                raise TypeError(f"Keyword argument {key} is not a primitive type")
+
+        self.model_name = model_name
+        self.device = device
+        self.normalize_embeddings = normalize_embeddings
+        self.kwargs = kwargs


⚠️ Potential issue | 🟡 Minor

Potential runtime error: frozenset(kwargs.items()) fails for unhashable nested values.

The validation at lines 37-39 allows list, dict, and tuple in kwargs values, but frozenset(kwargs.items()) at line 53 will raise TypeError if any value is a mutable type (list or dict). For example:

ef = Text2VecEmbeddingFunction(some_kwarg=[1, 2, 3]) # Passes validation ef._get_model() # TypeError: unhashable type: 'list'

Either restrict kwargs to hashable primitives only, or use a different approach for cache key generation.

🔧 Proposed fix: restrict to hashable types or use JSON serialization

Option 1: Restrict to hashable types only

# Validate kwargs - only allow primitive types for key, value in kwargs.items(): - if not isinstance(value, (str, int, float, bool, list, dict, tuple)): + if not isinstance(value, (str, int, float, bool, type(None))): raise TypeError(f"Keyword argument {key} is not a primitive type")

Option 2: Use JSON for cache key (keeps current validation)

+ `@staticmethod` + def _make_kwargs_key(kwargs: dict[str, Any]) -> str: + """Create a hashable key from kwargs.""" + import json + return json.dumps(kwargs, sort_keys=True) + def _get_model(self) -> Any: """Get or initialize the text2vec model instance.""" if self._model_instance is not None: return self._model_instance - cache_key = (self.model_name, self.device, frozenset(self.kwargs.items())) + cache_key = (self.model_name, self.device, self._make_kwargs_key(self.kwargs))

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 39-39: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py` around lines 36 - 44, The constructor validation for Text2VecEmbeddingFunction currently allows unhashable types (list, dict) in kwargs but _get_model builds a cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by either restricting kwargs to only hashable types (e.g., remove list and dict from allowed types in the constructor validation so only str/int/float/bool/tuple remain) or change the cache-key generation in _get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted keys or convert mutable values to tuples) so frozenset is not required—apply the chosen approach consistently to the self.kwargs usage and _get_model cache logic.

…ance

NTLx · 2026-02-10T00:09:53Z

Hi @hnwyllmm! This PR is now MERGEABLE with all CI checks passing.

Summary:

Added Text2Vec embedding function support
Enables use of Text2Vec models for embedding generation

Ready for merge!

coderabbitai Bot reviewed Jan 26, 2026

View reviewed changes

Comment thread src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py Outdated

zhanghuidinah mentioned this pull request Jan 30, 2026

Developer Activities: Call for Participation! oceanbase/seekdb#123

Closed

3 tasks

hnwyllmm reviewed Jan 31, 2026

View reviewed changes

hnwyllmm approved these changes Jan 31, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 1, 2026

View reviewed changes

NTLx added 4 commits February 1, 2026 22:09

chore: fix ruff lint errors (E402, S108) and format tests

decdaf0

chore: fix remaining ruff lint errors and apply auto-formatting

3760708

chore: fix MD5 security warning and ClassVar annotation for CI compli…

0e7096a

…ance

style: apply ruff formatting fixes for CI compliance

ca076c4

hnwyllmm approved these changes Feb 11, 2026

View reviewed changes

hnwyllmm merged commit b804ee9 into oceanbase:develop Feb 11, 2026
8 checks passed

This was referenced Feb 11, 2026

Test collection name more than 64 characters #141

Closed

[Enhancement]: Support Text2Vec embedding function #134

Closed

coderabbitai Bot mentioned this pull request Feb 13, 2026

refactor: extract ONNXEmbeddingFunction from DefaultEmbeddingFunction #156

Closed

hnwyllmm mentioned this pull request Mar 12, 2026

Developer Activities: Call for Participation! oceanbase/seekdb#252

Closed

3 tasks

Conversation

NTLx commented Jan 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎯 任务认领与 PR 提交

🤖 AI-Native 开发流程

✅ 技术亮点

1. 环境隔离与代理加速

2. 深度 Mocking 技术

3. 架构设计

4. 单元测试覆盖

Summary

Solution Description

Summary by CodeRabbit

Uh oh!

CLAassistant commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hnwyllmm Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

hnwyllmm left a comment

Choose a reason for hiding this comment

Uh oh!

hnwyllmm commented Jan 31, 2026

Uh oh!

hnwyllmm commented Jan 31, 2026

Uh oh!

NTLx commented Feb 1, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

NTLx commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NTLx commented Jan 26, 2026 •

edited by coderabbitai Bot

Loading

CLAassistant commented Jan 26, 2026 •

edited

Loading

coderabbitai Bot commented Jan 26, 2026 •

edited

Loading