Skip to content

feat: add Text2Vec embedding function support#142

Merged
hnwyllmm merged 6 commits into
oceanbase:developfrom
NTLx:feature/support-text2vec
Feb 11, 2026
Merged

feat: add Text2Vec embedding function support#142
hnwyllmm merged 6 commits into
oceanbase:developfrom
NTLx:feature/support-text2vec

Conversation

@NTLx

@NTLx NTLx commented Jan 26, 2026

Copy link
Copy Markdown
Contributor

Add Text2VecEmbeddingFunction class using text2vec library.

🎯 任务认领与 PR 提交

关联 Issue: #134 - Support Text2Vec embedding function
PR: [系统自动生成的 PR 链接]
状态: Open, Awaiting Review
贡献者: @NTLx


🤖 AI-Native 开发流程

本次任务由 Claude Code (Anthropic AI 助手) 全程驱动,实现了高度自动化的端到端工作流:

  1. 自动认领: 通过 BrowserOS 在 Issue 页面留言认领任务
  2. 环境准备: 使用 uv 创建虚拟环境,安装 text2vec 及其依赖(PyTorch, numpy<2)
  3. 核心实现: 参照 SentenceTransformerEmbeddingFunction 的架构,实现 Text2VecEmbeddingFunction
  4. 单元测试: 使用 unittest.mock 进行深度 Mock 验证,绕过环境依赖问题(_lzma 模块缺失),确保测试在任何 CI 环境下都能运行
  5. 代码提交: 本地 Commit 完成,准备推送

✅ 技术亮点

1. 环境隔离与代理加速

  • 使用 uv(Rust 编写的极速包管理器)创建虚拟环境
  • 通过 proxy 命令加速 PyTorch 等大型依赖的下载

2. 深度 Mocking 技术

  • 使用 sys.modules 注入 Mock 对象(Pre-Mocking),绕过 Python 编译环境缺失 _lzma 模块的问题
  • 实现了 setup_method 机制,确保每个测试前重置 Mock 状态
  • 这保证了测试在受限的 CI 环境中也能稳定运行

3. 架构设计

  • 参照现有 SentenceTransformerEmbeddingFunction 的模式实现
  • 支持配置序列化(get_config, build_from_config
  • 实现了维度属性的动态获取与回退机制

4. 单元测试覆盖

  • 测试初始化、embedding 生成、维度计算、配置序列化
  • 测试了模型的 Lazy Loading 和缓存机制
  • 测试结果: 5 passed in 0.09s

期待 Review!如有任何反馈或需要调整,随时响应。

Summary

Solution Description

Summary by CodeRabbit

  • New Features

    • Added a Text2Vec embedding function for generating document embeddings with configurable model, device, optional normalization, lazy loading and per-model caching for improved performance and memory use.
    • Exposes name, config serialization, and reconstruction for easy integration.
  • Tests

    • Added unit tests covering initialization, embedding generation, dimension reporting, config round-trip, and lazy-loading/caching behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

Add Text2VecEmbeddingFunction class using text2vec library.
- Lazy model loading (only imports text2vec when needed)
- Config serialization (get_config, build_from_config)
- Dimension property with fallback
- Supports device selection and normalization

- Issue: oceanbase#134
@CLAassistant

CLAassistant commented Jan 26, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@coderabbitai

coderabbitai Bot commented Jan 26, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@NTLx has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 15 minutes and 6 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📝 Walkthrough

Walkthrough

Adds a new Text2VecEmbeddingFunction class, exposes it in the embedding-functions public API, registers it in the client embedding-function registry, and adds unit tests that mock the text2vec dependency to verify initialization, encoding, dimension, config round-trip, and model caching.

Changes

Cohort / File(s) Summary
Public API Export
src/pyseekdb/utils/embedding_functions/__init__.py
Import and add Text2VecEmbeddingFunction to __all__ to expose the new embedding function.
Embedding Implementation
src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py
New Text2VecEmbeddingFunction class with lazy per-model caching, kwargs validation, safe import handling, dimension property, __call__ encoding, get_config / build_from_config.
Client Registry
src/pyseekdb/client/embedding_function.py
Include Text2VecEmbeddingFunction in optional imports and register it under the "text2vec" key in the embedding-function registry.
Unit Tests
tests/unit_tests/test_text2vec_embedding.py
New tests (mocked text2vec) covering defaults, encoding path, dimension extraction, config round-trip, and model caching/lazy-load behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Registry
  participant EmbFn as Text2VecEmbeddingFunction
  participant Lib as text2vec.SentenceModel

  Client->>Registry: request embedding function "text2vec"
  Registry->>EmbFn: instantiate or return factory
  Client->>EmbFn: call(documents)
  EmbFn->>Lib: _get_model(model_name, device, kwargs) (lazy cached)
  Lib-->>EmbFn: SentenceModel instance / embeddings
  EmbFn-->>Client: list[list[float]] embeddings
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • hnwyllmm

Poem

🐰 I sniffed a model in the glade so green,
Text2Vec hummed vectors sleek and keen,
Cached in a burrow, tidy on demand,
Tests laid carrots in a careful hand,
A rabbit applauds this clever new scheme. 🎋

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Text2Vec embedding function support to the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 45-62: The cache currently keys models only by model_name causing
collisions when device or kwargs differ; change the cache key in the
lazy-import/initialization block (where self.models, model_name, device, kwargs
and self._model are used) to include device and a stable, hashable
representation of kwargs (e.g., a tuple of sorted (k, v_repr) pairs or frozenset
of items where non-hashable values are converted via repr) so you construct
something like key = (model_name, device, kwargs_key) and use that key instead
of model_name when reading/writing self.models; ensure subsequent lookup for
self._model uses the same composite key.
🧹 Nitpick comments (4)
src/pyseekdb/utils/embedding_functions/__init__.py (1)

38-40: Import placement and __all__ ordering inconsistency.

The import is placed after __all__, unlike all other imports which are at the top. Additionally, the __all__ list appears to follow alphabetical order, but Text2VecEmbeddingFunction is appended at the end instead of between TencentHunyuanEmbeddingFunction and VoyageaiEmbeddingFunction.

♻️ Suggested fix

Move the import to the top with other imports:

 from .tencent_hunyuan_embedding_function import TencentHunyuanEmbeddingFunction
+from .text2vec_embedding_function import Text2VecEmbeddingFunction
 from .voyageai_embedding_function import VoyageaiEmbeddingFunction

And update __all__ ordering:

     "TencentHunyuanEmbeddingFunction",
+    "Text2VecEmbeddingFunction",
     "VoyageaiEmbeddingFunction",
-    "Text2VecEmbeddingFunction",
 ]
-from .text2vec_embedding_function import Text2VecEmbeddingFunction
tests/unit_tests/test_text2vec_embedding.py (1)

98-110: Silence unused variable warnings with _ prefix.

The ef and ef2 variables are assigned to trigger model loading side effects but are never read. Prefix with _ to indicate intentional non-use and silence the linter warnings (F841).

♻️ Suggested fix
         # First init should load model
-        ef = Text2VecEmbeddingFunction(model_name="new-model")
+        _ef = Text2VecEmbeddingFunction(model_name="new-model")
         mock_sentence_model.assert_called_once()

         # Second init with same model should not call constructor again
-        ef2 = Text2VecEmbeddingFunction(model_name="new-model")
+        _ef2 = Text2VecEmbeddingFunction(model_name="new-model")
         mock_sentence_model.assert_called_once()  # Call count remains 1
src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py (2)

64-72: Cache the dimension property to avoid repeated model inference.

Every access to dimension triggers an encode() call, which is expensive. Consider caching the result after the first computation.

♻️ Suggested fix
+    _cached_dimension: int | None = None
+
     `@property`
     def dimension(self) -> int:
         """Get the dimension of embeddings produced by this function."""
+        if self._cached_dimension is not None:
+            return self._cached_dimension
         # Get dimension from the model's encoding directly if possible
         # Or try encoding a dummy string
         sample = self._model.encode("test", normalize_embeddings=self.normalize_embeddings)
         if hasattr(sample, 'shape'):
-            return sample.shape[0] if len(sample.shape) == 1 else sample.shape[1]
-        return len(sample)
+            self._cached_dimension = sample.shape[0] if len(sample.shape) == 1 else sample.shape[1]
+        else:
+            self._cached_dimension = len(sample)
+        return self._cached_dimension

35-38: Consider recursive validation for nested structures (optional).

The validation accepts list, dict, and tuple but doesn't recursively validate their contents. A nested structure containing a non-serializable object (e.g., {"key": some_callable}) would pass validation but fail serialization. This is likely acceptable for typical usage, but worth noting.

Comment thread src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py Outdated
) from exc

# Get the actual model instance
self._model = self.models[model_name]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix me later.

@hnwyllmm hnwyllmm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hnwyllmm

Copy link
Copy Markdown
Member

Hi @NTLx please fix the CI/quality. You can execute make pre-commit to fix the issue.

@hnwyllmm

Copy link
Copy Markdown
Member

You also should register it to embedding function registry. You can refer to #143 for example.

…y loading

This update improves the Text2VecEmbeddingFunction implementation based on maintainer feedback:
- Implemented dimension caching for O(1) access after first computation.
- Enhanced model cache keys using (model_name, device, kwargs) for better isolation.
- Implemented true lazy loading to defer heavy model initialization until first usage.
- Registered Text2Vec in EmbeddingFunctionRegistry for global discovery.
- Standardized export order in __init__.py.
- Updated unit tests to align with lazy loading behavior and fixed lint warnings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@NTLx

NTLx commented Feb 1, 2026

Copy link
Copy Markdown
Contributor Author

Hello @oceanbase maintainers,

根据之前的反馈和代码规范,我已经对 PR #142 进行了深度优化,主要改动如下:

  1. 性能优化 (Text2VecEmbeddingFunction):

    • 维度缓存: 引入了 _cached_dimension 属性,维度信息在首次计算后会被持久缓存,后续访问实现 O(1) 的即时响应。
    • 真正的延迟加载 (Lazy Loading): 将模型初始化逻辑完全推迟到 _get_model 阶段,只有在真正执行向量生成或查询维度时才会加载权重,显著提升了库的初始导入性能。
    • 复合缓存键: 升级了模型缓存逻辑,使用 (model_name, device, kwargs) 作为键,确保了在多配置环境下的实例隔离与安全复用。
  2. 系统集成:

    • 全局注册: 在 EmbeddingFunctionRegistry 中正式注册了 text2vec 标识符。
    • 接口规范化: 按字母顺序重新整理了 embedding_functions/__init__.py 的导出列表。
  3. 质量保证:

    • 测试适配: 更新了 test_text2vec_embedding.py,通过验证初始化后的“零调用”状态和触发后的“加载”状态,确保测试与延迟加载逻辑一致。
    • Lint 修复: 处理了所有 F841 未使用变量警告,测试已全部通过。

代码已推送至当前分支。期待您的再次评审!

🤖 Generated with Claude Code

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py`:
- Around line 36-44: The constructor validation for Text2VecEmbeddingFunction
currently allows unhashable types (list, dict) in kwargs but _get_model builds a
cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by
either restricting kwargs to only hashable types (e.g., remove list and dict
from allowed types in the constructor validation so only
str/int/float/bool/tuple remain) or change the cache-key generation in
_get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted
keys or convert mutable values to tuples) so frozenset is not required—apply the
chosen approach consistently to the self.kwargs usage and _get_model cache
logic.

Comment on lines +36 to +44
# Validate kwargs - only allow primitive types
for key, value in kwargs.items():
if not isinstance(value, (str, int, float, bool, list, dict, tuple)):
raise TypeError(f"Keyword argument {key} is not a primitive type")

self.model_name = model_name
self.device = device
self.normalize_embeddings = normalize_embeddings
self.kwargs = kwargs

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential runtime error: frozenset(kwargs.items()) fails for unhashable nested values.

The validation at lines 37-39 allows list, dict, and tuple in kwargs values, but frozenset(kwargs.items()) at line 53 will raise TypeError if any value is a mutable type (list or dict). For example:

ef = Text2VecEmbeddingFunction(some_kwarg=[1, 2, 3])  # Passes validation
ef._get_model()  # TypeError: unhashable type: 'list'

Either restrict kwargs to hashable primitives only, or use a different approach for cache key generation.

🔧 Proposed fix: restrict to hashable types or use JSON serialization

Option 1: Restrict to hashable types only

         # Validate kwargs - only allow primitive types
         for key, value in kwargs.items():
-            if not isinstance(value, (str, int, float, bool, list, dict, tuple)):
+            if not isinstance(value, (str, int, float, bool, type(None))):
                 raise TypeError(f"Keyword argument {key} is not a primitive type")

Option 2: Use JSON for cache key (keeps current validation)

+    `@staticmethod`
+    def _make_kwargs_key(kwargs: dict[str, Any]) -> str:
+        """Create a hashable key from kwargs."""
+        import json
+        return json.dumps(kwargs, sort_keys=True)
+
     def _get_model(self) -> Any:
         """Get or initialize the text2vec model instance."""
         if self._model_instance is not None:
             return self._model_instance

-        cache_key = (self.model_name, self.device, frozenset(self.kwargs.items()))
+        cache_key = (self.model_name, self.device, self._make_kwargs_key(self.kwargs))
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 39-39: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/text2vec_embedding_function.py` around
lines 36 - 44, The constructor validation for Text2VecEmbeddingFunction
currently allows unhashable types (list, dict) in kwargs but _get_model builds a
cache key using frozenset(kwargs.items()), causing TypeError at runtime; fix by
either restricting kwargs to only hashable types (e.g., remove list and dict
from allowed types in the constructor validation so only
str/int/float/bool/tuple remain) or change the cache-key generation in
_get_model to use a safe serialization (e.g., JSON-serialize kwargs with sorted
keys or convert mutable values to tuples) so frozenset is not required—apply the
chosen approach consistently to the self.kwargs usage and _get_model cache
logic.

@NTLx

NTLx commented Feb 10, 2026

Copy link
Copy Markdown
Contributor Author

Hi @hnwyllmm! This PR is now MERGEABLE with all CI checks passing.

Summary:

  • Added Text2Vec embedding function support
  • Enables use of Text2Vec models for embedding generation

Ready for merge!

@hnwyllmm hnwyllmm merged commit b804ee9 into oceanbase:develop Feb 11, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants