Skip to content

fix ruff check#162

Merged
hnwyllmm merged 3 commits into
oceanbase:developfrom
hnwyllmm:refactor/default-embedding
Feb 2, 2026
Merged

fix ruff check#162
hnwyllmm merged 3 commits into
oceanbase:developfrom
hnwyllmm:refactor/default-embedding

Conversation

@hnwyllmm

@hnwyllmm hnwyllmm commented Feb 2, 2026

Copy link
Copy Markdown
Member

Summary

close #137
refactor DefaultEmbeddingFunction.
Split it into two classes: OnnxEmbeddingFunction and DefaultEmbeddingFunction.
DefaultEmbeddingFunction worked based on OnnxEmbeddingFunction. OnnxEmbeddingFunction can execute model based on ONNX.

Summary by CodeRabbit

  • New Features

    • Added a standalone ONNX-based embedding implementation and exposed it as an embedding option.
  • Refactor

    • Default embedding now delegates embedding work to the new ONNX embedding path for simpler, modular behavior.
  • Breaking Change

    • Removed several public embedding-related attributes and updated the default embedding constructor signature.
  • Tests

    • Added unit tests for the ONNX embedding and adjusted test startup path handling.

@hnwyllmm hnwyllmm requested a review from Copilot February 2, 2026 06:36
@coderabbitai

coderabbitai Bot commented Feb 2, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Extracted ONNX download/tokenizer/runtime logic into a new OnnxEmbeddingFunction; DefaultEmbeddingFunction now initializes and delegates to an internal OnnxEmbeddingFunction instance and simplified its public surface and attributes.

Changes

Cohort / File(s) Summary
Refactored Default Embedding Function
src/pyseekdb/client/embedding_function.py
Removed in-file ONNX download/tokenizer/runtime setup and public constants; updated constructor to accept model_name and preferred_providers; internalized model IDs (_MODEL_NAME, _HF_MODEL_ID) and delegates embedding calls to self._onnx(...).
New ONNX Embedding Implementation
src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py, src/pyseekdb/utils/embedding_functions/__init__.py
Added OnnxEmbeddingFunction with lazy imports, Hugging Face download (HF_ENDPOINT support), tokenizer & ONNX session caching, provider handling, input preprocessing, mean-pooling inference, robust error/log handling; re-exported in package __all__.
Tests and Test Setup
tests/unit_tests/test_onnx_embedding_function.py, tests/integration_tests/conftest.py
Added unit tests for OnnxEmbeddingFunction (validation, dimension, output shape); test conftest now inserts repo_root/src into sys.path and sets default SEEKDB_PATH to repo_root / "seekdb.db".

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DefaultEmbeddingFunction
    participant OnnxEmbeddingFunction
    participant Tokenizer
    participant ONNXRuntime
    participant HuggingFace

    Client->>DefaultEmbeddingFunction: request embeddings(documents)
    DefaultEmbeddingFunction->>OnnxEmbeddingFunction: __call__(documents)
    OnnxEmbeddingFunction->>OnnxEmbeddingFunction: ensure model/tokenizer downloaded
    OnnxEmbeddingFunction->>Tokenizer: tokenize(documents) (cached)
    OnnxEmbeddingFunction->>ONNXRuntime: run model(inputs) (cached session)
    ONNXRuntime-->>OnnxEmbeddingFunction: model outputs
    OnnxEmbeddingFunction-->>OnnxEmbeddingFunction: mean-pool using attention_mask
    OnnxEmbeddingFunction-->>DefaultEmbeddingFunction: embeddings
    DefaultEmbeddingFunction-->>Client: return embeddings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I tunneled through code, found a neat seam,

Split model and duties into a tidy team.
ONNX hums, tokenizer sings in line,
Default just calls — and embeddings shine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'fix ruff check' is vague and generic, not describing the main change of refactoring DefaultEmbeddingFunction into separate classes. Consider a more descriptive title like 'Refactor DefaultEmbeddingFunction to separate ONNX concerns' to clearly communicate the primary change.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The PR successfully implements all objectives from issue #137: introduces OnnxEmbeddingFunction to encapsulate ONNX model execution, refactors DefaultEmbeddingFunction to delegate to it, and maintains support for all-MiniLM-L6-v2 model.
Out of Scope Changes check ✅ Passed Changes are focused on refactoring DefaultEmbeddingFunction and introducing OnnxEmbeddingFunction as required. Test files and configuration updates support the refactoring; all changes align with the stated objectives.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the DefaultEmbeddingFunction class by extracting the ONNX-based embedding logic into a new generic OnnxEmbeddingFunction class. The refactoring aims to separate concerns and make the ONNX runtime functionality reusable for other embedding models beyond the default all-MiniLM-L6-v2.

Changes:

  • Created a new OnnxEmbeddingFunction class that handles ONNX model downloading, loading, and inference
  • Refactored DefaultEmbeddingFunction to delegate to OnnxEmbeddingFunction while maintaining backward compatibility
  • Added deprecation warnings for the preferred_providers parameter in DefaultEmbeddingFunction
  • Added unit tests for the new OnnxEmbeddingFunction class

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py New generic ONNX embedding function that handles model download, tokenization, and inference for ONNX-based models
src/pyseekdb/client/embedding_function.py Refactored DefaultEmbeddingFunction to use composition, delegating to OnnxEmbeddingFunction while maintaining backward compatibility
tests/unit_tests/test_onnx_embedding_function.py New unit tests covering initialization validation and basic functionality of OnnxEmbeddingFunction

Comment on lines +321 to +322
# Force CPU execution provider to avoid provider issues
providers=["CPUExecutionProvider"],

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred_providers parameter is validated and stored, but then ignored when creating the InferenceSession. Line 322 hardcodes providers to ["CPUExecutionProvider"], overriding any user-specified providers. Either the preferred_providers functionality should be removed if it's not supported, or the hardcoded providers should be replaced with self._preferred_providers. The current implementation is misleading to users who might expect their provider preferences to be respected.

Suggested change
# Force CPU execution provider to avoid provider issues
providers=["CPUExecutionProvider"],
providers=self._preferred_providers,

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +45
def test_init_validates_preferred_providers() -> None:
with pytest.raises(ValueError, match="Preferred providers must be a list of strings"):
OnnxEmbeddingFunction(model_name="test", hf_model_id="org/test", dimension=3, preferred_providers=[1]) # type: ignore[list-item]
with pytest.raises(ValueError, match="Preferred providers must be unique"):
OnnxEmbeddingFunction(
model_name="test",
hf_model_id="org/test",
dimension=3,
preferred_providers=["CPUExecutionProvider", "CPUExecutionProvider"],
)

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates the preferred_providers parameter but doesn't test the actual functionality of using different providers. Given that the model() method hardcodes providers to ["CPUExecutionProvider"] (line 322 in onnx_embedding_function.py), these validation tests are misleading as they suggest the feature works when it actually doesn't. Consider either removing these tests or fixing the implementation to actually use the preferred providers.

Copilot uses AI. Check for mistakes.
Comment on lines +158 to +160
model_name: str = "all-MiniLM-L6-v2", # Deprecated. Will be removed in a future version.
preferred_providers: list[str] | None = None, # Deprecated. Will be removed in a future version.
# The preferred ONNX runtime providers. Defaults to None (uses available providers).

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Args section of the docstring has incorrect formatting. The default values and comments are placed inline with the parameter names instead of in the description. The docstring should follow standard format with parameter names on one line and descriptions on the following lines. For example: 'model_name: Deprecated parameter. Will be removed in a future version. Currently only "all-MiniLM-L6-v2" is supported.'

Suggested change
model_name: str = "all-MiniLM-L6-v2", # Deprecated. Will be removed in a future version.
preferred_providers: list[str] | None = None, # Deprecated. Will be removed in a future version.
# The preferred ONNX runtime providers. Defaults to None (uses available providers).
model_name:
Deprecated parameter. Will be removed in a future version.
Currently only "all-MiniLM-L6-v2" is supported.
preferred_providers:
Deprecated parameter. Will be removed in a future version.
Preferred ONNX runtime providers. Defaults to None (uses available providers).
Use the preferred_providers argument of OnnxEmbeddingFunction instead.

Copilot uses AI. Check for mistakes.
"""

EXTRACTED_FOLDER_NAME = "onnx"
ARCHIVE_FILENAME = "onnx.tar.gz"

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ARCHIVE_FILENAME class constant is defined but never used anywhere in the class. This appears to be leftover from a previous implementation that used tar.gz archives instead of direct file downloads. Consider removing this unused constant to avoid confusion.

Suggested change
ARCHIVE_FILENAME = "onnx.tar.gz"

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +61
pytest.importorskip("onnxruntime")
pytest.importorskip("tokenizers")
pytest.importorskip("tqdm")

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses pytest.importorskip but these imports are mandatory dependencies in pyproject.toml (onnxruntime>=1.19.0, tokenizers>=0.15.0, tqdm). Using importorskip is appropriate for optional dependencies, but for required dependencies, the test should fail hard if they're missing rather than skip. This inconsistency could mask installation issues.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +67
download_path
if download_path is not None
else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The download_path is stored as a Path object but then used with os.path.join throughout the code (lines 147, 278, 320, 338). While os.path.join does accept Path objects in Python 3.6+, it's inconsistent to mix Path and os.path APIs. Consider either using Path methods consistently (download_path / EXTRACTED_FOLDER_NAME) or converting download_path to a string. The current approach reduces code clarity and could cause confusion.

Suggested change
download_path
if download_path is not None
else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name
str(download_path)
if download_path is not None
else str(Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name)

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +31
class OnnxEmbeddingFunction:
"""
Generic ONNX runtime embedding function.

This class handles model download, tokenizer/model loading, and embedding
generation using onnxruntime.
"""

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OnnxEmbeddingFunction class does not implement the persistence methods required by the EmbeddingFunction protocol (name(), get_config(), and build_from_config()). If this class is intended to be used directly by users as suggested in the deprecation warning in DefaultEmbeddingFunction, it should implement these methods to support collection persistence. Without these methods, collections using OnnxEmbeddingFunction directly cannot be persisted and restored.

Copilot uses AI. Check for mistakes.
Comment on lines +315 to +317
if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers:
# remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU.
self._preferred_providers.remove("CoreMLExecutionProvider")

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code modifies the _preferred_providers list in place by calling .remove("CoreMLExecutionProvider"). If this method is called multiple times (which could happen with cached_property in some edge cases), it will attempt to remove the same item twice and raise a ValueError. Consider using a list comprehension to create a new list without the unwanted provider instead: providers = [p for p in self._preferred_providers if p != "CoreMLExecutionProvider"]

Copilot uses AI. Check for mistakes.
self._onnx = OnnxEmbeddingFunction(
model_name=self.MODEL_NAME,
hf_model_id=self.HF_MODEL_ID,
dimension=self._DIMENSION,

Copilot AI Feb 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preferred_providers parameter is deprecated but still accepted in the init method. However, it's never passed to the underlying OnnxEmbeddingFunction. This means that even during the deprecation period, the parameter has no effect. If backward compatibility is desired during the deprecation period, the parameter should be passed through to OnnxEmbeddingFunction, or alternatively, the ValueError for unsupported values should be raised before the deprecation warning.

Suggested change
dimension=self._DIMENSION,
dimension=self._DIMENSION,
preferred_providers=preferred_providers,

Copilot uses AI. Check for mistakes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 164-176: The constructor currently warns that preferred_providers
is deprecated but does not forward that parameter to the OnnxEmbeddingFunction,
dropping caller selection; update the OnnxEmbeddingFunction instantiation in the
embedding function (the block that creates self._onnx) to pass the
preferred_providers argument through (i.e., include
preferred_providers=preferred_providers when calling OnnxEmbeddingFunction) so
existing callers retain provider selection behavior; ensure the variable name
matches the parameter on the parent constructor and OnnxEmbeddingFunction
signature.

In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py`:
- Around line 294-323: The InferenceSession call in onnx_embedding_function
currently hardcodes providers=["CPUExecutionProvider"] and thus ignores the
validated self._preferred_providers; update the providers argument in the
self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it
is a list/sequence) so the session respects user-selected/available providers
(note the code already removes "CoreMLExecutionProvider" from
self._preferred_providers). Use self._preferred_providers as the providers
parameter instead of the hardcoded CPUExecutionProvider.
- Around line 173-186: The broad "except Exception" should be narrowed to the
expected failure types so unexpected bugs bubble up; replace the generic except
block in the download flow (the block after the existing except
httpx.HTTPStatusError) with an explicit except (httpx.RequestError, OSError) as
e (or include PermissionError if desired), keep the same logger.warning and
local_path cleanup logic, and ensure httpx is imported so httpx.RequestError is
available; leave the existing httpx.HTTPStatusError handler unchanged.
- Around line 54-74: The constructor and other raise sites in
ONNXEmbeddingFunction currently raise ValueError with inline messages; create
specific exception classes (e.g., InvalidModelNameError, InvalidHFModelIDError,
InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the
appropriate default message, then replace raises that pass message strings (the
checks referencing model_name, hf_model_id, dimension, preferred_providers and
the other flagged raise sites later in the file) with raises of those exception
classes without arguments; ensure all other flagged raise sites in this module
use their own dedicated exception classes with messages defined in the exception
__init__ rather than inline messages at the raise call.

In `@tests/unit_tests/test_onnx_embedding_function.py`:
- Around line 12-18: The helper _make_onnx() should guard optional dependencies
via pytest.importorskip so tests that call it don't fail when
onnxruntime/tokenizers/tqdm are absent; update _make_onnx() to call
pytest.importorskip for the packages imported in OnnxEmbeddingFunction.__init__
(onnxruntime, tokenizers, tqdm) before constructing OnnxEmbeddingFunction and
then remove redundant per-test importorskip calls in test_dimension_property and
test_call_empty_returns_empty (leave test_call_generates_embeddings as-is or
remove its duplicate skips), referencing the helper _make_onnx and the tests
test_dimension_property and test_call_empty_returns_empty to locate changes.

Comment on lines +164 to +176
if preferred_providers:
warnings.warn(
"preferred_providers is deprecated and will be removed in a future version. "
"Use the preferred_providers argument of OnnxEmbeddingFunction instead.",
DeprecationWarning,
stacklevel=2,
)
self.model_name = self.MODEL_NAME
self._onnx = OnnxEmbeddingFunction(
model_name=self.MODEL_NAME,
hf_model_id=self.HF_MODEL_ID,
dimension=self._DIMENSION,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/client/embedding_function.py | sed -n '150,185p'

Repository: oceanbase/pyseekdb

Length of output: 1786


🏁 Script executed:

# Find the OnnxEmbeddingFunction class definition to check its constructor
rg "class OnnxEmbeddingFunction" -A 30

Repository: oceanbase/pyseekdb

Length of output: 3078


🏁 Script executed:

# Check the constructor signature more carefully
rg "def __init__" src/pyseekdb/client/embedding_function.py -A 15 -B 2

Repository: oceanbase/pyseekdb

Length of output: 2511


Pass preferred_providers through to OnnxEmbeddingFunction.

The deprecation warning directs users to use OnnxEmbeddingFunction's preferred_providers argument, but the parameter is never passed through, effectively dropping it and breaking backward compatibility for existing callers who rely on provider selection.

Proposed fix
         self._onnx = OnnxEmbeddingFunction(
             model_name=self.MODEL_NAME,
             hf_model_id=self.HF_MODEL_ID,
             dimension=self._DIMENSION,
+            preferred_providers=preferred_providers,
         )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if preferred_providers:
warnings.warn(
"preferred_providers is deprecated and will be removed in a future version. "
"Use the preferred_providers argument of OnnxEmbeddingFunction instead.",
DeprecationWarning,
stacklevel=2,
)
self.model_name = self.MODEL_NAME
self._onnx = OnnxEmbeddingFunction(
model_name=self.MODEL_NAME,
hf_model_id=self.HF_MODEL_ID,
dimension=self._DIMENSION,
)
if preferred_providers:
warnings.warn(
"preferred_providers is deprecated and will be removed in a future version. "
"Use the preferred_providers argument of OnnxEmbeddingFunction instead.",
DeprecationWarning,
stacklevel=2,
)
self.model_name = self.MODEL_NAME
self._onnx = OnnxEmbeddingFunction(
model_name=self.MODEL_NAME,
hf_model_id=self.HF_MODEL_ID,
dimension=self._DIMENSION,
preferred_providers=preferred_providers,
)
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/embedding_function.py` around lines 164 - 176, The
constructor currently warns that preferred_providers is deprecated but does not
forward that parameter to the OnnxEmbeddingFunction, dropping caller selection;
update the OnnxEmbeddingFunction instantiation in the embedding function (the
block that creates self._onnx) to pass the preferred_providers argument through
(i.e., include preferred_providers=preferred_providers when calling
OnnxEmbeddingFunction) so existing callers retain provider selection behavior;
ensure the variable name matches the parameter on the parent constructor and
OnnxEmbeddingFunction signature.

Comment on lines +54 to +74
if not model_name:
raise ValueError("model_name must be a non-empty string")
if not hf_model_id:
raise ValueError("hf_model_id must be a non-empty string")
if dimension <= 0:
raise ValueError("dimension must be a positive integer")

self.model_name = model_name
self.hf_model_id = hf_model_id
self._dimension = dimension
self.download_path = (
download_path
if download_path is not None
else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name
)

# Validate preferred_providers
if preferred_providers and not all(isinstance(i, str) for i in preferred_providers):
raise ValueError("Preferred providers must be a list of strings")
if preferred_providers and len(preferred_providers) != len(set(preferred_providers)):
raise ValueError("Preferred providers must be unique")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the file and check its size
find . -name "onnx_embedding_function.py" -type f

Repository: oceanbase/pyseekdb

Length of output: 129


🏁 Script executed:

# Read the relevant sections of the file
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -100

Repository: oceanbase/pyseekdb

Length of output: 3914


🏁 Script executed:

# Also check lines 220-230, 300-310, 350-360 to see other violations mentioned
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '220,230p;300,310p;350,360p'

Repository: oceanbase/pyseekdb

Length of output: 2056


🌐 Web query:

Ruff TRY003 rule error inline exception message

💡 Result:

TRY003 in Ruff is raise-vanilla-args: it flags raise MyError("...long message...") / raise MyError(f"...") when the message is defined at the raise site instead of inside the exception class. The intent is to keep exception messages consistent and make the exception reusable. [1][2]

Typical triggering example (bad):

class CantBeNegative(Exception): ...

def foo(x):
    if x < 0:
        raise CantBeNegative(f"{x} is negative")  # TRY003

Fix (recommended): define the message in the exception class and pass args:

class CantBeNegative(Exception):
    def __init__(self, number: int) -> None:
        super().__init__(f"{number} is negative")

def foo(x):
    if x < 0:
        raise CantBeNegative(x)

[1]

If you intentionally want an inline message: suppress it on that line:

raise CantBeNegative(f"{x} is negative")  # noqa: TRY003

Ruff supports inline # noqa: <codes> comments. [3]


🏁 Script executed:

# Look for ruff configuration
find . -name "pyproject.toml" -o -name "ruff.toml" | head -5

Repository: oceanbase/pyseekdb

Length of output: 116


🏁 Script executed:

# Check if ruff config exists and look for TRY003
if [ -f pyproject.toml ]; then
  echo "=== pyproject.toml ===" 
  grep -A 20 "\[tool.ruff" pyproject.toml || echo "No [tool.ruff] section"
fi
if [ -f ruff.toml ]; then
  echo "=== ruff.toml ===" 
  cat ruff.toml
fi

Repository: oceanbase/pyseekdb

Length of output: 1604


🏁 Script executed:

# Verify that TRY003 is included in the TRY ruleset
cat -n ruff.toml | grep -A 2 -B 2 "TRY"

Repository: oceanbase/pyseekdb

Length of output: 218


🌐 Web query:

Ruff TRY003 raise-vanilla-args ruleset tryceratops included rules

💡 Result:

In Ruff, TRY003 is the tryceratops rule raise-vanilla-args (“Avoid specifying long messages outside the exception class”). It flags cases where you build/format a long error message at the raise site instead of defining the message inside the exception class (to keep exceptions reusable and messages consistent). [2]

The tryceratops ruleset (TRY) included in Ruff currently consists of these rule codes: TRY002, TRY003, TRY004, TRY200, TRY201, TRY203, TRY300, TRY301, TRY400, TRY401. [1]


TRY003 violations must be fixed by moving exception messages into exception classes.

Ruff's TRY003 rule (raise-vanilla-args) is enabled via the TRY ruleset in ruff.toml and flags inline messages at lines 55, 57, 59, 72, 74, 224-226, 302-304, and 352-357. Move these messages into custom exception class __init__ methods rather than specifying them at raise sites.

Example pattern (apply similarly to the other raise sites)
+class InvalidModelNameError(ValueError):
+    def __init__(self) -> None:
+        super().__init__("model_name must be a non-empty string")
@@
-        if not model_name:
-            raise ValueError("model_name must be a non-empty string")
+        if not model_name:
+            raise InvalidModelNameError()

Also applies to: 224-226, 302-304, 352-357

🧰 Tools
🪛 Ruff (0.14.14)

[warning] 55-55: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 57-57: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 59-59: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 72-72: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 74-74: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 54 - 74, The constructor and other raise sites in ONNXEmbeddingFunction
currently raise ValueError with inline messages; create specific exception
classes (e.g., InvalidModelNameError, InvalidHFModelIDError,
InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the
appropriate default message, then replace raises that pass message strings (the
checks referencing model_name, hf_model_id, dimension, preferred_providers and
the other flagged raise sites later in the file) with raises of those exception
classes without arguments; ensure all other flagged raise sites in this module
use their own dedicated exception classes with messages defined in the exception
__init__ rather than inline messages at the raise call.

Comment on lines +173 to +186
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")
return False
logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")
if os.path.exists(local_path):
os.remove(local_path)
return False
except Exception as e:
logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
# If download fails, try to delete partially downloaded file
if os.path.exists(local_path):
os.remove(local_path)
return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '150,200p'

Repository: oceanbase/pyseekdb

Length of output: 3006


🏁 Script executed:

# Get more context around the download function
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '130,220p'

Repository: oceanbase/pyseekdb

Length of output: 4803


🏁 Script executed:

# Check what exceptions httpx can raise
rg "import httpx|from httpx" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 2

Repository: oceanbase/pyseekdb

Length of output: 120


🏁 Script executed:

grep -n "_download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -20

Repository: oceanbase/pyseekdb

Length of output: 564


🏁 Script executed:

# Find the _download method definition
ast-grep --pattern 'def _download($_) { $$$ }'

Repository: oceanbase/pyseekdb

Length of output: 44


🏁 Script executed:

# Search for _download method in the file
rg "def _download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 20

Repository: oceanbase/pyseekdb

Length of output: 2625


🏁 Script executed:

python3 << 'EOF'
try:
    import httpx
    # Check exception hierarchy
    print("httpx.HTTPError exists:", hasattr(httpx, 'HTTPError'))
    print("httpx.HTTPStatusError exists:", hasattr(httpx, 'HTTPStatusError'))
    print("httpx.RequestError exists:", hasattr(httpx, 'RequestError'))
    
    # Check MRO if HTTPStatusError exists
    if hasattr(httpx, 'HTTPStatusError'):
        print("\nhttpx.HTTPStatusError MRO:")
        for cls in httpx.HTTPStatusError.__mro__:
            print(f"  - {cls}")
    
    if hasattr(httpx, 'RequestError'):
        print("\nhttpx.RequestError MRO:")
        for cls in httpx.RequestError.__mro__:
            print(f"  - {cls}")
            
except ImportError:
    print("httpx not installed in this environment")
EOF

Repository: oceanbase/pyseekdb

Length of output: 485


Narrow the broad Exception catch to specific expected exceptions.

The blanket except Exception at line 181 can hide unexpected bugs and violates the BLE001 linting rule. Since _download() can raise httpx.HTTPStatusError (already caught at line 173), httpx.RequestError, and OSError from file operations, catch only those specific exceptions and let unexpected errors surface for debugging.

Proposed fix
-                except Exception as e:
+                except (httpx.HTTPError, OSError) as e:
                     logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
                     # If download fails, try to delete partially downloaded file
                     if os.path.exists(local_path):
                         os.remove(local_path)
                     return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")
return False
logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")
if os.path.exists(local_path):
os.remove(local_path)
return False
except Exception as e:
logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
# If download fails, try to delete partially downloaded file
if os.path.exists(local_path):
os.remove(local_path)
return False
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")
return False
logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")
if os.path.exists(local_path):
os.remove(local_path)
return False
except (httpx.HTTPError, OSError) as e:
logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
# If download fails, try to delete partially downloaded file
if os.path.exists(local_path):
os.remove(local_path)
return False
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 181-181: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 173 - 186, The broad "except Exception" should be narrowed to the expected
failure types so unexpected bugs bubble up; replace the generic except block in
the download flow (the block after the existing except httpx.HTTPStatusError)
with an explicit except (httpx.RequestError, OSError) as e (or include
PermissionError if desired), keep the same logger.warning and local_path cleanup
logic, and ensure httpx is imported so httpx.RequestError is available; leave
the existing httpx.HTTPStatusError handler unchanged.

Comment on lines +294 to +323
if self._preferred_providers is None or len(self._preferred_providers) == 0:
if len(self.ort.get_available_providers()) > 0:
logger.debug(
f"WARNING: No ONNX providers provided, defaulting to available providers: "
f"{self.ort.get_available_providers()}"
)
self._preferred_providers = self.ort.get_available_providers()
elif not set(self._preferred_providers).issubset(set(self.ort.get_available_providers())):
raise ValueError(
f"Preferred providers must be subset of available providers: {self.ort.get_available_providers()}"
)

# Create minimal session options to avoid issues
so = self.ort.SessionOptions()
so.log_severity_level = 3
# Disable all optimizations that might cause issues
so.graph_optimization_level = self.ort.GraphOptimizationLevel.ORT_DISABLE_ALL
so.execution_mode = self.ort.ExecutionMode.ORT_SEQUENTIAL
so.inter_op_num_threads = 1
so.intra_op_num_threads = 1

if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers:
# remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU.
self._preferred_providers.remove("CoreMLExecutionProvider")

return self.ort.InferenceSession(
os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"),
# Force CPU execution provider to avoid provider issues
providers=["CPUExecutionProvider"],
sess_options=so,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -n 330 | tail -n +270

Repository: oceanbase/pyseekdb

Length of output: 3075


🏁 Script executed:

head -n 100 src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 3214


🏁 Script executed:

# Search for other references to _preferred_providers or preferred_providers in the file
rg "preferred_providers" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -n

Repository: oceanbase/pyseekdb

Length of output: 930


🏁 Script executed:

# Search for any TODO comments or issues related to this in the entire file
cat src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | grep -i -E "todo|fixme|issue|bug|workaround" -C 2

Repository: oceanbase/pyseekdb

Length of output: 1030


🏁 Script executed:

# Search for any related issues or discussions in code comments
rg "provider" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -B 2 -A 2

Repository: oceanbase/pyseekdb

Length of output: 2256


Use self._preferred_providers instead of hardcoding CPU provider.

The preferred_providers parameter is validated and defaults to available providers, but is completely ignored when creating the InferenceSession. Pass self._preferred_providers to the providers argument so users can leverage GPU or other accelerators as specified.

Proposed fix
         return self.ort.InferenceSession(
             os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"),
-            # Force CPU execution provider to avoid provider issues
-            providers=["CPUExecutionProvider"],
+            providers=self._preferred_providers,
             sess_options=so,
         )
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 302-304: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 294 - 323, The InferenceSession call in onnx_embedding_function currently
hardcodes providers=["CPUExecutionProvider"] and thus ignores the validated
self._preferred_providers; update the providers argument in the
self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it
is a list/sequence) so the session respects user-selected/available providers
(note the code already removes "CoreMLExecutionProvider" from
self._preferred_providers). Use self._preferred_providers as the providers
parameter instead of the hardcoded CPUExecutionProvider.

Comment on lines +12 to +18
def _make_onnx() -> OnnxEmbeddingFunction:
return OnnxEmbeddingFunction(
model_name="all-MiniLM-L6-v2",
hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
dimension=384,
preferred_providers=None,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file
find . -name "test_onnx_embedding_function.py" -type f

Repository: oceanbase/pyseekdb

Length of output: 112


🏁 Script executed:

# Look for the OnnxEmbeddingFunction class to understand what it imports
find . -name "*.py" -type f | xargs grep -l "class OnnxEmbeddingFunction"

Repository: oceanbase/pyseekdb

Length of output: 129


🏁 Script executed:

# Read the test file to see the current state
cat -n ./tests/unit_tests/test_onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 2809


🏁 Script executed:

# Read the OnnxEmbeddingFunction implementation
cat -n ./src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 18457


Guard _make_onnx() with importorskip for optional deps.

OnnxEmbeddingFunction.__init__ imports optional packages (onnxruntime, tokenizers, tqdm) at lines 79-81. The tests test_dimension_property and test_call_empty_returns_empty will fail when these dependencies are absent because they call _make_onnx() without any skip guards, while test_call_generates_embeddings has them. Centralize the skips inside _make_onnx() and remove the redundant per-test skips.

Proposed fix
 def _make_onnx() -> OnnxEmbeddingFunction:
+    pytest.importorskip("onnxruntime")
+    pytest.importorskip("tokenizers")
+    pytest.importorskip("tqdm")
     return OnnxEmbeddingFunction(
         model_name="all-MiniLM-L6-v2",
         hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
         dimension=384,
         preferred_providers=None,
     )
@@
 def test_call_generates_embeddings() -> None:
-    pytest.importorskip("onnxruntime")
-    pytest.importorskip("tokenizers")
-    pytest.importorskip("tqdm")
-
     ef = _make_onnx()
     embeddings = ef("hello world")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _make_onnx() -> OnnxEmbeddingFunction:
return OnnxEmbeddingFunction(
model_name="all-MiniLM-L6-v2",
hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
dimension=384,
preferred_providers=None,
)
def _make_onnx() -> OnnxEmbeddingFunction:
pytest.importorskip("onnxruntime")
pytest.importorskip("tokenizers")
pytest.importorskip("tqdm")
return OnnxEmbeddingFunction(
model_name="all-MiniLM-L6-v2",
hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
dimension=384,
preferred_providers=None,
)
🤖 Prompt for AI Agents
In `@tests/unit_tests/test_onnx_embedding_function.py` around lines 12 - 18, The
helper _make_onnx() should guard optional dependencies via pytest.importorskip
so tests that call it don't fail when onnxruntime/tokenizers/tqdm are absent;
update _make_onnx() to call pytest.importorskip for the packages imported in
OnnxEmbeddingFunction.__init__ (onnxruntime, tokenizers, tqdm) before
constructing OnnxEmbeddingFunction and then remove redundant per-test
importorskip calls in test_dimension_property and test_call_empty_returns_empty
(leave test_call_generates_embeddings as-is or remove its duplicate skips),
referencing the helper _make_onnx and the tests test_dimension_property and
test_call_empty_returns_empty to locate changes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tests/integration_tests/conftest.py`:
- Around line 13-18: The import line "import pyseekdb  # noqa: E402" contains an
unused noqa directive; remove the " # noqa: E402" suffix so the line becomes a
normal import (keep the existing sys.path manipulation with repo_root, src_root,
and sys.path.insert as-is) to satisfy RUF100 and eliminate the unused noqa.

Comment on lines +13 to 18
# Add project path (repo root + src)
repo_root = Path(__file__).resolve().parents[2]
src_root = repo_root / "src"
sys.path.insert(0, str(src_root))

import pyseekdb # noqa: E402

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused noqa to satisfy RUF100.

Ruff reports the # noqa: E402 is unused (E402 not enabled), so the directive should be removed.

🔧 Suggested fix
-import pyseekdb  # noqa: E402
+import pyseekdb
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add project path (repo root + src)
repo_root = Path(__file__).resolve().parents[2]
src_root = repo_root / "src"
sys.path.insert(0, str(src_root))
import pyseekdb # noqa: E402
# Add project path (repo root + src)
repo_root = Path(__file__).resolve().parents[2]
src_root = repo_root / "src"
sys.path.insert(0, str(src_root))
import pyseekdb
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 18-18: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In `@tests/integration_tests/conftest.py` around lines 13 - 18, The import line
"import pyseekdb  # noqa: E402" contains an unused noqa directive; remove the "
# noqa: E402" suffix so the line becomes a normal import (keep the existing
sys.path manipulation with repo_root, src_root, and sys.path.insert as-is) to
satisfy RUF100 and eliminate the unused noqa.

@hnwyllmm hnwyllmm merged commit 38e58bf into oceanbase:develop Feb 2, 2026
8 checks passed
@hnwyllmm hnwyllmm deleted the refactor/default-embedding branch February 2, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement]: refactor DefaultEmbeddingFunction class

2 participants