fix ruff check by hnwyllmm · Pull Request #162 · oceanbase/pyseekdb

hnwyllmm · 2026-02-02T06:36:16Z

Summary

close #137
refactor DefaultEmbeddingFunction.
Split it into two classes: OnnxEmbeddingFunction and DefaultEmbeddingFunction.
DefaultEmbeddingFunction worked based on OnnxEmbeddingFunction. OnnxEmbeddingFunction can execute model based on ONNX.

Summary by CodeRabbit

New Features
- Added a standalone ONNX-based embedding implementation and exposed it as an embedding option.
Refactor
- Default embedding now delegates embedding work to the new ONNX embedding path for simpler, modular behavior.
Breaking Change
- Removed several public embedding-related attributes and updated the default embedding constructor signature.
Tests
- Added unit tests for the ONNX embedding and adjusted test startup path handling.

coderabbitai · 2026-02-02T06:36:31Z

📝 Walkthrough

Walkthrough

Extracted ONNX download/tokenizer/runtime logic into a new OnnxEmbeddingFunction; DefaultEmbeddingFunction now initializes and delegates to an internal OnnxEmbeddingFunction instance and simplified its public surface and attributes.

Changes

Cohort / File(s)	Summary
Refactored Default Embedding Function `src/pyseekdb/client/embedding_function.py`	Removed in-file ONNX download/tokenizer/runtime setup and public constants; updated constructor to accept `model_name` and `preferred_providers`; internalized model IDs (`_MODEL_NAME`, `_HF_MODEL_ID`) and delegates embedding calls to `self._onnx(...)`.
New ONNX Embedding Implementation `src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py`, `src/pyseekdb/utils/embedding_functions/__init__.py`	Added `OnnxEmbeddingFunction` with lazy imports, Hugging Face download (HF_ENDPOINT support), tokenizer & ONNX session caching, provider handling, input preprocessing, mean-pooling inference, robust error/log handling; re-exported in package `__all__`.
Tests and Test Setup `tests/unit_tests/test_onnx_embedding_function.py`, `tests/integration_tests/conftest.py`	Added unit tests for `OnnxEmbeddingFunction` (validation, dimension, output shape); test conftest now inserts `repo_root/src` into `sys.path` and sets default SEEKDB_PATH to `repo_root / "seekdb.db"`.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DefaultEmbeddingFunction
    participant OnnxEmbeddingFunction
    participant Tokenizer
    participant ONNXRuntime
    participant HuggingFace

    Client->>DefaultEmbeddingFunction: request embeddings(documents)
    DefaultEmbeddingFunction->>OnnxEmbeddingFunction: __call__(documents)
    OnnxEmbeddingFunction->>OnnxEmbeddingFunction: ensure model/tokenizer downloaded
    OnnxEmbeddingFunction->>Tokenizer: tokenize(documents) (cached)
    OnnxEmbeddingFunction->>ONNXRuntime: run model(inputs) (cached session)
    ONNXRuntime-->>OnnxEmbeddingFunction: model outputs
    OnnxEmbeddingFunction-->>OnnxEmbeddingFunction: mean-pool using attention_mask
    OnnxEmbeddingFunction-->>DefaultEmbeddingFunction: embeddings
    DefaultEmbeddingFunction-->>Client: return embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Support more embedding function platforms #132: Modifies embedding-functions and client embedding wiring; likely overlaps with this refactor and Onnx delegation.

Poem

🐇 I tunneled through code, found a neat seam,

Split model and duties into a tidy team.
ONNX hums, tokenizer sings in line,
Default just calls — and embeddings shine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'fix ruff check' is vague and generic, not describing the main change of refactoring DefaultEmbeddingFunction into separate classes.	Consider a more descriptive title like 'Refactor DefaultEmbeddingFunction to separate ONNX concerns' to clearly communicate the primary change.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The PR successfully implements all objectives from issue `#137`: introduces OnnxEmbeddingFunction to encapsulate ONNX model execution, refactors DefaultEmbeddingFunction to delegate to it, and maintains support for all-MiniLM-L6-v2 model.
Out of Scope Changes check	✅ Passed	Changes are focused on refactoring DefaultEmbeddingFunction and introducing OnnxEmbeddingFunction as required. Test files and configuration updates support the refactoring; all changes align with the stated objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR refactors the DefaultEmbeddingFunction class by extracting the ONNX-based embedding logic into a new generic OnnxEmbeddingFunction class. The refactoring aims to separate concerns and make the ONNX runtime functionality reusable for other embedding models beyond the default all-MiniLM-L6-v2.

Changes:

Created a new OnnxEmbeddingFunction class that handles ONNX model downloading, loading, and inference
Refactored DefaultEmbeddingFunction to delegate to OnnxEmbeddingFunction while maintaining backward compatibility
Added deprecation warnings for the preferred_providers parameter in DefaultEmbeddingFunction
Added unit tests for the new OnnxEmbeddingFunction class

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
`src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py`	New generic ONNX embedding function that handles model download, tokenization, and inference for ONNX-based models
`src/pyseekdb/client/embedding_function.py`	Refactored `DefaultEmbeddingFunction` to use composition, delegating to `OnnxEmbeddingFunction` while maintaining backward compatibility
`tests/unit_tests/test_onnx_embedding_function.py`	New unit tests covering initialization validation and basic functionality of `OnnxEmbeddingFunction`

Copilot · 2026-02-02T06:41:40Z

+            # Force CPU execution provider to avoid provider issues
+            providers=["CPUExecutionProvider"],


The preferred_providers parameter is validated and stored, but then ignored when creating the InferenceSession. Line 322 hardcodes providers to ["CPUExecutionProvider"], overriding any user-specified providers. Either the preferred_providers functionality should be removed if it's not supported, or the hardcoded providers should be replaced with self._preferred_providers. The current implementation is misleading to users who might expect their provider preferences to be respected.

Suggested change

# Force CPU execution provider to avoid provider issues

providers=["CPUExecutionProvider"],

providers=self._preferred_providers,

Copilot · 2026-02-02T06:41:41Z

+def test_init_validates_preferred_providers() -> None:
+    with pytest.raises(ValueError, match="Preferred providers must be a list of strings"):
+        OnnxEmbeddingFunction(model_name="test", hf_model_id="org/test", dimension=3, preferred_providers=[1])  # type: ignore[list-item]
+    with pytest.raises(ValueError, match="Preferred providers must be unique"):
+        OnnxEmbeddingFunction(
+            model_name="test",
+            hf_model_id="org/test",
+            dimension=3,
+            preferred_providers=["CPUExecutionProvider", "CPUExecutionProvider"],
+        )


The test validates the preferred_providers parameter but doesn't test the actual functionality of using different providers. Given that the model() method hardcodes providers to ["CPUExecutionProvider"] (line 322 in onnx_embedding_function.py), these validation tests are misleading as they suggest the feature works when it actually doesn't. Consider either removing these tests or fixing the implementation to actually use the preferred providers.

Copilot · 2026-02-02T06:41:41Z

+            model_name: str = "all-MiniLM-L6-v2",  # Deprecated. Will be removed in a future version.
+            preferred_providers: list[str] | None = None,  # Deprecated. Will be removed in a future version.
+                                # The preferred ONNX runtime providers. Defaults to None (uses available providers).


The Args section of the docstring has incorrect formatting. The default values and comments are placed inline with the parameter names instead of in the description. The docstring should follow standard format with parameter names on one line and descriptions on the following lines. For example: 'model_name: Deprecated parameter. Will be removed in a future version. Currently only "all-MiniLM-L6-v2" is supported.'

Suggested change

model_name: str = "all-MiniLM-L6-v2", # Deprecated. Will be removed in a future version.

preferred_providers: list[str] | None = None, # Deprecated. Will be removed in a future version.

# The preferred ONNX runtime providers. Defaults to None (uses available providers).

model_name:

Deprecated parameter. Will be removed in a future version.

Currently only "all-MiniLM-L6-v2" is supported.

preferred_providers:

Deprecated parameter. Will be removed in a future version.

Preferred ONNX runtime providers. Defaults to None (uses available providers).

Use the preferred_providers argument of OnnxEmbeddingFunction instead.

Copilot · 2026-02-02T06:41:41Z

+    """
+
+    EXTRACTED_FOLDER_NAME = "onnx"
+    ARCHIVE_FILENAME = "onnx.tar.gz"


The ARCHIVE_FILENAME class constant is defined but never used anywhere in the class. This appears to be leftover from a previous implementation that used tar.gz archives instead of direct file downloads. Consider removing this unused constant to avoid confusion.

Suggested change

ARCHIVE_FILENAME = "onnx.tar.gz"

Copilot · 2026-02-02T06:41:42Z

+    pytest.importorskip("onnxruntime")
+    pytest.importorskip("tokenizers")
+    pytest.importorskip("tqdm")


The test uses pytest.importorskip but these imports are mandatory dependencies in pyproject.toml (onnxruntime>=1.19.0, tokenizers>=0.15.0, tqdm). Using importorskip is appropriate for optional dependencies, but for required dependencies, the test should fail hard if they're missing rather than skip. This inconsistency could mask installation issues.

Copilot · 2026-02-02T06:41:42Z

+            download_path
+            if download_path is not None
+            else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name


The download_path is stored as a Path object but then used with os.path.join throughout the code (lines 147, 278, 320, 338). While os.path.join does accept Path objects in Python 3.6+, it's inconsistent to mix Path and os.path APIs. Consider either using Path methods consistently (download_path / EXTRACTED_FOLDER_NAME) or converting download_path to a string. The current approach reduces code clarity and could cause confusion.

Suggested change

download_path

if download_path is not None

else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name

str(download_path)

if download_path is not None

else str(Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name)

Copilot · 2026-02-02T06:41:42Z

+class OnnxEmbeddingFunction:
+    """
+    Generic ONNX runtime embedding function.
+
+    This class handles model download, tokenizer/model loading, and embedding
+    generation using onnxruntime.
+    """


The OnnxEmbeddingFunction class does not implement the persistence methods required by the EmbeddingFunction protocol (name(), get_config(), and build_from_config()). If this class is intended to be used directly by users as suggested in the deprecation warning in DefaultEmbeddingFunction, it should implement these methods to support collection persistence. Without these methods, collections using OnnxEmbeddingFunction directly cannot be persisted and restored.

Copilot · 2026-02-02T06:41:42Z

+        if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers:
+            # remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU.
+            self._preferred_providers.remove("CoreMLExecutionProvider")


The code modifies the _preferred_providers list in place by calling .remove("CoreMLExecutionProvider"). If this method is called multiple times (which could happen with cached_property in some edge cases), it will attempt to remove the same item twice and raise a ValueError. Consider using a list comprehension to create a new list without the unwanted provider instead: providers = [p for p in self._preferred_providers if p != "CoreMLExecutionProvider"]

Copilot · 2026-02-02T06:41:42Z

+        self._onnx = OnnxEmbeddingFunction(
+            model_name=self.MODEL_NAME,
+            hf_model_id=self.HF_MODEL_ID,
+            dimension=self._DIMENSION,


The preferred_providers parameter is deprecated but still accepted in the init method. However, it's never passed to the underlying OnnxEmbeddingFunction. This means that even during the deprecation period, the parameter has no effect. If backward compatibility is desired during the deprecation period, the parameter should be passed through to OnnxEmbeddingFunction, or alternatively, the ValueError for unsupported values should be raised before the deprecation warning.

Suggested change

dimension=self._DIMENSION,

dimension=self._DIMENSION,

preferred_providers=preferred_providers,

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@src/pyseekdb/client/embedding_function.py`:
- Around line 164-176: The constructor currently warns that preferred_providers
is deprecated but does not forward that parameter to the OnnxEmbeddingFunction,
dropping caller selection; update the OnnxEmbeddingFunction instantiation in the
embedding function (the block that creates self._onnx) to pass the
preferred_providers argument through (i.e., include
preferred_providers=preferred_providers when calling OnnxEmbeddingFunction) so
existing callers retain provider selection behavior; ensure the variable name
matches the parameter on the parent constructor and OnnxEmbeddingFunction
signature.

In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py`:
- Around line 294-323: The InferenceSession call in onnx_embedding_function
currently hardcodes providers=["CPUExecutionProvider"] and thus ignores the
validated self._preferred_providers; update the providers argument in the
self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it
is a list/sequence) so the session respects user-selected/available providers
(note the code already removes "CoreMLExecutionProvider" from
self._preferred_providers). Use self._preferred_providers as the providers
parameter instead of the hardcoded CPUExecutionProvider.
- Around line 173-186: The broad "except Exception" should be narrowed to the
expected failure types so unexpected bugs bubble up; replace the generic except
block in the download flow (the block after the existing except
httpx.HTTPStatusError) with an explicit except (httpx.RequestError, OSError) as
e (or include PermissionError if desired), keep the same logger.warning and
local_path cleanup logic, and ensure httpx is imported so httpx.RequestError is
available; leave the existing httpx.HTTPStatusError handler unchanged.
- Around line 54-74: The constructor and other raise sites in
ONNXEmbeddingFunction currently raise ValueError with inline messages; create
specific exception classes (e.g., InvalidModelNameError, InvalidHFModelIDError,
InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the
appropriate default message, then replace raises that pass message strings (the
checks referencing model_name, hf_model_id, dimension, preferred_providers and
the other flagged raise sites later in the file) with raises of those exception
classes without arguments; ensure all other flagged raise sites in this module
use their own dedicated exception classes with messages defined in the exception
__init__ rather than inline messages at the raise call.

In `@tests/unit_tests/test_onnx_embedding_function.py`:
- Around line 12-18: The helper _make_onnx() should guard optional dependencies
via pytest.importorskip so tests that call it don't fail when
onnxruntime/tokenizers/tqdm are absent; update _make_onnx() to call
pytest.importorskip for the packages imported in OnnxEmbeddingFunction.__init__
(onnxruntime, tokenizers, tqdm) before constructing OnnxEmbeddingFunction and
then remove redundant per-test importorskip calls in test_dimension_property and
test_call_empty_returns_empty (leave test_call_generates_embeddings as-is or
remove its duplicate skips), referencing the helper _make_onnx and the tests
test_dimension_property and test_call_empty_returns_empty to locate changes.

coderabbitai · 2026-02-02T06:47:11Z

+        if preferred_providers:
+            warnings.warn(
+                "preferred_providers is deprecated and will be removed in a future version. "
+                "Use the preferred_providers argument of OnnxEmbeddingFunction instead.",
+                DeprecationWarning,
+                stacklevel=2,
+            )
+        self.model_name = self.MODEL_NAME
+        self._onnx = OnnxEmbeddingFunction(
+            model_name=self.MODEL_NAME,
+            hf_model_id=self.HF_MODEL_ID,
+            dimension=self._DIMENSION,
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/client/embedding_function.py | sed -n '150,185p'

Repository: oceanbase/pyseekdb

Length of output: 1786

🏁 Script executed:

# Find the OnnxEmbeddingFunction class definition to check its constructor rg "class OnnxEmbeddingFunction" -A 30

Repository: oceanbase/pyseekdb

Length of output: 3078

🏁 Script executed:

# Check the constructor signature more carefully rg "def __init__" src/pyseekdb/client/embedding_function.py -A 15 -B 2

Repository: oceanbase/pyseekdb

Length of output: 2511

Pass preferred_providers through to OnnxEmbeddingFunction.

The deprecation warning directs users to use OnnxEmbeddingFunction's preferred_providers argument, but the parameter is never passed through, effectively dropping it and breaking backward compatibility for existing callers who rely on provider selection.

Proposed fix

self._onnx = OnnxEmbeddingFunction( model_name=self.MODEL_NAME, hf_model_id=self.HF_MODEL_ID, dimension=self._DIMENSION, + preferred_providers=preferred_providers, )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if preferred_providers:

warnings.warn(

"preferred_providers is deprecated and will be removed in a future version. "

"Use the preferred_providers argument of OnnxEmbeddingFunction instead.",

DeprecationWarning,

stacklevel=2,

)

self.model_name = self.MODEL_NAME

self._onnx = OnnxEmbeddingFunction(

model_name=self.MODEL_NAME,

hf_model_id=self.HF_MODEL_ID,

dimension=self._DIMENSION,

)

if preferred_providers:

warnings.warn(

"preferred_providers is deprecated and will be removed in a future version. "

"Use the preferred_providers argument of OnnxEmbeddingFunction instead.",

DeprecationWarning,

stacklevel=2,

)

self.model_name = self.MODEL_NAME

self._onnx = OnnxEmbeddingFunction(

model_name=self.MODEL_NAME,

hf_model_id=self.HF_MODEL_ID,

dimension=self._DIMENSION,

preferred_providers=preferred_providers,

)

🤖 Prompt for AI Agents

In `@src/pyseekdb/client/embedding_function.py` around lines 164 - 176, The constructor currently warns that preferred_providers is deprecated but does not forward that parameter to the OnnxEmbeddingFunction, dropping caller selection; update the OnnxEmbeddingFunction instantiation in the embedding function (the block that creates self._onnx) to pass the preferred_providers argument through (i.e., include preferred_providers=preferred_providers when calling OnnxEmbeddingFunction) so existing callers retain provider selection behavior; ensure the variable name matches the parameter on the parent constructor and OnnxEmbeddingFunction signature.

coderabbitai · 2026-02-02T06:47:11Z

+        if not model_name:
+            raise ValueError("model_name must be a non-empty string")
+        if not hf_model_id:
+            raise ValueError("hf_model_id must be a non-empty string")
+        if dimension <= 0:
+            raise ValueError("dimension must be a positive integer")
+
+        self.model_name = model_name
+        self.hf_model_id = hf_model_id
+        self._dimension = dimension
+        self.download_path = (
+            download_path
+            if download_path is not None
+            else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name
+        )
+
+        # Validate preferred_providers
+        if preferred_providers and not all(isinstance(i, str) for i in preferred_providers):
+            raise ValueError("Preferred providers must be a list of strings")
+        if preferred_providers and len(preferred_providers) != len(set(preferred_providers)):
+            raise ValueError("Preferred providers must be unique")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the file and check its size find . -name "onnx_embedding_function.py" -type f

Repository: oceanbase/pyseekdb

Length of output: 129

🏁 Script executed:

# Read the relevant sections of the file cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -100

Repository: oceanbase/pyseekdb

Length of output: 3914

🏁 Script executed:

# Also check lines 220-230, 300-310, 350-360 to see other violations mentioned cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '220,230p;300,310p;350,360p'

Repository: oceanbase/pyseekdb

Length of output: 2056

🌐 Web query:

Ruff TRY003 rule error inline exception message

💡 Result:

TRY003 in Ruff is raise-vanilla-args: it flags raise MyError("...long message...") / raise MyError(f"...") when the message is defined at the raise site instead of inside the exception class. The intent is to keep exception messages consistent and make the exception reusable. [1][2]

Typical triggering example (bad):

class CantBeNegative(Exception): ... def foo(x): if x < 0: raise CantBeNegative(f"{x} is negative") # TRY003

Fix (recommended): define the message in the exception class and pass args:

class CantBeNegative(Exception): def __init__(self, number: int) -> None: super().__init__(f"{number} is negative") def foo(x): if x < 0: raise CantBeNegative(x)

[1]

If you intentionally want an inline message: suppress it on that line:

raise CantBeNegative(f"{x} is negative") # noqa: TRY003

Ruff supports inline # noqa: <codes> comments. [3]

🏁 Script executed:

# Look for ruff configuration find . -name "pyproject.toml" -o -name "ruff.toml" | head -5

Repository: oceanbase/pyseekdb

Length of output: 116

🏁 Script executed:

# Check if ruff config exists and look for TRY003 if [ -f pyproject.toml ]; then echo "=== pyproject.toml ===" grep -A 20 "\[tool.ruff" pyproject.toml || echo "No [tool.ruff] section" fi if [ -f ruff.toml ]; then echo "=== ruff.toml ===" cat ruff.toml fi

Repository: oceanbase/pyseekdb

Length of output: 1604

🏁 Script executed:

# Verify that TRY003 is included in the TRY ruleset cat -n ruff.toml | grep -A 2 -B 2 "TRY"

Repository: oceanbase/pyseekdb

Length of output: 218

🌐 Web query:

Ruff TRY003 raise-vanilla-args ruleset tryceratops included rules

💡 Result:

In Ruff, TRY003 is the tryceratops rule raise-vanilla-args (“Avoid specifying long messages outside the exception class”). It flags cases where you build/format a long error message at the raise site instead of defining the message inside the exception class (to keep exceptions reusable and messages consistent). [2]

The tryceratops ruleset (TRY) included in Ruff currently consists of these rule codes: TRY002, TRY003, TRY004, TRY200, TRY201, TRY203, TRY300, TRY301, TRY400, TRY401. [1]

TRY003 violations must be fixed by moving exception messages into exception classes.

Ruff's TRY003 rule (raise-vanilla-args) is enabled via the TRY ruleset in ruff.toml and flags inline messages at lines 55, 57, 59, 72, 74, 224-226, 302-304, and 352-357. Move these messages into custom exception class __init__ methods rather than specifying them at raise sites.

Example pattern (apply similarly to the other raise sites)

+class InvalidModelNameError(ValueError): + def __init__(self) -> None: + super().__init__("model_name must be a non-empty string") @@ - if not model_name: - raise ValueError("model_name must be a non-empty string") + if not model_name: + raise InvalidModelNameError()

Also applies to: 224-226, 302-304, 352-357

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 55-55: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 57-57: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 59-59: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 72-72: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 74-74: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around lines 54 - 74, The constructor and other raise sites in ONNXEmbeddingFunction currently raise ValueError with inline messages; create specific exception classes (e.g., InvalidModelNameError, InvalidHFModelIDError, InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the appropriate default message, then replace raises that pass message strings (the checks referencing model_name, hf_model_id, dimension, preferred_providers and the other flagged raise sites later in the file) with raises of those exception classes without arguments; ensure all other flagged raise sites in this module use their own dedicated exception classes with messages defined in the exception __init__ rather than inline messages at the raise call.

coderabbitai · 2026-02-02T06:47:11Z

+                except httpx.HTTPStatusError as e:
+                    if e.response.status_code == 404:
+                        logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")
+                        return False
+                    logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")
+                    if os.path.exists(local_path):
+                        os.remove(local_path)
+                    return False
+                except Exception as e:
+                    logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
+                    # If download fails, try to delete partially downloaded file
+                    if os.path.exists(local_path):
+                        os.remove(local_path)
+                    return False


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '150,200p'

Repository: oceanbase/pyseekdb

Length of output: 3006

🏁 Script executed:

# Get more context around the download function cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '130,220p'

Repository: oceanbase/pyseekdb

Length of output: 4803

🏁 Script executed:

# Check what exceptions httpx can raise rg "import httpx|from httpx" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 2

Repository: oceanbase/pyseekdb

Length of output: 120

🏁 Script executed:

grep -n "_download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -20

Repository: oceanbase/pyseekdb

Length of output: 564

🏁 Script executed:

# Find the _download method definition ast-grep --pattern 'def _download($_) { $$$ }'

Repository: oceanbase/pyseekdb

Length of output: 44

🏁 Script executed:

# Search for _download method in the file rg "def _download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 20

Repository: oceanbase/pyseekdb

Length of output: 2625

🏁 Script executed:

python3 << 'EOF' try: import httpx # Check exception hierarchy print("httpx.HTTPError exists:", hasattr(httpx, 'HTTPError')) print("httpx.HTTPStatusError exists:", hasattr(httpx, 'HTTPStatusError')) print("httpx.RequestError exists:", hasattr(httpx, 'RequestError')) # Check MRO if HTTPStatusError exists if hasattr(httpx, 'HTTPStatusError'): print("\nhttpx.HTTPStatusError MRO:") for cls in httpx.HTTPStatusError.__mro__: print(f" - {cls}") if hasattr(httpx, 'RequestError'): print("\nhttpx.RequestError MRO:") for cls in httpx.RequestError.__mro__: print(f" - {cls}") except ImportError: print("httpx not installed in this environment") EOF

Repository: oceanbase/pyseekdb

Length of output: 485

Narrow the broad Exception catch to specific expected exceptions.

The blanket except Exception at line 181 can hide unexpected bugs and violates the BLE001 linting rule. Since _download() can raise httpx.HTTPStatusError (already caught at line 173), httpx.RequestError, and OSError from file operations, catch only those specific exceptions and let unexpected errors surface for debugging.

Proposed fix

- except Exception as e: + except (httpx.HTTPError, OSError) as e: logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}") # If download fails, try to delete partially downloaded file if os.path.exists(local_path): os.remove(local_path) return False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except httpx.HTTPStatusError as e:

if e.response.status_code == 404:

logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")

return False

logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")

if os.path.exists(local_path):

os.remove(local_path)

return False

except Exception as e:

logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")

# If download fails, try to delete partially downloaded file

if os.path.exists(local_path):

os.remove(local_path)

return False

except httpx.HTTPStatusError as e:

if e.response.status_code == 404:

logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback")

return False

logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}")

if os.path.exists(local_path):

os.remove(local_path)

return False

except (httpx.HTTPError, OSError) as e:

logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")

# If download fails, try to delete partially downloaded file

if os.path.exists(local_path):

os.remove(local_path)

return False

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 181-181: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around lines 173 - 186, The broad "except Exception" should be narrowed to the expected failure types so unexpected bugs bubble up; replace the generic except block in the download flow (the block after the existing except httpx.HTTPStatusError) with an explicit except (httpx.RequestError, OSError) as e (or include PermissionError if desired), keep the same logger.warning and local_path cleanup logic, and ensure httpx is imported so httpx.RequestError is available; leave the existing httpx.HTTPStatusError handler unchanged.

coderabbitai · 2026-02-02T06:47:11Z

+        if self._preferred_providers is None or len(self._preferred_providers) == 0:
+            if len(self.ort.get_available_providers()) > 0:
+                logger.debug(
+                    f"WARNING: No ONNX providers provided, defaulting to available providers: "
+                    f"{self.ort.get_available_providers()}"
+                )
+            self._preferred_providers = self.ort.get_available_providers()
+        elif not set(self._preferred_providers).issubset(set(self.ort.get_available_providers())):
+            raise ValueError(
+                f"Preferred providers must be subset of available providers: {self.ort.get_available_providers()}"
+            )
+
+        # Create minimal session options to avoid issues
+        so = self.ort.SessionOptions()
+        so.log_severity_level = 3
+        # Disable all optimizations that might cause issues
+        so.graph_optimization_level = self.ort.GraphOptimizationLevel.ORT_DISABLE_ALL
+        so.execution_mode = self.ort.ExecutionMode.ORT_SEQUENTIAL
+        so.inter_op_num_threads = 1
+        so.intra_op_num_threads = 1
+
+        if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers:
+            # remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU.
+            self._preferred_providers.remove("CoreMLExecutionProvider")
+
+        return self.ort.InferenceSession(
+            os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"),
+            # Force CPU execution provider to avoid provider issues
+            providers=["CPUExecutionProvider"],
+            sess_options=so,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -n 330 | tail -n +270

Repository: oceanbase/pyseekdb

Length of output: 3075

🏁 Script executed:

head -n 100 src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 3214

🏁 Script executed:

# Search for other references to _preferred_providers or preferred_providers in the file rg "preferred_providers" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -n

Repository: oceanbase/pyseekdb

Length of output: 930

🏁 Script executed:

# Search for any TODO comments or issues related to this in the entire file cat src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | grep -i -E "todo|fixme|issue|bug|workaround" -C 2

Repository: oceanbase/pyseekdb

Length of output: 1030

🏁 Script executed:

# Search for any related issues or discussions in code comments rg "provider" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -B 2 -A 2

Repository: oceanbase/pyseekdb

Length of output: 2256

Use self._preferred_providers instead of hardcoding CPU provider.

The preferred_providers parameter is validated and defaults to available providers, but is completely ignored when creating the InferenceSession. Pass self._preferred_providers to the providers argument so users can leverage GPU or other accelerators as specified.

Proposed fix

return self.ort.InferenceSession( os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"), - # Force CPU execution provider to avoid provider issues - providers=["CPUExecutionProvider"], + providers=self._preferred_providers, sess_options=so, )

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 302-304: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around lines 294 - 323, The InferenceSession call in onnx_embedding_function currently hardcodes providers=["CPUExecutionProvider"] and thus ignores the validated self._preferred_providers; update the providers argument in the self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it is a list/sequence) so the session respects user-selected/available providers (note the code already removes "CoreMLExecutionProvider" from self._preferred_providers). Use self._preferred_providers as the providers parameter instead of the hardcoded CPUExecutionProvider.

coderabbitai · 2026-02-02T06:47:12Z

+def _make_onnx() -> OnnxEmbeddingFunction:
+    return OnnxEmbeddingFunction(
+        model_name="all-MiniLM-L6-v2",
+        hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
+        dimension=384,
+        preferred_providers=None,
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, find and read the test file find . -name "test_onnx_embedding_function.py" -type f

Repository: oceanbase/pyseekdb

Length of output: 112

🏁 Script executed:

# Look for the OnnxEmbeddingFunction class to understand what it imports find . -name "*.py" -type f | xargs grep -l "class OnnxEmbeddingFunction"

Repository: oceanbase/pyseekdb

Length of output: 129

🏁 Script executed:

# Read the test file to see the current state cat -n ./tests/unit_tests/test_onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 2809

🏁 Script executed:

# Read the OnnxEmbeddingFunction implementation cat -n ./src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py

Repository: oceanbase/pyseekdb

Length of output: 18457

Guard _make_onnx() with importorskip for optional deps.

OnnxEmbeddingFunction.__init__ imports optional packages (onnxruntime, tokenizers, tqdm) at lines 79-81. The tests test_dimension_property and test_call_empty_returns_empty will fail when these dependencies are absent because they call _make_onnx() without any skip guards, while test_call_generates_embeddings has them. Centralize the skips inside _make_onnx() and remove the redundant per-test skips.

Proposed fix

def _make_onnx() -> OnnxEmbeddingFunction: + pytest.importorskip("onnxruntime") + pytest.importorskip("tokenizers") + pytest.importorskip("tqdm") return OnnxEmbeddingFunction( model_name="all-MiniLM-L6-v2", hf_model_id="sentence-transformers/all-MiniLM-L6-v2", dimension=384, preferred_providers=None, ) @@ def test_call_generates_embeddings() -> None: - pytest.importorskip("onnxruntime") - pytest.importorskip("tokenizers") - pytest.importorskip("tqdm") - ef = _make_onnx() embeddings = ef("hello world")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _make_onnx() -> OnnxEmbeddingFunction:

return OnnxEmbeddingFunction(

model_name="all-MiniLM-L6-v2",

hf_model_id="sentence-transformers/all-MiniLM-L6-v2",

dimension=384,

preferred_providers=None,

)

def _make_onnx() -> OnnxEmbeddingFunction:

pytest.importorskip("onnxruntime")

pytest.importorskip("tokenizers")

pytest.importorskip("tqdm")

return OnnxEmbeddingFunction(

model_name="all-MiniLM-L6-v2",

hf_model_id="sentence-transformers/all-MiniLM-L6-v2",

dimension=384,

preferred_providers=None,

)

🤖 Prompt for AI Agents

In `@tests/unit_tests/test_onnx_embedding_function.py` around lines 12 - 18, The helper _make_onnx() should guard optional dependencies via pytest.importorskip so tests that call it don't fail when onnxruntime/tokenizers/tqdm are absent; update _make_onnx() to call pytest.importorskip for the packages imported in OnnxEmbeddingFunction.__init__ (onnxruntime, tokenizers, tqdm) before constructing OnnxEmbeddingFunction and then remove redundant per-test importorskip calls in test_dimension_property and test_call_empty_returns_empty (leave test_call_generates_embeddings as-is or remove its duplicate skips), referencing the helper _make_onnx and the tests test_dimension_property and test_call_empty_returns_empty to locate changes.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tests/integration_tests/conftest.py`:
- Around line 13-18: The import line "import pyseekdb  # noqa: E402" contains an
unused noqa directive; remove the " # noqa: E402" suffix so the line becomes a
normal import (keep the existing sys.path manipulation with repo_root, src_root,
and sys.path.insert as-is) to satisfy RUF100 and eliminate the unused noqa.

coderabbitai · 2026-02-02T06:57:53Z

+# Add project path (repo root + src)
+repo_root = Path(__file__).resolve().parents[2]
+src_root = repo_root / "src"
+sys.path.insert(0, str(src_root))

 import pyseekdb  # noqa: E402


⚠️ Potential issue | 🟡 Minor

Remove unused noqa to satisfy RUF100.

Ruff reports the # noqa: E402 is unused (E402 not enabled), so the directive should be removed.

🔧 Suggested fix

-import pyseekdb # noqa: E402 +import pyseekdb

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Add project path (repo root + src)

repo_root = Path(__file__).resolve().parents[2]

src_root = repo_root / "src"

sys.path.insert(0, str(src_root))

import pyseekdb # noqa: E402

# Add project path (repo root + src)

repo_root = Path(__file__).resolve().parents[2]

src_root = repo_root / "src"

sys.path.insert(0, str(src_root))

import pyseekdb

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 18-18: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In `@tests/integration_tests/conftest.py` around lines 13 - 18, The import line "import pyseekdb # noqa: E402" contains an unused noqa directive; remove the " # noqa: E402" suffix so the line becomes a normal import (keep the existing sys.path manipulation with repo_root, src_root, and sys.path.insert as-is) to satisfy RUF100 and eliminate the unused noqa.

fix ruff check

24995e5

hnwyllmm requested a review from Copilot February 2, 2026 06:36

Copilot started reviewing on behalf of hnwyllmm February 2, 2026 06:36 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

coderabbitai Bot reviewed Feb 2, 2026

View reviewed changes

fix collect tests fail

d38072e

coderabbitai Bot reviewed Feb 2, 2026

View reviewed changes

fix imporing onnxembedding function error

2352e63

hnwyllmm merged commit 38e58bf into oceanbase:develop Feb 2, 2026
8 checks passed

hnwyllmm deleted the refactor/default-embedding branch February 2, 2026 08:23

coderabbitai Bot mentioned this pull request Feb 2, 2026

support python3.14 by using sentence-transformers #163

Merged

coderabbitai Bot mentioned this pull request Feb 13, 2026

refactor: extract ONNXEmbeddingFunction from DefaultEmbeddingFunction #156

Closed

coderabbitai Bot mentioned this pull request Feb 26, 2026

feat: support sparse vector index for keyword-based retrieval #172

Merged

5 tasks

		# Force CPU execution provider to avoid provider issues
		providers=["CPUExecutionProvider"],

	# Force CPU execution provider to avoid provider issues
	providers=["CPUExecutionProvider"],
	providers=self._preferred_providers,

	dimension=self._DIMENSION,
	dimension=self._DIMENSION,
	preferred_providers=preferred_providers,

Conversation

hnwyllmm commented Feb 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hnwyllmm commented Feb 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 2, 2026 •

edited

Loading