fix ruff check#162
Conversation
📝 WalkthroughWalkthroughExtracted ONNX download/tokenizer/runtime logic into a new OnnxEmbeddingFunction; DefaultEmbeddingFunction now initializes and delegates to an internal OnnxEmbeddingFunction instance and simplified its public surface and attributes. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant DefaultEmbeddingFunction
participant OnnxEmbeddingFunction
participant Tokenizer
participant ONNXRuntime
participant HuggingFace
Client->>DefaultEmbeddingFunction: request embeddings(documents)
DefaultEmbeddingFunction->>OnnxEmbeddingFunction: __call__(documents)
OnnxEmbeddingFunction->>OnnxEmbeddingFunction: ensure model/tokenizer downloaded
OnnxEmbeddingFunction->>Tokenizer: tokenize(documents) (cached)
OnnxEmbeddingFunction->>ONNXRuntime: run model(inputs) (cached session)
ONNXRuntime-->>OnnxEmbeddingFunction: model outputs
OnnxEmbeddingFunction-->>OnnxEmbeddingFunction: mean-pool using attention_mask
OnnxEmbeddingFunction-->>DefaultEmbeddingFunction: embeddings
DefaultEmbeddingFunction-->>Client: return embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR refactors the DefaultEmbeddingFunction class by extracting the ONNX-based embedding logic into a new generic OnnxEmbeddingFunction class. The refactoring aims to separate concerns and make the ONNX runtime functionality reusable for other embedding models beyond the default all-MiniLM-L6-v2.
Changes:
- Created a new
OnnxEmbeddingFunctionclass that handles ONNX model downloading, loading, and inference - Refactored
DefaultEmbeddingFunctionto delegate toOnnxEmbeddingFunctionwhile maintaining backward compatibility - Added deprecation warnings for the
preferred_providersparameter inDefaultEmbeddingFunction - Added unit tests for the new
OnnxEmbeddingFunctionclass
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py |
New generic ONNX embedding function that handles model download, tokenization, and inference for ONNX-based models |
src/pyseekdb/client/embedding_function.py |
Refactored DefaultEmbeddingFunction to use composition, delegating to OnnxEmbeddingFunction while maintaining backward compatibility |
tests/unit_tests/test_onnx_embedding_function.py |
New unit tests covering initialization validation and basic functionality of OnnxEmbeddingFunction |
| # Force CPU execution provider to avoid provider issues | ||
| providers=["CPUExecutionProvider"], |
There was a problem hiding this comment.
The preferred_providers parameter is validated and stored, but then ignored when creating the InferenceSession. Line 322 hardcodes providers to ["CPUExecutionProvider"], overriding any user-specified providers. Either the preferred_providers functionality should be removed if it's not supported, or the hardcoded providers should be replaced with self._preferred_providers. The current implementation is misleading to users who might expect their provider preferences to be respected.
| # Force CPU execution provider to avoid provider issues | |
| providers=["CPUExecutionProvider"], | |
| providers=self._preferred_providers, |
| def test_init_validates_preferred_providers() -> None: | ||
| with pytest.raises(ValueError, match="Preferred providers must be a list of strings"): | ||
| OnnxEmbeddingFunction(model_name="test", hf_model_id="org/test", dimension=3, preferred_providers=[1]) # type: ignore[list-item] | ||
| with pytest.raises(ValueError, match="Preferred providers must be unique"): | ||
| OnnxEmbeddingFunction( | ||
| model_name="test", | ||
| hf_model_id="org/test", | ||
| dimension=3, | ||
| preferred_providers=["CPUExecutionProvider", "CPUExecutionProvider"], | ||
| ) |
There was a problem hiding this comment.
The test validates the preferred_providers parameter but doesn't test the actual functionality of using different providers. Given that the model() method hardcodes providers to ["CPUExecutionProvider"] (line 322 in onnx_embedding_function.py), these validation tests are misleading as they suggest the feature works when it actually doesn't. Consider either removing these tests or fixing the implementation to actually use the preferred providers.
| model_name: str = "all-MiniLM-L6-v2", # Deprecated. Will be removed in a future version. | ||
| preferred_providers: list[str] | None = None, # Deprecated. Will be removed in a future version. | ||
| # The preferred ONNX runtime providers. Defaults to None (uses available providers). |
There was a problem hiding this comment.
The Args section of the docstring has incorrect formatting. The default values and comments are placed inline with the parameter names instead of in the description. The docstring should follow standard format with parameter names on one line and descriptions on the following lines. For example: 'model_name: Deprecated parameter. Will be removed in a future version. Currently only "all-MiniLM-L6-v2" is supported.'
| model_name: str = "all-MiniLM-L6-v2", # Deprecated. Will be removed in a future version. | |
| preferred_providers: list[str] | None = None, # Deprecated. Will be removed in a future version. | |
| # The preferred ONNX runtime providers. Defaults to None (uses available providers). | |
| model_name: | |
| Deprecated parameter. Will be removed in a future version. | |
| Currently only "all-MiniLM-L6-v2" is supported. | |
| preferred_providers: | |
| Deprecated parameter. Will be removed in a future version. | |
| Preferred ONNX runtime providers. Defaults to None (uses available providers). | |
| Use the preferred_providers argument of OnnxEmbeddingFunction instead. |
| """ | ||
|
|
||
| EXTRACTED_FOLDER_NAME = "onnx" | ||
| ARCHIVE_FILENAME = "onnx.tar.gz" |
There was a problem hiding this comment.
The ARCHIVE_FILENAME class constant is defined but never used anywhere in the class. This appears to be leftover from a previous implementation that used tar.gz archives instead of direct file downloads. Consider removing this unused constant to avoid confusion.
| ARCHIVE_FILENAME = "onnx.tar.gz" |
| pytest.importorskip("onnxruntime") | ||
| pytest.importorskip("tokenizers") | ||
| pytest.importorskip("tqdm") |
There was a problem hiding this comment.
The test uses pytest.importorskip but these imports are mandatory dependencies in pyproject.toml (onnxruntime>=1.19.0, tokenizers>=0.15.0, tqdm). Using importorskip is appropriate for optional dependencies, but for required dependencies, the test should fail hard if they're missing rather than skip. This inconsistency could mask installation issues.
| download_path | ||
| if download_path is not None | ||
| else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name |
There was a problem hiding this comment.
The download_path is stored as a Path object but then used with os.path.join throughout the code (lines 147, 278, 320, 338). While os.path.join does accept Path objects in Python 3.6+, it's inconsistent to mix Path and os.path APIs. Consider either using Path methods consistently (download_path / EXTRACTED_FOLDER_NAME) or converting download_path to a string. The current approach reduces code clarity and could cause confusion.
| download_path | |
| if download_path is not None | |
| else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name | |
| str(download_path) | |
| if download_path is not None | |
| else str(Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name) |
| class OnnxEmbeddingFunction: | ||
| """ | ||
| Generic ONNX runtime embedding function. | ||
|
|
||
| This class handles model download, tokenizer/model loading, and embedding | ||
| generation using onnxruntime. | ||
| """ |
There was a problem hiding this comment.
The OnnxEmbeddingFunction class does not implement the persistence methods required by the EmbeddingFunction protocol (name(), get_config(), and build_from_config()). If this class is intended to be used directly by users as suggested in the deprecation warning in DefaultEmbeddingFunction, it should implement these methods to support collection persistence. Without these methods, collections using OnnxEmbeddingFunction directly cannot be persisted and restored.
| if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers: | ||
| # remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU. | ||
| self._preferred_providers.remove("CoreMLExecutionProvider") |
There was a problem hiding this comment.
The code modifies the _preferred_providers list in place by calling .remove("CoreMLExecutionProvider"). If this method is called multiple times (which could happen with cached_property in some edge cases), it will attempt to remove the same item twice and raise a ValueError. Consider using a list comprehension to create a new list without the unwanted provider instead: providers = [p for p in self._preferred_providers if p != "CoreMLExecutionProvider"]
| self._onnx = OnnxEmbeddingFunction( | ||
| model_name=self.MODEL_NAME, | ||
| hf_model_id=self.HF_MODEL_ID, | ||
| dimension=self._DIMENSION, |
There was a problem hiding this comment.
The preferred_providers parameter is deprecated but still accepted in the init method. However, it's never passed to the underlying OnnxEmbeddingFunction. This means that even during the deprecation period, the parameter has no effect. If backward compatibility is desired during the deprecation period, the parameter should be passed through to OnnxEmbeddingFunction, or alternatively, the ValueError for unsupported values should be raised before the deprecation warning.
| dimension=self._DIMENSION, | |
| dimension=self._DIMENSION, | |
| preferred_providers=preferred_providers, |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/embedding_function.py`:
- Around line 164-176: The constructor currently warns that preferred_providers
is deprecated but does not forward that parameter to the OnnxEmbeddingFunction,
dropping caller selection; update the OnnxEmbeddingFunction instantiation in the
embedding function (the block that creates self._onnx) to pass the
preferred_providers argument through (i.e., include
preferred_providers=preferred_providers when calling OnnxEmbeddingFunction) so
existing callers retain provider selection behavior; ensure the variable name
matches the parameter on the parent constructor and OnnxEmbeddingFunction
signature.
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py`:
- Around line 294-323: The InferenceSession call in onnx_embedding_function
currently hardcodes providers=["CPUExecutionProvider"] and thus ignores the
validated self._preferred_providers; update the providers argument in the
self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it
is a list/sequence) so the session respects user-selected/available providers
(note the code already removes "CoreMLExecutionProvider" from
self._preferred_providers). Use self._preferred_providers as the providers
parameter instead of the hardcoded CPUExecutionProvider.
- Around line 173-186: The broad "except Exception" should be narrowed to the
expected failure types so unexpected bugs bubble up; replace the generic except
block in the download flow (the block after the existing except
httpx.HTTPStatusError) with an explicit except (httpx.RequestError, OSError) as
e (or include PermissionError if desired), keep the same logger.warning and
local_path cleanup logic, and ensure httpx is imported so httpx.RequestError is
available; leave the existing httpx.HTTPStatusError handler unchanged.
- Around line 54-74: The constructor and other raise sites in
ONNXEmbeddingFunction currently raise ValueError with inline messages; create
specific exception classes (e.g., InvalidModelNameError, InvalidHFModelIDError,
InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the
appropriate default message, then replace raises that pass message strings (the
checks referencing model_name, hf_model_id, dimension, preferred_providers and
the other flagged raise sites later in the file) with raises of those exception
classes without arguments; ensure all other flagged raise sites in this module
use their own dedicated exception classes with messages defined in the exception
__init__ rather than inline messages at the raise call.
In `@tests/unit_tests/test_onnx_embedding_function.py`:
- Around line 12-18: The helper _make_onnx() should guard optional dependencies
via pytest.importorskip so tests that call it don't fail when
onnxruntime/tokenizers/tqdm are absent; update _make_onnx() to call
pytest.importorskip for the packages imported in OnnxEmbeddingFunction.__init__
(onnxruntime, tokenizers, tqdm) before constructing OnnxEmbeddingFunction and
then remove redundant per-test importorskip calls in test_dimension_property and
test_call_empty_returns_empty (leave test_call_generates_embeddings as-is or
remove its duplicate skips), referencing the helper _make_onnx and the tests
test_dimension_property and test_call_empty_returns_empty to locate changes.
| if preferred_providers: | ||
| warnings.warn( | ||
| "preferred_providers is deprecated and will be removed in a future version. " | ||
| "Use the preferred_providers argument of OnnxEmbeddingFunction instead.", | ||
| DeprecationWarning, | ||
| stacklevel=2, | ||
| ) | ||
| self.model_name = self.MODEL_NAME | ||
| self._onnx = OnnxEmbeddingFunction( | ||
| model_name=self.MODEL_NAME, | ||
| hf_model_id=self.HF_MODEL_ID, | ||
| dimension=self._DIMENSION, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/pyseekdb/client/embedding_function.py | sed -n '150,185p'Repository: oceanbase/pyseekdb
Length of output: 1786
🏁 Script executed:
# Find the OnnxEmbeddingFunction class definition to check its constructor
rg "class OnnxEmbeddingFunction" -A 30Repository: oceanbase/pyseekdb
Length of output: 3078
🏁 Script executed:
# Check the constructor signature more carefully
rg "def __init__" src/pyseekdb/client/embedding_function.py -A 15 -B 2Repository: oceanbase/pyseekdb
Length of output: 2511
Pass preferred_providers through to OnnxEmbeddingFunction.
The deprecation warning directs users to use OnnxEmbeddingFunction's preferred_providers argument, but the parameter is never passed through, effectively dropping it and breaking backward compatibility for existing callers who rely on provider selection.
Proposed fix
self._onnx = OnnxEmbeddingFunction(
model_name=self.MODEL_NAME,
hf_model_id=self.HF_MODEL_ID,
dimension=self._DIMENSION,
+ preferred_providers=preferred_providers,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if preferred_providers: | |
| warnings.warn( | |
| "preferred_providers is deprecated and will be removed in a future version. " | |
| "Use the preferred_providers argument of OnnxEmbeddingFunction instead.", | |
| DeprecationWarning, | |
| stacklevel=2, | |
| ) | |
| self.model_name = self.MODEL_NAME | |
| self._onnx = OnnxEmbeddingFunction( | |
| model_name=self.MODEL_NAME, | |
| hf_model_id=self.HF_MODEL_ID, | |
| dimension=self._DIMENSION, | |
| ) | |
| if preferred_providers: | |
| warnings.warn( | |
| "preferred_providers is deprecated and will be removed in a future version. " | |
| "Use the preferred_providers argument of OnnxEmbeddingFunction instead.", | |
| DeprecationWarning, | |
| stacklevel=2, | |
| ) | |
| self.model_name = self.MODEL_NAME | |
| self._onnx = OnnxEmbeddingFunction( | |
| model_name=self.MODEL_NAME, | |
| hf_model_id=self.HF_MODEL_ID, | |
| dimension=self._DIMENSION, | |
| preferred_providers=preferred_providers, | |
| ) |
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/embedding_function.py` around lines 164 - 176, The
constructor currently warns that preferred_providers is deprecated but does not
forward that parameter to the OnnxEmbeddingFunction, dropping caller selection;
update the OnnxEmbeddingFunction instantiation in the embedding function (the
block that creates self._onnx) to pass the preferred_providers argument through
(i.e., include preferred_providers=preferred_providers when calling
OnnxEmbeddingFunction) so existing callers retain provider selection behavior;
ensure the variable name matches the parameter on the parent constructor and
OnnxEmbeddingFunction signature.
| if not model_name: | ||
| raise ValueError("model_name must be a non-empty string") | ||
| if not hf_model_id: | ||
| raise ValueError("hf_model_id must be a non-empty string") | ||
| if dimension <= 0: | ||
| raise ValueError("dimension must be a positive integer") | ||
|
|
||
| self.model_name = model_name | ||
| self.hf_model_id = hf_model_id | ||
| self._dimension = dimension | ||
| self.download_path = ( | ||
| download_path | ||
| if download_path is not None | ||
| else Path.home() / ".cache" / "pyseekdb" / "onnx_models" / model_name | ||
| ) | ||
|
|
||
| # Validate preferred_providers | ||
| if preferred_providers and not all(isinstance(i, str) for i in preferred_providers): | ||
| raise ValueError("Preferred providers must be a list of strings") | ||
| if preferred_providers and len(preferred_providers) != len(set(preferred_providers)): | ||
| raise ValueError("Preferred providers must be unique") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the file and check its size
find . -name "onnx_embedding_function.py" -type fRepository: oceanbase/pyseekdb
Length of output: 129
🏁 Script executed:
# Read the relevant sections of the file
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -100Repository: oceanbase/pyseekdb
Length of output: 3914
🏁 Script executed:
# Also check lines 220-230, 300-310, 350-360 to see other violations mentioned
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '220,230p;300,310p;350,360p'Repository: oceanbase/pyseekdb
Length of output: 2056
🌐 Web query:
Ruff TRY003 rule error inline exception message
💡 Result:
TRY003 in Ruff is raise-vanilla-args: it flags raise MyError("...long message...") / raise MyError(f"...") when the message is defined at the raise site instead of inside the exception class. The intent is to keep exception messages consistent and make the exception reusable. [1][2]
Typical triggering example (bad):
class CantBeNegative(Exception): ...
def foo(x):
if x < 0:
raise CantBeNegative(f"{x} is negative") # TRY003Fix (recommended): define the message in the exception class and pass args:
class CantBeNegative(Exception):
def __init__(self, number: int) -> None:
super().__init__(f"{number} is negative")
def foo(x):
if x < 0:
raise CantBeNegative(x)[1]
If you intentionally want an inline message: suppress it on that line:
raise CantBeNegative(f"{x} is negative") # noqa: TRY003Ruff supports inline # noqa: <codes> comments. [3]
🏁 Script executed:
# Look for ruff configuration
find . -name "pyproject.toml" -o -name "ruff.toml" | head -5Repository: oceanbase/pyseekdb
Length of output: 116
🏁 Script executed:
# Check if ruff config exists and look for TRY003
if [ -f pyproject.toml ]; then
echo "=== pyproject.toml ==="
grep -A 20 "\[tool.ruff" pyproject.toml || echo "No [tool.ruff] section"
fi
if [ -f ruff.toml ]; then
echo "=== ruff.toml ==="
cat ruff.toml
fiRepository: oceanbase/pyseekdb
Length of output: 1604
🏁 Script executed:
# Verify that TRY003 is included in the TRY ruleset
cat -n ruff.toml | grep -A 2 -B 2 "TRY"Repository: oceanbase/pyseekdb
Length of output: 218
🌐 Web query:
Ruff TRY003 raise-vanilla-args ruleset tryceratops included rules
💡 Result:
In Ruff, TRY003 is the tryceratops rule raise-vanilla-args (“Avoid specifying long messages outside the exception class”). It flags cases where you build/format a long error message at the raise site instead of defining the message inside the exception class (to keep exceptions reusable and messages consistent). [2]
The tryceratops ruleset (TRY) included in Ruff currently consists of these rule codes: TRY002, TRY003, TRY004, TRY200, TRY201, TRY203, TRY300, TRY301, TRY400, TRY401. [1]
TRY003 violations must be fixed by moving exception messages into exception classes.
Ruff's TRY003 rule (raise-vanilla-args) is enabled via the TRY ruleset in ruff.toml and flags inline messages at lines 55, 57, 59, 72, 74, 224-226, 302-304, and 352-357. Move these messages into custom exception class __init__ methods rather than specifying them at raise sites.
Example pattern (apply similarly to the other raise sites)
+class InvalidModelNameError(ValueError):
+ def __init__(self) -> None:
+ super().__init__("model_name must be a non-empty string")
@@
- if not model_name:
- raise ValueError("model_name must be a non-empty string")
+ if not model_name:
+ raise InvalidModelNameError()Also applies to: 224-226, 302-304, 352-357
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 55-55: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 57-57: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 59-59: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 72-72: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 74-74: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 54 - 74, The constructor and other raise sites in ONNXEmbeddingFunction
currently raise ValueError with inline messages; create specific exception
classes (e.g., InvalidModelNameError, InvalidHFModelIDError,
InvalidDimensionError, InvalidPreferredProvidersError) whose __init__ set the
appropriate default message, then replace raises that pass message strings (the
checks referencing model_name, hf_model_id, dimension, preferred_providers and
the other flagged raise sites later in the file) with raises of those exception
classes without arguments; ensure all other flagged raise sites in this module
use their own dedicated exception classes with messages defined in the exception
__init__ rather than inline messages at the raise call.
| except httpx.HTTPStatusError as e: | ||
| if e.response.status_code == 404: | ||
| logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback") | ||
| return False | ||
| logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}") | ||
| if os.path.exists(local_path): | ||
| os.remove(local_path) | ||
| return False | ||
| except Exception as e: | ||
| logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}") | ||
| # If download fails, try to delete partially downloaded file | ||
| if os.path.exists(local_path): | ||
| os.remove(local_path) | ||
| return False |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '150,200p'Repository: oceanbase/pyseekdb
Length of output: 3006
🏁 Script executed:
# Get more context around the download function
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | sed -n '130,220p'Repository: oceanbase/pyseekdb
Length of output: 4803
🏁 Script executed:
# Check what exceptions httpx can raise
rg "import httpx|from httpx" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 2Repository: oceanbase/pyseekdb
Length of output: 120
🏁 Script executed:
grep -n "_download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -20Repository: oceanbase/pyseekdb
Length of output: 564
🏁 Script executed:
# Find the _download method definition
ast-grep --pattern 'def _download($_) { $$$ }'Repository: oceanbase/pyseekdb
Length of output: 44
🏁 Script executed:
# Search for _download method in the file
rg "def _download" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -A 20Repository: oceanbase/pyseekdb
Length of output: 2625
🏁 Script executed:
python3 << 'EOF'
try:
import httpx
# Check exception hierarchy
print("httpx.HTTPError exists:", hasattr(httpx, 'HTTPError'))
print("httpx.HTTPStatusError exists:", hasattr(httpx, 'HTTPStatusError'))
print("httpx.RequestError exists:", hasattr(httpx, 'RequestError'))
# Check MRO if HTTPStatusError exists
if hasattr(httpx, 'HTTPStatusError'):
print("\nhttpx.HTTPStatusError MRO:")
for cls in httpx.HTTPStatusError.__mro__:
print(f" - {cls}")
if hasattr(httpx, 'RequestError'):
print("\nhttpx.RequestError MRO:")
for cls in httpx.RequestError.__mro__:
print(f" - {cls}")
except ImportError:
print("httpx not installed in this environment")
EOFRepository: oceanbase/pyseekdb
Length of output: 485
Narrow the broad Exception catch to specific expected exceptions.
The blanket except Exception at line 181 can hide unexpected bugs and violates the BLE001 linting rule. Since _download() can raise httpx.HTTPStatusError (already caught at line 173), httpx.RequestError, and OSError from file operations, catch only those specific exceptions and let unexpected errors surface for debugging.
Proposed fix
- except Exception as e:
+ except (httpx.HTTPError, OSError) as e:
logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}")
# If download fails, try to delete partially downloaded file
if os.path.exists(local_path):
os.remove(local_path)
return False📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except httpx.HTTPStatusError as e: | |
| if e.response.status_code == 404: | |
| logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback") | |
| return False | |
| logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}") | |
| if os.path.exists(local_path): | |
| os.remove(local_path) | |
| return False | |
| except Exception as e: | |
| logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}") | |
| # If download fails, try to delete partially downloaded file | |
| if os.path.exists(local_path): | |
| os.remove(local_path) | |
| return False | |
| except httpx.HTTPStatusError as e: | |
| if e.response.status_code == 404: | |
| logger.warning(f"File {hf_filename} not found on Hugging Face (404), will try fallback") | |
| return False | |
| logger.warning(f"HTTP error downloading {hf_filename} from Hugging Face: {e}") | |
| if os.path.exists(local_path): | |
| os.remove(local_path) | |
| return False | |
| except (httpx.HTTPError, OSError) as e: | |
| logger.warning(f"Failed to download {hf_filename} from Hugging Face: {e}") | |
| # If download fails, try to delete partially downloaded file | |
| if os.path.exists(local_path): | |
| os.remove(local_path) | |
| return False |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 181-181: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 173 - 186, The broad "except Exception" should be narrowed to the expected
failure types so unexpected bugs bubble up; replace the generic except block in
the download flow (the block after the existing except httpx.HTTPStatusError)
with an explicit except (httpx.RequestError, OSError) as e (or include
PermissionError if desired), keep the same logger.warning and local_path cleanup
logic, and ensure httpx is imported so httpx.RequestError is available; leave
the existing httpx.HTTPStatusError handler unchanged.
| if self._preferred_providers is None or len(self._preferred_providers) == 0: | ||
| if len(self.ort.get_available_providers()) > 0: | ||
| logger.debug( | ||
| f"WARNING: No ONNX providers provided, defaulting to available providers: " | ||
| f"{self.ort.get_available_providers()}" | ||
| ) | ||
| self._preferred_providers = self.ort.get_available_providers() | ||
| elif not set(self._preferred_providers).issubset(set(self.ort.get_available_providers())): | ||
| raise ValueError( | ||
| f"Preferred providers must be subset of available providers: {self.ort.get_available_providers()}" | ||
| ) | ||
|
|
||
| # Create minimal session options to avoid issues | ||
| so = self.ort.SessionOptions() | ||
| so.log_severity_level = 3 | ||
| # Disable all optimizations that might cause issues | ||
| so.graph_optimization_level = self.ort.GraphOptimizationLevel.ORT_DISABLE_ALL | ||
| so.execution_mode = self.ort.ExecutionMode.ORT_SEQUENTIAL | ||
| so.inter_op_num_threads = 1 | ||
| so.intra_op_num_threads = 1 | ||
|
|
||
| if self._preferred_providers and "CoreMLExecutionProvider" in self._preferred_providers: | ||
| # remove CoreMLExecutionProvider from the list, it is not as well optimized as CPU. | ||
| self._preferred_providers.remove("CoreMLExecutionProvider") | ||
|
|
||
| return self.ort.InferenceSession( | ||
| os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"), | ||
| # Force CPU execution provider to avoid provider issues | ||
| providers=["CPUExecutionProvider"], | ||
| sess_options=so, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | head -n 330 | tail -n +270Repository: oceanbase/pyseekdb
Length of output: 3075
🏁 Script executed:
head -n 100 src/pyseekdb/utils/embedding_functions/onnx_embedding_function.pyRepository: oceanbase/pyseekdb
Length of output: 3214
🏁 Script executed:
# Search for other references to _preferred_providers or preferred_providers in the file
rg "preferred_providers" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -nRepository: oceanbase/pyseekdb
Length of output: 930
🏁 Script executed:
# Search for any TODO comments or issues related to this in the entire file
cat src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py | grep -i -E "todo|fixme|issue|bug|workaround" -C 2Repository: oceanbase/pyseekdb
Length of output: 1030
🏁 Script executed:
# Search for any related issues or discussions in code comments
rg "provider" src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py -B 2 -A 2Repository: oceanbase/pyseekdb
Length of output: 2256
Use self._preferred_providers instead of hardcoding CPU provider.
The preferred_providers parameter is validated and defaults to available providers, but is completely ignored when creating the InferenceSession. Pass self._preferred_providers to the providers argument so users can leverage GPU or other accelerators as specified.
Proposed fix
return self.ort.InferenceSession(
os.path.join(self.download_path, self.EXTRACTED_FOLDER_NAME, "model.onnx"),
- # Force CPU execution provider to avoid provider issues
- providers=["CPUExecutionProvider"],
+ providers=self._preferred_providers,
sess_options=so,
)🧰 Tools
🪛 Ruff (0.14.14)
[warning] 302-304: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/pyseekdb/utils/embedding_functions/onnx_embedding_function.py` around
lines 294 - 323, The InferenceSession call in onnx_embedding_function currently
hardcodes providers=["CPUExecutionProvider"] and thus ignores the validated
self._preferred_providers; update the providers argument in the
self.ort.InferenceSession(...) call to pass self._preferred_providers (ensure it
is a list/sequence) so the session respects user-selected/available providers
(note the code already removes "CoreMLExecutionProvider" from
self._preferred_providers). Use self._preferred_providers as the providers
parameter instead of the hardcoded CPUExecutionProvider.
| def _make_onnx() -> OnnxEmbeddingFunction: | ||
| return OnnxEmbeddingFunction( | ||
| model_name="all-MiniLM-L6-v2", | ||
| hf_model_id="sentence-transformers/all-MiniLM-L6-v2", | ||
| dimension=384, | ||
| preferred_providers=None, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find and read the test file
find . -name "test_onnx_embedding_function.py" -type fRepository: oceanbase/pyseekdb
Length of output: 112
🏁 Script executed:
# Look for the OnnxEmbeddingFunction class to understand what it imports
find . -name "*.py" -type f | xargs grep -l "class OnnxEmbeddingFunction"Repository: oceanbase/pyseekdb
Length of output: 129
🏁 Script executed:
# Read the test file to see the current state
cat -n ./tests/unit_tests/test_onnx_embedding_function.pyRepository: oceanbase/pyseekdb
Length of output: 2809
🏁 Script executed:
# Read the OnnxEmbeddingFunction implementation
cat -n ./src/pyseekdb/utils/embedding_functions/onnx_embedding_function.pyRepository: oceanbase/pyseekdb
Length of output: 18457
Guard _make_onnx() with importorskip for optional deps.
OnnxEmbeddingFunction.__init__ imports optional packages (onnxruntime, tokenizers, tqdm) at lines 79-81. The tests test_dimension_property and test_call_empty_returns_empty will fail when these dependencies are absent because they call _make_onnx() without any skip guards, while test_call_generates_embeddings has them. Centralize the skips inside _make_onnx() and remove the redundant per-test skips.
Proposed fix
def _make_onnx() -> OnnxEmbeddingFunction:
+ pytest.importorskip("onnxruntime")
+ pytest.importorskip("tokenizers")
+ pytest.importorskip("tqdm")
return OnnxEmbeddingFunction(
model_name="all-MiniLM-L6-v2",
hf_model_id="sentence-transformers/all-MiniLM-L6-v2",
dimension=384,
preferred_providers=None,
)
@@
def test_call_generates_embeddings() -> None:
- pytest.importorskip("onnxruntime")
- pytest.importorskip("tokenizers")
- pytest.importorskip("tqdm")
-
ef = _make_onnx()
embeddings = ef("hello world")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _make_onnx() -> OnnxEmbeddingFunction: | |
| return OnnxEmbeddingFunction( | |
| model_name="all-MiniLM-L6-v2", | |
| hf_model_id="sentence-transformers/all-MiniLM-L6-v2", | |
| dimension=384, | |
| preferred_providers=None, | |
| ) | |
| def _make_onnx() -> OnnxEmbeddingFunction: | |
| pytest.importorskip("onnxruntime") | |
| pytest.importorskip("tokenizers") | |
| pytest.importorskip("tqdm") | |
| return OnnxEmbeddingFunction( | |
| model_name="all-MiniLM-L6-v2", | |
| hf_model_id="sentence-transformers/all-MiniLM-L6-v2", | |
| dimension=384, | |
| preferred_providers=None, | |
| ) |
🤖 Prompt for AI Agents
In `@tests/unit_tests/test_onnx_embedding_function.py` around lines 12 - 18, The
helper _make_onnx() should guard optional dependencies via pytest.importorskip
so tests that call it don't fail when onnxruntime/tokenizers/tqdm are absent;
update _make_onnx() to call pytest.importorskip for the packages imported in
OnnxEmbeddingFunction.__init__ (onnxruntime, tokenizers, tqdm) before
constructing OnnxEmbeddingFunction and then remove redundant per-test
importorskip calls in test_dimension_property and test_call_empty_returns_empty
(leave test_call_generates_embeddings as-is or remove its duplicate skips),
referencing the helper _make_onnx and the tests test_dimension_property and
test_call_empty_returns_empty to locate changes.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tests/integration_tests/conftest.py`:
- Around line 13-18: The import line "import pyseekdb # noqa: E402" contains an
unused noqa directive; remove the " # noqa: E402" suffix so the line becomes a
normal import (keep the existing sys.path manipulation with repo_root, src_root,
and sys.path.insert as-is) to satisfy RUF100 and eliminate the unused noqa.
| # Add project path (repo root + src) | ||
| repo_root = Path(__file__).resolve().parents[2] | ||
| src_root = repo_root / "src" | ||
| sys.path.insert(0, str(src_root)) | ||
|
|
||
| import pyseekdb # noqa: E402 |
There was a problem hiding this comment.
Remove unused noqa to satisfy RUF100.
Ruff reports the # noqa: E402 is unused (E402 not enabled), so the directive should be removed.
🔧 Suggested fix
-import pyseekdb # noqa: E402
+import pyseekdb📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Add project path (repo root + src) | |
| repo_root = Path(__file__).resolve().parents[2] | |
| src_root = repo_root / "src" | |
| sys.path.insert(0, str(src_root)) | |
| import pyseekdb # noqa: E402 | |
| # Add project path (repo root + src) | |
| repo_root = Path(__file__).resolve().parents[2] | |
| src_root = repo_root / "src" | |
| sys.path.insert(0, str(src_root)) | |
| import pyseekdb |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 18-18: Unused noqa directive (non-enabled: E402)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In `@tests/integration_tests/conftest.py` around lines 13 - 18, The import line
"import pyseekdb # noqa: E402" contains an unused noqa directive; remove the "
# noqa: E402" suffix so the line becomes a normal import (keep the existing
sys.path manipulation with repo_root, src_root, and sys.path.insert as-is) to
satisfy RUF100 and eliminate the unused noqa.
Summary
close #137
refactor DefaultEmbeddingFunction.
Split it into two classes: OnnxEmbeddingFunction and DefaultEmbeddingFunction.
DefaultEmbeddingFunction worked based on OnnxEmbeddingFunction. OnnxEmbeddingFunction can execute model based on ONNX.
Summary by CodeRabbit
New Features
Refactor
Breaking Change
Tests