Skip to content

refactor: Use pymysql to automatically escape#77

Merged
hnwyllmm merged 16 commits into
oceanbase:developfrom
longbingljw:escape
Dec 19, 2025
Merged

refactor: Use pymysql to automatically escape#77
hnwyllmm merged 16 commits into
oceanbase:developfrom
longbingljw:escape

Conversation

@longbingljw

@longbingljw longbingljw commented Dec 16, 2025

Copy link
Copy Markdown
Member

Summary

Use parameterized query and pymysql will automatically escape the parameters
related issue #62
fix #75

Solution Description

Summary by CodeRabbit

Release Notes

  • Documentation

    • Updated README to describe hybrid search capabilities and RAG interface features
    • Expanded documentation for embedding modes (default, local, API) in RAG demo
  • Bug Fixes

    • Enhanced string handling across database operations for improved data safety
    • Corrected test assertions for stricter data validation
  • Tests

    • Added comprehensive security validation test suite
    • Added empty value handling tests across all deployment modes

✏️ Tip: You can customize this high-level summary in your review settings.

longbingljw and others added 9 commits December 8, 2025 09:17
<!--
Thank you for contributing to OceanBase! 
Please feel free to ping the maintainers for the review!
-->

## Summary
<!-- 
Please clearly and concisely describe the purpose of this pull request.
If this pull request resolves an issue, please link it via "close #xxx"
or "fix #xxx".
-->
The rag demo has supported hybrid search.The readme need to be updated.
close oceanbase#60


## Solution Description
<!-- Please clearly and concisely describe your solution. -->
Added comprehensive PR review report and design document template.
This reverts commit e9921b5.
@coderabbitai

coderabbitai Bot commented Dec 16, 2025

Copy link
Copy Markdown

Walkthrough

This PR standardizes SQL escaping across the pyseekdb client using PyMySQL's escape_string function to address SQL injection vulnerabilities, introduces comprehensive security and empty value handling test suites, updates documentation to reflect hybrid search capabilities, and adds a structured PR review template.

Changes

Cohort / File(s) Summary
Documentation & Templates
.github/instructions/*.instructions.md, README.md, demo/rag/README.md, demo/rag/README_CN.md
Added new PR review template structure and updated RAG demo descriptions from "vector search knowledge base" to "hybrid search knowledge base" with embedded mode documentation.
SQL Security Fixes
src/pyseekdb/client/client_base.py, src/pyseekdb/client/client_seekdb_embedded.py, src/pyseekdb/client/sql_utils.py
Replaced ad-hoc quote escaping with centralized PyMySQL escape_string calls across SQL-building paths (add, update, upsert, query, collection operations) to standardize escaping for documents, metadata, IDs, and query parameters.
Security & Validation Tests
tests/test_security_sql_injection.py, tests/test_empty_value_handling.py
Added new TestSecuritySQLInjection class with comprehensive per-operation and workflow tests for SQL injection prevention across embedded, server, and OceanBase deployments. Added new TestEmptyValueHandling class validating empty/falsy value handling in upsert operations across three client modes.
Test Assertion Fix
tests/test_offical_case.py
Removed default fallback in dict.get("test", 1) assertion to require the "test" key presence and value validation on both server and OceanBase test cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Areas requiring extra attention:

  • SQL escaping logic in src/pyseekdb/client/client_base.py, client_seekdb_embedded.py, and sql_utils.py — verify escape_string is applied correctly at all SQL injection points, especially around JSON metadata, document strings, and query operators
  • New test suite coverage in test_security_sql_injection.py — ensure test cases comprehensively cover the SQL injection vectors fixed and that per-operation and workflow tests properly validate data integrity
  • test_empty_value_handling.py — confirm edge cases for empty strings, whitespace, and falsy values in upsert operations are adequately tested across all deployment modes
  • Assertion change in test_offical_case.py — validate that removing the default value intentionally tightens the validation requirement

Poem

🐰 With escape_string now our shield so strong,
SQL injection villains don't belong!
Security tests guard both night and day,
Empty values handled every way.
Safe hybrid searches leap and bound,
Our database fortress, safe and sound! 🛡️

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR addresses issue #75 (single quote escaping in hybrid search) by standardizing SQL escaping with pymysql's escape_string across all code paths. However, issue #83 ($in operator for JSON arrays) is not addressed; the code changes do not implement JSON_OVERLAPS for array comparisons. Implement JSON_OVERLAPS logic in query handling for $in/$nin operators when comparing JSON array fields, as specified in issue #83.
Out of Scope Changes check ⚠️ Warning The PR contains changes that are within scope (escape_string implementation for issue #75) and also includes new test files (test_security_sql_injection.py, test_empty_value_handling.py) and test assertions (test_offical_case.py) that go beyond the escaping refactor mentioned in objectives. Clarify the scope: either focus solely on escaping refactor for issue #75, or document why comprehensive security/empty-value tests and assertion changes are necessary for this PR.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'refactor: Use pymysql to automatically escape' clearly describes the main change in the PR, which replaces ad-hoc escaping with pymysql's escape_string across multiple SQL construction paths.
Docstring Coverage ✅ Passed Docstring coverage is 95.74% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/test_security_sql_injection.py (2)

94-114: Consider logging exceptions for debugging.

The bare except on line 113 is acceptable for test verification logic, but consider logging the exception to help diagnose test failures during development:

         except Exception:
+            import logging
+            logging.debug(f"Data integrity verification failed", exc_info=True)
             return False

116-136: Remove unused parameters.

The admin and temp_dir parameters are not used in this method and should be removed for clarity.

Apply this diff:

-    def run_security_tests(self, client, admin, temp_dir=None):
+    def run_security_tests(self, client):
         """Run all security tests for a given client"""

And update the calls at lines 387, 420, and 448:

-            self.run_security_tests(client, admin, temp_dir)
+            self.run_security_tests(client)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ec851c and 257084e.

⛔ Files ignored due to path filters (1)
  • demo/rag/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .github/instructions/*.instructions.md (1 hunks)
  • README.md (1 hunks)
  • demo/rag/README.md (1 hunks)
  • demo/rag/README_CN.md (1 hunks)
  • src/pyseekdb/client/client_base.py (10 hunks)
  • src/pyseekdb/client/client_seekdb_embedded.py (2 hunks)
  • src/pyseekdb/client/sql_utils.py (2 hunks)
  • tests/test_security_sql_injection.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
  • CollectionFieldNames (4-10)
tests/test_security_sql_injection.py (3)
src/pyseekdb/client/client_base.py (2)
  • delete_collection (115-117)
  • delete_collection (410-428)
src/pyseekdb/client/__init__.py (2)
  • AdminClient (180-282)
  • Client (60-177)
src/pyseekdb/client/client_seekdb_embedded.py (1)
  • create_database (269-280)
🪛 Ruff (0.14.8)
tests/test_security_sql_injection.py

111-111: Consider moving this statement to an else block

(TRY300)


113-113: Do not catch blind exception: Exception

(BLE001)


116-116: Unused method argument: admin

(ARG002)


116-116: Unused method argument: temp_dir

(ARG002)


145-145: Do not use bare except

(E722)


145-146: try-except-pass detected, consider logging the exception

(S110)


158-158: Do not use bare except

(E722)


158-159: try-except-pass detected, consider logging the exception

(S110)


281-281: Do not catch blind exception: Exception

(BLE001)


314-314: Do not catch blind exception: Exception

(BLE001)


381-381: Do not use bare except

(E722)


381-382: try-except-pass detected, consider logging the exception

(S110)


393-393: Do not use bare except

(E722)


393-394: try-except-pass detected, consider logging the exception

(S110)


408-408: Do not use bare except

(E722)


408-409: try-except-pass detected, consider logging the exception

(S110)


435-435: Do not use bare except

(E722)


435-436: try-except-pass detected, consider logging the exception

(S110)

🔇 Additional comments (19)
demo/rag/README_CN.md (1)

3-3: LGTM! Documentation terminology updated correctly.

The change from "向量检索知识库" (vector search knowledge base) to "混合检索知识库" (hybrid search knowledge base) accurately reflects the enhanced hybrid search capabilities introduced in this PR.

demo/rag/README.md (1)

3-3: LGTM! Documentation terminology updated correctly.

The terminology change from "vector search knowledge base" to "hybrid search knowledge base" is consistent with the Chinese README and accurately reflects the system's hybrid search capabilities.

README.md (1)

1225-1225: LGTM! Documentation terminology updated correctly.

The change to "hybrid search knowledge base" is consistent with the demo README files and accurately reflects the system's capabilities.

.github/instructions/*.instructions.md (1)

1-108: LGTM! Comprehensive PR review template.

This template provides excellent structure for PR documentation, covering overview statistics, design documentation, architecture changes, detailed change analysis, compatibility warnings, and usage examples. This will help ensure consistent and thorough PR documentation.

tests/test_security_sql_injection.py (7)

53-92: LGTM! Comprehensive SQL injection test cases.

The test cases cover a good range of attack vectors including:

  • Single and double quotes
  • Backslashes
  • Classic SQL injection patterns (DROP TABLE, condition bypass)
  • Mixed special characters
  • Unicode and emoji characters

This provides solid coverage for validating the escape_string implementation.


138-159: LGTM! Appropriate use of exception handling for test infrastructure.

The bare except blocks at lines 145 and 158 are appropriate for test cleanup logic, ensuring collections are deleted even if they don't exist or if deletion fails. This provides robustness in the test infrastructure.


161-282: LGTM! Comprehensive operation-level security testing.

The tests systematically cover:

  • ADD: Basic insertion with attack vectors
  • UPDATE: Modification with security payloads
  • UPSERT: Insert and update scenarios
  • QUERY: Query execution with special characters

Each test properly validates that operations complete without SQL errors and data integrity is preserved. The exception handling at line 281 is appropriate for test failure reporting.


284-315: LGTM! Thorough testing of GET operations with special character IDs.

The test validates that GET operations safely handle IDs containing:

  • Single quotes
  • Backslashes
  • Double quotes
  • SQL injection attempts in the ID field

This is critical for preventing injection through the ID parameter. The exception handling is appropriate for test failure reporting.


317-366: LGTM! Excellent comprehensive workflow test.

This test validates the complete workflow (ADD → UPDATE → QUERY → GET) using a complex payload containing:

  • Multiple SQL injection attempts
  • Mixed quotes and backslashes
  • Nested attack vectors in metadata

This is an excellent real-world scenario that validates the escaping works correctly across operation boundaries.


370-448: LGTM! Comprehensive mode-specific test coverage.

The tests properly cover all three deployment modes:

  • Embedded: Uses temporary directory with proper cleanup
  • Server: Uses environment variables for connection
  • OceanBase: Uses tenant-aware configuration

The bare except blocks at lines 381, 393, 408, and 435 are appropriate for handling cases where databases already exist during test setup.


453-466: LGTM! Convenient standalone test functions for CI.

These wrapper functions enable selective test execution (e.g., pytest -k test_embedded) which is valuable for CI/CD pipelines that may only have certain deployment modes available.

src/pyseekdb/client/client_seekdb_embedded.py (1)

140-174: LGTM! Critical security improvement for embedded mode.

The change to use escape_string in embedded mode is essential because pylibseekdb doesn't support parameterized queries. The implementation correctly:

  1. Handles None as NULL without escaping (line 165)
  2. Handles numbers without escaping (line 167)
  3. Uses escape_string for string parameters (line 169)
  4. Uses escape_string for other types after string conversion (line 173)

This properly prevents SQL injection in embedded mode where parameters must be embedded directly into SQL.

src/pyseekdb/client/client_base.py (6)

689-701: LGTM! Proper escaping in ADD operation.

The ADD operation correctly escapes:

  • Line 690: Document strings using escape_string
  • Lines 698-701: JSON-serialized metadata using escape_string

Both are critical injection points since they contain user-provided data. The escaping is applied after JSON serialization, which is correct.


832-840: LGTM! Consistent escaping in UPDATE operation.

The UPDATE operation follows the same secure pattern as ADD, properly escaping:

  • Line 832: Document strings
  • Lines 838-840: JSON-serialized metadata

This ensures UPDATE operations are protected against SQL injection.


989-1015: LGTM! Proper escaping in UPSERT operation.

The UPSERT operation correctly handles both code paths:

  • Update path (lines 989, 994): Escapes documents and metadata when updating existing records
  • Insert path (lines 1008, 1015): Escapes documents and metadata when inserting new records

This ensures SQL injection protection regardless of whether the record exists.


1341-1344: LGTM! Critical escaping for ID parameters.

The _convert_id_to_sql method is used throughout for ID-based operations (GET, DELETE, UPDATE, etc.). Escaping IDs at line 1342 is critical because:

  • IDs are user-provided strings
  • IDs are used in WHERE clauses
  • Without escaping, IDs could be an injection vector

The implementation correctly escapes before using CAST to convert to BINARY.


1840-1842: LGTM! Proper escaping for hybrid search parameters.

The hybrid search parameters are correctly escaped at line 1841 before being embedded in the SET @search_parm SQL statement. This is critical because search_parm contains user-provided:

  • Query text
  • Filter conditions
  • Metadata filters

1994-2036: LGTM! Comprehensive escaping for document query operators.

The _build_document_query method properly escapes user-provided query content for all operators:

  • Line 1996: $contains operator
  • Lines 2013-2014: $and operator with multiple conditions
  • Lines 2031-2032: $or operator with multiple conditions

This prevents SQL injection through full-text search queries, which was specifically mentioned in issue #75 (single quote not properly escaped in hybrid search).

src/pyseekdb/client/sql_utils.py (1)

8-8: LGTM! Critical security improvement using pymysql's escape_string.

The change from manual escaping to pymysql.converters.escape_string is a legitimate security enhancement. pymysql's escape_string properly handles backslash, single/double quotes, NUL bytes (ASCII 0), newlines, carriage returns, and Control+Z — the standard special characters that require escaping for SQL safety. pymysql version 1.1.1 supports this function, and the existing logic for bytes/hex and numeric types remains intact.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/pyseekdb/client/client_base.py (2)

2129-2140: Missing escaping for $not_contains query content.

The $contains operator (line 2121) correctly escapes the query content, but $not_contains (line 2136) uses the value directly without escaping. This inconsistency could lead to SQL injection or syntax errors when the query contains special characters.

         # Handle $not_contains - wrap query_string in must_not bool
         if "$not_contains" in where_document:
+            not_contains_content = where_document["$not_contains"]
+            escaped_not_contains = escape_string(not_contains_content)
             return _with_boost({
                 "bool": {
                     "must_not": [{
                         "query_string": {
                             "fields": ["document"],
-                            "query": where_document["$not_contains"]
+                            "query": escaped_not_contains
                         }
                     }]
                 }
             })

2178-2186: Missing escaping for default string case in document query.

When where_document is a plain string (line 2179), it's used directly as the query without escaping, unlike the $contains case which properly escapes the content.

         # Default: if it's a string, treat as $contains
         if isinstance(where_document, str):
+            escaped_query = escape_string(where_document)
             return _with_boost({
                 "query_string": {
                     "fields": ["document"],
-                    "query": where_document
+                    "query": escaped_query
                 }
             })
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 257084e and 927329d.

📒 Files selected for processing (3)
  • README.md (1 hunks)
  • src/pyseekdb/client/client_base.py (10 hunks)
  • src/pyseekdb/client/client_seekdb_embedded.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
  • CollectionFieldNames (4-10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (8)
README.md (1)

1226-1238: Good alignment with PR objectives; documentation is clear and well-structured.

The changes appropriately document the hybrid search capability and three embedding modes (default/local/api). The reworded description from "vector search" to "hybrid search" accurately reflects the enhancement scope. The documentation is accessible and clearly explains the no-key-required default mode alongside the optional alternatives.

src/pyseekdb/client/client_base.py (5)

10-10: LGTM! Correct import for SQL escaping.

The import of escape_string from pymysql.converters is the appropriate function for escaping string literals in MySQL/MariaDB compatible databases.


770-786: LGTM! Document and metadata escaping in add operation.

The escaping pattern is correct: user-provided doc_val and serialized meta_json are properly escaped before being embedded into SQL string literals.


912-923: LGTM! Document and metadata escaping in update operation.

Consistent escaping pattern with the add operation.


1424-1427: LGTM! ID escaping in SQL conversion.

IDs are properly escaped before being embedded in the CAST expression, preventing SQL injection through malicious ID values.


1930-1932: LGTM! Search parameter escaping in hybrid search.

This is the key fix for issue #75. The search_parm_json (which may contain user queries with single quotes) is now properly escaped before being embedded in the SET SQL statement.

src/pyseekdb/client/client_seekdb_embedded.py (2)

8-8: LGTM! Import for embedded mode escaping.

Correct import of escape_string for the embedded client that needs to manually embed parameters.


163-174: LGTM! Parameter escaping in embedded mode.

The escaping logic correctly handles different parameter types:

  • None → SQL NULL
  • Numeric types → direct string conversion (safe)
  • Strings → properly escaped with escape_string
  • Other types → converted to string and escaped

This ensures all user-provided parameters are safely embedded into SQL.

Comment thread src/pyseekdb/client/client_base.py
Comment thread src/pyseekdb/client/client_base.py
@longbingljw

Copy link
Copy Markdown
Member Author

The past pr:#76

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/pyseekdb/client/client_base.py (2)

2136-2146: Missing escaping in $not_contains operator.

While $contains, $and, and $or operators properly use escape_string, the $not_contains operator at line 2142 does not escape the query content before embedding it in the SQL query_string.

Apply this diff to add consistent escaping:

         # Handle $not_contains - wrap query_string in must_not bool
         if "$not_contains" in where_document:
+            not_contains_content = where_document["$not_contains"]
+            escaped_not_contains = escape_string(not_contains_content)
             return _with_boost({
                 "bool": {
                     "must_not": [{
                         "query_string": {
                             "fields": ["document"],
-                            "query": where_document["$not_contains"]
+                            "query": escaped_not_contains
                         }
                     }]
                 }
             })

2184-2192: Missing escaping in default string case.

When where_document is a plain string (line 2185-2189), it's embedded directly in the query_string without escaping. This could be a SQL injection vector if user-provided strings reach this code path.

Apply this diff to add escaping:

         # Default: if it's a string, treat as $contains
         if isinstance(where_document, str):
+            escaped_query = escape_string(where_document)
             return _with_boost({
                 "query_string": {
                     "fields": ["document"],
-                    "query": where_document
+                    "query": escaped_query
                 }
             })
🧹 Nitpick comments (1)
src/pyseekdb/client/client_base.py (1)

770-805: Consider true parameterized queries for DML operations.

While the current use of escape_string is safe and addresses the SQL injection concerns, the PR title mentions "parameterized queries" which typically refers to using placeholder syntax (e.g., %s) with separate parameter binding via cursor.execute(sql, params). This approach is demonstrated in _collection_delete (line 1168-1171) but not consistently used across other DML operations.

True parameterized queries offer additional benefits:

  • Database driver handles escaping automatically
  • Clearer separation between SQL structure and data
  • Slightly better performance in some cases (query plan caching)

Consider refactoring DML operations to use parameterized queries in the future, though the current implementation is secure and functional.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 927329d and 8959fbf.

📒 Files selected for processing (1)
  • src/pyseekdb/client/client_base.py (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (4)
src/pyseekdb/client/meta_info.py (2)
  • CollectionFieldNames (4-10)
  • table_name (14-15)
src/pyseekdb/client/client_seekdb_embedded.py (1)
  • _execute (107-123)
src/pyseekdb/client/base_connection.py (1)
  • _execute (31-33)
src/pyseekdb/client/client_seekdb_server.py (1)
  • _execute (98-111)
🪛 Ruff (0.14.8)
src/pyseekdb/client/client_base.py

1091-1091: Possible SQL injection vector through string-based query construction

(S608)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (6)
src/pyseekdb/client/client_base.py (6)

10-10: LGTM: Import of pymysql escape utility.

The import of escape_string from pymysql.converters is appropriate for preventing SQL injection when building SQL strings manually.


770-797: LGTM: Proper escaping in _collection_add.

The code correctly uses escape_string for document values and metadata JSON before embedding them in SQL strings. The None checks ensure that NULL is used when values are not provided.


912-930: LGTM: Consistent escaping in _collection_update.

The update method correctly applies escape_string to document and metadata values, maintaining consistency with the add operation.


1071-1105: LGTM: Empty string handling fixed in upsert.

The code now correctly uses is not None checks instead of truthiness checks, which addresses the past review comments. This ensures that empty strings are properly escaped and preserved rather than being converted to NULL.


1430-1433: LGTM: ID escaping prevents injection.

The use of escape_string for ID values before casting to BINARY is appropriate, as IDs are user-provided strings that could contain malicious content.


1936-1938: LGTM: Hybrid search escaping addresses issue #75.

The escaping of search_parm JSON before embedding in the SET statement directly addresses the bug reported in issue #75 where single quotes in hybrid search queries caused SQL syntax errors.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (8)
tests/test_empty_value_handling.py (3)

13-15: Path manipulation may cause import issues in CI environments.

Directly manipulating sys.path can lead to fragile imports. Consider using pytest's pythonpath configuration or installing the package in editable mode (pip install -e .) instead.


61-66: Remove unused noqa directive.

The # noqa: F401 directive on line 64 is flagged as unused by static analysis (Ruff RUF100). Since the pylibseekdb import is used only for existence checking, the directive can be removed.

         try:
-            import pylibseekdb  # noqa: F401
+            import pylibseekdb
         except ImportError:
             pytest.fail("seekdb embedded package is not installed")

338-344: Add strict=True to zip() for defensive coding.

The zip() call on line 338 doesn't include the strict= parameter. While both lists currently have 4 elements, using strict=True would catch mismatches if the test data is modified later.

-        for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas)):
+        for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas, strict=True)):
tests/test_security_sql_injection.py (5)

94-114: Consider more specific exception handling.

The verify_data_integrity method catches all exceptions and returns False. While acceptable for test helpers, catching a more specific exception type would help distinguish between "data not found" and "unexpected errors" during debugging.


116-137: Remove unused parameters admin and temp_dir.

The run_security_tests method accepts admin and temp_dir parameters but never uses them. Remove these to avoid confusion about the method's interface.

-    def run_security_tests(self, client, admin, temp_dir=None):
+    def run_security_tests(self, client):
         """Run all security tests for a given client"""

Then update the call sites at lines 390, 431, and 466:

-            self.run_security_tests(client, admin, temp_dir)
+            self.run_security_tests(client)

382-397: Use specific exceptions instead of bare except.

Bare except: clauses can mask unexpected errors. For database creation, catch the specific exception type for "database already exists". For cleanup, at minimum catch Exception.

             try:
                 admin.create_database(SEEKDB_DATABASE)
-            except:
-                pass  # Database might already exist
+            except Exception:
+                pass  # Database might already exist
                 
             client = pyseekdb.Client(path=temp_dir, database=SEEKDB_DATABASE)
             
             # Run all security tests
-            self.run_security_tests(client, admin, temp_dir)
+            self.run_security_tests(client)
             
         finally:
             # Cleanup
             try:
                 shutil.rmtree(temp_dir)
-            except:
+            except OSError:
                 pass

408-412: Replace bare except with specific exception type.

The database creation try-except at lines 409-412 should use a specific exception type instead of bare except:.

         try:
             admin.create_database(SERVER_DATABASE)
-        except:
+        except Exception:
             pass  # Database might already exist

443-447: Replace bare except with specific exception type.

Same pattern fix needed for OceanBase database creation.

         try:
             admin.create_database(OB_DATABASE)
-        except:
+        except Exception:
             pass  # Database might already exist
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8959fbf and f9153e0.

📒 Files selected for processing (2)
  • tests/test_empty_value_handling.py (1 hunks)
  • tests/test_security_sql_injection.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_empty_value_handling.py (5)
src/pyseekdb/client/collection.py (6)
  • client (78-80)
  • get (364-428)
  • name (63-65)
  • embedding_function (88-90)
  • add (103-149)
  • upsert (194-234)
src/pyseekdb/client/meta_info.py (2)
  • CollectionNames (12-15)
  • CollectionFieldNames (4-10)
tests/test_security_sql_injection.py (1)
  • _cleanup_collection (359-364)
src/pyseekdb/client/embedding_function.py (1)
  • DefaultEmbeddingFunction (84-469)
src/pyseekdb/client/__init__.py (1)
  • Client (84-201)
tests/test_security_sql_injection.py (1)
src/pyseekdb/client/__init__.py (2)
  • AdminClient (204-306)
  • Client (84-201)
🪛 Ruff (0.14.8)
tests/test_empty_value_handling.py

49-49: Do not catch blind exception: Exception

(BLE001)


64-64: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


91-91: Do not catch blind exception: Exception

(BLE001)


116-116: Do not catch blind exception: Exception

(BLE001)


165-165: Use explicit conversion flag

Replace with conversion flag

(RUF010)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


184-184: Use explicit conversion flag

Replace with conversion flag

(RUF010)


200-200: Use explicit conversion flag

Replace with conversion flag

(RUF010)


214-214: Use explicit conversion flag

Replace with conversion flag

(RUF010)


214-214: Use explicit conversion flag

Replace with conversion flag

(RUF010)


219-219: Use explicit conversion flag

Replace with conversion flag

(RUF010)


276-276: Use explicit conversion flag

Replace with conversion flag

(RUF010)


290-290: Use explicit conversion flag

Replace with conversion flag

(RUF010)


290-290: Use explicit conversion flag

Replace with conversion flag

(RUF010)


292-292: Use explicit conversion flag

Replace with conversion flag

(RUF010)


297-297: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


338-338: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


342-342: Use explicit conversion flag

Replace with conversion flag

(RUF010)


342-342: Use explicit conversion flag

Replace with conversion flag

(RUF010)

tests/test_security_sql_injection.py

111-111: Consider moving this statement to an else block

(TRY300)


113-113: Do not catch blind exception: Exception

(BLE001)


116-116: Unused method argument: admin

(ARG002)


116-116: Unused method argument: temp_dir

(ARG002)


272-272: Do not catch blind exception: Exception

(BLE001)


305-305: Do not catch blind exception: Exception

(BLE001)


363-363: Do not catch blind exception: Exception

(BLE001)


371-371: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


384-384: Do not use bare except

(E722)


384-385: try-except-pass detected, consider logging the exception

(S110)


396-396: Do not use bare except

(E722)


396-397: try-except-pass detected, consider logging the exception

(S110)


411-411: Do not use bare except

(E722)


411-412: try-except-pass detected, consider logging the exception

(S110)


427-427: Do not catch blind exception: Exception

(BLE001)


446-446: Do not use bare except

(E722)


446-447: try-except-pass detected, consider logging the exception

(S110)


462-462: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (5)
tests/test_empty_value_handling.py (2)

234-242: Verify None value handling in metadata storage.

The test case {"key": None} at line 239 validates that None values in metadata are preserved. Ensure the underlying storage (SQL/JSON) correctly handles None vs. JSON null vs. missing keys, as some backends may serialize None differently.


144-184: Good coverage of falsy value edge cases.

The test cases for empty strings, whitespace, "0", and "false" provide comprehensive coverage for the falsy value handling bug fix. The assertions correctly verify that values are stored exactly as provided without being converted to "NULL".

tests/test_security_sql_injection.py (3)

53-92: Excellent security test case coverage.

The test cases cover important SQL injection attack vectors including single/double quotes, backslashes, classic injection patterns ('; DROP TABLE), and Unicode with special characters. This provides good coverage for the escape_string fixes.


308-357: Comprehensive security workflow test provides excellent end-to-end coverage.

The _test_comprehensive_security_workflow method effectively combines all operations (ADD, UPDATE, QUERY, GET) with a complex payload containing multiple attack vectors. This provides good regression protection for the SQL injection fix.


275-306: Good coverage for special character IDs.

Testing GET operations with IDs containing quotes, backslashes, and SQL injection payloads ensures the escaping logic handles all parameter positions correctly, not just document/metadata fields.

Comment thread tests/test_empty_value_handling.py
Comment thread tests/test_empty_value_handling.py
Comment thread tests/test_security_sql_injection.py
Comment thread tests/test_security_sql_injection.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/test_empty_value_handling.py (2)

10-10: Unused import.

uuid is imported but never used in this file.

-import uuid

338-338: Consider adding strict=True to zip() for defensive validation.

While both lists are guaranteed to have 4 elements here, using strict=True would catch any future mismatches.

-        for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas)):
+        for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas, strict=True)):
tests/test_security_sql_injection.py (3)

116-136: Unused parameters admin and temp_dir.

These parameters are passed but never used within the method. Consider removing them or documenting why they're reserved for future use.

-    def run_security_tests(self, client, admin, temp_dir=None):
+    def run_security_tests(self, client):
         """Run all security tests for a given client"""

Note: If you remove these parameters, update the callers at lines 390, 431, and 466.


382-386: Bare except clauses may mask unexpected errors.

While ignoring "database already exists" errors is intentional, bare except: catches all exceptions including unexpected ones. Consider catching a specific exception type or at least logging.

         # Create database
         try:
             admin.create_database(SEEKDB_DATABASE)
-        except:
-            pass  # Database might already exist
+        except Exception:
+            pass  # Database might already exist (ignore duplicate errors)

Apply similar changes to lines 396, 411, and 446.


359-364: Consider extracting shared test utilities.

The _cleanup_collection method is duplicated across test_offical_case.py, test_empty_value_handling.py, and this file. Consider extracting to a shared conftest.py fixture or test utility module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f9153e0 and 1b11421.

📒 Files selected for processing (3)
  • tests/test_empty_value_handling.py (1 hunks)
  • tests/test_offical_case.py (2 hunks)
  • tests/test_security_sql_injection.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_security_sql_injection.py (2)
tests/test_empty_value_handling.py (1)
  • _cleanup_collection (45-50)
tests/test_offical_case.py (1)
  • _cleanup_collection (103-107)
🪛 Ruff (0.14.8)
tests/test_security_sql_injection.py

111-111: Consider moving this statement to an else block

(TRY300)


113-113: Do not catch blind exception: Exception

(BLE001)


116-116: Unused method argument: admin

(ARG002)


116-116: Unused method argument: temp_dir

(ARG002)


272-272: Do not catch blind exception: Exception

(BLE001)


305-305: Do not catch blind exception: Exception

(BLE001)


363-363: Do not catch blind exception: Exception

(BLE001)


371-371: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


384-384: Do not use bare except

(E722)


384-385: try-except-pass detected, consider logging the exception

(S110)


396-396: Do not use bare except

(E722)


396-397: try-except-pass detected, consider logging the exception

(S110)


411-411: Do not use bare except

(E722)


411-412: try-except-pass detected, consider logging the exception

(S110)


427-427: Do not catch blind exception: Exception

(BLE001)


446-446: Do not use bare except

(E722)


446-447: try-except-pass detected, consider logging the exception

(S110)


462-462: Do not catch blind exception: Exception

(BLE001)

tests/test_empty_value_handling.py

49-49: Do not catch blind exception: Exception

(BLE001)


64-64: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


91-91: Do not catch blind exception: Exception

(BLE001)


116-116: Do not catch blind exception: Exception

(BLE001)


165-165: Use explicit conversion flag

Replace with conversion flag

(RUF010)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


179-179: Use explicit conversion flag

Replace with conversion flag

(RUF010)


184-184: Use explicit conversion flag

Replace with conversion flag

(RUF010)


200-200: Use explicit conversion flag

Replace with conversion flag

(RUF010)


214-214: Use explicit conversion flag

Replace with conversion flag

(RUF010)


214-214: Use explicit conversion flag

Replace with conversion flag

(RUF010)


219-219: Use explicit conversion flag

Replace with conversion flag

(RUF010)


276-276: Use explicit conversion flag

Replace with conversion flag

(RUF010)


290-290: Use explicit conversion flag

Replace with conversion flag

(RUF010)


290-290: Use explicit conversion flag

Replace with conversion flag

(RUF010)


292-292: Use explicit conversion flag

Replace with conversion flag

(RUF010)


297-297: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


338-338: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


342-342: Use explicit conversion flag

Replace with conversion flag

(RUF010)


342-342: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (7)
tests/test_offical_case.py (2)

140-144: LGTM! Assertion now correctly validates the key presence.

Removing the default value 1 from get("test", 1) ensures the test fails when the key is missing rather than silently passing.


164-168: LGTM! Consistent fix applied to OceanBase connection test.

Same improvement as the server test - the assertion now properly validates the response.

tests/test_empty_value_handling.py (2)

87-92: Connection assertion is now correct.

The fix suggested in past reviews has been applied - using get("test") without a default properly validates the key exists.


42-59: Well-structured test class with proper cleanup.

The helper methods for collection management and the consistent use of try/finally for cleanup are good practices.

tests/test_security_sql_injection.py (3)

53-92: Comprehensive security test cases.

The test vectors cover a good range of SQL injection patterns including quote escaping, backslashes, classic injection attempts, and unicode characters.


423-428: Connection assertion is now correct.

The suggested fix from past reviews has been applied - the assertion properly validates the key exists and has value 1.


308-357: Good comprehensive workflow test.

The _test_comprehensive_security_workflow method effectively combines all operations with security payloads and validates data integrity at each step.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_collection_get.py (2)

557-578: Good addition: scalar field backward compatibility.

These tests ensure that the new JSON_OVERLAPS implementation works correctly with scalar metadata fields, not just arrays. This is important for backward compatibility with existing code.

Optional: Remove extraneous f-string prefixes

Static analysis notes that lines 558 and 569 use f-strings without placeholders:

-            print(f"✅ Testing get with scalar $in operator")
+            print("✅ Testing get with scalar $in operator")

-            print(f"✅ Testing get with scalar $nin operator")
+            print("✅ Testing get with scalar $nin operator")

900-921: Consistent scalar field testing for OceanBase.

The scalar $in and $nin tests provide the same backward compatibility validation for OceanBase as the embedded tests. This ensures consistent behavior across all client types.

Optional: Remove extraneous f-string prefixes

Lines 901 and 912 use f-strings without placeholders:

-            print(f"✅ Testing get with scalar $in operator")
+            print("✅ Testing get with scalar $in operator")

-            print(f"✅ Testing get with scalar $nin operator")
+            print("✅ Testing get with scalar $nin operator")
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b11421 and bb9b63d.

📒 Files selected for processing (2)
  • src/pyseekdb/client/filters.py (2 hunks)
  • tests/test_collection_get.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (6)
src/pyseekdb/client/collection.py (5)
  • client (78-80)
  • name (63-65)
  • embedding_function (88-90)
  • add (103-149)
  • get (364-428)
src/pyseekdb/client/__init__.py (1)
  • Client (84-201)
src/pyseekdb/client/admin_client.py (1)
  • get_or_create_collection (214-233)
src/pyseekdb/client/client_seekdb_embedded.py (1)
  • _execute (107-123)
src/pyseekdb/client/base_connection.py (1)
  • _execute (31-33)
src/pyseekdb/client/client_seekdb_server.py (1)
  • _execute (98-111)
🪛 Ruff (0.14.8)
tests/test_collection_get.py

141-141: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


176-176: Do not catch blind exception: Exception

(BLE001)


197-197: Do not catch blind exception: Exception

(BLE001)


226-226: Do not catch blind exception: Exception

(BLE001)


247-247: Do not catch blind exception: Exception

(BLE001)


276-276: Do not catch blind exception: Exception

(BLE001)


285-285: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)


326-326: Do not catch blind exception: Exception

(BLE001)


347-347: Do not catch blind exception: Exception

(BLE001)


382-382: Do not catch blind exception: Exception

(BLE001)


403-403: Do not catch blind exception: Exception

(BLE001)


438-438: Do not catch blind exception: Exception

(BLE001)


558-558: f-string without any placeholders

Remove extraneous f prefix

(F541)


569-569: f-string without any placeholders

Remove extraneous f prefix

(F541)


901-901: f-string without any placeholders

Remove extraneous f prefix

(F541)


912-912: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (3)
src/pyseekdb/client/filters.py (1)

142-154: LGTM! Consistent negation logic.

The $nin implementation correctly mirrors the $in operator with negation. The parameterized approach prevents SQL injection, and empty list handling is appropriate.

tests/test_collection_get.py (2)

135-278: Excellent test coverage for array overlap with $in operator.

The three test variants (embedded, server, OceanBase) provide comprehensive coverage for the new JSON_OVERLAPS functionality. The test logic correctly validates that tags=["ml","ai"] matches $in ["ml","python"] due to the "ml" overlap.

Good practices:

  • Connection validation for remote clients
  • Cleanup in finally blocks
  • Clear test documentation

279-440: LGTM! Comprehensive $nin exclusion logic validation.

The $nin tests effectively validate the exclusion behavior by inserting two records—one with overlap and one without. Correctly expects only the non-overlapping record (id2) to be returned. The three client variants provide thorough coverage.

Comment thread src/pyseekdb/client/filters.py Outdated
Comment on lines +128 to +140
elif op == "$in":
placeholders = ", ".join(["%s"] * len(op_value))
clauses.append(f"JSON_EXTRACT({metadata_column}, '$.{key}') IN ({placeholders})")
params.extend(op_value)
# Parameter validation
if not isinstance(op_value, (list, tuple)):
op_value = [op_value]
if len(op_value) == 0:
continue

# Use single JSON_OVERLAPS (supports both scalar and array fields)
json_array = json.dumps(list(op_value))
clauses.append(
f"JSON_OVERLAPS(JSON_EXTRACT({metadata_column}, '$.{key}'), CAST(%s AS JSON))"
)
params.append(json_array)

@coderabbitai coderabbitai Bot Dec 18, 2025

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for version requirements and JSON_OVERLAPS usage in the repository

echo "=== Searching for version requirements in documentation and config files ==="
find . -name "README*" -o -name "*.md" -o -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" 2>/dev/null | head -20 | xargs grep -l "version\|mysql\|oceanbase" 2>/dev/null | head -10

echo -e "\n=== Checking setup.py or pyproject.toml for version constraints ==="
find . -name "setup.py" -o -name "pyproject.toml" 2>/dev/null | xargs cat 2>/dev/null | grep -i "python\|mysql\|oceanbase" | head -20

echo -e "\n=== Searching for JSON_OVERLAPS usage in Python code ==="
rg -n "JSON_OVERLAPS" --type=py

echo -e "\n=== Searching for other JSON functions in filters.py context ==="
rg -n "JSON_EXTRACT|JSON_CONTAINS|JSON_OVERLAPS" --type=py -B2 -A2 | head -50

Repository: oceanbase/pyseekdb

Length of output: 5492


🏁 Script executed:

#!/bin/bash
# Check for version requirements documentation and runtime version checks

echo "=== Checking README for version requirements ==="
head -100 README.md | grep -i "mysql\|oceanbase\|version\|requirement" || echo "No version info in README header"

echo -e "\n=== Checking for runtime version checks in client code ==="
rg -n "version|VERSION" src/pyseekdb/client/ --type=py | head -20

echo -e "\n=== Checking connection setup code for version validation ==="
rg -n "execute|connect|version" src/pyseekdb/client/client.py --type=py -A2 -B2 | head -40

Repository: oceanbase/pyseekdb

Length of output: 1016


🏁 Script executed:

#!/bin/bash
# Find the actual client code and check for database connection patterns

echo "=== Finding client-related Python files ==="
find . -name "*.py" -path "*/client/*" | head -20

echo -e "\n=== Checking for connection/execute patterns in filters.py ==="
cat src/pyseekdb/client/filters.py | head -50

echo -e "\n=== Looking for version checks or configuration documentation ==="
find . -name "*.md" -o -name "*.txt" | xargs grep -l "8.0.17\|MySQL.*version\|minimum.*version" 2>/dev/null

Repository: oceanbase/pyseekdb

Length of output: 2156


The parameterized JSON_OVERLAPS implementation is correct, but consider documenting MySQL version requirements.

JSON_OVERLAPS was introduced in MySQL 8.0.17, and your use of parameterized queries with CAST(%s AS JSON) correctly prevents SQL injection. The implementation is validated by comprehensive tests across embedded, server, and OceanBase deployment modes.

However, no explicit MySQL version constraints are documented in the project configuration. For users deploying against standalone MySQL instances (rather than embedded SeekDB), consider adding a note in documentation or code comments that MySQL 8.0.17+ is required for this metadata filter to function.

🤖 Prompt for AI Agents
In src/pyseekdb/client/filters.py around lines 128 to 140, the JSON_OVERLAPS
usage requires MySQL 8.0.17+ but no version requirement is documented; add a
brief inline comment above this block (or update project docs/config) stating
that JSON_OVERLAPS requires MySQL 8.0.17 or later and that deployments using
standalone MySQL must meet this minimum version, ensuring users are aware of the
compatibility constraint.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc plz@hnwyllmm

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@longbingljw, I see you've tagged @hnwyllmm for input on the MySQL version requirement concern. I'll wait for their feedback on whether documenting the MySQL 8.0.17+ requirement for JSON_OVERLAPS would be valuable for the project.

@hnwyllmm

Copy link
Copy Markdown
Member

Please don't do too many things in on pull request, it's difficult to revert and review.
Please create a new pull request to fix the $in operator issue.
Anyway, thanks a lot for your contribution.

@longbingljw

Copy link
Copy Markdown
Member Author

Please don't do too many things in on pull request, it's difficult to revert and review. Please create a new pull request to fix the $in operator issue. Anyway, thanks a lot for your contribution.

Make sense to me.I will create a new pull request

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
src/pyseekdb/client/client_base.py (2)

2164-2169: CRITICAL: Double escaping in $and conditions.

Similar to the $contains case, lines 2165 and 2169 apply SQL escaping to query strings before JSON encoding, causing double escaping. The query strings should be added directly to the dictionary without SQL escaping.

🔎 Proposed fix
         if contains_queries:
-            # Combine multiple $contains with AND (escape each query)
-            escaped_queries = [escape_string(q) for q in contains_queries]
             return _with_boost({
                 "query_string": {
                     "fields": ["document"],
-                    "query": " ".join(escaped_queries)
+                    "query": " ".join(contains_queries)
                 }
             })

2182-2187: CRITICAL: Double escaping in $or conditions.

Similar to the $contains and $and cases, lines 2183 and 2187 apply SQL escaping to query strings before JSON encoding, causing double escaping. The query strings should be added directly to the dictionary without SQL escaping.

🔎 Proposed fix
         if contains_queries:
-            # Combine multiple $contains with OR (escape each query)
-            escaped_queries = [escape_string(q) for q in contains_queries]
             return _with_boost({
                 "query_string": {
                     "fields": ["document"],
-                    "query": " OR ".join(escaped_queries)
+                    "query": " OR ".join(contains_queries)
                 }
             })
🧹 Nitpick comments (1)
src/pyseekdb/client/client_base.py (1)

2295-2302: Missing JSON_OVERLAPS implementation for JSON array fields.

The PR objectives mention fixing issue #83 (handling $in operator for JSON array values), which proposed using JSON_OVERLAPS to test overlap between JSON arrays. However, the current implementation at lines 2295-2302 still uses terms queries, which won't correctly match when the metadata field contains a JSON array like ["ml", "ai"].

Note: Based on PR comments, the reviewer (hnwyllmm) requested this fix be moved to a separate PR. This comment is for tracking purposes to ensure the issue is addressed separately.

For reference, issue #83 describes the problem:

# Query: {"tags": {"$in": ["ml", "python"]}}
# Metadata: {"tags": ["ml", "ai"]}
# Current behavior: No match (comparing JSON array to scalar)
# Expected: Should match (array overlap detection)

The proposed solution would use JSON_OVERLAPS(JSON_EXTRACT(metadata, '$.tags'), JSON_ARRAY('ml','python')) for array membership queries.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb9b63d and a2d8c6f.

📒 Files selected for processing (1)
  • src/pyseekdb/client/client_base.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
  • CollectionFieldNames (4-10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: test (server, ubuntu-latest)
  • GitHub Check: test (embedded, ubuntu-latest)
  • GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (1)
src/pyseekdb/client/client_base.py (1)

1086-1090: LGTM: Empty string handling correctly fixed.

The code now uses is not None checks instead of truthiness checks, which correctly distinguishes between None (→ SQL NULL) and empty strings (→ escaped as ''). This properly addresses the concerns from previous reviews.

Also applies to: 1107-1109

Comment thread src/pyseekdb/client/client_base.py
Comment thread src/pyseekdb/client/client_base.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Single quote not properly escaped in hybrid search

2 participants