refactor: Use pymysql to automatically escape#77
Conversation
<!-- Thank you for contributing to OceanBase! Please feel free to ping the maintainers for the review! --> ## Summary <!-- Please clearly and concisely describe the purpose of this pull request. If this pull request resolves an issue, please link it via "close #xxx" or "fix #xxx". --> The rag demo has supported hybrid search.The readme need to be updated. close oceanbase#60 ## Solution Description <!-- Please clearly and concisely describe your solution. -->
Added comprehensive PR review report and design document template.
This reverts commit e9921b5.
WalkthroughThis PR standardizes SQL escaping across the pyseekdb client using PyMySQL's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/test_security_sql_injection.py (2)
94-114: Consider logging exceptions for debugging.The bare
excepton line 113 is acceptable for test verification logic, but consider logging the exception to help diagnose test failures during development:except Exception: + import logging + logging.debug(f"Data integrity verification failed", exc_info=True) return False
116-136: Remove unused parameters.The
adminandtemp_dirparameters are not used in this method and should be removed for clarity.Apply this diff:
- def run_security_tests(self, client, admin, temp_dir=None): + def run_security_tests(self, client): """Run all security tests for a given client"""And update the calls at lines 387, 420, and 448:
- self.run_security_tests(client, admin, temp_dir) + self.run_security_tests(client)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
demo/rag/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (8)
.github/instructions/*.instructions.md(1 hunks)README.md(1 hunks)demo/rag/README.md(1 hunks)demo/rag/README_CN.md(1 hunks)src/pyseekdb/client/client_base.py(10 hunks)src/pyseekdb/client/client_seekdb_embedded.py(2 hunks)src/pyseekdb/client/sql_utils.py(2 hunks)tests/test_security_sql_injection.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
CollectionFieldNames(4-10)
tests/test_security_sql_injection.py (3)
src/pyseekdb/client/client_base.py (2)
delete_collection(115-117)delete_collection(410-428)src/pyseekdb/client/__init__.py (2)
AdminClient(180-282)Client(60-177)src/pyseekdb/client/client_seekdb_embedded.py (1)
create_database(269-280)
🪛 Ruff (0.14.8)
tests/test_security_sql_injection.py
111-111: Consider moving this statement to an else block
(TRY300)
113-113: Do not catch blind exception: Exception
(BLE001)
116-116: Unused method argument: admin
(ARG002)
116-116: Unused method argument: temp_dir
(ARG002)
145-145: Do not use bare except
(E722)
145-146: try-except-pass detected, consider logging the exception
(S110)
158-158: Do not use bare except
(E722)
158-159: try-except-pass detected, consider logging the exception
(S110)
281-281: Do not catch blind exception: Exception
(BLE001)
314-314: Do not catch blind exception: Exception
(BLE001)
381-381: Do not use bare except
(E722)
381-382: try-except-pass detected, consider logging the exception
(S110)
393-393: Do not use bare except
(E722)
393-394: try-except-pass detected, consider logging the exception
(S110)
408-408: Do not use bare except
(E722)
408-409: try-except-pass detected, consider logging the exception
(S110)
435-435: Do not use bare except
(E722)
435-436: try-except-pass detected, consider logging the exception
(S110)
🔇 Additional comments (19)
demo/rag/README_CN.md (1)
3-3: LGTM! Documentation terminology updated correctly.The change from "向量检索知识库" (vector search knowledge base) to "混合检索知识库" (hybrid search knowledge base) accurately reflects the enhanced hybrid search capabilities introduced in this PR.
demo/rag/README.md (1)
3-3: LGTM! Documentation terminology updated correctly.The terminology change from "vector search knowledge base" to "hybrid search knowledge base" is consistent with the Chinese README and accurately reflects the system's hybrid search capabilities.
README.md (1)
1225-1225: LGTM! Documentation terminology updated correctly.The change to "hybrid search knowledge base" is consistent with the demo README files and accurately reflects the system's capabilities.
.github/instructions/*.instructions.md (1)
1-108: LGTM! Comprehensive PR review template.This template provides excellent structure for PR documentation, covering overview statistics, design documentation, architecture changes, detailed change analysis, compatibility warnings, and usage examples. This will help ensure consistent and thorough PR documentation.
tests/test_security_sql_injection.py (7)
53-92: LGTM! Comprehensive SQL injection test cases.The test cases cover a good range of attack vectors including:
- Single and double quotes
- Backslashes
- Classic SQL injection patterns (DROP TABLE, condition bypass)
- Mixed special characters
- Unicode and emoji characters
This provides solid coverage for validating the escape_string implementation.
138-159: LGTM! Appropriate use of exception handling for test infrastructure.The bare
exceptblocks at lines 145 and 158 are appropriate for test cleanup logic, ensuring collections are deleted even if they don't exist or if deletion fails. This provides robustness in the test infrastructure.
161-282: LGTM! Comprehensive operation-level security testing.The tests systematically cover:
- ADD: Basic insertion with attack vectors
- UPDATE: Modification with security payloads
- UPSERT: Insert and update scenarios
- QUERY: Query execution with special characters
Each test properly validates that operations complete without SQL errors and data integrity is preserved. The exception handling at line 281 is appropriate for test failure reporting.
284-315: LGTM! Thorough testing of GET operations with special character IDs.The test validates that GET operations safely handle IDs containing:
- Single quotes
- Backslashes
- Double quotes
- SQL injection attempts in the ID field
This is critical for preventing injection through the ID parameter. The exception handling is appropriate for test failure reporting.
317-366: LGTM! Excellent comprehensive workflow test.This test validates the complete workflow (ADD → UPDATE → QUERY → GET) using a complex payload containing:
- Multiple SQL injection attempts
- Mixed quotes and backslashes
- Nested attack vectors in metadata
This is an excellent real-world scenario that validates the escaping works correctly across operation boundaries.
370-448: LGTM! Comprehensive mode-specific test coverage.The tests properly cover all three deployment modes:
- Embedded: Uses temporary directory with proper cleanup
- Server: Uses environment variables for connection
- OceanBase: Uses tenant-aware configuration
The bare
exceptblocks at lines 381, 393, 408, and 435 are appropriate for handling cases where databases already exist during test setup.
453-466: LGTM! Convenient standalone test functions for CI.These wrapper functions enable selective test execution (e.g.,
pytest -k test_embedded) which is valuable for CI/CD pipelines that may only have certain deployment modes available.src/pyseekdb/client/client_seekdb_embedded.py (1)
140-174: LGTM! Critical security improvement for embedded mode.The change to use
escape_stringin embedded mode is essential because pylibseekdb doesn't support parameterized queries. The implementation correctly:
- Handles
Noneas NULL without escaping (line 165)- Handles numbers without escaping (line 167)
- Uses
escape_stringfor string parameters (line 169)- Uses
escape_stringfor other types after string conversion (line 173)This properly prevents SQL injection in embedded mode where parameters must be embedded directly into SQL.
src/pyseekdb/client/client_base.py (6)
689-701: LGTM! Proper escaping in ADD operation.The ADD operation correctly escapes:
- Line 690: Document strings using
escape_string- Lines 698-701: JSON-serialized metadata using
escape_stringBoth are critical injection points since they contain user-provided data. The escaping is applied after JSON serialization, which is correct.
832-840: LGTM! Consistent escaping in UPDATE operation.The UPDATE operation follows the same secure pattern as ADD, properly escaping:
- Line 832: Document strings
- Lines 838-840: JSON-serialized metadata
This ensures UPDATE operations are protected against SQL injection.
989-1015: LGTM! Proper escaping in UPSERT operation.The UPSERT operation correctly handles both code paths:
- Update path (lines 989, 994): Escapes documents and metadata when updating existing records
- Insert path (lines 1008, 1015): Escapes documents and metadata when inserting new records
This ensures SQL injection protection regardless of whether the record exists.
1341-1344: LGTM! Critical escaping for ID parameters.The
_convert_id_to_sqlmethod is used throughout for ID-based operations (GET, DELETE, UPDATE, etc.). Escaping IDs at line 1342 is critical because:
- IDs are user-provided strings
- IDs are used in WHERE clauses
- Without escaping, IDs could be an injection vector
The implementation correctly escapes before using
CASTto convert to BINARY.
1840-1842: LGTM! Proper escaping for hybrid search parameters.The hybrid search parameters are correctly escaped at line 1841 before being embedded in the
SET @search_parmSQL statement. This is critical becausesearch_parmcontains user-provided:
- Query text
- Filter conditions
- Metadata filters
1994-2036: LGTM! Comprehensive escaping for document query operators.The
_build_document_querymethod properly escapes user-provided query content for all operators:
- Line 1996:
$containsoperator- Lines 2013-2014:
$andoperator with multiple conditions- Lines 2031-2032:
$oroperator with multiple conditionsThis prevents SQL injection through full-text search queries, which was specifically mentioned in issue #75 (single quote not properly escaped in hybrid search).
src/pyseekdb/client/sql_utils.py (1)
8-8: LGTM! Critical security improvement using pymysql's escape_string.The change from manual escaping to
pymysql.converters.escape_stringis a legitimate security enhancement. pymysql's escape_string properly handles backslash, single/double quotes, NUL bytes (ASCII 0), newlines, carriage returns, and Control+Z — the standard special characters that require escaping for SQL safety. pymysql version 1.1.1 supports this function, and the existing logic for bytes/hex and numeric types remains intact.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/pyseekdb/client/client_base.py (2)
2129-2140: Missing escaping for$not_containsquery content.The
$containsoperator (line 2121) correctly escapes the query content, but$not_contains(line 2136) uses the value directly without escaping. This inconsistency could lead to SQL injection or syntax errors when the query contains special characters.# Handle $not_contains - wrap query_string in must_not bool if "$not_contains" in where_document: + not_contains_content = where_document["$not_contains"] + escaped_not_contains = escape_string(not_contains_content) return _with_boost({ "bool": { "must_not": [{ "query_string": { "fields": ["document"], - "query": where_document["$not_contains"] + "query": escaped_not_contains } }] } })
2178-2186: Missing escaping for default string case in document query.When
where_documentis a plain string (line 2179), it's used directly as the query without escaping, unlike the$containscase which properly escapes the content.# Default: if it's a string, treat as $contains if isinstance(where_document, str): + escaped_query = escape_string(where_document) return _with_boost({ "query_string": { "fields": ["document"], - "query": where_document + "query": escaped_query } })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
README.md(1 hunks)src/pyseekdb/client/client_base.py(10 hunks)src/pyseekdb/client/client_seekdb_embedded.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
CollectionFieldNames(4-10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (8)
README.md (1)
1226-1238: Good alignment with PR objectives; documentation is clear and well-structured.The changes appropriately document the hybrid search capability and three embedding modes (default/local/api). The reworded description from "vector search" to "hybrid search" accurately reflects the enhancement scope. The documentation is accessible and clearly explains the no-key-required default mode alongside the optional alternatives.
src/pyseekdb/client/client_base.py (5)
10-10: LGTM! Correct import for SQL escaping.The import of
escape_stringfrompymysql.convertersis the appropriate function for escaping string literals in MySQL/MariaDB compatible databases.
770-786: LGTM! Document and metadata escaping in add operation.The escaping pattern is correct: user-provided
doc_valand serializedmeta_jsonare properly escaped before being embedded into SQL string literals.
912-923: LGTM! Document and metadata escaping in update operation.Consistent escaping pattern with the add operation.
1424-1427: LGTM! ID escaping in SQL conversion.IDs are properly escaped before being embedded in the CAST expression, preventing SQL injection through malicious ID values.
1930-1932: LGTM! Search parameter escaping in hybrid search.This is the key fix for issue #75. The
search_parm_json(which may contain user queries with single quotes) is now properly escaped before being embedded in the SET SQL statement.src/pyseekdb/client/client_seekdb_embedded.py (2)
8-8: LGTM! Import for embedded mode escaping.Correct import of
escape_stringfor the embedded client that needs to manually embed parameters.
163-174: LGTM! Parameter escaping in embedded mode.The escaping logic correctly handles different parameter types:
None→ SQLNULL- Numeric types → direct string conversion (safe)
- Strings → properly escaped with
escape_string- Other types → converted to string and escaped
This ensures all user-provided parameters are safely embedded into SQL.
|
The past pr:#76 |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/pyseekdb/client/client_base.py (2)
2136-2146: Missing escaping in $not_contains operator.While
$contains,$and, and$oroperators properly useescape_string, the$not_containsoperator at line 2142 does not escape the query content before embedding it in the SQL query_string.Apply this diff to add consistent escaping:
# Handle $not_contains - wrap query_string in must_not bool if "$not_contains" in where_document: + not_contains_content = where_document["$not_contains"] + escaped_not_contains = escape_string(not_contains_content) return _with_boost({ "bool": { "must_not": [{ "query_string": { "fields": ["document"], - "query": where_document["$not_contains"] + "query": escaped_not_contains } }] } })
2184-2192: Missing escaping in default string case.When
where_documentis a plain string (line 2185-2189), it's embedded directly in the query_string without escaping. This could be a SQL injection vector if user-provided strings reach this code path.Apply this diff to add escaping:
# Default: if it's a string, treat as $contains if isinstance(where_document, str): + escaped_query = escape_string(where_document) return _with_boost({ "query_string": { "fields": ["document"], - "query": where_document + "query": escaped_query } })
🧹 Nitpick comments (1)
src/pyseekdb/client/client_base.py (1)
770-805: Consider true parameterized queries for DML operations.While the current use of
escape_stringis safe and addresses the SQL injection concerns, the PR title mentions "parameterized queries" which typically refers to using placeholder syntax (e.g.,%s) with separate parameter binding viacursor.execute(sql, params). This approach is demonstrated in_collection_delete(line 1168-1171) but not consistently used across other DML operations.True parameterized queries offer additional benefits:
- Database driver handles escaping automatically
- Clearer separation between SQL structure and data
- Slightly better performance in some cases (query plan caching)
Consider refactoring DML operations to use parameterized queries in the future, though the current implementation is secure and functional.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/pyseekdb/client/client_base.py(9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (4)
src/pyseekdb/client/meta_info.py (2)
CollectionFieldNames(4-10)table_name(14-15)src/pyseekdb/client/client_seekdb_embedded.py (1)
_execute(107-123)src/pyseekdb/client/base_connection.py (1)
_execute(31-33)src/pyseekdb/client/client_seekdb_server.py (1)
_execute(98-111)
🪛 Ruff (0.14.8)
src/pyseekdb/client/client_base.py
1091-1091: Possible SQL injection vector through string-based query construction
(S608)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (6)
src/pyseekdb/client/client_base.py (6)
10-10: LGTM: Import of pymysql escape utility.The import of
escape_stringfrompymysql.convertersis appropriate for preventing SQL injection when building SQL strings manually.
770-797: LGTM: Proper escaping in _collection_add.The code correctly uses
escape_stringfor document values and metadata JSON before embedding them in SQL strings. The None checks ensure that NULL is used when values are not provided.
912-930: LGTM: Consistent escaping in _collection_update.The update method correctly applies
escape_stringto document and metadata values, maintaining consistency with the add operation.
1071-1105: LGTM: Empty string handling fixed in upsert.The code now correctly uses
is not Nonechecks instead of truthiness checks, which addresses the past review comments. This ensures that empty strings are properly escaped and preserved rather than being converted to NULL.
1430-1433: LGTM: ID escaping prevents injection.The use of
escape_stringfor ID values before casting to BINARY is appropriate, as IDs are user-provided strings that could contain malicious content.
1936-1938: LGTM: Hybrid search escaping addresses issue #75.The escaping of
search_parmJSON before embedding in the SET statement directly addresses the bug reported in issue #75 where single quotes in hybrid search queries caused SQL syntax errors.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (8)
tests/test_empty_value_handling.py (3)
13-15: Path manipulation may cause import issues in CI environments.Directly manipulating
sys.pathcan lead to fragile imports. Consider using pytest'spythonpathconfiguration or installing the package in editable mode (pip install -e .) instead.
61-66: Remove unused noqa directive.The
# noqa: F401directive on line 64 is flagged as unused by static analysis (Ruff RUF100). Since thepylibseekdbimport is used only for existence checking, the directive can be removed.try: - import pylibseekdb # noqa: F401 + import pylibseekdb except ImportError: pytest.fail("seekdb embedded package is not installed")
338-344: Addstrict=Truetozip()for defensive coding.The
zip()call on line 338 doesn't include thestrict=parameter. While both lists currently have 4 elements, usingstrict=Truewould catch mismatches if the test data is modified later.- for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas)): + for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas, strict=True)):tests/test_security_sql_injection.py (5)
94-114: Consider more specific exception handling.The
verify_data_integritymethod catches all exceptions and returnsFalse. While acceptable for test helpers, catching a more specific exception type would help distinguish between "data not found" and "unexpected errors" during debugging.
116-137: Remove unused parametersadminandtemp_dir.The
run_security_testsmethod acceptsadminandtemp_dirparameters but never uses them. Remove these to avoid confusion about the method's interface.- def run_security_tests(self, client, admin, temp_dir=None): + def run_security_tests(self, client): """Run all security tests for a given client"""Then update the call sites at lines 390, 431, and 466:
- self.run_security_tests(client, admin, temp_dir) + self.run_security_tests(client)
382-397: Use specific exceptions instead of bareexcept.Bare
except:clauses can mask unexpected errors. For database creation, catch the specific exception type for "database already exists". For cleanup, at minimum catchException.try: admin.create_database(SEEKDB_DATABASE) - except: - pass # Database might already exist + except Exception: + pass # Database might already exist client = pyseekdb.Client(path=temp_dir, database=SEEKDB_DATABASE) # Run all security tests - self.run_security_tests(client, admin, temp_dir) + self.run_security_tests(client) finally: # Cleanup try: shutil.rmtree(temp_dir) - except: + except OSError: pass
408-412: Replace bareexceptwith specific exception type.The database creation try-except at lines 409-412 should use a specific exception type instead of bare
except:.try: admin.create_database(SERVER_DATABASE) - except: + except Exception: pass # Database might already exist
443-447: Replace bareexceptwith specific exception type.Same pattern fix needed for OceanBase database creation.
try: admin.create_database(OB_DATABASE) - except: + except Exception: pass # Database might already exist
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/test_empty_value_handling.py(1 hunks)tests/test_security_sql_injection.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_empty_value_handling.py (5)
src/pyseekdb/client/collection.py (6)
client(78-80)get(364-428)name(63-65)embedding_function(88-90)add(103-149)upsert(194-234)src/pyseekdb/client/meta_info.py (2)
CollectionNames(12-15)CollectionFieldNames(4-10)tests/test_security_sql_injection.py (1)
_cleanup_collection(359-364)src/pyseekdb/client/embedding_function.py (1)
DefaultEmbeddingFunction(84-469)src/pyseekdb/client/__init__.py (1)
Client(84-201)
tests/test_security_sql_injection.py (1)
src/pyseekdb/client/__init__.py (2)
AdminClient(204-306)Client(84-201)
🪛 Ruff (0.14.8)
tests/test_empty_value_handling.py
49-49: Do not catch blind exception: Exception
(BLE001)
64-64: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
91-91: Do not catch blind exception: Exception
(BLE001)
116-116: Do not catch blind exception: Exception
(BLE001)
165-165: Use explicit conversion flag
Replace with conversion flag
(RUF010)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
184-184: Use explicit conversion flag
Replace with conversion flag
(RUF010)
200-200: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Use explicit conversion flag
Replace with conversion flag
(RUF010)
219-219: Use explicit conversion flag
Replace with conversion flag
(RUF010)
276-276: Use explicit conversion flag
Replace with conversion flag
(RUF010)
290-290: Use explicit conversion flag
Replace with conversion flag
(RUF010)
290-290: Use explicit conversion flag
Replace with conversion flag
(RUF010)
292-292: Use explicit conversion flag
Replace with conversion flag
(RUF010)
297-297: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
338-338: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
342-342: Use explicit conversion flag
Replace with conversion flag
(RUF010)
342-342: Use explicit conversion flag
Replace with conversion flag
(RUF010)
tests/test_security_sql_injection.py
111-111: Consider moving this statement to an else block
(TRY300)
113-113: Do not catch blind exception: Exception
(BLE001)
116-116: Unused method argument: admin
(ARG002)
116-116: Unused method argument: temp_dir
(ARG002)
272-272: Do not catch blind exception: Exception
(BLE001)
305-305: Do not catch blind exception: Exception
(BLE001)
363-363: Do not catch blind exception: Exception
(BLE001)
371-371: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
384-384: Do not use bare except
(E722)
384-385: try-except-pass detected, consider logging the exception
(S110)
396-396: Do not use bare except
(E722)
396-397: try-except-pass detected, consider logging the exception
(S110)
411-411: Do not use bare except
(E722)
411-412: try-except-pass detected, consider logging the exception
(S110)
427-427: Do not catch blind exception: Exception
(BLE001)
446-446: Do not use bare except
(E722)
446-447: try-except-pass detected, consider logging the exception
(S110)
462-462: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
🔇 Additional comments (5)
tests/test_empty_value_handling.py (2)
234-242: Verify None value handling in metadata storage.The test case
{"key": None}at line 239 validates thatNonevalues in metadata are preserved. Ensure the underlying storage (SQL/JSON) correctly handlesNonevs. JSONnullvs. missing keys, as some backends may serializeNonedifferently.
144-184: Good coverage of falsy value edge cases.The test cases for empty strings, whitespace, "0", and "false" provide comprehensive coverage for the falsy value handling bug fix. The assertions correctly verify that values are stored exactly as provided without being converted to "NULL".
tests/test_security_sql_injection.py (3)
53-92: Excellent security test case coverage.The test cases cover important SQL injection attack vectors including single/double quotes, backslashes, classic injection patterns (
'; DROP TABLE), and Unicode with special characters. This provides good coverage for the escape_string fixes.
308-357: Comprehensive security workflow test provides excellent end-to-end coverage.The
_test_comprehensive_security_workflowmethod effectively combines all operations (ADD, UPDATE, QUERY, GET) with a complex payload containing multiple attack vectors. This provides good regression protection for the SQL injection fix.
275-306: Good coverage for special character IDs.Testing GET operations with IDs containing quotes, backslashes, and SQL injection payloads ensures the escaping logic handles all parameter positions correctly, not just document/metadata fields.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tests/test_empty_value_handling.py (2)
10-10: Unused import.
uuidis imported but never used in this file.-import uuid
338-338: Consider addingstrict=Truetozip()for defensive validation.While both lists are guaranteed to have 4 elements here, using
strict=Truewould catch any future mismatches.- for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas)): + for i, (expected_doc, expected_meta) in enumerate(zip(updated_docs, updated_metas, strict=True)):tests/test_security_sql_injection.py (3)
116-136: Unused parametersadminandtemp_dir.These parameters are passed but never used within the method. Consider removing them or documenting why they're reserved for future use.
- def run_security_tests(self, client, admin, temp_dir=None): + def run_security_tests(self, client): """Run all security tests for a given client"""Note: If you remove these parameters, update the callers at lines 390, 431, and 466.
382-386: Bareexceptclauses may mask unexpected errors.While ignoring "database already exists" errors is intentional, bare
except:catches all exceptions including unexpected ones. Consider catching a specific exception type or at least logging.# Create database try: admin.create_database(SEEKDB_DATABASE) - except: - pass # Database might already exist + except Exception: + pass # Database might already exist (ignore duplicate errors)Apply similar changes to lines 396, 411, and 446.
359-364: Consider extracting shared test utilities.The
_cleanup_collectionmethod is duplicated acrosstest_offical_case.py,test_empty_value_handling.py, and this file. Consider extracting to a sharedconftest.pyfixture or test utility module.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/test_empty_value_handling.py(1 hunks)tests/test_offical_case.py(2 hunks)tests/test_security_sql_injection.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_security_sql_injection.py (2)
tests/test_empty_value_handling.py (1)
_cleanup_collection(45-50)tests/test_offical_case.py (1)
_cleanup_collection(103-107)
🪛 Ruff (0.14.8)
tests/test_security_sql_injection.py
111-111: Consider moving this statement to an else block
(TRY300)
113-113: Do not catch blind exception: Exception
(BLE001)
116-116: Unused method argument: admin
(ARG002)
116-116: Unused method argument: temp_dir
(ARG002)
272-272: Do not catch blind exception: Exception
(BLE001)
305-305: Do not catch blind exception: Exception
(BLE001)
363-363: Do not catch blind exception: Exception
(BLE001)
371-371: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
384-384: Do not use bare except
(E722)
384-385: try-except-pass detected, consider logging the exception
(S110)
396-396: Do not use bare except
(E722)
396-397: try-except-pass detected, consider logging the exception
(S110)
411-411: Do not use bare except
(E722)
411-412: try-except-pass detected, consider logging the exception
(S110)
427-427: Do not catch blind exception: Exception
(BLE001)
446-446: Do not use bare except
(E722)
446-447: try-except-pass detected, consider logging the exception
(S110)
462-462: Do not catch blind exception: Exception
(BLE001)
tests/test_empty_value_handling.py
49-49: Do not catch blind exception: Exception
(BLE001)
64-64: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
91-91: Do not catch blind exception: Exception
(BLE001)
116-116: Do not catch blind exception: Exception
(BLE001)
165-165: Use explicit conversion flag
Replace with conversion flag
(RUF010)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
179-179: Use explicit conversion flag
Replace with conversion flag
(RUF010)
184-184: Use explicit conversion flag
Replace with conversion flag
(RUF010)
200-200: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Use explicit conversion flag
Replace with conversion flag
(RUF010)
219-219: Use explicit conversion flag
Replace with conversion flag
(RUF010)
276-276: Use explicit conversion flag
Replace with conversion flag
(RUF010)
290-290: Use explicit conversion flag
Replace with conversion flag
(RUF010)
290-290: Use explicit conversion flag
Replace with conversion flag
(RUF010)
292-292: Use explicit conversion flag
Replace with conversion flag
(RUF010)
297-297: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
338-338: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
342-342: Use explicit conversion flag
Replace with conversion flag
(RUF010)
342-342: Use explicit conversion flag
Replace with conversion flag
(RUF010)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (7)
tests/test_offical_case.py (2)
140-144: LGTM! Assertion now correctly validates the key presence.Removing the default value
1fromget("test", 1)ensures the test fails when the key is missing rather than silently passing.
164-168: LGTM! Consistent fix applied to OceanBase connection test.Same improvement as the server test - the assertion now properly validates the response.
tests/test_empty_value_handling.py (2)
87-92: Connection assertion is now correct.The fix suggested in past reviews has been applied - using
get("test")without a default properly validates the key exists.
42-59: Well-structured test class with proper cleanup.The helper methods for collection management and the consistent use of try/finally for cleanup are good practices.
tests/test_security_sql_injection.py (3)
53-92: Comprehensive security test cases.The test vectors cover a good range of SQL injection patterns including quote escaping, backslashes, classic injection attempts, and unicode characters.
423-428: Connection assertion is now correct.The suggested fix from past reviews has been applied - the assertion properly validates the key exists and has value 1.
308-357: Good comprehensive workflow test.The
_test_comprehensive_security_workflowmethod effectively combines all operations with security payloads and validates data integrity at each step.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/test_collection_get.py (2)
557-578: Good addition: scalar field backward compatibility.These tests ensure that the new JSON_OVERLAPS implementation works correctly with scalar metadata fields, not just arrays. This is important for backward compatibility with existing code.
Optional: Remove extraneous f-string prefixes
Static analysis notes that lines 558 and 569 use f-strings without placeholders:
- print(f"✅ Testing get with scalar $in operator") + print("✅ Testing get with scalar $in operator") - print(f"✅ Testing get with scalar $nin operator") + print("✅ Testing get with scalar $nin operator")
900-921: Consistent scalar field testing for OceanBase.The scalar
$inand$nintests provide the same backward compatibility validation for OceanBase as the embedded tests. This ensures consistent behavior across all client types.Optional: Remove extraneous f-string prefixes
Lines 901 and 912 use f-strings without placeholders:
- print(f"✅ Testing get with scalar $in operator") + print("✅ Testing get with scalar $in operator") - print(f"✅ Testing get with scalar $nin operator") + print("✅ Testing get with scalar $nin operator")
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/pyseekdb/client/filters.py(2 hunks)tests/test_collection_get.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_collection_get.py (6)
src/pyseekdb/client/collection.py (5)
client(78-80)name(63-65)embedding_function(88-90)add(103-149)get(364-428)src/pyseekdb/client/__init__.py (1)
Client(84-201)src/pyseekdb/client/admin_client.py (1)
get_or_create_collection(214-233)src/pyseekdb/client/client_seekdb_embedded.py (1)
_execute(107-123)src/pyseekdb/client/base_connection.py (1)
_execute(31-33)src/pyseekdb/client/client_seekdb_server.py (1)
_execute(98-111)
🪛 Ruff (0.14.8)
tests/test_collection_get.py
141-141: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
176-176: Do not catch blind exception: Exception
(BLE001)
197-197: Do not catch blind exception: Exception
(BLE001)
226-226: Do not catch blind exception: Exception
(BLE001)
247-247: Do not catch blind exception: Exception
(BLE001)
276-276: Do not catch blind exception: Exception
(BLE001)
285-285: Unused noqa directive (non-enabled: F401)
Remove unused noqa directive
(RUF100)
326-326: Do not catch blind exception: Exception
(BLE001)
347-347: Do not catch blind exception: Exception
(BLE001)
382-382: Do not catch blind exception: Exception
(BLE001)
403-403: Do not catch blind exception: Exception
(BLE001)
438-438: Do not catch blind exception: Exception
(BLE001)
558-558: f-string without any placeholders
Remove extraneous f prefix
(F541)
569-569: f-string without any placeholders
Remove extraneous f prefix
(F541)
901-901: f-string without any placeholders
Remove extraneous f prefix
(F541)
912-912: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (other, ubuntu-latest)
🔇 Additional comments (3)
src/pyseekdb/client/filters.py (1)
142-154: LGTM! Consistent negation logic.The
$ninimplementation correctly mirrors the$inoperator with negation. The parameterized approach prevents SQL injection, and empty list handling is appropriate.tests/test_collection_get.py (2)
135-278: Excellent test coverage for array overlap with $in operator.The three test variants (embedded, server, OceanBase) provide comprehensive coverage for the new JSON_OVERLAPS functionality. The test logic correctly validates that
tags=["ml","ai"]matches$in ["ml","python"]due to the "ml" overlap.Good practices:
- Connection validation for remote clients
- Cleanup in finally blocks
- Clear test documentation
279-440: LGTM! Comprehensive $nin exclusion logic validation.The
$nintests effectively validate the exclusion behavior by inserting two records—one with overlap and one without. Correctly expects only the non-overlapping record (id2) to be returned. The three client variants provide thorough coverage.
| elif op == "$in": | ||
| placeholders = ", ".join(["%s"] * len(op_value)) | ||
| clauses.append(f"JSON_EXTRACT({metadata_column}, '$.{key}') IN ({placeholders})") | ||
| params.extend(op_value) | ||
| # Parameter validation | ||
| if not isinstance(op_value, (list, tuple)): | ||
| op_value = [op_value] | ||
| if len(op_value) == 0: | ||
| continue | ||
|
|
||
| # Use single JSON_OVERLAPS (supports both scalar and array fields) | ||
| json_array = json.dumps(list(op_value)) | ||
| clauses.append( | ||
| f"JSON_OVERLAPS(JSON_EXTRACT({metadata_column}, '$.{key}'), CAST(%s AS JSON))" | ||
| ) | ||
| params.append(json_array) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for version requirements and JSON_OVERLAPS usage in the repository
echo "=== Searching for version requirements in documentation and config files ==="
find . -name "README*" -o -name "*.md" -o -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" 2>/dev/null | head -20 | xargs grep -l "version\|mysql\|oceanbase" 2>/dev/null | head -10
echo -e "\n=== Checking setup.py or pyproject.toml for version constraints ==="
find . -name "setup.py" -o -name "pyproject.toml" 2>/dev/null | xargs cat 2>/dev/null | grep -i "python\|mysql\|oceanbase" | head -20
echo -e "\n=== Searching for JSON_OVERLAPS usage in Python code ==="
rg -n "JSON_OVERLAPS" --type=py
echo -e "\n=== Searching for other JSON functions in filters.py context ==="
rg -n "JSON_EXTRACT|JSON_CONTAINS|JSON_OVERLAPS" --type=py -B2 -A2 | head -50Repository: oceanbase/pyseekdb
Length of output: 5492
🏁 Script executed:
#!/bin/bash
# Check for version requirements documentation and runtime version checks
echo "=== Checking README for version requirements ==="
head -100 README.md | grep -i "mysql\|oceanbase\|version\|requirement" || echo "No version info in README header"
echo -e "\n=== Checking for runtime version checks in client code ==="
rg -n "version|VERSION" src/pyseekdb/client/ --type=py | head -20
echo -e "\n=== Checking connection setup code for version validation ==="
rg -n "execute|connect|version" src/pyseekdb/client/client.py --type=py -A2 -B2 | head -40Repository: oceanbase/pyseekdb
Length of output: 1016
🏁 Script executed:
#!/bin/bash
# Find the actual client code and check for database connection patterns
echo "=== Finding client-related Python files ==="
find . -name "*.py" -path "*/client/*" | head -20
echo -e "\n=== Checking for connection/execute patterns in filters.py ==="
cat src/pyseekdb/client/filters.py | head -50
echo -e "\n=== Looking for version checks or configuration documentation ==="
find . -name "*.md" -o -name "*.txt" | xargs grep -l "8.0.17\|MySQL.*version\|minimum.*version" 2>/dev/nullRepository: oceanbase/pyseekdb
Length of output: 2156
The parameterized JSON_OVERLAPS implementation is correct, but consider documenting MySQL version requirements.
JSON_OVERLAPS was introduced in MySQL 8.0.17, and your use of parameterized queries with CAST(%s AS JSON) correctly prevents SQL injection. The implementation is validated by comprehensive tests across embedded, server, and OceanBase deployment modes.
However, no explicit MySQL version constraints are documented in the project configuration. For users deploying against standalone MySQL instances (rather than embedded SeekDB), consider adding a note in documentation or code comments that MySQL 8.0.17+ is required for this metadata filter to function.
🤖 Prompt for AI Agents
In src/pyseekdb/client/filters.py around lines 128 to 140, the JSON_OVERLAPS
usage requires MySQL 8.0.17+ but no version requirement is documented; add a
brief inline comment above this block (or update project docs/config) stating
that JSON_OVERLAPS requires MySQL 8.0.17 or later and that deployments using
standalone MySQL must meet this minimum version, ensuring users are aware of the
compatibility constraint.
There was a problem hiding this comment.
@longbingljw, I see you've tagged @hnwyllmm for input on the MySQL version requirement concern. I'll wait for their feedback on whether documenting the MySQL 8.0.17+ requirement for JSON_OVERLAPS would be valuable for the project.
|
Please don't do too many things in on pull request, it's difficult to revert and review. |
Make sense to me.I will create a new pull request |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/pyseekdb/client/client_base.py (2)
2164-2169: CRITICAL: Double escaping in $and conditions.Similar to the $contains case, lines 2165 and 2169 apply SQL escaping to query strings before JSON encoding, causing double escaping. The query strings should be added directly to the dictionary without SQL escaping.
🔎 Proposed fix
if contains_queries: - # Combine multiple $contains with AND (escape each query) - escaped_queries = [escape_string(q) for q in contains_queries] return _with_boost({ "query_string": { "fields": ["document"], - "query": " ".join(escaped_queries) + "query": " ".join(contains_queries) } })
2182-2187: CRITICAL: Double escaping in $or conditions.Similar to the $contains and $and cases, lines 2183 and 2187 apply SQL escaping to query strings before JSON encoding, causing double escaping. The query strings should be added directly to the dictionary without SQL escaping.
🔎 Proposed fix
if contains_queries: - # Combine multiple $contains with OR (escape each query) - escaped_queries = [escape_string(q) for q in contains_queries] return _with_boost({ "query_string": { "fields": ["document"], - "query": " OR ".join(escaped_queries) + "query": " OR ".join(contains_queries) } })
🧹 Nitpick comments (1)
src/pyseekdb/client/client_base.py (1)
2295-2302: Missing JSON_OVERLAPS implementation for JSON array fields.The PR objectives mention fixing issue #83 (handling
$inoperator for JSON array values), which proposed usingJSON_OVERLAPSto test overlap between JSON arrays. However, the current implementation at lines 2295-2302 still usestermsqueries, which won't correctly match when the metadata field contains a JSON array like["ml", "ai"].Note: Based on PR comments, the reviewer (hnwyllmm) requested this fix be moved to a separate PR. This comment is for tracking purposes to ensure the issue is addressed separately.
For reference, issue #83 describes the problem:
# Query: {"tags": {"$in": ["ml", "python"]}} # Metadata: {"tags": ["ml", "ai"]} # Current behavior: No match (comparing JSON array to scalar) # Expected: Should match (array overlap detection)The proposed solution would use
JSON_OVERLAPS(JSON_EXTRACT(metadata, '$.tags'), JSON_ARRAY('ml','python'))for array membership queries.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/pyseekdb/client/client_base.py(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/pyseekdb/client/client_base.py (1)
src/pyseekdb/client/meta_info.py (1)
CollectionFieldNames(4-10)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: test (server, ubuntu-latest)
- GitHub Check: test (embedded, ubuntu-latest)
- GitHub Check: test (oceanbase, ubuntu-latest)
🔇 Additional comments (1)
src/pyseekdb/client/client_base.py (1)
1086-1090: LGTM: Empty string handling correctly fixed.The code now uses
is not Nonechecks instead of truthiness checks, which correctly distinguishes betweenNone(→ SQL NULL) and empty strings (→ escaped as''). This properly addresses the concerns from previous reviews.Also applies to: 1107-1109
Summary
Use parameterized query and pymysql will automatically escape the parameters
related issue #62
fix #75
Solution Description
Summary by CodeRabbit
Release Notes
Documentation
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.