Support fulltext index parser configuration#68
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for fulltext index parser configuration to pyseekdb, enabling users to customize text parsing behavior when creating collections. It introduces a new Configuration wrapper class and FulltextParserConfig to specify different parser types (ik, space, ngram, ngram2, beng) with optional parameters, while maintaining backward compatibility with the existing HNSWConfiguration-only API.
Key Changes:
- Introduced
ConfigurationandFulltextParserConfigclasses for flexible fulltext parser configuration - Added SQL generation logic to support PARSER and PARSER_PROPERTIES clauses in CREATE TABLE statements
- Refactored internal server method from
executeto_executeto indicate it's an internal API - Added comprehensive test coverage for fulltext parser configuration scenarios
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 23 comments.
Show a summary per file
| File | Description |
|---|---|
src/pyseekdb/client/configuration.py |
New file defining Configuration, FulltextParserConfig, and HNSWConfiguration classes with validation |
src/pyseekdb/client/client_base.py |
Added fulltext config extraction and SQL generation functions; updated create_collection to support Configuration wrapper |
src/pyseekdb/client/__init__.py |
Exported new Configuration and FulltextParserConfig classes |
src/pyseekdb/__init__.py |
Updated exports to include Configuration and FulltextParserConfig |
src/pyseekdb/client/embedding_function.py |
Added dimension_of helper function to extract dimension from embedding functions |
src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py |
New SentenceTransformerEmbeddingFunction utility class for sentence-transformers integration |
tests/test_fulltext_parser_config.py |
Comprehensive tests for fulltext parser configuration including all parser types and backward compatibility |
tests/test_configuration.py |
Unit tests for Configuration, HNSWConfiguration, and FulltextParserConfig classes |
tests/test_offical_case.py |
Updated to use _execute internal method instead of execute |
tests/test_default_embedding_function.py |
Updated to use _execute internal method instead of execute |
tests/test_collection_query.py |
Updated to use _execute internal method instead of execute |
tests/test_collection_hybrid_search_builder_integration.py |
Updated to use execute; renamed test methods from test_seekdb_server* to test_server_* |
tests/test_collection_hybrid_search.py |
Updated to use execute; renamed test methods from test_seekdb_server* to test_server_* |
tests/test_collection_get.py |
Updated to use _execute internal method instead of execute |
tests/test_collection_embedding_function.py |
Updated to use _execute internal method instead of execute |
tests/test_collection_dml.py |
Updated to use _execute internal method instead of execute |
tests/test_client_creation.py |
Updated to use _execute; added test for Configuration with fulltext parser |
README.md |
Updated documentation with examples of Configuration usage and fulltext parser options |
pyproject.toml |
Added Tuna PyPI mirror as primary source |
.github/workflows/ci.yml |
Removed tuna source in CI; updated OceanBase version; added test output parsing logic |
Comments suppressed due to low confidence (1)
src/pyseekdb/client/configuration.py:4
- Import of 'get_default_embedding_function' is not used.
Import of 'dimension_of' is not used.
from .embedding_function import get_default_embedding_function, dimension_of
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
In this example from pyseekdb import ( client = Client(host="127.0.0.1", port=2881, database="test") Explicitly use IK parser (default for Chinese)config = Configuration( the magic number 384 is not suitable |
|
Parser is not a proper name, analyzer is better in NLP and search engine. refer to https://www.qianwen.com/share?shareId=12e9fdd7-da05-4ed8-ab95-c979b1984238 |
Summary
close #56
Support fulltext index parser configuration
Fulltext Index Parser Configuration Feature
Overview
This feature allows users to configure different fulltext parsers when creating collections in pyseekdb, enabling better text search capabilities for various languages and use cases.
What It Does
The patch adds support for configuring fulltext index parsers when creating collections. Previously, collections were created with a hardcoded IK parser for fulltext indexing. This feature enables:
Configurationwrapper class that combines both vector index (HNSW) and fulltext index configuration in a single objectHNSWConfigurationdirectly continues to work without modificationHow It Works
Architecture Changes
New Configuration Module (
configuration.py):HNSWConfigurationfromclient_base.pyinto a dedicated moduleFulltextParserConfigdataclass for parser configurationConfigurationwrapper class that combines both configurationsDistanceMetricenum for type-safe distance metric constantsConfiguration Classes:
FulltextParserConfig: Configures the fulltext parserparser(str): Parser name ('ik', 'space', 'ngram', 'ngram2', 'beng', etc.)params(Optional[Dict]): Parser-specific parametersConfiguration: Wrapper class containing:hnsw(Optional[HNSWConfiguration]): Vector index configurationfulltext_config(Optional[FulltextParserConfig]): Fulltext parser configurationHNSWConfiguration: Moved to configuration module (unchanged functionality)SQL Generation:
_get_fulltext_index_sql(): Generates the SQL clause for FULLTEXT INDEX based on parser configurationWITH PARSER ikWITH PARSER ngram PARSER_PROPERTIES=(size=2)_get_vector_index_sql(): Extracted vector index SQL generation into a separate functionCollection Creation Flow:
Key Implementation Details
Backward Compatibility: The
create_collection()method accepts bothConfigurationandHNSWConfiguration:HNSWConfigurationis provided: Uses default IK parserConfigurationis provided: Uses specified parser or defaults to IKNoneis provided: Calculates dimension from embedding function, uses default IK parserDefault Behavior: When no fulltext configuration is provided, the system defaults to IK parser (maintaining backward compatibility)
Parameter Validation:
FulltextParserConfigvalidates that parser names and parameters don't contain quotes to prevent SQL injectionCompatibility
Backward Compatibility
✅ Fully Backward Compatible: All existing code continues to work without modification.
Existing code patterns that still work:
Migration Path
To take advantage of the new fulltext parser configuration, migrate to the
Configurationwrapper:Before:
After (Recommended):
API Changes
New Exports:
Configuration: Wrapper class for collection configurationFulltextParserConfig: Fulltext parser configuration classMoved Exports:
HNSWConfiguration: Now exported fromconfigurationmodule (still available from main package)Removed Exports:
DEFAULT_VECTOR_DIMENSION: Moved to internal moduleDEFAULT_DISTANCE_METRIC: Moved to internal moduleNew Feature Usage Examples
Example 1: Using IK Parser (Default, Chinese Text)
Example 2: Using Space Parser (English Text)
Example 3: Using Ngram Parser with Custom Parameters
Example 4: Using Ngram2 Parser
Example 5: Using Bengali Parser
Example 6: Configuration with Only HNSW (Uses Default Parser)
Example 7: Complete Hybrid Search Example
Available Parser Types
ikspacengramngram_token_size(int)ngram2bengFor more information about parsers and their parameters, refer to the OceanBase documentation on tokenizer options.
Best Practices
Use Configuration Wrapper: Even when only configuring HNSW, use the
Configurationwrapper for future extensibility:Choose Parser Based on Language:
ikparserspaceparserbengparserngramwith appropriate parametersTest Parser Performance: Different parsers may have different performance characteristics. Test with your specific data and query patterns.
Parameter Validation: The system validates parser names and parameters to prevent SQL injection. Ensure your parser names and parameter values don't contain quotes.
Technical Details
SQL Generation
The feature generates SQL clauses dynamically based on configuration:
Default (IK parser):
With parameters:
Internal Functions
_extract_hnsw_config(): Extracts HNSWConfiguration from Configuration or HNSWConfiguration_extract_fulltext_config(): Extracts FulltextParserConfig from Configuration_get_fulltext_index_sql(): Generates FULLTEXT INDEX SQL clause_get_vector_index_sql(): Generates VECTOR INDEX SQL clauseSummary
This feature enhances pyseekdb by providing flexible fulltext parser configuration while maintaining full backward compatibility. Users can now:
The implementation is clean, well-structured, and follows Python best practices with proper type hints, validation, and error handling.