Support fulltext index parser configuration by hnwyllmm · Pull Request #68 · oceanbase/pyseekdb

hnwyllmm · 2025-12-11T03:21:56Z

Summary

close #56

Support fulltext index parser configuration

Fulltext Index Parser Configuration Feature

Overview

This feature allows users to configure different fulltext parsers when creating collections in pyseekdb, enabling better text search capabilities for various languages and use cases.

What It Does

The patch adds support for configuring fulltext index parsers when creating collections. Previously, collections were created with a hardcoded IK parser for fulltext indexing. This feature enables:

Configurable Fulltext Parsers: Choose from multiple parser types (IK, space, ngram, ngram2, beng) based on your language and use case
Parser-Specific Parameters: Configure custom parameters for parsers that support them (e.g., ngram token size)
Unified Configuration API: A new Configuration wrapper class that combines both vector index (HNSW) and fulltext index configuration in a single object
Backward Compatibility: Existing code using HNSWConfiguration directly continues to work without modification

How It Works

Architecture Changes

New Configuration Module (configuration.py):
- Extracted HNSWConfiguration from client_base.py into a dedicated module
- Introduced FulltextParserConfig dataclass for parser configuration
- Introduced Configuration wrapper class that combines both configurations
- Added DistanceMetric enum for type-safe distance metric constants
Configuration Classes:
- FulltextParserConfig: Configures the fulltext parser
  - parser (str): Parser name ('ik', 'space', 'ngram', 'ngram2', 'beng', etc.)
  - params (Optional[Dict]): Parser-specific parameters
- Configuration: Wrapper class containing:
  - hnsw (Optional[HNSWConfiguration]): Vector index configuration
  - fulltext_config (Optional[FulltextParserConfig]): Fulltext parser configuration
- HNSWConfiguration: Moved to configuration module (unchanged functionality)
SQL Generation:
- _get_fulltext_index_sql(): Generates the SQL clause for FULLTEXT INDEX based on parser configuration
  - Default: WITH PARSER ik
  - With parameters: WITH PARSER ngram PARSER_PROPERTIES=(size=2)
- _get_vector_index_sql(): Extracted vector index SQL generation into a separate function

Collection Creation Flow:

create_collection()
├── Extract HNSW config from Configuration or HNSWConfiguration
├── Extract fulltext config from Configuration (or use default)
├── Generate fulltext index SQL clause
├── Generate vector index SQL clause
└── Execute CREATE TABLE with both index configurations

Key Implementation Details

Backward Compatibility: The create_collection() method accepts both Configuration and HNSWConfiguration:
- If HNSWConfiguration is provided: Uses default IK parser
- If Configuration is provided: Uses specified parser or defaults to IK
- If None is provided: Calculates dimension from embedding function, uses default IK parser
Default Behavior: When no fulltext configuration is provided, the system defaults to IK parser (maintaining backward compatibility)
Parameter Validation: FulltextParserConfig validates that parser names and parameters don't contain quotes to prevent SQL injection

Compatibility

Backward Compatibility

✅ Fully Backward Compatible: All existing code continues to work without modification.

Existing code patterns that still work:

# Pattern 1: No configuration (uses defaults)
collection = client.create_collection(name="my_collection")

# Pattern 2: HNSWConfiguration only (uses default IK parser)
config = HNSWConfiguration(dimension=384, distance='cosine')
collection = client.create_collection(
    name="my_collection",
    configuration=config
)

# Pattern 3: HNSWConfiguration with embedding function
ef = DefaultEmbeddingFunction()
config = HNSWConfiguration(dimension=384, distance='cosine')
collection = client.create_collection(
    name="my_collection",
    configuration=config,
    embedding_function=ef
)

Migration Path

To take advantage of the new fulltext parser configuration, migrate to the Configuration wrapper:

Before:

config = HNSWConfiguration(dimension=384, distance='cosine')
collection = client.create_collection(
    name="my_collection",
    configuration=config
)

After (Recommended):

from pyseekdb import Configuration, HNSWConfiguration, FulltextParserConfig

config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='ik')  # Explicit, or omit for default
)
collection = client.create_collection(
    name="my_collection",
    configuration=config
)

API Changes

New Exports:

Configuration: Wrapper class for collection configuration
FulltextParserConfig: Fulltext parser configuration class

Moved Exports:

HNSWConfiguration: Now exported from configuration module (still available from main package)

Removed Exports:

DEFAULT_VECTOR_DIMENSION: Moved to internal module
DEFAULT_DISTANCE_METRIC: Moved to internal module

New Feature Usage Examples

Example 1: Using IK Parser (Default, Chinese Text)

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig,
    DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

# Explicitly use IK parser (default for Chinese)
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='ik')
)

collection = client.create_collection(
    name="chinese_collection",
    configuration=config,
    embedding_function=ef
)

# Add Chinese documents
collection.add(
    documents=["这是中文文档", "另一个中文文档"],
    metadatas=[{"lang": "zh"}, {"lang": "zh"}],
    ids=["doc1", "doc2"]
)

# Search with fulltext
results = collection.hybrid_search(
    query=[{"where_document": {"$contains": "中文"}}],
    knn=[{"query_texts": "文档", "n_results": 5}],
    n_results=10
)

Example 2: Using Space Parser (English Text)

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig,
    DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

# Use space parser for English text
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='space')
)

collection = client.create_collection(
    name="english_collection",
    configuration=config,
    embedding_function=ef
)

# Add English documents
collection.add(
    documents=[
        "Machine learning is a subset of artificial intelligence",
        "Natural language processing enables computers to understand text"
    ],
    metadatas=[{"lang": "en"}, {"lang": "en"}],
    ids=["doc1", "doc2"]
)

# Search with fulltext
results = collection.hybrid_search(
    query=[{"where_document": {"$contains": "machine learning"}}],
    knn=[{"query_texts": "AI technology", "n_results": 5}],
    n_results=10
)

Example 3: Using Ngram Parser with Custom Parameters

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig,
    DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

# Use ngram parser with custom token size
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(
        parser='ngram',
        params={'ngram_token_size': 3}  # 3-gram tokenizer
    )
)

collection = client.create_collection(
    name="ngram_collection",
    configuration=config,
    embedding_function=ef
)

# Add documents
collection.add(
    documents=["ABCDEFG", "HIJKLMN"],
    metadatas=[{"type": "sequence"}, {"type": "sequence"}],
    ids=["seq1", "seq2"]
)

# Search
results = collection.hybrid_search(
    query=[{"where_document": {"$contains": "ABC"}}],
    knn=[{"query_texts": "sequence", "n_results": 5}],
    n_results=10
)

Example 4: Using Ngram2 Parser

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig
)

client = Client(host="127.0.0.1", port=2881, database="test")

# Use ngram2 parser (2-gram tokenizer)
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='ngram2')
)

collection = client.create_collection(
    name="ngram2_collection",
    configuration=config
)

# Add documents and search...

Example 5: Using Bengali Parser

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig,
    DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

# Use Bengali parser
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='beng')
)

collection = client.create_collection(
    name="bengali_collection",
    configuration=config,
    embedding_function=ef
)

# Add Bengali documents
collection.add(
    documents=["এটি একটি বাংলা নথি", "আরেকটি বাংলা নথি"],
    metadatas=[{"lang": "bn"}, {"lang": "bn"}],
    ids=["doc1", "doc2"]
)

Example 6: Configuration with Only HNSW (Uses Default Parser)

from pyseekdb import Configuration, HNSWConfiguration

# When only HNSW config is provided, defaults to IK parser
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine')
    # fulltext_config defaults to FulltextParserConfig(parser='ik')
)

collection = client.create_collection(
    name="my_collection",
    configuration=config
)

Example 7: Complete Hybrid Search Example

from pyseekdb import (
    Client,
    Configuration,
    HNSWConfiguration,
    FulltextParserConfig,
    DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

# Create collection with space parser for English
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
    fulltext_config=FulltextParserConfig(parser='space')
)

collection = client.create_collection(
    name="hybrid_collection",
    configuration=config,
    embedding_function=ef
)

# Add documents
collection.add(
    documents=[
        "Python is a programming language",
        "Machine learning uses algorithms",
        "Data science combines statistics and computing"
    ],
    metadatas=[
        {"topic": "programming"},
        {"topic": "AI"},
        {"topic": "data"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Hybrid search combining fulltext and vector search
results = collection.hybrid_search(
    query=[
        {
            "where_document": {"$contains": "programming"},
            "boost": 1.5
        }
    ],
    knn=[
        {
            "query_texts": "coding languages",
            "n_results": 5,
            "boost": 1.0
        }
    ],
    rank={"rrf": {}},  # Reciprocal Rank Fusion
    n_results=10
)

print(f"Found {len(results['ids'][0])} results")
for i, (doc_id, doc, metadata) in enumerate(zip(
    results['ids'][0],
    results['documents'][0],
    results['metadatas'][0]
)):
    print(f"{i+1}. ID: {doc_id}")
    print(f"   Document: {doc}")
    print(f"   Metadata: {metadata}")

Available Parser Types

Parser	Description	Use Case	Parameters
`ik`	IK parser for Chinese text segmentation	Chinese documents	None
`space`	Space-separated tokenizer	English and other space-separated languages	None
`ngram`	N-gram tokenizer	Flexible tokenization, sequence matching	`ngram_token_size` (int)
`ngram2`	2-gram tokenizer	Fixed 2-gram tokenization	None
`beng`	Bengali text parser	Bengali documents	None

For more information about parsers and their parameters, refer to the OceanBase documentation on tokenizer options.

Best Practices

Use Configuration Wrapper: Even when only configuring HNSW, use the Configuration wrapper for future extensibility:
```
config = Configuration(
    hnsw=HNSWConfiguration(dimension=384, distance='cosine')
)
```
Choose Parser Based on Language:
- Chinese: Use ik parser
- English/Space-separated: Use space parser
- Bengali: Use beng parser
- Custom tokenization: Use ngram with appropriate parameters
Test Parser Performance: Different parsers may have different performance characteristics. Test with your specific data and query patterns.
Parameter Validation: The system validates parser names and parameters to prevent SQL injection. Ensure your parser names and parameter values don't contain quotes.

Technical Details

SQL Generation

The feature generates SQL clauses dynamically based on configuration:

Default (IK parser):

FULLTEXT INDEX idx_fts(document) WITH PARSER ik

With parameters:

FULLTEXT INDEX idx_fts(document) WITH PARSER ngram PARSER_PROPERTIES=(ngram_token_size=3)

Internal Functions

_extract_hnsw_config(): Extracts HNSWConfiguration from Configuration or HNSWConfiguration
_extract_fulltext_config(): Extracts FulltextParserConfig from Configuration
_get_fulltext_index_sql(): Generates FULLTEXT INDEX SQL clause
_get_vector_index_sql(): Generates VECTOR INDEX SQL clause

Summary

This feature enhances pyseekdb by providing flexible fulltext parser configuration while maintaining full backward compatibility. Users can now:

Choose appropriate parsers for different languages
Configure parser-specific parameters
Use a unified Configuration API for both vector and fulltext indexing
Continue using existing code without modification

The implementation is clean, well-structured, and follows Python best practices with proper type hints, validation, and error handling.

Copilot

Pull request overview

This PR adds support for fulltext index parser configuration to pyseekdb, enabling users to customize text parsing behavior when creating collections. It introduces a new Configuration wrapper class and FulltextParserConfig to specify different parser types (ik, space, ngram, ngram2, beng) with optional parameters, while maintaining backward compatibility with the existing HNSWConfiguration-only API.

Key Changes:

Introduced Configuration and FulltextParserConfig classes for flexible fulltext parser configuration
Added SQL generation logic to support PARSER and PARSER_PROPERTIES clauses in CREATE TABLE statements
Refactored internal server method from execute to _execute to indicate it's an internal API
Added comprehensive test coverage for fulltext parser configuration scenarios

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 23 comments.

Show a summary per file

File	Description
`src/pyseekdb/client/configuration.py`	New file defining Configuration, FulltextParserConfig, and HNSWConfiguration classes with validation
`src/pyseekdb/client/client_base.py`	Added fulltext config extraction and SQL generation functions; updated create_collection to support Configuration wrapper
`src/pyseekdb/client/__init__.py`	Exported new Configuration and FulltextParserConfig classes
`src/pyseekdb/__init__.py`	Updated exports to include Configuration and FulltextParserConfig
`src/pyseekdb/client/embedding_function.py`	Added dimension_of helper function to extract dimension from embedding functions
`src/pyseekdb/utils/embedding_functions/sentence_transformer_embedding_function.py`	New SentenceTransformerEmbeddingFunction utility class for sentence-transformers integration
`tests/test_fulltext_parser_config.py`	Comprehensive tests for fulltext parser configuration including all parser types and backward compatibility
`tests/test_configuration.py`	Unit tests for Configuration, HNSWConfiguration, and FulltextParserConfig classes
`tests/test_offical_case.py`	Updated to use _execute internal method instead of execute
`tests/test_default_embedding_function.py`	Updated to use _execute internal method instead of execute
`tests/test_collection_query.py`	Updated to use _execute internal method instead of execute
`tests/test_collection_hybrid_search_builder_integration.py`	Updated to use execute; renamed test methods from test_seekdb_server* to test_server_*
`tests/test_collection_hybrid_search.py`	Updated to use execute; renamed test methods from test_seekdb_server* to test_server_*
`tests/test_collection_get.py`	Updated to use _execute internal method instead of execute
`tests/test_collection_embedding_function.py`	Updated to use _execute internal method instead of execute
`tests/test_collection_dml.py`	Updated to use _execute internal method instead of execute
`tests/test_client_creation.py`	Updated to use _execute; added test for Configuration with fulltext parser
`README.md`	Updated documentation with examples of Configuration usage and fulltext parser options
`pyproject.toml`	Added Tuna PyPI mirror as primary source
`.github/workflows/ci.yml`	Removed tuna source in CI; updated OceanBase version; added test output parsing logic

Comments suppressed due to low confidence (1)

src/pyseekdb/client/configuration.py:4

Import of 'get_default_embedding_function' is not used.
Import of 'dimension_of' is not used.

from .embedding_function import get_default_embedding_function, dimension_of

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

liuhao6741 · 2025-12-12T10:08:52Z

In this example

from pyseekdb import (
Client,
Configuration,
HNSWConfiguration,
FulltextParserConfig,
DefaultEmbeddingFunction
)

client = Client(host="127.0.0.1", port=2881, database="test")
ef = DefaultEmbeddingFunction()

Explicitly use IK parser (default for Chinese)

config = Configuration(
hnsw=HNSWConfiguration(dimension=384, distance='cosine'),
fulltext_config=FulltextParserConfig(parser='ik')
)

the magic number 384 is not suitable

liuhao6741 · 2025-12-12T10:47:15Z

Parser is not a proper name, analyzer is better in NLP and search engine. refer to https://www.qianwen.com/share?shareId=12e9fdd7-da05-4ed8-ab95-c979b1984238

hnwyllmm added 9 commits December 9, 2025 14:49

sentence-transformers

f8cce74

configuration

0e73002

merge from develop

bcebc5d

support fulltext index config

2950110

support configuration

d3a8b95

fix create vector index error

6ccf7a4

test

7497531

use test_server prefix instead of test_seekdb_server

de03864

use 4.5.0 oceanbase image

6017864

hnwyllmm requested a review from Copilot December 12, 2025 00:58

Copilot started reviewing on behalf of hnwyllmm December 12, 2025 00:59 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

hnwyllmm added 4 commits December 12, 2025 09:37

test dimention_of

add3206

test dimention_of

98a8447

remove tuna source

cbbed21

remove tuna

24f4e4d

hnwyllmm merged commit 1ec851c into oceanbase:develop Dec 12, 2025
5 checks passed

hnwyllmm deleted the embedding-function branch December 12, 2025 03:54

hnwyllmm mentioned this pull request Dec 12, 2025

[Feature]: fulltext index configuration #56

Closed

hnwyllmm mentioned this pull request Jan 20, 2026

Parser is not a proper name, analyzer is better in NLP and search engine. #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fulltext index parser configuration#68

Support fulltext index parser configuration#68
hnwyllmm merged 13 commits into
oceanbase:developfrom
hnwyllmm:embedding-function

hnwyllmm commented Dec 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liuhao6741 commented Dec 12, 2025

Uh oh!

liuhao6741 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hnwyllmm commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fulltext Index Parser Configuration Feature

Overview

What It Does

How It Works

Architecture Changes

Key Implementation Details

Compatibility

Backward Compatibility

Migration Path

API Changes

New Feature Usage Examples

Example 1: Using IK Parser (Default, Chinese Text)

Example 2: Using Space Parser (English Text)

Example 3: Using Ngram Parser with Custom Parameters

Example 4: Using Ngram2 Parser

Example 5: Using Bengali Parser

Example 6: Configuration with Only HNSW (Uses Default Parser)

Example 7: Complete Hybrid Search Example

Available Parser Types

Best Practices

Technical Details

SQL Generation

Internal Functions

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liuhao6741 commented Dec 12, 2025

Explicitly use IK parser (default for Chinese)

Uh oh!

liuhao6741 commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hnwyllmm commented Dec 11, 2025 •

edited

Loading