-
-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Summary
Enhance ArcadeDB's full-text search capabilities with configurable analyzers, Lucene query syntax support, new SQL functions, and relevance scoring.
Features
1. Configurable Lucene Analyzers
Support custom analyzers per index via SQL METADATA:
CREATE INDEX ON Article (content) FULL_TEXT
METADATA {"analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"}Options:
analyzer- Default analyzer classindex_analyzer/query_analyzer- Separate analyzers for indexing vs querying<field>_analyzer- Per-field analyzer overridesallowLeadingWildcard- Enable leading wildcard queriesdefaultOperator- Default boolean operator (AND/OR)
2. Multi-Property Full-Text Indexes
Index multiple STRING fields in a single full-text index:
CREATE INDEX ON Article (title, body) FULL_TEXTSupports both unqualified searches and field-specific queries (title:java).
3. SEARCH_INDEX() SQL Function
Search full-text indexes using Lucene query syntax:
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', '+java +programming -python')Supports:
- Boolean operators:
+(required),-(excluded) - Phrase queries:
"exact phrase" - Wildcards:
java*,te?t - Field-specific:
title:java
4. SEARCH_FIELDS() SQL Function
Auto-discovers the appropriate full-text index by field names:
SELECT * FROM Article
WHERE SEARCH_FIELDS(['title', 'content'], 'database')5. Relevance Scoring ($score)
Expose search relevance scores in query results:
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESCScore = number of matching search terms per document.
6. SEARCH_INDEX_MORE() - Similar Document Search
Find documents similar to source documents using More Like This (MLT) algorithm:
SELECT title, $similarity FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3, #10:4])
ORDER BY $similarity DESCFeatures:
- Uses TF-IDF term analysis to find similar content
- Supports multiple source documents (combines their terms)
- Returns normalized similarity score ($similarity: 0.0 to 1.0)
- Configurable with 9 parameters (minTermFreq, maxQueryTerms, etc.)
Use cases:
- Content recommendations ("related articles")
- Duplicate detection
- Exploratory search
- Content clustering
7. SEARCH_FIELDS_MORE() - Similar Document Search by Fields
Auto-discovers full-text index by field names for similarity search:
SELECT title, $similarity FROM Article
WHERE SEARCH_FIELDS_MORE(['title', 'body'], [#10:3])
ORDER BY $similarity DESCConfiguration parameters:
SELECT title, $similarity FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3, #10:4], {
'minTermFreq': 2,
'minDocFreq': 5,
'maxQueryTerms': 50,
'excludeSource': false
})
ORDER BY $similarity DESC8. Similarity Scoring ($similarity)
Normalized similarity scores (0.0 to 1.0) for More Like This queries:
SELECT title, $score, $similarity FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3])
ORDER BY $similarity DESC$similarity- Normalized score (1.0 = most similar)$score- Raw Lucene score (accumulated term matches)
Use Cases
- Search applications - Build search functionality with relevance ranking
- Multilingual content - Use language-specific analyzers (English, French, German, etc.)
- Advanced queries - Boolean logic, phrase matching, wildcards
- Cross-field search - Search across title + body with single index
- Content recommendations - "Readers who liked this also enjoyed..."
- Related content - Automatically find similar articles, documents, or products
- Duplicate detection - Identify near-duplicate content
Implementation Notes
- Uses Apache Lucene 10.x QueryParser and MoreLikeThis
- Backward compatible with existing
CONTAINSTEXToperator - Thread-safe query execution
- Results cached per query execution for efficiency
- More Like This uses TF-IDF algorithm for similarity ranking