Skip to content

Full-Text Index Engine Improvements #3220

@robfrank

Description

@robfrank

Summary

Enhance ArcadeDB's full-text search capabilities with configurable analyzers, Lucene query syntax support, new SQL functions, and relevance scoring.

Features

1. Configurable Lucene Analyzers

Support custom analyzers per index via SQL METADATA:

CREATE INDEX ON Article (content) FULL_TEXT 
METADATA {"analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"}

Options:

  • analyzer - Default analyzer class
  • index_analyzer / query_analyzer - Separate analyzers for indexing vs querying
  • <field>_analyzer - Per-field analyzer overrides
  • allowLeadingWildcard - Enable leading wildcard queries
  • defaultOperator - Default boolean operator (AND/OR)

2. Multi-Property Full-Text Indexes

Index multiple STRING fields in a single full-text index:

CREATE INDEX ON Article (title, body) FULL_TEXT

Supports both unqualified searches and field-specific queries (title:java).

3. SEARCH_INDEX() SQL Function

Search full-text indexes using Lucene query syntax:

SELECT title, $score FROM Article 
WHERE SEARCH_INDEX('Article[content]', '+java +programming -python')

Supports:

  • Boolean operators: + (required), - (excluded)
  • Phrase queries: "exact phrase"
  • Wildcards: java*, te?t
  • Field-specific: title:java

4. SEARCH_FIELDS() SQL Function

Auto-discovers the appropriate full-text index by field names:

SELECT * FROM Article 
WHERE SEARCH_FIELDS(['title', 'content'], 'database')

5. Relevance Scoring ($score)

Expose search relevance scores in query results:

SELECT title, $score FROM Article 
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESC

Score = number of matching search terms per document.

6. SEARCH_INDEX_MORE() - Similar Document Search

Find documents similar to source documents using More Like This (MLT) algorithm:

SELECT title, $similarity FROM Article 
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3, #10:4])
ORDER BY $similarity DESC

Features:

  • Uses TF-IDF term analysis to find similar content
  • Supports multiple source documents (combines their terms)
  • Returns normalized similarity score ($similarity: 0.0 to 1.0)
  • Configurable with 9 parameters (minTermFreq, maxQueryTerms, etc.)

Use cases:

  • Content recommendations ("related articles")
  • Duplicate detection
  • Exploratory search
  • Content clustering

7. SEARCH_FIELDS_MORE() - Similar Document Search by Fields

Auto-discovers full-text index by field names for similarity search:

SELECT title, $similarity FROM Article 
WHERE SEARCH_FIELDS_MORE(['title', 'body'], [#10:3])
ORDER BY $similarity DESC

Configuration parameters:

SELECT title, $similarity FROM Article 
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3, #10:4], {
  'minTermFreq': 2,
  'minDocFreq': 5,
  'maxQueryTerms': 50,
  'excludeSource': false
})
ORDER BY $similarity DESC

8. Similarity Scoring ($similarity)

Normalized similarity scores (0.0 to 1.0) for More Like This queries:

SELECT title, $score, $similarity FROM Article 
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:3])
ORDER BY $similarity DESC
  • $similarity - Normalized score (1.0 = most similar)
  • $score - Raw Lucene score (accumulated term matches)

Use Cases

  • Search applications - Build search functionality with relevance ranking
  • Multilingual content - Use language-specific analyzers (English, French, German, etc.)
  • Advanced queries - Boolean logic, phrase matching, wildcards
  • Cross-field search - Search across title + body with single index
  • Content recommendations - "Readers who liked this also enjoyed..."
  • Related content - Automatically find similar articles, documents, or products
  • Duplicate detection - Identify near-duplicate content

Implementation Notes

  • Uses Apache Lucene 10.x QueryParser and MoreLikeThis
  • Backward compatible with existing CONTAINSTEXT operator
  • Thread-safe query execution
  • Results cached per query execution for efficiency
  • More Like This uses TF-IDF algorithm for similarity ranking

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions