Skip to content

FULL_TEXT index on LIKE queries shows no performance improvement or regression #2703

@tae898

Description

@tae898

Summary

FULL_TEXT indexes on text fields do not improve performance for LIKE '%term%' queries, and in some cases cause performance regression compared to sequential scans.

Environment

  • ArcadeDB Version: 25.9.1
  • Dataset: MovieLens (ml-latest: 62,423 movies, 33M+ ratings)
  • Index Type: FULL_TEXT on Movie.genres field
  • Query Pattern: SELECT count(*) FROM Movie WHERE genres LIKE '%Action%'

Expected Behavior

FULL_TEXT indexes should improve or at least not degrade performance for text search queries.

Actual Behavior

FULL_TEXT index causes 10.8% performance regression on COUNT queries with LIKE:

Performance Results

Query: Count ALL Action movies (LIKE, no LIMIT)

  • Without index: 58.7ms ± 8.4ms
  • With FULL_TEXT index: 65.1ms ± 18.0ms
  • Result: 0.9x speedup (10.8% SLOWER)

Query: Find movies by genre (LIKE with LIMIT 10)

  • Without index: 1.0ms ± 1.0ms
  • With FULL_TEXT index: 0.8ms ± 0.3ms
  • Result: 1.3x speedup (marginal improvement)

Comparison with Other Indexes

For comparison, LSM_TREE indexes on the same dataset show massive improvements:

  • Rating(userId): 14,836x speedup
  • Rating(movieId): 107.9x speedup
  • Movie(movieId): 58.1x speedup

Test Code

-- Create FULL_TEXT index
CREATE INDEX ON Movie (genres) FULL_TEXT

-- Query that shows no improvement
SELECT count(*) as count FROM Movie WHERE genres LIKE '%Action%'

Dataset: Genres field contains pipe-delimited values like "Action|Comedy|Sci-Fi"

Analysis

The FULL_TEXT index appears to add overhead without providing optimization for LIKE '%term%' pattern matching:

  1. Tokenization mismatch: FULL_TEXT tokenizes "Action|Comedy" into separate terms
  2. Pattern matching: LIKE '%Action%' is substring search, not token search
  3. Index overhead: Query planner may attempt to use index inefficiently

Questions

  1. Should LIKE queries automatically use FULL_TEXT indexes when available?
  2. Is there alternative syntax (e.g., SEARCH function) that properly utilizes FULL_TEXT indexes?
  3. Should the query planner avoid using FULL_TEXT indexes for LIKE patterns?

Reproduction

Full reproduction code available in Python bindings example:
https://github.com/humemai/arcadedb-embedded-python/blob/main/bindings/python/examples/04_csv_import_documents.py

The example demonstrates:

  • CSV import with automatic schema inference
  • Index creation after bulk import
  • Performance comparison with/without indexes
  • Result validation

Request

Please clarify:

  1. Expected behavior of FULL_TEXT indexes with LIKE queries
  2. Recommended syntax for full-text search on tokenized fields
  3. Whether this is expected behavior or a query optimizer issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions