Skip to content

Add multi-vector late interaction (ColBERT-style) index support #3970

@lvca

Description

@lvca

Summary

Add a new index type for multi-vector (ColBERT/ColPali-style) retrieval alongside the existing LSM_VECTOR (HNSW) index. This unlocks citation-grade RAG for agentic use cases (primary consumer: ArcadeBrain).

Motivation

The current LSM_VECTOR index stores one vector per document, which averages token-level signal and loses precision on:

  • multi-concept queries
  • rare terms / proper nouns
  • long documents (books)
  • citation-grade retrieval for LLM agents

ColBERT-style late interaction keeps one vector per token and computes MaxSim at query time. Published benchmarks (BEIR) show recall@10 improvements from ~60-70% (dense) to ~85-95% (late interaction) on hard queries. This is the dominant pattern in 2026 for high-precision RAG.

Competitors (Vespa, Qdrant, Weaviate, Milvus) already support this natively. ArcadeDB is the only graph+vector DB that does not.

Scope (Phase 1: denormalized approach)

Each token of a document is indexed as its own HNSW entry carrying a back-pointer RID to the parent document. Query expands each query token against the HNSW graph, aggregates candidates by parent RID, then applies MaxSim (the existing vector.multiScore(..., 'MAX') function can be reused for the fusion step).

Schema / Metadata

  • Add multiVector: boolean and parentRidProperty: string fields to LSMVectorIndexMetadata
  • Extend TypeLSMVectorIndexBuilder with .withMultiVector(true).withParentProperty("docRid")
  • Parse new multiVector flag in SQL CREATE INDEX ... METADATA {...} path
  • Validate: multi-vector index requires a non-indexed LINK property for parent RID

Storage / Insert path

  • Add LSMVectorIndex.putMulti(RID parentRid, float[][] tokens) that loops and inserts N entries, each tagged with parentRid
  • Extend VectorLocationIndex entry with parentRid field
  • Hook into record-update path: on parent update, delete old token set, insert new
  • Hook into record-delete path: cascade-delete all tokens where parentRid = deletedRid

Query path

  • New SQL function SQLFunctionVectorMultiNeighbors in function/sql/vector/ - signature vector.multiNeighbors('Type[prop]', float[][], k, {efSearch, candidateMultiplier})
  • Algorithm: for each query token -> HNSW search top-k' (default k'=k*10); collect candidate parent RIDs; compute MaxSim per candidate; sort, return top-k
  • Reuse existing vector.multiScore(..., 'MAX') aggregation
  • Register alias vectorMultiNeighbors for naming consistency

Cypher integration

  • Add procedure db.index.vector.queryMultiNodes(indexName, k, float[][]) in query/opencypher/procedures/db/

Parameter binding

  • Verify float[][] / nested JSON array binds correctly through PostCommandHandler / PostQueryHandler
  • Test both SQL literal [[0.1, 0.2],[0.3, 0.4]] and HTTP JSON body param

Tests (TDD)

  • LSMMultiVectorIndexTest - 10 docs x 5 tokens x 64 dims, assert top-k matches brute-force MaxSim
  • Regression: cascade delete, update parent vectors
  • HTTP-level test in server/ module with nested array param
  • Cypher test for db.index.vector.queryMultiNodes
  • 1k-docs end-to-end test tagged @Tag("slow")

Out of Scope (future phases)

  • Phase 2: Contiguous binary serialization for 2D float arrays (ARRAY_OF_FLOATS_2D type). Current ARRAY_OF_FLOATS uses VarInt-wrapped IEEE754 (~5 bytes/float worst case) - a dense format halves storage.
  • Phase 3: Native LSM_MULTIVECTOR index type with PLAID centroid pre-filter, product quantization for multi-vector.

Compatibility

  • Additive only - new index type, existing LSM_VECTOR untouched.
  • No new dependencies (JVector 4.0 already handles it).
  • Zero breaking changes.

Estimated effort

~2 weeks for one engineer.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions