Summary
Add a new index type for multi-vector (ColBERT/ColPali-style) retrieval alongside the existing LSM_VECTOR (HNSW) index. This unlocks citation-grade RAG for agentic use cases (primary consumer: ArcadeBrain).
Motivation
The current LSM_VECTOR index stores one vector per document, which averages token-level signal and loses precision on:
- multi-concept queries
- rare terms / proper nouns
- long documents (books)
- citation-grade retrieval for LLM agents
ColBERT-style late interaction keeps one vector per token and computes MaxSim at query time. Published benchmarks (BEIR) show recall@10 improvements from ~60-70% (dense) to ~85-95% (late interaction) on hard queries. This is the dominant pattern in 2026 for high-precision RAG.
Competitors (Vespa, Qdrant, Weaviate, Milvus) already support this natively. ArcadeDB is the only graph+vector DB that does not.
Scope (Phase 1: denormalized approach)
Each token of a document is indexed as its own HNSW entry carrying a back-pointer RID to the parent document. Query expands each query token against the HNSW graph, aggregates candidates by parent RID, then applies MaxSim (the existing vector.multiScore(..., 'MAX') function can be reused for the fusion step).
Schema / Metadata
Storage / Insert path
Query path
Cypher integration
Parameter binding
Tests (TDD)
Out of Scope (future phases)
- Phase 2: Contiguous binary serialization for 2D float arrays (
ARRAY_OF_FLOATS_2D type). Current ARRAY_OF_FLOATS uses VarInt-wrapped IEEE754 (~5 bytes/float worst case) - a dense format halves storage.
- Phase 3: Native
LSM_MULTIVECTOR index type with PLAID centroid pre-filter, product quantization for multi-vector.
Compatibility
- Additive only - new index type, existing
LSM_VECTOR untouched.
- No new dependencies (JVector 4.0 already handles it).
- Zero breaking changes.
Estimated effort
~2 weeks for one engineer.
Summary
Add a new index type for multi-vector (ColBERT/ColPali-style) retrieval alongside the existing
LSM_VECTOR(HNSW) index. This unlocks citation-grade RAG for agentic use cases (primary consumer: ArcadeBrain).Motivation
The current
LSM_VECTORindex stores one vector per document, which averages token-level signal and loses precision on:ColBERT-style late interaction keeps one vector per token and computes MaxSim at query time. Published benchmarks (BEIR) show recall@10 improvements from ~60-70% (dense) to ~85-95% (late interaction) on hard queries. This is the dominant pattern in 2026 for high-precision RAG.
Competitors (Vespa, Qdrant, Weaviate, Milvus) already support this natively. ArcadeDB is the only graph+vector DB that does not.
Scope (Phase 1: denormalized approach)
Each token of a document is indexed as its own HNSW entry carrying a back-pointer RID to the parent document. Query expands each query token against the HNSW graph, aggregates candidates by parent RID, then applies MaxSim (the existing
vector.multiScore(..., 'MAX')function can be reused for the fusion step).Schema / Metadata
multiVector: booleanandparentRidProperty: stringfields toLSMVectorIndexMetadataTypeLSMVectorIndexBuilderwith.withMultiVector(true).withParentProperty("docRid")multiVectorflag in SQLCREATE INDEX ... METADATA {...}pathLINKproperty for parent RIDStorage / Insert path
LSMVectorIndex.putMulti(RID parentRid, float[][] tokens)that loops and inserts N entries, each tagged with parentRidVectorLocationIndexentry withparentRidfieldparentRid = deletedRidQuery path
SQLFunctionVectorMultiNeighborsinfunction/sql/vector/- signaturevector.multiNeighbors('Type[prop]', float[][], k, {efSearch, candidateMultiplier})vector.multiScore(..., 'MAX')aggregationvectorMultiNeighborsfor naming consistencyCypher integration
db.index.vector.queryMultiNodes(indexName, k, float[][])inquery/opencypher/procedures/db/Parameter binding
float[][]/ nested JSON array binds correctly throughPostCommandHandler/PostQueryHandler[[0.1, 0.2],[0.3, 0.4]]and HTTP JSON body paramTests (TDD)
LSMMultiVectorIndexTest- 10 docs x 5 tokens x 64 dims, assert top-k matches brute-force MaxSimserver/module with nested array paramdb.index.vector.queryMultiNodes@Tag("slow")Out of Scope (future phases)
ARRAY_OF_FLOATS_2Dtype). CurrentARRAY_OF_FLOATSuses VarInt-wrapped IEEE754 (~5 bytes/float worst case) - a dense format halves storage.LSM_MULTIVECTORindex type with PLAID centroid pre-filter, product quantization for multi-vector.Compatibility
LSM_VECTORuntouched.Estimated effort
~2 weeks for one engineer.