Skip to content

Serialize and persist bm25 indexes#942

Merged
bplatz merged 41 commits intomainfrom
feature/bm25-search-continued
Mar 24, 2025
Merged

Serialize and persist bm25 indexes#942
bplatz merged 41 commits intomainfrom
feature/bm25-search-continued

Conversation

@zonotope
Copy link
Contributor

This patch adds functions to serialize and persist any virtual graph indexes as part of the normal indexing process and to reload any persisted indexes when a db is loaded from storage.

This pr currently has draft status because the index isn't persisted if the minimum novelty threshold isn't reached and the graph indexing process doesn't kick off. I'm still working on the best way to detect when the bm25 index has changed and needs persisting at commit time, and how to wait until the bm25 indexing process completes to persist the index without holding up the transactor, but I wanted to publish the persistence code now for feedback.

@zonotope zonotope requested a review from bplatz November 26, 2024 19:22
Copy link
Contributor

@bplatz bplatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Are you thinking we should write out new VG indexes with every commit, or try to rebuild the index from the last FlakeIndex 't' and just hook into that process?

(loop [[[vg-alias vg] & r] vg-map
address-map {}]
(if vg-alias
(let [address (<? (vg/write-vg index-catalog vg))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you are light on optimizations for now intentionally - but maybe at least worth a TODO that this ideally would be parallelized.

(loop [[[vg-alias address] & r] vg-address-map
vg-map {}]
(if vg-alias
(let [vg (<? (vg/read-vg index-catalog address))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure of best strategy, but ideally we delay loads until the index is used.


#?(:clj (set! *warn-on-reflection* true))

(defprotocol GraphSerializer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems too broad, maybe FlakeIndexSerializer?

@bplatz bplatz mentioned this pull request Jan 27, 2025
…eature/bm25-search-continued

# Conflicts:
#	src/clj/fluree/db/virtual_graph/bm25/index.clj
#	src/clj/fluree/db/virtual_graph/parse.cljc
Base automatically changed from feature/bm25-search to main January 28, 2025 11:48
@bplatz bplatz mentioned this pull request Mar 19, 2025
@bplatz bplatz merged commit 765337e into main Mar 24, 2025
6 checks passed
@bplatz bplatz deleted the feature/bm25-search-continued branch March 24, 2025 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants