-
-
Notifications
You must be signed in to change notification settings - Fork 94
Description
-
Context: Long-running MSMARCO builds (10M/20M) hit “Building JVector graph index …” then stay silent for hours. No progress visibility; hard to tell if GC-thrashing or building. Current code only logs start/end and uses
graphCallback, but call sites passnull. -
Expected: Periodic progress logs during graph build and persistence (e.g., nodes added / total, inserts in progress, persistence chunks). Optionally emit GC-friendly throttled logs (every few seconds).
-
Actual: Single INFO line at start of graph build, then nothing until completion. No visibility into progress or stalls.
-
Relevant code: LSMVectorIndex.java
- Graph build and monitor thread: LSMVectorIndex.java
- Persistence progress hook: LSMVectorIndex.java
- Build entry point currently called with
graphCallback=null: LSMVectorIndex.java
-
Proposal:
- Add a small
GraphBuildCallbackimpl that logs progress at INFO (throttled, e.g., every 5–10s) for phases “validating”, “building”, “persisting”. - Pass this callback into
build(...)instead ofnullso the existing monitor thread emits progress. - (Optional) Log chunk persistence at INFO with bytes written (already available via
ChunkCommitCallback) and keep DEBUG/FINE for noisy details. - Consider exposing a config to disable/slow progress logging if needed.
- Add a small
-
Benefit: Operators can see forward movement (or lack thereof) during multi-hour graph builds, aiding debugging of stalls/GC thrash and long indexing runs.