Skip to content

Add progress logging for JVector graph build in LSMVectorIndex** #3163

@tae898

Description

@tae898
  • Context: Long-running MSMARCO builds (10M/20M) hit “Building JVector graph index …” then stay silent for hours. No progress visibility; hard to tell if GC-thrashing or building. Current code only logs start/end and uses graphCallback, but call sites pass null.

  • Expected: Periodic progress logs during graph build and persistence (e.g., nodes added / total, inserts in progress, persistence chunks). Optionally emit GC-friendly throttled logs (every few seconds).

  • Actual: Single INFO line at start of graph build, then nothing until completion. No visibility into progress or stalls.

  • Relevant code: LSMVectorIndex.java

    • Graph build and monitor thread: LSMVectorIndex.java
    • Persistence progress hook: LSMVectorIndex.java
    • Build entry point currently called with graphCallback=null: LSMVectorIndex.java
  • Proposal:

    1. Add a small GraphBuildCallback impl that logs progress at INFO (throttled, e.g., every 5–10s) for phases “validating”, “building”, “persisting”.
    2. Pass this callback into build(...) instead of null so the existing monitor thread emits progress.
    3. (Optional) Log chunk persistence at INFO with bytes written (already available via ChunkCommitCallback) and keep DEBUG/FINE for noisy details.
    4. Consider exposing a config to disable/slow progress logging if needed.
  • Benefit: Operators can see forward movement (or lack thereof) during multi-hour graph builds, aiding debugging of stalls/GC thrash and long indexing runs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions