Skip to content

Honor storeVectorsInGraph when writing vecgraph#3145

Merged
lvca merged 1 commit intoArcadeData:mainfrom
humemai:fix/vecgraph-store-flag
Jan 15, 2026
Merged

Honor storeVectorsInGraph when writing vecgraph#3145
lvca merged 1 commit intoArcadeData:mainfrom
humemai:fix/vecgraph-store-flag

Conversation

@tae898
Copy link
Contributor

@tae898 tae898 commented Jan 15, 2026

Title: Honor storeVectorsInGraph=false to avoid inline vector duplication

Summary

Why it matters

  • With inline vectors on, vecgraph balloons from ~54 MB to ~3.9 GB and total DB from ~6.6 GB to ~10.5 GB, with no recall gain.
  • Post-reopen search is much slower when inline vectors are on (24.1 s vs 2.68 s off).
  • RSS and ingest/build are slightly worse with inline vectors; lsmvecidx (~999 MB) and doc bucket (~5.6 GB) stay unchanged.
  • Net: store_vectors_in_graph=on provides no benefit here and hurts performance; keep it off by default on local SSD.

Benchmark (MSMARCO-1M, INT8, add_hierarchy=false, max_connections=12, beam_width=64)

store_vectors_in_graph ingest_s ingest_rss_mb warmup_s search_s search_rss_mb recall@50_before_close warmup_after_reopen_s search_after_reopen_s search_after_reopen_rss_mb recall@50_after_reopen peak_rss_mb db_size_mb total_duration
False 70.191 8806.59 805.646 6.108 9.062 0.911 1.516 2.68 33.797 0.911 9431.61 6734.19 14m 52.298s
True 70.775 8816.24 854.573 5.857 13.117 0.9086 1.449 24.112 10.586 0.9085 9455.94 10633.9 16m 2.495s

Disk listings

store_vectors_in_graph=off (vecgraph ~54 MB; lsmvecidx ~999 MB; bucket ~5.6 GB)

.../VectorData_0_2647080377041612_vecgraph.5.262144.v0.vecgraph   54M
.../VectorData_0_2647080377041612.4.262144.v0.lsmvecidx           999M
.../VectorData_0.1.65536.v0.bucket                                5.6G

store_vectors_in_graph=on (vecgraph ~3.9 GB; lsmvecidx ~999 MB; bucket ~5.6 GB)

.../VectorData_0_2647080838931807_vecgraph.5.262144.v0.vecgraph   3.9G
.../VectorData_0_2647080838931807.4.262144.v0.lsmvecidx           999M
.../VectorData_0.1.65536.v0.bucket                                5.6G

Notes

  • Inline vectors did not help recall or memory; they increased disk and slowed post-reopen search.
  • Default should remain storeVectorsInGraph=false; enable only for slow/remote doc storage or single-file portability.

Testing

  • Benchmark run: MSMARCO-1M INT8 as above.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @tae898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the vecgraph writer failed to honor the storeVectorsInGraph=false configuration, leading to unintended vector duplication within the graph file. The fix introduces conditional logic to prevent vectors from being inlined when disabled, significantly optimizing disk usage and improving search performance. By emitting a zero-dimension inline feature in topology-only mode, the change ensures the writer's requirements are met without incurring unnecessary storage overhead.

Highlights

  • Correctly Honor storeVectorsInGraph=false: The vecgraph writer now correctly skips inlining vectors when storeVectorsInGraph is false, preventing unnecessary vector duplication and reducing file size.
  • Significant Performance and Disk Usage Improvement: This change drastically reduces vecgraph file size (e.g., from ~3.9 GB to ~54 MB) and improves post-reopen search times (e.g., from 24.1s to 2.68s) by avoiding inline vector storage when not desired.
  • Satisfying Writer Requirements with Zero-Dimension Vectors: When storeVectorsInGraph is false, a zero-dimension inline feature is emitted. This ensures the graph writer functions correctly without storing any actual vector payload, maintaining topology-only mode.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify
Copy link
Contributor

mergify bot commented Jan 15, 2026

🧪 CI Insights

Here's what we observed from your CI run for fd91a0d.

🟢 All jobs passed!

But CI Insights is watching 👀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements the logic to honor storeVectorsInGraph=false by writing a topology-only graph with zero-dimension vectors, which resolves a significant performance and storage issue. The change is well-implemented. I've suggested a minor refactoring to reduce some code duplication, which could improve future maintainability.

Comment on lines +184 to 208
if (storeVectors) {
final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension);
try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
.with(new InlineVectors(dimension)).build()) {
// Write header with startOffset 0 (graph data starts at beginning of file)
indexWriter.writeHeader(graph.getView(), 0L);

// Write actual vectors inline
indexWriter.write(Map.of(FeatureId.INLINE_VECTORS,
(IntFunction<Feature.State>) ordinal -> {
final VectorFloat<?> vector = vectors.getVector(ordinal);
return new InlineVectors.State(vector != null ? vector : emptyVector);
} else {
// Original behavior: empty vectors (fetched from documents on-demand)
return new InlineVectors.State(emptyVector);
}
}));
}));
}
} else {
// Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload)
final InlineVectors zeroInline = new InlineVectors(0);
final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0);
try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
.with(zeroInline).build()) {
indexWriter.writeHeader(graph.getView(), 0L);
indexWriter.write(Map.of(FeatureId.INLINE_VECTORS,
(IntFunction<Feature.State>) ordinal -> new InlineVectors.State(emptyVector)));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if/else block introduces some code duplication, particularly with the try-with-resources statement for OnDiskSequentialGraphIndexWriter and the calls to writeHeader and write. This could be refactored to reduce duplication and improve maintainability.

You can define the InlineVectors and the IntFunction<Feature.State> provider based on the storeVectors flag, and then have a single try-with-resources block that uses these variables. This would make the code more DRY (Don't Repeat Yourself).

      final InlineVectors inlineVectors;
      final IntFunction<Feature.State> stateProvider;

      if (storeVectors) {
        inlineVectors = new InlineVectors(dimension);
        final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension);
        stateProvider = ordinal -> {
          final VectorFloat<?> vector = vectors.getVector(ordinal);
          return new InlineVectors.State(vector != null ? vector : emptyVector);
        };
      } else {
        // Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload)
        inlineVectors = new InlineVectors(0);
        final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0);
        stateProvider = ordinal -> new InlineVectors.State(emptyVector);
      }

      try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
          .with(inlineVectors).build()) {
        indexWriter.writeHeader(graph.getView(), 0L);
        indexWriter.write(Map.of(FeatureId.INLINE_VECTORS, stateProvider));
      }

@lvca lvca self-requested a review January 15, 2026 14:39
@lvca lvca added the bug label Jan 15, 2026
@lvca lvca added this to the 26.1.1 milestone Jan 15, 2026
@lvca
Copy link
Contributor

lvca commented Jan 15, 2026

Thanks for this PR, seems to work! 🎉 I'm going to change it a little bit the code later and add a test case to avoid regressions.

@lvca lvca merged commit da5e70d into ArcadeData:main Jan 15, 2026
11 of 13 checks passed
@tae898 tae898 deleted the fix/vecgraph-store-flag branch January 16, 2026 14:57
robfrank pushed a commit that referenced this pull request Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

storeVectorsInGraph=false ignored: vecgraph still inlines vectors

2 participants