Honor storeVectorsInGraph when writing vecgraph#3145
Conversation
Summary of ChangesHello @tae898, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue where the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🧪 CI InsightsHere's what we observed from your CI run for fd91a0d. 🟢 All jobs passed!But CI Insights is watching 👀 |
There was a problem hiding this comment.
Code Review
This pull request correctly implements the logic to honor storeVectorsInGraph=false by writing a topology-only graph with zero-dimension vectors, which resolves a significant performance and storage issue. The change is well-implemented. I've suggested a minor refactoring to reduce some code duplication, which could improve future maintainability.
| if (storeVectors) { | ||
| final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension); | ||
| try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer) | ||
| .with(new InlineVectors(dimension)).build()) { | ||
| // Write header with startOffset 0 (graph data starts at beginning of file) | ||
| indexWriter.writeHeader(graph.getView(), 0L); | ||
|
|
||
| // Write actual vectors inline | ||
| indexWriter.write(Map.of(FeatureId.INLINE_VECTORS, | ||
| (IntFunction<Feature.State>) ordinal -> { | ||
| final VectorFloat<?> vector = vectors.getVector(ordinal); | ||
| return new InlineVectors.State(vector != null ? vector : emptyVector); | ||
| } else { | ||
| // Original behavior: empty vectors (fetched from documents on-demand) | ||
| return new InlineVectors.State(emptyVector); | ||
| } | ||
| })); | ||
| })); | ||
| } | ||
| } else { | ||
| // Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload) | ||
| final InlineVectors zeroInline = new InlineVectors(0); | ||
| final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0); | ||
| try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer) | ||
| .with(zeroInline).build()) { | ||
| indexWriter.writeHeader(graph.getView(), 0L); | ||
| indexWriter.write(Map.of(FeatureId.INLINE_VECTORS, | ||
| (IntFunction<Feature.State>) ordinal -> new InlineVectors.State(emptyVector))); | ||
| } | ||
| } |
There was a problem hiding this comment.
The if/else block introduces some code duplication, particularly with the try-with-resources statement for OnDiskSequentialGraphIndexWriter and the calls to writeHeader and write. This could be refactored to reduce duplication and improve maintainability.
You can define the InlineVectors and the IntFunction<Feature.State> provider based on the storeVectors flag, and then have a single try-with-resources block that uses these variables. This would make the code more DRY (Don't Repeat Yourself).
final InlineVectors inlineVectors;
final IntFunction<Feature.State> stateProvider;
if (storeVectors) {
inlineVectors = new InlineVectors(dimension);
final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension);
stateProvider = ordinal -> {
final VectorFloat<?> vector = vectors.getVector(ordinal);
return new InlineVectors.State(vector != null ? vector : emptyVector);
};
} else {
// Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload)
inlineVectors = new InlineVectors(0);
final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0);
stateProvider = ordinal -> new InlineVectors.State(emptyVector);
}
try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
.with(inlineVectors).build()) {
indexWriter.writeHeader(graph.getView(), 0L);
indexWriter.write(Map.of(FeatureId.INLINE_VECTORS, stateProvider));
}|
Thanks for this PR, seems to work! 🎉 I'm going to change it a little bit the code later and add a test case to avoid regressions. |
(cherry picked from commit da5e70d)
Title: Honor
storeVectorsInGraph=falseto avoid inline vector duplicationSummary
storeVectorsInGraph=falseignored: vecgraph still inlines vectors #3142storeVectorsInGraph=false, keeping the graph topology-only.Why it matters
store_vectors_in_graph=onprovides no benefit here and hurts performance; keep it off by default on local SSD.Benchmark (MSMARCO-1M, INT8, add_hierarchy=false, max_connections=12, beam_width=64)
Disk listings
store_vectors_in_graph=off(vecgraph ~54 MB; lsmvecidx ~999 MB; bucket ~5.6 GB)store_vectors_in_graph=on(vecgraph ~3.9 GB; lsmvecidx ~999 MB; bucket ~5.6 GB)Notes
storeVectorsInGraph=false; enable only for slow/remote doc storage or single-file portability.Testing