Honor storeVectorsInGraph when writing vecgraph by tae898 · Pull Request #3145 · ArcadeData/arcadedb

tae898 · 2026-01-15T11:49:17Z

Title: Honor storeVectorsInGraph=false to avoid inline vector duplication

Summary

Fixes storeVectorsInGraph=false ignored: vecgraph still inlines vectors #3142
Fixes vecgraph writer to skip inlining vectors when storeVectorsInGraph=false, keeping the graph topology-only.
Keeps writer happy in topology-only mode by emitting a zero-dimension inline feature (no payload).

Why it matters

With inline vectors on, vecgraph balloons from ~54 MB to ~3.9 GB and total DB from ~6.6 GB to ~10.5 GB, with no recall gain.
Post-reopen search is much slower when inline vectors are on (24.1 s vs 2.68 s off).
RSS and ingest/build are slightly worse with inline vectors; lsmvecidx (~999 MB) and doc bucket (~5.6 GB) stay unchanged.
Net: store_vectors_in_graph=on provides no benefit here and hurts performance; keep it off by default on local SSD.

Benchmark (MSMARCO-1M, INT8, add_hierarchy=false, max_connections=12, beam_width=64)

store_vectors_in_graph	ingest_s	ingest_rss_mb	warmup_s	search_s	search_rss_mb	recall@50_before_close	warmup_after_reopen_s	search_after_reopen_s	search_after_reopen_rss_mb	recall@50_after_reopen	peak_rss_mb	db_size_mb	total_duration
False	70.191	8806.59	805.646	6.108	9.062	0.911	1.516	2.68	33.797	0.911	9431.61	6734.19	14m 52.298s
True	70.775	8816.24	854.573	5.857	13.117	0.9086	1.449	24.112	10.586	0.9085	9455.94	10633.9	16m 2.495s

Disk listings

store_vectors_in_graph=off (vecgraph ~54 MB; lsmvecidx ~999 MB; bucket ~5.6 GB)

.../VectorData_0_2647080377041612_vecgraph.5.262144.v0.vecgraph   54M
.../VectorData_0_2647080377041612.4.262144.v0.lsmvecidx           999M
.../VectorData_0.1.65536.v0.bucket                                5.6G

store_vectors_in_graph=on (vecgraph ~3.9 GB; lsmvecidx ~999 MB; bucket ~5.6 GB)

.../VectorData_0_2647080838931807_vecgraph.5.262144.v0.vecgraph   3.9G
.../VectorData_0_2647080838931807.4.262144.v0.lsmvecidx           999M
.../VectorData_0.1.65536.v0.bucket                                5.6G

Notes

Inline vectors did not help recall or memory; they increased disk and slowed post-reopen search.
Default should remain storeVectorsInGraph=false; enable only for slow/remote doc storage or single-file portability.

Testing

Benchmark run: MSMARCO-1M INT8 as above.

gemini-code-assist · 2026-01-15T11:49:34Z

Summary of Changes

Hello @tae898, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where the vecgraph writer failed to honor the storeVectorsInGraph=false configuration, leading to unintended vector duplication within the graph file. The fix introduces conditional logic to prevent vectors from being inlined when disabled, significantly optimizing disk usage and improving search performance. By emitting a zero-dimension inline feature in topology-only mode, the change ensures the writer's requirements are met without incurring unnecessary storage overhead.

Highlights

Correctly Honor storeVectorsInGraph=false: The vecgraph writer now correctly skips inlining vectors when storeVectorsInGraph is false, preventing unnecessary vector duplication and reducing file size.
Significant Performance and Disk Usage Improvement: This change drastically reduces vecgraph file size (e.g., from ~3.9 GB to ~54 MB) and improves post-reopen search times (e.g., from 24.1s to 2.68s) by avoiding inline vector storage when not desired.
Satisfying Writer Requirements with Zero-Dimension Vectors: When storeVectorsInGraph is false, a zero-dimension inline feature is emitted. This ensures the graph writer functions correctly without storing any actual vector payload, maintaining topology-only mode.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

mergify · 2026-01-15T11:50:36Z

🧪 CI Insights

Here's what we observed from your CI run for fd91a0d.

🟢 All jobs passed!

But CI Insights is watching 👀

gemini-code-assist

Code Review

This pull request correctly implements the logic to honor storeVectorsInGraph=false by writing a topology-only graph with zero-dimension vectors, which resolves a significant performance and storage issue. The change is well-implemented. I've suggested a minor refactoring to reduce some code duplication, which could improve future maintainability.

gemini-code-assist · 2026-01-15T11:51:01Z

engine/src/main/java/com/arcadedb/index/vector/LSMVectorIndexGraphFile.java

+      if (storeVectors) {
+        final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension);
+        try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
+            .with(new InlineVectors(dimension)).build()) {
+          // Write header with startOffset 0 (graph data starts at beginning of file)
+          indexWriter.writeHeader(graph.getView(), 0L);
+
+          // Write actual vectors inline
+          indexWriter.write(Map.of(FeatureId.INLINE_VECTORS,
+              (IntFunction<Feature.State>) ordinal -> {
                final VectorFloat<?> vector = vectors.getVector(ordinal);
                return new InlineVectors.State(vector != null ? vector : emptyVector);
-              } else {
-                // Original behavior: empty vectors (fetched from documents on-demand)
-                return new InlineVectors.State(emptyVector);
-              }
-            }));
+              }));
+        }
+      } else {
+        // Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload)
+        final InlineVectors zeroInline = new InlineVectors(0);
+        final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0);
+        try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer)
+            .with(zeroInline).build()) {
+          indexWriter.writeHeader(graph.getView(), 0L);
+          indexWriter.write(Map.of(FeatureId.INLINE_VECTORS,
+              (IntFunction<Feature.State>) ordinal -> new InlineVectors.State(emptyVector)));
+        }
      }


The if/else block introduces some code duplication, particularly with the try-with-resources statement for OnDiskSequentialGraphIndexWriter and the calls to writeHeader and write. This could be refactored to reduce duplication and improve maintainability.

You can define the InlineVectors and the IntFunction<Feature.State> provider based on the storeVectors flag, and then have a single try-with-resources block that uses these variables. This would make the code more DRY (Don't Repeat Yourself).

final InlineVectors inlineVectors; final IntFunction<Feature.State> stateProvider; if (storeVectors) { inlineVectors = new InlineVectors(dimension); final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(dimension); stateProvider = ordinal -> { final VectorFloat<?> vector = vectors.getVector(ordinal); return new InlineVectors.State(vector != null ? vector : emptyVector); }; } else { // Topology-only: satisfy writer requirement with zero-dimension inline vectors (no payload) inlineVectors = new InlineVectors(0); final VectorFloat<?> emptyVector = JVectorUtils.createVectorFloat(0); stateProvider = ordinal -> new InlineVectors.State(emptyVector); } try (final OnDiskSequentialGraphIndexWriter indexWriter = new OnDiskSequentialGraphIndexWriter.Builder(graph, writer) .with(inlineVectors).build()) { indexWriter.writeHeader(graph.getView(), 0L); indexWriter.write(Map.of(FeatureId.INLINE_VECTORS, stateProvider)); }

lvca · 2026-01-15T14:44:39Z

Thanks for this PR, seems to work! 🎉 I'm going to change it a little bit the code later and add a test case to avoid regressions.

(cherry picked from commit da5e70d)

Honor storeVectorsInGraph when writing vecgraph

fd91a0d

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

lvca self-requested a review January 15, 2026 14:39

lvca added the bug label Jan 15, 2026

lvca added this to the 26.1.1 milestone Jan 15, 2026

lvca merged commit da5e70d into ArcadeData:main Jan 15, 2026
11 of 13 checks passed

lvca assigned tae898 Jan 15, 2026

tae898 deleted the fix/vecgraph-store-flag branch January 16, 2026 14:57

robfrank pushed a commit that referenced this pull request Feb 11, 2026

Honor storeVectorsInGraph when writing vecgraph (#3145)

2a5a02d

(cherry picked from commit da5e70d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Honor storeVectorsInGraph when writing vecgraph#3145

Honor storeVectorsInGraph when writing vecgraph#3145
lvca merged 1 commit intoArcadeData:mainfrom
humemai:fix/vecgraph-store-flag

tae898 commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

mergify bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

lvca commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tae898 commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

mergify bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

lvca commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergify bot commented Jan 15, 2026 •

edited

Loading