Skip to content

LSM Vector updates and fixes#2831

Merged
robfrank merged 10 commits intomainfrom
feat/2529-update-tests-to-lsm-vector
Nov 28, 2025
Merged

LSM Vector updates and fixes#2831
robfrank merged 10 commits intomainfrom
feat/2529-update-tests-to-lsm-vector

Conversation

@robfrank
Copy link
Collaborator

This pull request introduces several improvements and fixes to the handling of vector indexes in the codebase, particularly for LSMVectorIndex and its compaction process. It also enhances array handling in serialization and comparison utilities, and improves test coverage for vector index import and query operations. The most important changes are grouped below.

Vector Index Compaction and Loading Improvements

  • Added support for loading compacted vector index files by distinguishing them from regular index files in LSMVectorIndex.PaginatedComponentFactoryHandlerUnique, ensuring correct instantiation of LSMVectorIndexCompacted when needed.
  • Improved page reading in LSMVectorIndexCompactor by switching from direct ByteBuffer access to page methods (readInt, readByte, etc.), which correctly handle page header offsets and reduce corruption risk. Also added validation for pointer bounds and entry offsets. [1] [2] [3] [4]
  • Updated compaction logic to avoid unnecessary page version tracking when writing new pages during compaction, reflecting that these pages are freshly created and not subject to WAL versioning. [1] [2]

Array Handling and Serialization Enhancements

  • Added a robust arrayToList utility method to both BinaryComparator and JsonSerializer, allowing conversion of primitive and object arrays to lists for serialization and comparison, and replaced previous usages of Arrays.asList and List.of with this method. [1] [2] [3]

Schema and Index Registration

  • Registered the compacted vector index file extension with the schema's component factory, ensuring that compacted indexes are recognized and loaded properly by the database schema. [1] [2]

Integration Test Improvements

  • Enhanced the GloVe vector importer integration test by adding a new test to verify vector neighbor queries and index creation, and refactored the import command to use multi-line string syntax for clarity. [1] [2] [3]

Miscellaneous

  • Added a debug print statement to the vector index creation logic in the text embeddings importer to aid troubleshooting during index creation.

@robfrank robfrank requested a review from lvca November 26, 2025 14:37
@mergify
Copy link
Contributor

mergify bot commented Nov 26, 2025

🧪 CI Insights

Here's what we observed from your CI run for edf3896.

🟢 All jobs passed!

But CI Insights is watching 👀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @robfrank, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on refining the handling of vector indexes, particularly for compacted versions, to improve their loading, compaction, and overall robustness. It also introduces a more reliable method for converting arrays to lists for serialization and comparison across different components, and enhances integration tests for vector import and querying.

Highlights

  • Support for Compacted Vector Indexes: The system can now correctly load and distinguish compacted vector index files (LSMVectorIndexCompacted) from regular index files, ensuring proper instantiation during database operations.
  • Improved Vector Index Compaction Logic: Page reading during compaction in LSMVectorIndexCompactor has been enhanced by using page-specific methods instead of direct ByteBuffer access, which correctly handles page headers and reduces corruption risk. Validation for pointer bounds and entry offsets was also added.
  • Optimized Page Versioning During Compaction: Compaction logic was updated to avoid unnecessary page version tracking for newly created pages, as these pages are fresh and not subject to Write-Ahead Log (WAL) versioning.
  • Robust Array-to-List Conversion Utility: A new arrayToList utility method was introduced in BinaryComparator and JsonSerializer to safely convert both primitive and object arrays into lists for consistent serialization and comparison, replacing previous Arrays.asList and List.of usages.
  • Schema Registration for Compacted Indexes: The file extension for compacted vector indexes has been registered with the database schema's component factory, ensuring they are properly recognized and loaded.
  • Enhanced Vector Importer Integration Tests: The GloVe vector importer integration test now includes a new test case to verify vector neighbor queries and index creation, and the import command syntax was refactored for better readability using multi-line strings.
  • Debugging Aid for Vector Index Creation: A debug print statement was added to the text embeddings importer to assist in troubleshooting issues during vector index creation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable improvements to vector index handling, particularly for LSMVectorIndex compaction and serialization. The refactoring in LSMVectorIndexCompactor to use BasePage methods and add validation is a significant enhancement for robustness and maintainability. The new arrayToList utility is a good addition for handling primitive arrays, though it is duplicated and should be refactored into a shared utility class. While the test coverage is improved, there are some issues with test dependencies, hardcoded paths, and a critical bug in one of the test files that need to be addressed.

Comment on lines +85 to +87
// db.drop();
// TestHelper.checkActiveDatabases();
// FileUtils.deleteRecursively(new File(databasePath));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The cleanup logic in the finally block is commented out. This is likely to allow the new query() test to use the database created by importDocuments(), creating a dependency between tests. This is not a good practice as tests should be independent. A better approach would be to use JUnit 5's @BeforeAll or @BeforeEach to set up the database for each test or for the whole class. Also, leaving cleanup code commented out can lead to resource leaks and leftover files after test execution. Please refactor the tests to be independent and ensure proper cleanup.

@robfrank robfrank changed the title wip LSM Vector updates and fixes Nov 26, 2025
@codacy-production
Copy link

codacy-production bot commented Nov 26, 2025

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
-0.06% 17.60%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (5d2cb19) 75387 47275 62.71%
Head commit (edf3896) 75491 (+104) 47294 (+19) 62.65% (-0.06%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#2831) 125 22 17.60%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@robfrank robfrank force-pushed the feat/2529-update-tests-to-lsm-vector branch 2 times, most recently from 4ff1931 to c0b34ff Compare November 27, 2025 09:22
@robfrank robfrank force-pushed the feat/2529-update-tests-to-lsm-vector branch from 5f3e261 to edf3896 Compare November 28, 2025 19:23
@robfrank robfrank merged commit 18ec4af into main Nov 28, 2025
9 of 10 checks passed
@robfrank robfrank deleted the feat/2529-update-tests-to-lsm-vector branch November 28, 2025 19:26
robfrank added a commit that referenced this pull request Feb 11, 2026
(cherry picked from commit 18ec4af)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant