Skip to content

Cherry pick2#9

Open
CalvinNeo wants to merge 11 commits intofeature/vector-indexfrom
cherry-pick2
Open

Cherry pick2#9
CalvinNeo wants to merge 11 commits intofeature/vector-indexfrom
cherry-pick2

Conversation

@CalvinNeo
Copy link
Owner

@CalvinNeo CalvinNeo commented Aug 13, 2024

What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • New Features

    • Non-intrusive key-presence check added to LRU cache.
    • In-memory vector-index cache with background cleanup and entity-count sizing.
    • Foreground S3 download + wait-for-ready support for file cache.
    • New per-thread performance contexts and metrics for vector search and file-cache operations.
  • Bug Fixes

    • Improved error handling and logging around file caching and downloads.
  • Refactor

    • Vector index API reworked into builder/viewer model and updated metadata representation.
  • Tests

    • Expanded tests for vector indexing, S3 caching, eviction and concurrent download scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

breezewish and others added 11 commits August 12, 2024 15:54
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
@coderabbitai
Copy link

coderabbitai bot commented Aug 13, 2024

Walkthrough

Adds a builder/viewer vector-index architecture, integrates vector-index-based DMFile read paths, introduces an in-memory VectorIndexCache with background cleaning, adds S3 foreground-download and wait semantics to FileCache, updates TiDB vector-index types, and several supporting API/signature adjustments across delta-merge and server components.

Changes

Cohort / File(s) Change Summary
Core vector index API
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h, .../VectorIndex.cpp, .../VectorIndex_fwd.h
Replace monolithic VectorIndex with Builder/Viewer abstraction: add VectorIndexBuilder and VectorIndexViewer public types, change forward declarations and factory/view APIs.
HNSW implementation
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h, .../Index.cpp
Rework HNSW into VectorIndexHNSWBuilder and VectorIndexHNSWViewer; runtime validation, search/get APIs moved to viewer, save to builder.
In-memory cache & perf
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h, .../VectorIndexCache.cpp, .../VectorSearchPerf.h, .../VectorSearchPerf.cpp
New VectorIndexCache class (LRU-based) with cleaner thread, files_to_check tracking; add per-thread vector-search perf context.
DeltaMerge file read path (new)
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h, .../DMFileWithVectorIndexBlockInputStream.cpp, .../VectorColumnFromIndexReader.h, .../VectorColumnFromIndexReader.cpp
Add DMFileWithVectorIndexBlockInputStream class and update vector-from-index reader to accept viewer keys; implement load/search/read flows using vector index results and pack filtering.
DMFile writer/meta updates
dbms/src/Storages/DeltaMerge/File/DMFileWriter.h, .../DMFileWriter.cpp, .../ColumnStat.h, .../dtpb/dmfile.proto
Switch vector-index types to VectorIndexDefinition/file props, write vector index to separate index files (not merged file), update proto (VectorIndexFileProps) and column stat type.
DMContext / DeltaMerge ABI changes
dbms/src/Storages/DeltaMerge/DMContext.h, dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h
DMContext::createUnique gains optional ScanContextPtr; ColumnDefine now holds VectorIndexDefinitionPtr.
Vector index serialization/viewing
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp, .../VectorIndex.h (detailed)
Add builder.create, viewer.view entry points; move isSupportedType to builder and adapt APIs to new proto/types.
Vector index test & utilities
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp
Update tests to use VectorIndexDefinitionPtr, protobuf distance enums, add S3-backed index tests and helpers (large test changes).
Vector index cache usage in Context/Server
dbms/src/Interpreters/Context.h, .../Context.cpp, dbms/src/Server/Server.cpp
Change setVectorIndexCache(size_t) semantics from bytes to entity count; adjust reset semantics to smart-pointer resets.
S3 FileCache foreground/wait
dbms/src/Storages/S3/FileCache.h, .../FileCache.cpp, .../FileCachePerf.h, .../FileCachePerf.cpp
Add getOrWait, downloadFileForLocalRead, fgDownload, FileSegment::waitForNotEmpty with condition_variable; integrate failpoint and perf counters and thread-local perf context.
Fail points and LRU helper
dbms/src/Common/FailPoint.cpp, dbms/src/Common/LRUCache.h
Add FailPoints::file_cache_fg_download_fail; add non-mutating LRUCache::contains(const Key&).
ScanContext metric
dbms/src/Storages/DeltaMerge/ScanContext.h
Add std::atomic<uint64_t> total_vector_idx_load_from_s3 and update (de)serialization/merge paths.
Remote serializer visibility
dbms/src/Storages/DeltaMerge/Remote/Serializer.h
serializeSegment made public with expanded signature.
TiDB schema / VectorIndexDefinition
dbms/src/TiDB/Schema/VectorIndex.h, dbms/src/TiDB/Schema/TiDB.h, dbms/src/TiDB/Schema/TiDB.cpp
Replace VectorIndexInfo with immutable VectorIndexDefinition (uses tipb enums), add MAX_VECTOR_DIMENSION, update JSON (de)serialization to protobuf parse/name helpers.

Sequence Diagram(s)

sequenceDiagram
    participant Reader as DMFileWithVectorIndexBlockInputStream
    participant VecCache as VectorIndexCache
    participant FileCache as S3 FileCache
    participant Viewer as VectorIndexViewer
    participant DMFile as DMFileReader

    Reader->>VecCache: getOrSet(index_file_path, load viewer)
    alt viewer in cache
        VecCache-->>Reader: return cached Viewer
    else not in cache
        VecCache->>FileCache: request index file (fgDownload via getOrWait)
        FileCache->>FileCache: download from S3 (foreground) / write local file
        FileCache-->>VecCache: local file path ready
        VecCache->>Viewer: load/view from local path
        Viewer-->>VecCache: store Viewer
        VecCache-->>Reader: return Viewer
    end
    Reader->>Viewer: search(queryInfo, valid_rows)
    Viewer-->>Reader: sorted rowid results
    Reader->>DMFile: read packs using pack filter (pack ids from results)
    DMFile-->>Reader: vector + other columns per pack
    Reader->>Reader: assemble output Block and optional Filter
    Reader-->>Client: return Block / filter
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is entirely empty—all required template sections are unfilled, including problem statement, commit message, test checklist, side effects documentation, and implementation details. Complete all required sections: describe the problem, provide commit message with implementation details, check at least one test type, document side effects, and explain impact on users or compatibility.
Docstring Coverage ⚠️ Warning Docstring coverage is 7.96% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Cherry pick2' is vague and non-descriptive, providing no meaningful information about the actual changes in this substantial pull request involving vector index refactoring. Replace with a descriptive title that summarizes the main change, such as 'Refactor vector index APIs to use builder/viewer pattern' or 'Implement VectorIndexBuilder and VectorIndexViewer abstractions'.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch cherry-pick2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CalvinNeo
Copy link
Owner Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Aug 13, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range, codebase verification and nitpick comments (9)
dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)

44-44: Ensure Serialization and Deserialization of std::optional<dtpb::VectorIndexFileProps>

The toProto and mergeFromProto methods in ColumnStat.h do not currently handle the std::optional<dtpb::VectorIndexFileProps>. Ensure that these methods are updated to serialize and deserialize this field correctly to prevent potential issues.

  • File: dbms/src/Storages/DeltaMerge/File/ColumnStat.h
  • Methods: toProto, mergeFromProto
Analysis chain

Verify the impact on serialization and deserialization.

The change to std::optional<dtpb::VectorIndexFileProps> may affect how vector_index is serialized and deserialized. Ensure that all related methods handle the new type correctly.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the impact of the type change on serialization and deserialization.

# Test: Search for methods related to serialization and deserialization of `ColumnStat`.
rg --type cpp -A 3 $'toProto|mergeFromProto|serializeToBuffer|parseFromBuffer'

Length of output: 8784


Script:

#!/bin/bash
# Extract the implementation of `toProto` and `mergeFromProto` methods from `ColumnStat.h`
ast-grep --lang cpp --pattern $'dtpb::ColumnStat toProto() const { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h
ast-grep --lang cpp --pattern $'void mergeFromProto(const dtpb::ColumnStat & proto) { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h

Length of output: 262


Script:

#!/bin/bash
# Use ripgrep to search for the definitions of `toProto` and `mergeFromProto` methods in `ColumnStat.h`
rg --type cpp 'dtpb::ColumnStat toProto\(\) const' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10
rg --type cpp 'void mergeFromProto\(const dtpb::ColumnStat & proto\)' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10

Length of output: 1164

dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)

104-108: Update documentation for addStreams method.

The addStreams method signature has been updated to use TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that any related documentation or comments are updated to reflect this change.

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)

83-103: Consider Logging Additional Information in Destructor.

The destructor logs the completion of reading a DMFile with vector index. Consider logging additional context, such as the number of rows processed or any errors encountered, to enhance traceability.

-  "Finished read DMFile with vector index for column dmf_{}/{}(id={}), "
+  "Finished read DMFile with vector index for column dmf_{}/{}(id={}), processed_rows={}, "

106-123: Clarify Filtering Logic in read Method.

The read method contains logic for filtering based on a FilterPtr. Ensure that the comments and logic are clear, especially regarding the implications of return_filter being false.

-  // If return_filter == false, we must filter by ourselves.
+  // If return_filter == false, apply filtering within this method.

170-215: Consider Caching Results in readByIndexReader.

The method readByIndexReader iterates over packs and reads vector columns. Consider caching results if this method is called frequently with the same parameters to improve performance.


258-271: Add Logging for Load Status in load Method.

Consider adding logging to indicate when the load method is invoked and completed, which can be useful for debugging and performance monitoring.

+  LOG_DEBUG(log, "Loading vector index and search results");

477-485: Clarify the Purpose of getPackIdFromBlock.

The method getPackIdFromBlock retrieves the pack ID from a block. Ensure that the purpose and usage of this method are well-documented, especially regarding assumptions about block alignment.

-  // The start offset of a block is ensured to be aligned with the pack.
+  // Assumes the start offset of a block is aligned with the pack.
dbms/src/Storages/S3/FileCache.cpp (2)

68-107: Use named constants for timeout values.

The method uses magic numbers (30 and 300 seconds) for timeouts. Consider defining named constants for these values to improve readability and maintainability.

constexpr auto WAIT_TIMEOUT = std::chrono::seconds(30);
constexpr auto MAX_WAIT_TIMEOUT = std::chrono::seconds(300);

160-176: Enhance logging for file download attempts.

Consider adding more detailed logging to trace the execution flow, especially when a file is removed from the cache due to inconsistencies.

LOG_DEBUG(log, "Attempting to download file for local read: {}", s3_fname.toFullKey());
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 32d911b and ff13cae.

Files selected for processing (36)
  • contrib/tipb (1 hunks)
  • dbms/src/Common/FailPoint.cpp (1 hunks)
  • dbms/src/Common/LRUCache.h (1 hunks)
  • dbms/src/Interpreters/Context.cpp (3 hunks)
  • dbms/src/Interpreters/Context.h (1 hunks)
  • dbms/src/Server/Server.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/DMContext.h (2 hunks)
  • dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (5 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (5 hunks)
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (4 hunks)
  • dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (3 hunks)
  • dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1 hunks)
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1 hunks)
  • dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2 hunks)
  • dbms/src/Storages/DeltaMerge/ScanContext.h (6 hunks)
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (26 hunks)
  • dbms/src/Storages/S3/FileCache.cpp (8 hunks)
  • dbms/src/Storages/S3/FileCache.h (6 hunks)
  • dbms/src/Storages/S3/FileCachePerf.cpp (1 hunks)
  • dbms/src/Storages/S3/FileCachePerf.h (1 hunks)
  • dbms/src/TiDB/Schema/TiDB.cpp (3 hunks)
  • dbms/src/TiDB/Schema/TiDB.h (1 hunks)
  • dbms/src/TiDB/Schema/VectorIndex.h (1 hunks)
Files skipped from review due to trivial changes (1)
  • contrib/tipb
Additional comments not posted (76)
dbms/src/Storages/S3/FileCachePerf.cpp (1)

1-22: LGTM!

The file sets up a thread-local performance context for file caching. The setup is straightforward and aligns with typical practices for performance tracking.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1)

1-22: LGTM!

The file sets up a thread-local performance context for vector search operations. The setup is straightforward and aligns with typical practices for performance tracking.

dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1)

Line range hint 1-26:
Simplification acknowledged and approved.

The removal of the vector member and its constructor simplifies the RSIndex structure, focusing on the essential components. This change enhances clarity and reduces unnecessary complexity.

dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)

24-28: LGTM! Modular design enhances clarity and maintainability.

The introduction of VectorIndexBuilder and VectorIndexViewer classes, along with their smart pointer types, suggests a more modular approach. This can improve code clarity and maintainability by separating concerns between building and viewing vector indices.

dbms/src/Storages/S3/FileCachePerf.h (1)

1-37: LGTM! Effective use of thread-local storage for performance tracking.

The FileCachePerfContext structure provides a clear way to track performance metrics related to file caching. Using thread_local for file_cache ensures that performance data is managed per thread, which is suitable for concurrent environments.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)

1-37: LGTM! Efficient thread-local performance tracking for vector search.

The VectorSearchPerfContext structure is well-designed for tracking performance metrics in vector search operations. The use of thread_local for vector_search ensures that performance data is isolated per thread, which is advantageous in concurrent processing.

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)

27-39: Review VectorIndexHNSWBuilder class changes.

The class VectorIndexHNSWBuilder has been introduced, replacing USearchIndexWithSerialization. The constructor now takes a VectorIndexDefinitionPtr, which suggests a design shift towards using definitions for initialization. Ensure that the new design aligns with the overall architecture and that the addBlock and save methods are correctly implemented.

Verification successful

VectorIndexHNSWBuilder Class Implementation Verified

The VectorIndexHNSWBuilder class and its methods addBlock and save are implemented correctly, with appropriate type checks and error handling. The integration with the rest of the codebase is consistent with the new design. No issues found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexHNSWBuilder` class.

# Test: Search for the class definition and usage. Expect: Consistent usage with new constructor.
rg --type cpp -A 5 $'class VectorIndexHNSWBuilder'

Length of output: 652


Script:

#!/bin/bash
# Description: Find all usages of the `VectorIndexHNSWBuilder` class to verify its integration and usage.

# Test: Search for all instances where `VectorIndexHNSWBuilder` is used in the codebase.
rg --type cpp 'VectorIndexHNSWBuilder'

Length of output: 867


Script:

#!/bin/bash
# Description: Verify the implementation of the `addBlock` and `save` methods in `Index.cpp`.

# Test: Retrieve the implementation details of `addBlock` and `save` methods.
rg --type cpp -A 10 'void VectorIndexHNSWBuilder::addBlock' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp
rg --type cpp -A 10 'void VectorIndexHNSWBuilder::save' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp

Length of output: 1248


41-55: Review VectorIndexHNSWViewer class introduction.

The VectorIndexHNSWViewer class extends VectorIndexViewer and introduces a static view method. The search method has been updated to remove the SearchStatistics parameter, focusing on query information and valid rows. Ensure that this change does not impact the functionality negatively and aligns with the intended design.

Verification successful

Removal of SearchStatistics Parameter Verified

The removal of the SearchStatistics parameter from the search method in the VectorIndexHNSWViewer class does not appear to impact functionality negatively. There are no references to SearchStatistics in the codebase, indicating it was not crucial to the method's operation. This change aligns with the intended design.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexHNSWViewer` class.

# Test: Search for the class definition and usage. Expect: Proper integration with the new search method.
rg --type cpp -A 5 $'class VectorIndexHNSWViewer'

Length of output: 650


Script:

#!/bin/bash
# Description: Search for the usage of the `search` method in the `VectorIndexHNSWViewer` class.

# Test: Find all occurrences of `search` method calls. Expect: Understand how the method is used and ensure the change is safe.
rg --type cpp 'search\(' -A 3

Length of output: 22355


Script:

#!/bin/bash
# Description: Search for any references to `SearchStatistics` in relation to the `search` method.

# Test: Find all occurrences of `SearchStatistics` to understand its previous usage and impact.
rg --type cpp 'SearchStatistics'

Length of output: 32


Script:

#!/bin/bash
# Description: Search for comments or documentation mentioning `SearchStatistics`.

# Test: Find any comments or documentation that reference `SearchStatistics`.
rg --type cpp 'SearchStatistics' -g '*.h' -g '*.cpp' -A 3 -B 3

Length of output: 62

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)

30-77: Review VectorIndexCache class implementation.

The VectorIndexCache class uses an LRUCache for caching VectorIndexViewer objects. It includes mechanisms for cleaning outdated cache entries and managing shutdown operations. Ensure that the use of mutexes and condition variables is correct and that the cleaning logic is efficient and does not introduce race conditions.

Verification successful

Integration and Usage of VectorIndexCache Verified

The VectorIndexCache class is well-integrated within the codebase, with clear methods for managing its lifecycle. It is used in various components, including the server setup and context management, and its functionality is covered by unit tests.

  • Server.cpp: Sets the vector index cache in the global context.
  • Context.cpp: Provides methods to set, get, and drop the vector index cache.
  • Test files confirm the cache's functionality is tested.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexCache` class.

# Test: Search for the class definition and usage. Expect: Proper integration with cache management logic.
rg --type cpp -A 5 $'class VectorIndexCache'

Length of output: 1190


Script:

#!/bin/bash
# Description: Search for the usage of `VectorIndexCache` to understand its integration and usage.

# Test: Find all occurrences of `VectorIndexCache` to verify its integration.
rg --type cpp 'VectorIndexCache'

Length of output: 3577

dbms/src/TiDB/Schema/VectorIndex.h (2)

27-35: Review VectorIndexDefinition struct introduction.

The VectorIndexDefinition struct replaces previous enums and structures, encapsulating index properties. Ensure that the struct is comprehensive and correctly represents the necessary properties. The TODO comment suggests potential future additions, which should be tracked.

Verification successful

The VectorIndexDefinition struct is consistently integrated.

The VectorIndexDefinition struct is used across various components and is tested, indicating proper integration. The struct's usage in different contexts, including formatters for output, suggests it is well-incorporated into the codebase. No issues found with its current implementation.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation and usage of the `VectorIndexDefinition` struct.

# Test: Search for the struct definition and usage. Expect: Consistent usage with new properties.
rg --type cpp -A 5 $'struct VectorIndexDefinition'

Length of output: 497


Script:

#!/bin/bash
# Description: Search for usage of `VectorIndexDefinition` in the codebase to verify its integration.

# Test: Find all occurrences of `VectorIndexDefinition` usage.
rg --type cpp 'VectorIndexDefinition'

Length of output: 3281


Line range hint 48-73:
Review fmt::formatter for VectorIndexDefinition.

The fmt::formatter specialization for VectorIndexDefinition has been updated. Ensure that the formatting logic correctly reflects the new struct and that it integrates well with the existing logging or output mechanisms.

dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)

Line range hint 33-41:
Refactor to VectorIndexBuilder: Ensure type support logic is correct.

The isSupportedType method has been moved to VectorIndexBuilder. This refactoring aligns with the new class responsibilities. Verify that the logic for checking supported types remains correct and comprehensive.


59-73: Refactor to VectorIndexViewer: Ensure view logic is correct.

The view method replaces load and now uses VectorIndexFileProps. This refactoring is intended to improve clarity and modularity. Verify that the logic for viewing vector index files is correctly implemented and that all relevant cases are handled.


42-56: Refactor to VectorIndexBuilder: Check method signature and logic.

The create method now uses VectorIndexDefinitionPtr instead of VectorIndexInfo. This change enhances encapsulation and type safety. Ensure that all usages of create are updated to match the new signature and that the switch-case logic correctly handles all necessary cases.

Verification successful

Verification Successful: create Method Signature Updated Correctly

The usage of VectorIndexBuilder::create has been successfully updated to the new signature using VectorIndexDefinitionPtr. All identified usages are consistent with this change.

  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.h: The do_vector_index parameter is correctly typed as TiDB::VectorIndexDefinitionPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `create` method are updated to the new signature.

# Test: Search for the `create` method usage. Expect: Only occurrences of the new signature.
rg --type cpp -A 5 $'VectorIndexBuilder::create'

Length of output: 1216


Script:

#!/bin/bash
# Description: Check the type of `do_vector_index` in `DMFileWriter.h` to verify it matches the new signature.
rg --type cpp 'do_vector_index' dbms/src/Storages/DeltaMerge/File/DMFileWriter.h -A 5 -B 5

Length of output: 1700

dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)

31-55: Introduction of VectorIndexBuilder: Review class structure and methods.

The VectorIndexBuilder class is introduced to handle the construction of vector indices. The methods create, isSupportedType, addBlock, and save are aligned with its responsibilities. Ensure that the class design supports efficient and correct index building.


58-90: Introduction of VectorIndexViewer: Review class structure and methods.

The VectorIndexViewer class is introduced to handle viewing vector index files. The methods view, search, and get are aligned with its responsibilities. Ensure that the class design supports efficient and correct index viewing.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (3)

47-47: Update to VectorIndexViewerPtr: Ensure index interaction is correct.

The index member variable now uses VectorIndexViewerPtr, reflecting the new class structure. Verify that all interactions with the index are correctly updated to use the viewer.


56-56: Update calcResultsByPack method signature: Ensure efficient result handling.

The method now accepts a reference to a vector of VectorIndexViewer::Key, improving efficiency by avoiding unnecessary copies. Ensure that all calls to this method are updated to match the new signature.


65-71: Update constructor: Ensure alignment with new indexing strategy.

The constructor now takes VectorIndexViewerPtr and a reference to a vector of VectorIndexViewer::Key. This change aligns with the new indexing strategy. Ensure that all instances of this class are constructed using the updated parameters.

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)

93-98: Destructor implementation looks good.

The destructor correctly handles shutting down the cleaner thread and joining it.


86-91: Verify thread management.

Ensure that the cleaner thread is properly managed and terminated to prevent resource leaks.

Verification successful

Thread management is correctly implemented in VectorIndexCache.

The destructor of VectorIndexCache ensures proper termination of the cleaner thread by setting a shutdown flag, notifying the condition variable, and joining the thread. This prevents resource leaks and ensures clean shutdown.

  • Destructor: VectorIndexCache::~VectorIndexCache() sets is_shutting_down = true, notifies shutdown_cv, and calls cleaner_thread.join().
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify proper thread management in the constructor of `VectorIndexCache`.

# Test: Check for proper thread management and termination.
rg --type cpp 'VectorIndexCache' -A 5

Length of output: 16167


63-83: Ensure proper shutdown handling.

The loop correctly handles shutdown signals, but ensure that any resources are properly released if exceptions occur.

Verification successful

Proper shutdown handling and thread management confirmed.

The cleanOutdatedLoop method and its associated thread are correctly managed. The destructor properly sets the shutdown flag, notifies the condition variable, and joins the thread, ensuring a clean and safe shutdown process. No further action is needed.

  • Destructor: VectorIndexCache::~VectorIndexCache() properly joins the cleaner_thread.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify proper shutdown handling in the `cleanOutdatedLoop`.

# Test: Check for any resources that might not be released upon exceptions.
rg --type cpp 'cleanOutdatedLoop' -A 5

Length of output: 1429


Script:

#!/bin/bash
# Description: Check if `cleaner_thread` is properly managed in the `VectorIndexCache` class.

# Test: Look for thread management (join or detach) in the destructor or elsewhere in the class.
rg --type cpp 'cleaner_thread' -A 10

Length of output: 1717

dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)

90-94: Addition of VectorIndexFileProps looks good.

The new message is well-defined and should enhance the handling of vector index properties.


65-67: Verify backward compatibility.

The change from ColumnVectorIndexInfo to VectorIndexFileProps may affect backward compatibility. Ensure that the reserved field is correctly handled in the codebase.

Verification successful

Backward compatibility maintained.

The change from ColumnVectorIndexInfo to VectorIndexFileProps does not impact backward compatibility, as there are no remaining references to ColumnVectorIndexInfo in the codebase. The use of the reserved field 101 ensures that the field number is not reused.

  • The reserved field 101 is correctly implemented in dmfile.proto.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify backward compatibility for changes in `ColumnStat`.

# Test: Check for usage of the reserved field and ensure compatibility.
rg --type proto 'reserved 101' -A 5

Length of output: 67


Script:

#!/bin/bash
# Description: Verify backward compatibility for changes in `ColumnStat`.

# Test: Check for usage of the reserved field and ensure compatibility.
rg 'reserved 101' --glob '*.proto' -A 5

Length of output: 540


Script:

#!/bin/bash
# Description: Check for references to `ColumnVectorIndexInfo` in the codebase to ensure backward compatibility.

# Test: Search for `ColumnVectorIndexInfo` in the codebase.
rg 'ColumnVectorIndexInfo'

Length of output: 26

dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)

96-96: Verify impact of access modifier change for serializeColumnFileSet.

Changing the access modifier from public to private may affect accessibility. Ensure that this change is intentional and does not introduce any issues.

Verification successful

Access modifier change for serializeColumnFileSet is safe.

The method serializeColumnFileSet is only used within the Serializer.cpp file and does not have external dependencies. Changing its access modifier to private improves encapsulation without affecting other parts of the codebase.

  • No external usage of serializeColumnFileSet found outside Serializer.cpp.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the impact of changing the access modifier for `serializeColumnFileSet`.

# Test: Check for any unintended consequences of making `serializeColumnFileSet` private.
rg --type cpp 'serializeColumnFileSet' -A 5

Length of output: 4711


96-96: Verify impact of access modifier change for serializeSegment.

Changing the access modifier from private to public may affect encapsulation. Ensure that this change is intentional and does not introduce any issues.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)

119-119: Type consistency maintained.

The update to use VectorIndexViewer::Key for rowid ensures consistency with the new type usage across the file.


37-47: Consider the performance impact of the runtime check.

The addition of a runtime check to ensure that sorted_results is sorted improves robustness. However, this could introduce a performance overhead in critical paths. Consider evaluating the necessity of this check in production environments or providing a build configuration to disable it.

dbms/src/Storages/DeltaMerge/DMContext.h (1)

136-137: Verify the usage of the new scan_context parameter.

The addition of the scan_context parameter enhances flexibility, but ensure that its usage is correctly handled throughout the codebase.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)

75-75: Use builder pattern for vector index creation.

The vector_index is now created using VectorIndexBuilder::create, which enhances flexibility. Ensure that the builder pattern is correctly implemented and utilized.


101-101: Update member variable type.

The vector_index member variable type has been updated to VectorIndexBuilderPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Update member variable type verified.

The vector_index member variable type has been updated to VectorIndexBuilderPtr in the relevant parts of the codebase. No inconsistencies were found in the application of this change.

  • The vector_index member variable in DMFileWriter.h has been correctly updated to VectorIndexBuilderPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vector_index` member variable are updated to use `VectorIndexBuilderPtr`.

# Test: Search for `vector_index` usage. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'vector_index'

Length of output: 30888


57-57: Update constructor parameter type.

The constructor now uses TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that all instances of this class are updated to pass the correct type.


163-163: Update method parameter type.

The addStreams method now uses TiDB::VectorIndexDefinitionPtr instead of TiDB::VectorIndexInfoPtr. Ensure that all calls to this method are updated to pass the correct type.

Verification successful

All addStreams method calls use the correct parameter type.

The addStreams method now correctly uses TiDB::VectorIndexDefinitionPtr for the do_vector_index parameter, and all relevant calls to this method have been updated accordingly. The verification confirms that the type change has been properly implemented.

  • DMFileWriter.cpp: The call to addStreams uses cd.vector_index, which is correctly defined as TiDB::VectorIndexDefinitionPtr.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all calls to `addStreams` use the correct type for `do_vector_index`.

# Test: Search for `addStreams` calls. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'addStreams'

Length of output: 1531


Script:

#!/bin/bash
# Use ast-grep to find all calls to the addStreams method and check the type of the do_vector_index parameter.
ast-grep --lang cpp --pattern 'addStreams($_, $_, $_, $_)' 

Length of output: 60


Script:

#!/bin/bash
# Use rg to find all calls to the addStreams method and include context lines to check parameter types.
rg --type cpp 'addStreams' -A 3 -B 2

Length of output: 1678


Script:

#!/bin/bash
# Search for the definition of `cd` to confirm the type of `cd.vector_index`.
rg --type cpp 'cd\s*=' -A 3 -B 2

Length of output: 11390

dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (2)

95-95: Update member variable type.

The vector_index member variable type has been updated to TiDB::VectorIndexDefinitionPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Verification complete: Member variable type updated consistently.

The vector_index member variable type change to TiDB::VectorIndexDefinitionPtr in ColumnDefine has been consistently applied across the codebase. No discrepancies found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vector_index` member variable in `ColumnDefine` are updated to use `TiDB::VectorIndexDefinitionPtr`.

# Test: Search for `vector_index` usage in `ColumnDefine`. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'ColumnDefine'

Length of output: 324094


102-102: Update constructor parameter type.

The constructor now uses TiDB::VectorIndexDefinitionPtr for vector_index_. Ensure that all instances of this struct are updated to pass the correct type.

Verification successful

No issues found with ColumnDefine constructor usage. The search did not find any instances of ColumnDefine being instantiated with the vector_index_ parameter, indicating that the codebase does not directly use this parameter in constructor calls.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all instantiations of the `ColumnDefine` struct use the correct type for `vector_index_`.

# Test: Search for `ColumnDefine` instantiations. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'ColumnDefine'

Length of output: 324094


Script:

#!/bin/bash
# Search for all constructor calls for `ColumnDefine` that include the `vector_index_` parameter.
ast-grep --lang cpp --pattern 'ColumnDefine($_, $_, $_, $_, $_)' --json

Length of output: 76

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (7)

96-96: Override destructor.

The destructor now explicitly uses the override keyword, which improves clarity and ensures that the base class has a virtual destructor.


108-108: Override method signature.

The read method now explicitly uses the override keyword, which improves clarity and ensures that the base class has a virtual method.


144-144: Declare method without implementation.

The readByIndexReader method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.


148-148: Declare method without implementation.

The readByFollowingOtherColumns method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.


182-182: Update member variable type.

The vec_index member variable type has been updated to VectorIndexViewerPtr. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Member variable type update verified.

The vec_index member variable type has been correctly updated to VectorIndexViewerPtr in all relevant parts of the codebase. No further action is needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `vec_index` member variable are updated to use `VectorIndexViewerPtr`.

# Test: Search for `vec_index` usage. Expect: Only occurrences with the new type.
rg --type cpp -A 5 $'vec_index'

Length of output: 15367


186-186: Update member variable name and purpose.

The valid_rows_after_search variable has been replaced with sorted_results, indicating a change in how results are managed. Ensure that this change is reflected in all relevant parts of the codebase.

Verification successful

Change from valid_rows_after_search to sorted_results is correctly reflected.

The sorted_results variable is used consistently across the codebase, and there are no remaining references to valid_rows_after_search. The transition appears to be complete and correctly implemented.

  • Files checked:
    • DMFileWithVectorIndexBlockInputStream.h
    • VectorColumnFromIndexReader.h
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of the `sorted_results` member variable are updated accordingly.

# Test: Search for `sorted_results` usage. Expect: Only occurrences with the new name and purpose.
rg --type cpp -A 5 $'sorted_results'

Length of output: 11304


113-113: Declare method without implementation.

The readImpl method is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.

Verification successful

Implementation Found for readImpl Method

The readImpl method declared in DMFileWithVectorIndexBlockInputStream.h is implemented in the corresponding source file DMFileWithVectorIndexBlockInputStream.cpp. This addresses the concern regarding the method's implementation. No further action is required.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `readImpl` is provided in the corresponding source file.

# Test: Search for `readImpl` implementation. Expect: Implementation in the corresponding source file.
rg --type cpp -A 5 $'readImpl'

Length of output: 88199

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5)

38-54: LGTM!

The getUSearchMetricKind function correctly maps distance metrics and handles unsupported metrics with a runtime check.


58-64: Constructor looks good!

The VectorIndexHNSWBuilder constructor correctly initializes the index with the appropriate metric and checks the index kind.


Line range hint 67-106: Robust method implementation!

The addBlock method effectively handles memory allocation and data validity checks, ensuring robust data addition to the index.


110-113: Save method is well-implemented!

The save method effectively handles errors using runtime checks, ensuring the index is saved correctly.


Line range hint 134-198: Comprehensive search method!

The search method includes detailed checks and exception handling, ensuring robust search functionality. Note the TODO comment regarding efSearch support.

dbms/src/Common/LRUCache.h (1)

74-80: Thread-safe contains function!

The contains function correctly checks for the presence of a key using a lock guard, ensuring thread safety without altering the LRU order.

dbms/src/Storages/S3/FileCache.h (4)

228-233: Efficient caching strategy!

The downloadFileForLocalRead method effectively ensures that files in use are not evicted prematurely, enhancing caching strategy.


249-253: Robust retrieval strategy!

The getOrWait method provides a robust mechanism to ensure file availability, handling waiting and foreground downloading effectively.


256-256: Foreground download method looks good!

The fgDownload method is well-implemented to handle file downloads in the foreground, ensuring correct processing.


75-76: Verify waitForNotEmpty implementation.

Ensure that the waitForNotEmpty method correctly blocks until the segment is populated, maintaining thread safety.

Verification successful

waitForNotEmpty Method Implementation Verified

The waitForNotEmpty method in FileSegment is correctly implemented to block until the segment is populated, using mutexes and condition variables to ensure thread safety. The method includes a timeout to prevent indefinite blocking.

  • File: dbms/src/Storages/S3/FileCache.cpp
  • Implementation Details: Uses a mutex (mtx) and a condition variable (cv_ready) to wait for the status to change from Empty.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `waitForNotEmpty` method in `FileSegment`.

# Test: Check if `waitForNotEmpty` is implemented and used correctly.
ast-grep --lang cpp --pattern $'Status FileSegment::waitForNotEmpty() {
  $$$
}'

Length of output: 81


Script:

#!/bin/bash
# Description: Search for any method named `waitForNotEmpty` in the entire codebase to locate its implementation or usage.

# Search for `waitForNotEmpty` method across all C++ files.
rg 'waitForNotEmpty' --type cpp -A 5

Length of output: 2284

dbms/src/TiDB/Schema/TiDB.h (1)

209-209: Verify the impact of changing vector_index type.

The type of vector_index in the ColumnInfo struct has been changed from VectorIndexInfoPtr to VectorIndexDefinitionPtr. This change may affect how vector indices are used throughout the codebase. Ensure that all usages of vector_index are updated to accommodate this new type.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)

64-64: Ensure VectorIndexBuilder::isSupportedType is correctly implemented.

The change from VectorIndex::isSupportedType to VectorIndexBuilder::isSupportedType suggests a restructuring of vector index handling. Verify that the new method is implemented correctly and covers all necessary type checks.

Verification successful

Verification of VectorIndexBuilder::isSupportedType Implementation

The VectorIndexBuilder::isSupportedType method is correctly implemented. It checks for nullable types and supports DataTypeFloat32. This implementation aligns with its intended use in DMFileWriter.cpp to ensure only supported types are processed for vector indexing.

  • File: dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp
  • Method: bool VectorIndexBuilder::isSupportedType(const IDataType & type)
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of `VectorIndexBuilder::isSupportedType`.

# Test: Search for the implementation of `isSupportedType`. Expect: Correct type checks.
ast-grep --lang cpp --pattern $'class VectorIndexBuilder {
  $$$
  static bool isSupportedType($_) {
    $$$
  }
  $$$  
}'

Length of output: 469


Script:

#!/bin/bash
# Search for the definition and usage of `isSupportedType` in the codebase
rg 'isSupportedType' --context 5 --type cpp

# Search for the definition of the `VectorIndexBuilder` class
rg 'class VectorIndexBuilder' --context 10 --type cpp

Length of output: 5909

dbms/src/Storages/DeltaMerge/ScanContext.h (2)

93-93: Initialize total_vector_idx_load_from_s3 correctly.

The new atomic variable total_vector_idx_load_from_s3 is introduced to track vector index load times from S3. Ensure that it is correctly initialized and used throughout the class.


150-150: Verify integration of total_vector_idx_load_from_s3 in serialization and merging.

The new variable total_vector_idx_load_from_s3 is integrated into serialization, deserialization, and merging logic. Verify that these changes are correctly implemented and that the variable is accurately tracked and aggregated.

Also applies to: 202-202, 262-262, 314-314

Verification successful

Integration of total_vector_idx_load_from_s3 is correctly implemented.

The variable total_vector_idx_load_from_s3 is properly integrated into serialization and merging logic. It is serialized and deserialized using tiflash_scan_context_pb, and its value is aggregated during merging. Additionally, its behavior is verified through multiple tests. No issues found with its integration.

  • Serialization and Deserialization: Properly handled via tiflash_scan_context_pb.
  • Merging Logic: Correctly aggregated with other contexts.
  • Testing: Adequately tested in gtest_dm_vector_index.cpp.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the integration of `total_vector_idx_load_from_s3` in serialization, deserialization, and merging.

# Test: Search for `total_vector_idx_load_from_s3` usage in serialization and merging. Expect: Correct integration.
rg --type cpp --word-regexp 'total_vector_idx_load_from_s3'

Length of output: 2866

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)

125-168: Optimize Filter Assignment in readImpl Method.

The filter assignment in readImpl could be optimized by using a more efficient algorithm or data structure, especially if sorted_results is large.

-  auto it = std::lower_bound(sorted_results.begin(), sorted_results.end(), start_offset);
-  while (it != sorted_results.end())
-  {
-      auto rowid = *it;
-      if (rowid >= max_rowid_exclusive)
-          break;
-      filter[rowid - start_offset] = 1;
-      ++it;
-  }
+  for (auto rowid : sorted_results)
+  {
+      if (rowid >= start_offset && rowid < max_rowid_exclusive)
+          filter[rowid - start_offset] = 1;
+  }

Likely invalid or redundant comment.


34-81: Ensure Proper Error Handling for Constructor Preconditions.

The constructor of DMFileWithVectorIndexBlockInputStream uses RUNTIME_CHECK to validate preconditions. Ensure that these checks are appropriate and that exceptions are handled correctly if they fail.


217-255: Ensure Consistency in readByFollowingOtherColumns.

The method readByFollowingOtherColumns assembles a block using header_layout. Ensure consistency in the layout and verify that all necessary columns are present.

Verification successful

Consistency in readByFollowingOtherColumns is ensured.

The constructor of DMFileWithVectorIndexBlockInputStream includes runtime checks to verify that header_layout contains all necessary columns. These checks ensure that the layout used in readByFollowingOtherColumns is consistent. No further action is needed.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistency in the layout of `readByFollowingOtherColumns`.
# Test: Ensure that all necessary columns are present in the layout.

# Search for the usage of `header_layout` in the method.
rg --type cpp --context 5 'header_layout' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'

Length of output: 1937


273-384: Verify S3 File Cache Configuration in loadVectorIndex.

The method loadVectorIndex relies on S3 file cache. Ensure that the configuration is correct and that potential errors are handled gracefully.

Verification successful

S3 File Cache Configuration and Error Handling Verified

The loadVectorIndex method correctly verifies the presence of the S3 file cache and handles potential errors by retrying downloads and throwing exceptions if necessary. The configuration and error handling are properly implemented.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify S3 file cache configuration in `loadVectorIndex`.
# Test: Ensure that the configuration is correct and errors are handled.

# Search for S3 file cache usage in the method.
rg --type cpp --context 5 'S3' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'

Length of output: 1961


386-475: Ensure Correctness of Vector Search Logic in loadVectorSearchResult.

The method loadVectorSearchResult performs a vector search and updates the pack filter. Verify the correctness of the search logic and the impact on the pack filter.

dbms/src/Common/FailPoint.cpp (1)

109-109: Verify Integration of New Fail Point file_cache_fg_download_fail.

Ensure that the new fail point file_cache_fg_download_fail is correctly integrated and used in the codebase. Verify that it serves its intended purpose without unintended side effects.

Verification successful

Fail Point file_cache_fg_download_fail is Correctly Integrated

The fail point file_cache_fg_download_fail is properly declared, used, and tested within the codebase, indicating correct integration. No issues were found regarding its implementation or usage.

  • Declaration: dbms/src/Storages/S3/FileCache.cpp
  • Usage: Triggered in FileCache.cpp and tested in gtest_dm_vector_index.cpp
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify integration of new fail point `file_cache_fg_download_fail`.
# Test: Ensure the fail point is correctly integrated and used.

# Search for the usage of `file_cache_fg_download_fail` in the codebase.
rg --type cpp 'file_cache_fg_download_fail'

Length of output: 758

dbms/src/Interpreters/Context.h (1)

395-395: Verify Impact of Parameter Name Change in setVectorIndexCache.

The parameter name change from cache_size_in_bytes to cache_entities may impact how the method is used. Ensure that all calls to this method are updated accordingly and that the new name accurately reflects its purpose.

Verification successful

Parameter Name Change Verified: setVectorIndexCache

The parameter name change from cache_size_in_bytes to cache_entities is correctly implemented. The calls to this method have been updated, and the new name accurately reflects its purpose, as evidenced by the context and comments in the code.

  • gtest_dm_vector_index.cpp: Uses a literal value for cache entities.
  • Server.cpp: Uses a configuration value for cache entities with appropriate comments explaining the change.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify impact of parameter name change in `setVectorIndexCache`.
# Test: Ensure all calls to this method are updated and the new name reflects its purpose.

# Search for calls to `setVectorIndexCache` in the codebase.
rg --type cpp 'setVectorIndexCache'

Length of output: 420


Script:

#!/bin/bash
# Description: Gather context around calls to `setVectorIndexCache` to verify parameter name change.
# Get 5 lines of context around each call to understand usage and purpose.

# Get context around the call in `gtest_dm_vector_index.cpp`.
rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp

# Get context around the call in `Server.cpp`.
rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Server/Server.cpp

Length of output: 1392

dbms/src/Storages/S3/FileCache.cpp (1)

676-700: Foreground download logic is well-implemented.

The method effectively manages foreground downloads with appropriate logging and exception handling. The use of fail points for testing is commendable.

dbms/src/TiDB/Schema/TiDB.cpp (2)

411-417: Improved type safety and error handling for vector index.

The use of tipb::VectorIndexKind and tipb::VectorDistanceMetric, along with runtime checks, enhances type safety and robustness.


475-499: Enhanced clarity and robustness in vector index parsing.

The explicit parsing and structured initialization improve clarity and error handling in vector index management.

dbms/src/Interpreters/Context.cpp (4)

57-57: Include Directive Update Approved.

The change from VectorIndex.h to VectorIndexCache.h aligns with modifications in cache handling and improves clarity.


1388-1388: Simplified Cache Reset Approved.

The direct call to reset() on the shared pointer enhances clarity and conciseness.


1410-1410: Simplified Cache Reset Approved.

The direct call to reset() on the shared pointer enhances clarity and conciseness.


1391-1397: Method Signature Change Approved.

The parameter change from cache_size_in_bytes to cache_entities aligns with the new focus on entity count. Ensure all usages of this method are updated accordingly.

Verification successful

Method Signature Change Verified.

All usages of setVectorIndexCache have been updated to reflect the new parameter cache_entities. The changes are consistent across the codebase.

  • dbms/src/Server/Server.cpp: Usage with vec_index_cache_entities.
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Usage with 1000.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of `setVectorIndexCache` to ensure they match the new parameter `cache_entities`.

# Test: Search for the method usage. Expect: Only occurrences of the new parameter name.
rg --type cpp -A 3 $'setVectorIndexCache'

Length of output: 1274

dbms/src/Server/Server.cpp (1)

1442-1445: Verify the impact of changing vector index cache configuration.

The change from a byte size to an entity count for the vector index cache could have significant implications on memory management and performance. Ensure that this change is well-documented and tested to confirm it meets the desired performance and memory usage goals.

dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (5)

93-96: Addition of cdVec() method is appropriate.

The cdVec() method provides a useful utility for defining vector columns, enhancing code reusability and readability.


884-887: Update to VectorIndexDefinition is appropriate.

The change from VectorIndexInfo to VectorIndexDefinition enhances type safety and aligns with the refactoring efforts. Ensure that all usages of VectorIndexDefinition are correctly updated throughout the codebase.


1427-1907: Comprehensive test cases for S3 and file cache interactions.

The test cases cover critical scenarios, including cache eviction and S3 failures, ensuring robustness. Consider verifying the test coverage to ensure all edge cases are addressed.

Verification successful

Comprehensive test cases for S3 and file cache interactions confirmed.

The test cases in VectorIndexSegmentOnS3Test cover various scenarios, ensuring robust handling of S3 and file cache interactions. The coverage appears comprehensive.

  • FileCacheNotEnabled
  • ReadWithoutIndex
  • ReadFromIndex
  • FileCacheEvict
  • FileCacheEvictAndVectorCacheDrop
  • FileCacheDeleted
  • FileCacheDeletedAndVectorCacheDrop
  • ConcurrentDownloadFromS3
  • S3Failure
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage for S3 and file cache interactions.

# Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage.
rg --type cpp --type c++ --type c 'TEST_F\(VectorIndexSegmentOnS3Test'

Length of output: 100


Script:

#!/bin/bash
# Description: Verify test coverage for S3 and file cache interactions.

# Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage.
rg --type cpp 'TEST_F\(VectorIndexSegmentOnS3Test'

Length of output: 1111


1207-1290: S3 interaction and file cache management setup is well-implemented.

The SetUp and TearDown methods ensure proper configuration and cleanup of S3 and file cache environments, which are essential for the tests. Verify that these configurations are consistent with the overall system requirements.

Verification successful

S3 and file cache management setup is consistent across the codebase.

The configurations in gtest_dm_vector_index.cpp align with the widespread use of S3::ClientFactory::instance() and FileCache::initialize() in other parts of the codebase, indicating a consistent approach. This setup is well-implemented and matches the overall system requirements.

  • Instances found in files like FlashService.cpp, S3LockService.cpp, Server.cpp, and others confirm the consistency.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify S3 configuration and file cache management in the codebase.

# Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase.
rg --type cpp --type c++ --type c 'S3::ClientFactory|FileCache::initialize'

Length of output: 105


Script:

#!/bin/bash
# Description: Verify S3 configuration and file cache management in the codebase.

# Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase.
rg --type cpp --type c 'S3::ClientFactory|FileCache::initialize'

Length of output: 9671


205-208: Update to VectorIndexDefinition is appropriate.

The change from VectorIndexInfo to VectorIndexDefinition enhances type safety and aligns with the refactoring efforts. Ensure that all usages of VectorIndexDefinition are correctly updated throughout the codebase.

Verification successful

VectorIndexDefinition is consistently used across the codebase.

The transition from VectorIndexInfo to VectorIndexDefinition is well-implemented throughout the relevant files. This change enhances type safety and aligns with the refactoring objectives.

  • dbms/src/TiDB/Schema/VectorIndex.h: Definition and usage of VectorIndexDefinition.
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Multiple instances of VectorIndexDefinition usage in tests.
  • Other related files in DeltaMerge and TiDB modules.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all usages of `VectorIndexDefinition` in the codebase.

# Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase.
rg --type cpp --type c++ --type c 'VectorIndexDefinition'

Length of output: 85


Script:

#!/bin/bash
# Description: Verify all usages of `VectorIndexDefinition` in the codebase.

# Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase.
rg --type cpp 'VectorIndexDefinition'

Length of output: 3281

@JaySon-Huang
Copy link

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @dbms/src/Interpreters/Context.cpp:
- Line 1388: The three cache dropters are inconsistent: dropMarkCache calls
shared->mark_cache->reset() (clearing contents but keeping the shared_ptr),
while dropMinMaxIndexCache and dropVectorIndexCache call
shared->minmax_index_cache.reset() and shared->vector_index_cache.reset()
(destroying the shared_ptr), causing getMarkCache to return an empty object but
the others to return nullptr; make them consistent by choosing one semantic and
applying it to all three—either change dropMarkCache to reset the pointer
(shared->mark_cache.reset()) to match the other two, or change
dropMinMaxIndexCache and dropVectorIndexCache to clear contents via ->reset() to
match dropMarkCache; update the corresponding getMarkCache, getMinMaxIndexCache,
and getVectorIndexCache callers/assumptions if needed.

In @dbms/src/Storages/DeltaMerge/Index/VectorIndex.h:
- Around line 58-59: Fix the typo in the class/file comment for VectorIndex:
replace "may nor may not read the whole content of the file into memory" with
"may or may not read the whole content of the file into memory" so the comment
for the VectorIndex view is grammatically correct and clear.

In @dbms/src/Storages/DeltaMerge/ScanContext.h:
- Around line 28-31: The header opens a Clang diagnostic push/ignore block (the
__clang__ / #pragma clang diagnostic push and ignored
"-Wdeprecated-declarations" in ScanContext.h) but never restores it, causing
suppression to leak to including TUs; add a matching clang diagnostic pop (i.e.,
the corresponding #pragma clang diagnostic pop) right before the file/namespace
is closed (or remove the push/ignore pair if unused) so the suppression is
properly scoped and does not affect other translation units.

In @dbms/src/Storages/S3/FileCache.cpp:
- Around line 236-297: Add a clear header comment to FileCache::getOrWait
explaining its error behavior: unlike FileCache::get which returns nullptr on
reservation failure, getOrWait will throw an Exception when it cannot reserve
space (and is intended to be used by downloadFileForLocalRead which handles that
exception). Mention the rationale (caller is expected to retry or handle via
try/catch), the specific exception type (ErrorCodes::S3_ERROR) and that callers
of getOrWait must be prepared for exceptions rather than nullptr, and reference
FileCache::get for the alternative semantics.
🧹 Nitpick comments (8)
dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)

87-95: Public serializeSegment signature expansion is likely ABI-breaking; consider an options struct to stop signature bloat.

Making this public and adding multiple parameters (segment_epoch, segment_range, read_ranges, MemTrackerWrapper &) will ripple across the codebase and can break ABI for downstream linkage. If this API is expected to evolve further (seems likely with vector-index plumbing), consider grouping the new fields into a single params/options type to reduce churn and make call sites self-documenting.

Also please ensure the header declaration exactly matches the out-of-line definition and all call sites were updated.


96-100: Good: serializeColumnFileSet kept private; consider forward-declaring MemTrackerWrapper to reduce header coupling.

The explicit private: and the helper extraction look reasonable. Since MemTrackerWrapper appears in multiple declarations in this header, if it’s only used by reference, a forward declaration (and moving the include to the .cpp) can reduce rebuild impact—if nothing else in this header truly needs Flash/Mpp/TrackedMppDataPacket.h.

dbms/src/Storages/S3/FileCache.cpp (2)

341-341: Consider documenting the lock type change.

The change from lock_guard to unique_lock in the remove() method is subtle but likely necessary for consistency with other methods. However, it's not immediately clear why this change was needed as part of this PR.

Do the other methods that were also changed to use unique_lock require early unlock or additional lock operations? If not, consider adding a brief comment explaining when unique_lock vs lock_guard should be used.


68-107: Extract the hardcoded timeout values into configurable parameters.

The 300-second timeout for waiting on file segments is hardcoded (line 95), as is the 30-second per-iteration timeout (line 83). While the comment "Snapshot time is 300s" references the snapshot interval concept, the actual timeout value lacks configuration. This contrasts with other parts of the S3 codebase—S3GCManager uses configurable mark_delete_timeout_seconds and PocoHTTPClient uses ConnectionTimeouts—and with FileCache's own pattern of configurable parameters like dt_filecache_max_downloading_count_scale and dt_filecache_min_age_seconds.

Consider making both timeouts configurable through Settings or StorageRemoteCacheConfig to support varying operational requirements without code changes.

dbms/src/Common/LRUCache.h (1)

74-80: Consider lock type consistency and const correctness.

The new non-const contains() method has two minor style inconsistencies:

  1. Lock type variance: This method uses std::lock_guard while other methods (e.g., get(), set()) use std::scoped_lock. Although functionally equivalent for single-mutex locking, consistency aids maintainability.

  2. Const correctness: Since this method doesn't mutate LRU order or internal state, it could be marked const to match the existing const overload at Line 195-199, eliminating the need for two separate implementations.

♻️ Proposed refinement

If a non-const overload is genuinely required (e.g., for template resolution), at minimum align the lock type:

-    bool contains(const Key & key)
+    bool contains(const Key & key) const
 {
-        std::lock_guard cache_lock(mutex);
+        std::scoped_lock cache_lock(mutex);
         return cells.contains(key);
 }

Otherwise, remove this overload and rely solely on the existing const version.

dbms/src/Server/Server.cpp (1)

1442-1445: LGTM! Cache unit changed from bytes to entity count.

The refactor from vec_index_cache_size (bytes) to vec_index_cache_entities (count) with a default of 1000 is reasonable. The updated comment correctly notes that mmap delegates memory management to the OS. The conditional check prevents setting the cache when the count is zero.

Optional: Consider upper bound validation.

While the zero-check prevents misconfiguration, there's no upper bound validation. An extremely large value (e.g., millions) could cause issues. Consider adding a sanity check or logging a warning for unusually large values.

♻️ Optional: Add upper bound validation
 size_t vec_index_cache_entities = config().getUInt64("vec_index_cache_entities", 1000);
+if (vec_index_cache_entities > 100000)
+{
+    LOG_WARNING(log, "vec_index_cache_entities is set to a very large value: {}, consider reducing it", vec_index_cache_entities);
+}
 if (vec_index_cache_entities)
     global_context->setVectorIndexCache(vec_index_cache_entities);
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)

66-76: Consider using std::shared_lock for read-only cache lookups.

The getOrSet method acquires an exclusive lock (std::scoped_lock) just to insert into files_to_check. If the file path is already in the set, a shared lock would suffice for the lookup. However, since std::unordered_set::insert is a modifying operation, the current approach is correct for the insert case.

One potential optimization: check if the key exists first with a shared lock, then upgrade to exclusive only if needed. This could reduce contention under high read loads.

♻️ Optional optimization for reduced contention
     template <typename LoadFunc>
     Cache::MappedPtr getOrSet(const Cache::Key & file_path, LoadFunc && load)
     {
         {
-            std::scoped_lock lock(mu);
-            files_to_check.insert(file_path);
+            // Try shared lock first to check existence
+            {
+                std::shared_lock read_lock(mu);
+                if (files_to_check.count(file_path) > 0)
+                    goto cache_lookup;
+            }
+            // Upgrade to exclusive lock for insertion
+            {
+                std::scoped_lock write_lock(mu);
+                files_to_check.insert(file_path);
+            }
         }
-
+cache_lookup:
         auto result = cache.getOrSet(file_path, load);
         return result.first;
     }
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1)

305-329: Consider making the retry count configurable.

The hardcoded retry count of 3 for S3 downloads works but could be made configurable for different deployment scenarios. This is a minor improvement that could be deferred.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d911b and ff13cae.

📒 Files selected for processing (36)
  • contrib/tipb
  • dbms/src/Common/FailPoint.cpp
  • dbms/src/Common/LRUCache.h
  • dbms/src/Interpreters/Context.cpp
  • dbms/src/Interpreters/Context.h
  • dbms/src/Server/Server.cpp
  • dbms/src/Storages/DeltaMerge/DMContext.h
  • dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h
  • dbms/src/Storages/DeltaMerge/File/ColumnStat.h
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp
  • dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp
  • dbms/src/Storages/DeltaMerge/File/DMFileWriter.h
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp
  • dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h
  • dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto
  • dbms/src/Storages/DeltaMerge/Index/RSIndex.h
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex.h
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp
  • dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h
  • dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp
  • dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h
  • dbms/src/Storages/DeltaMerge/Remote/Serializer.h
  • dbms/src/Storages/DeltaMerge/ScanContext.h
  • dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp
  • dbms/src/Storages/S3/FileCache.cpp
  • dbms/src/Storages/S3/FileCache.h
  • dbms/src/Storages/S3/FileCachePerf.cpp
  • dbms/src/Storages/S3/FileCachePerf.h
  • dbms/src/TiDB/Schema/TiDB.cpp
  • dbms/src/TiDB/Schema/TiDB.h
  • dbms/src/TiDB/Schema/VectorIndex.h
💤 Files with no reviewable changes (1)
  • dbms/src/Storages/DeltaMerge/Index/RSIndex.h
🧰 Additional context used
🧬 Code graph analysis (18)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (2)
dbms/src/Storages/DeltaMerge/ScanContext.h (1)
  • DB (33-355)
dbms/src/Storages/S3/FileCachePerf.h (1)
  • DB (24-37)
dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (3)
dbms/src/TiDB/Schema/VectorIndex.h (1)
  • TiDB (24-45)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1)
  • VectorIndexBuilder (32-56)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
  • create (42-57)
  • create (42-42)
dbms/src/Storages/DeltaMerge/DMContext.h (1)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (2)
  • scan_context (1393-1406)
  • scan_context (1393-1393)
dbms/src/Storages/S3/FileCachePerf.cpp (1)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (1)
  • file_cache (1341-1349)
dbms/src/Interpreters/Context.h (1)
dbms/src/Interpreters/Context.cpp (2)
  • setVectorIndexCache (1391-1398)
  • setVectorIndexCache (1391-1391)
dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)
  • VectorIndexBuilder (32-56)
  • VectorIndexViewer (60-90)
dbms/src/TiDB/Schema/VectorIndex.h (1)
dbms/src/TiDB/Schema/TiDB.h (1)
  • TiDB (47-289)
dbms/src/Storages/S3/FileCachePerf.h (1)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)
  • DB (24-37)
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (2)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1)
  • VectorIndexViewer (60-90)
dbms/src/Common/ColumnsHashing.h (1)
  • Key (529-550)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (2)
  • view (116-132)
  • view (116-116)
dbms/src/Storages/S3/FileCache.cpp (2)
dbms/src/Storages/S3/tests/gtest_filecache.cpp (16)
  • file_cache (185-199)
  • file_cache (185-185)
  • file_cache (245-245)
  • file_cache (269-269)
  • file_cache (286-286)
  • file_cache (308-308)
  • file_cache (344-344)
  • file_cache (446-446)
  • file_cache (465-465)
  • file_cache (484-484)
  • file_cache (503-503)
  • file_cache (522-522)
  • file_cache (541-541)
  • file_cache (560-560)
  • file_cache (579-579)
  • file_cache (598-598)
dbms/src/Common/FailPoint.cpp (4)
  • lock (208-215)
  • lock (210-210)
  • lock (217-221)
  • lock (219-219)
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)
  • VectorIndexCache (30-77)
dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
  • isSupportedType (33-40)
  • isSupportedType (33-33)
dbms/src/Storages/S3/FileCache.h (1)
dbms/src/Storages/S3/FileCache.cpp (10)
  • waitForNotEmpty (68-107)
  • waitForNotEmpty (68-68)
  • downloadFileForLocalRead (160-176)
  • downloadFileForLocalRead (160-162)
  • getOrWait (236-297)
  • getOrWait (236-236)
  • bgDownload (664-674)
  • bgDownload (664-664)
  • fgDownload (676-700)
  • fgDownload (676-676)
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (4)
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (1)
  • DMFileWithVectorIndexBlockInputStream (53-81)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (4)
  • get (202-206)
  • get (202-202)
  • view (116-132)
  • view (116-116)
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)
  • read (79-137)
  • read (79-79)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
  • view (59-74)
  • view (59-59)
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (1)
dbms/src/Columns/ColumnArray.h (2)
  • create (53-56)
  • create (58-58)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (2)
dbms/src/Storages/DeltaMerge/Filter/RSOperator.cpp (2)
  • wrapWithANNQueryInfo (90-93)
  • wrapWithANNQueryInfo (90-90)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (2)
  • get (202-206)
  • get (202-202)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (3)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1)
  • VectorIndexHNSWBuilder (27-39)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
  • view (59-74)
  • view (59-59)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/usearch_index_dense.h (2)
  • metric (822-841)
  • metric (1977-1981)
🪛 Clang (14.0.6)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h

[error] 24-24: variable 'DB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 24-24: variable name 'DB' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

dbms/src/Storages/S3/FileCachePerf.cpp

[error] 20-20: variable 'file_cache' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)

dbms/src/TiDB/Schema/VectorIndex.h

[error] 24-24: variable 'TiDB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h

[error] 27-27: variable 'DB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 27-27: variable name 'DB' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

dbms/src/Storages/S3/FileCachePerf.h

[error] 24-24: variable 'DB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 24-24: variable name 'DB' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp

[error] 20-20: variable 'vector_search' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)

dbms/src/Storages/S3/FileCache.cpp

[error] 59-59: do not declare C-style arrays, use std::array<> instead

(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp

[error] 49-49: do not declare C-style arrays, use std::array<> instead

(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)


[error] 50-50: do not declare C-style arrays, use std::array<> instead

(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)


[error] 503-503: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 570-570: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1202-1202: constructor does not initialize these fields: page_id_allocator, storage_path_pool, storage_pool, table_columns, settings, wn_segment

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)


[error] 1207-1207: method 'SetUp' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 1207-1207: invalid case style for method 'SetUp'

(readability-identifier-naming,-warnings-as-errors)


[error] 1209-1209: variable 'force_use_dmfile_format_v3' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1249-1249: variable 'file_cache_config' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1276-1276: method 'TearDown' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 1276-1276: invalid case style for method 'TearDown'

(readability-identifier-naming,-warnings-as-errors)


[error] 1278-1278: variable 'force_use_dmfile_format_v3' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1288-1288: variable 's3_client' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1295-1295: 4 adjacent parameters of 'createComputeNodeStream' of similar type ('const int &') are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 1295-1295: parameter 'write_node_segment' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1296-1296: parameter 'columns_to_read' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1297-1297: parameter 'filter' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1298-1298: parameter 'read_scan_context' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1343-1343: variable 'file_cache' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1351-1351: method 'prepareWriteNodeStable' can be made static

(readability-convert-member-functions-to-static,-warnings-as-errors)


[error] 1354-1354: variable 'block' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)


[error] 1369-1369: parameter 'ref_vec' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1370-1370: 2 adjacent parameters of 'computeNodeANNQuery' of similar type are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)


[error] 1370-1370: parameter 'top_k' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1371-1371: parameter 'read_scan_context' is unused

(misc-unused-parameters,-warnings-as-errors)


[error] 1408-1408: redundant access specifier has the same accessibility as the previous access specifier

(readability-redundant-access-specifiers,-warnings-as-errors)


[error] 1410-1410: member variable 'page_id_allocator' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1411-1411: member variable 'storage_path_pool' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1412-1412: member variable 'storage_pool' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1413-1413: member variable 'table_columns' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1414-1414: member variable 'settings' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1416-1416: member variable 'ns_id' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1419-1419: member variable 'wn_segment' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1421-1421: member variable 'orig_mode' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1424-1424: member variable 'dummy_mem_tracker' has protected visibility

(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)


[error] 1427-1427: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1452-1452: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1478-1478: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1526-1526: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1600-1600: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1679-1679: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1746-1746: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1818-1818: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)


[error] 1884-1884: all parameters should be named in a function

(readability-named-parameter,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp

[error] 38-38: parameter 'd' is unused

(misc-unused-parameters,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/Index/VectorIndex.h

[error] 28-28: variable 'DB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 28-28: variable name 'DB' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h

[error] 21-21: variable 'DB' is non-const and globally accessible, consider making it const

(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)


[error] 21-21: variable name 'DB' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

🪛 Cppcheck (2.19.0)
dbms/src/Storages/S3/FileCache.cpp

[error] 169-169: syntax error

(syntaxError)

🔇 Additional comments (80)
contrib/tipb (1)

1-1: Subproject commit update is valid and properly integrated.

The contrib/tipb submodule pointer update to 56b0212a1fa1c92ccf3b9ed7945c716346dad299 exists and is reachable. Vector-index related code in the repository correctly includes and uses the tipb protobuf types (tipb::VectorIndexKind, tipb::VectorDistanceMetric) without apparent issues. The integration appears sound.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (2)

1-14: LGTM: Standard license header.

The Apache 2.0 license header is correctly formatted with the appropriate copyright year.


15-22: LGTM: Correct thread_local performance context implementation.

The implementation correctly defines the per-thread performance context following the established pattern from FileCachePerfContext. The thread_local variable is properly declared as extern in the header and defined here with default initialization.

Note: The static analysis warning about "non-const global variable" is a false positive and can be safely ignored. The variable must be mutable for performance tracking, and thread_local ensures thread-safety.

dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (3)

1-17: LGTM: Standard header structure.

The license header, include guard, and necessary includes are correctly formatted.


19-22: LGTM: Good defensive programming.

The macro safeguard to undefine Poco's thread_local macro is appropriate defensive programming, ensuring compatibility with the C++11 thread_local keyword.


24-37: LGTM: Well-designed performance context following established patterns.

The VectorSearchPerfContext struct correctly follows the pattern established by FileCachePerfContext in the codebase. The design is appropriate:

  • Public members are suitable for a simple performance tracking struct
  • Member names (visited_nodes, discarded_nodes) align with metrics defined in ScanContext.h
  • The reset() method using aggregate initialization is idiomatic C++
  • The extern thread_local declaration is the correct pattern for thread_local variables shared across translation units

Note: The static analysis warnings about "variable 'DB' is non-const" and "variable name 'DB' is too short" are false positives—DB is a namespace, not a variable, and follows the established naming convention in this codebase.

dbms/src/Storages/DeltaMerge/ScanContext.h (4)

93-93: LGTM: Metric declaration is consistent.

The new total_vector_idx_load_from_s3 metric follows the established naming convention and is properly placed alongside related vector index metrics.


150-150: LGTM: Deserialization is correct.

The deserialization of total_vector_idx_load_from_s3 from the protobuf field follows the same pattern as other vector index metrics.


202-202: LGTM: Serialization is correct.

The serialization of total_vector_idx_load_from_s3 to the protobuf field follows the same pattern as other vector index metrics.


262-262: LGTM: Merge logic is correct.

Both merge methods correctly accumulate total_vector_idx_load_from_s3 using the += operator, consistent with the pattern used for other vector index metrics.

Also applies to: 314-314

dbms/src/Common/FailPoint.cpp (1)

109-109: LGTM! Fail point added correctly.

The new fail point file_cache_fg_download_fail is properly added to the APPLY_FOR_FAILPOINTS macro, enabling fault injection for foreground S3 downloads.

dbms/src/Storages/S3/FileCachePerf.cpp (1)

17-22: LGTM! Per-thread performance context correctly defined.

The thread_local variable enables per-thread tracking of foreground download metrics without synchronization overhead.

dbms/src/Storages/S3/FileCachePerf.h (1)

19-37: LGTM! Performance context header well-structured.

The header properly defines a thread_local performance context for tracking foreground download metrics, following the established pattern from VectorSearchPerfContext.

dbms/src/Storages/S3/FileCache.h (5)

29-29: LGTM! Required include for synchronization.

The <condition_variable> header is necessary for the newly added cv_ready member used to coordinate file segment state transitions.


75-75: LGTM! Synchronization primitives properly integrated.

The waitForNotEmpty() method and cv_ready condition variable work together to enable foreground callers to wait until a file segment is downloaded. The notification in setStatus() correctly wakes waiting threads when the status changes from Empty.

Also applies to: 87-88, 134-134


228-233: LGTM! Method provides local-read guarantees.

The downloadFileForLocalRead() method ensures that callers receive a FileSegmentPtr guard, preventing the cached file from being evicted while in use. This is essential for local reads that require stable file access.


249-253: LGTM! Foreground download semantics well-documented.

The getOrWait() method provides blocking semantics for foreground downloads with appropriate documentation about potential nullptr returns requiring caller retry.


256-256: LGTM! Foreground download method added.

The fgDownload() method complements the existing bgDownload() for background downloads, providing synchronous download semantics.

dbms/src/Storages/S3/FileCache.cpp (3)

17-17: LGTM! Required headers and namespace declarations added.

The includes for FailPoint, SyncPoint, and FileCachePerf support the new foreground download functionality with fault injection and performance tracking.

Also applies to: 20-20, 30-30, 57-60


160-176: LGTM! Consistency check prevents stale cache entries.

The downloadFileForLocalRead() implementation properly handles the edge case where cache metadata indicates a file is Complete but the actual file was manually removed from disk. The forced removal ensures cache consistency.


676-700: LGTM! Foreground download implementation is well-structured.

The fgDownload() implementation correctly:

  • Uses SYNC_FOR for test synchronization
  • Integrates fail point for testing
  • Handles exceptions and marks failures appropriately
  • Increments performance metrics
  • Cleans up failed downloads by removing them from cache

The error handling ensures that failed foreground downloads don't leave orphaned cache entries.

dbms/src/TiDB/Schema/TiDB.h (1)

209-209: LGTM! Type updated consistently with refactor.

The change from VectorIndexInfoPtr to VectorIndexDefinitionPtr aligns with the broader refactor introducing the builder/viewer architecture for vector indices. The TODO comment appropriately flags that this will eventually move to the table level.

dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1)

95-108: LGTM! Type signature updated consistently.

The member and constructor parameter types are correctly updated from VectorIndexInfoPtr to VectorIndexDefinitionPtr, maintaining consistency with the broader refactor. The initialization logic remains unchanged.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)

57-57: LGTM! Parameter type updated consistently.

The Stream constructor parameter type correctly changed from VectorIndexInfoPtr to TiDB::VectorIndexDefinitionPtr, aligning with the broader refactor.


75-75: LGTM! Factory method updated to builder pattern.

The change from VectorIndex::create to VectorIndexBuilder::create correctly adopts the new builder/viewer architecture. Based on the relevant code snippets, VectorIndexBuilder::create validates the definition and returns the appropriate builder implementation.


101-101: LGTM! Member type updated to builder.

The member type correctly changed from VectorIndexPtr to VectorIndexBuilderPtr, consistent with the Stream constructor initialization at Line 75.


163-163: LGTM! Method signature updated consistently.

The addStreams parameter type correctly changed to TiDB::VectorIndexDefinitionPtr, matching the Stream constructor signature at Line 57.

dbms/src/Interpreters/Context.h (1)

395-395: LGTM! Parameter semantics updated from bytes to entity count.

The signature change from cache_size_in_bytes to cache_entities correctly reflects the shift to entity-based caching for the VectorIndexCache.

dbms/src/Storages/DeltaMerge/DMContext.h (1)

136-150: LGTM! Scan context properly propagated.

The addition of the scan_context parameter with a default value of nullptr maintains backward compatibility while enabling scan context propagation through the DM read paths.

dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)

44-44: LGTM! Protobuf type updated consistently.

The change from ColumnVectorIndexInfo to VectorIndexFileProps aligns with the protobuf schema changes and is properly handled in serialization/deserialization methods.

dbms/src/Interpreters/Context.cpp (2)

57-57: LGTM! Include updated to VectorIndexCache.

The include change aligns with the new VectorIndexCache architecture.


1391-1397: LGTM! VectorIndexCache construction updated with entity-based parameter.

The parameter semantic change from cache_size_in_bytes to cache_entities is correctly implemented and aligns with the updated API.

dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)

65-67: LGTM! Protobuf field migration handled correctly.

The old field 101 is properly reserved, and the new vector_index field uses field number 102 with the updated VectorIndexFileProps type. This ensures backward compatibility while migrating to the new schema.


84-94: LGTM! Well-documented protobuf message.

The VectorIndexFileProps message is clearly documented, distinguishing its read-oriented purpose (providing information to read out the index) from VectorIndexDefinition's construction-oriented purpose. The field choices are appropriate for storing basic index metadata.

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (2)

30-77: LGTM on the overall cache design.

The VectorIndexCache class provides a clean abstraction for caching VectorIndexViewer instances with background cleanup. The use of LRUCache with a separate tracking set for periodic validation is a reasonable approach. The thread-safety primitives (shared_mutex, atomic, condition_variable) are appropriately chosen.


43-45: The shutdown mechanism is correctly implemented.

The destructor properly notifies shutdown_cv.notify_all() immediately after setting is_shutting_down = true. Additionally, the shutdown_cv.wait_for() call in the cleanup loop includes a predicate that checks the atomic flag, and has a 1-minute timeout to prevent indefinite blocking. No changes are needed.

Likely an incorrect or invalid review comment.

dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)

24-28: LGTM on the forward declaration updates.

The forward declarations correctly reflect the new builder/viewer abstraction pattern. The separation of VectorIndexBuilder and VectorIndexViewer with their respective shared_ptr typedefs is clean and aligns with the broader refactor.

dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (3)

350-355: LGTM on vector index metadata population.

The metadata fields (index_kind, distance_metric, dimensions) are correctly populated from the vector index definition using protobuf enum name lookups. This ensures proper serialization for later restoration when reading.


104-127: LGTM on the addStreams signature update.

The parameter change from VectorIndexInfoPtr to VectorIndexDefinitionPtr aligns with the broader type refactor. The null-safe propagation of do_vector_index through the substream callback is correctly handled.


340-356: No changes needed. The save() method already includes error handling via RUNTIME_CHECK_MSG, which throws an exception if the save operation fails. This prevents getSize() from being called on a non-existent file. Additionally, save() returns void, making the suggestion to use its return value infeasible.

Likely an incorrect or invalid review comment.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (3)

43-48: Good defensive check, but consider adding a release-mode fast-fail for corrupted data.

The sortedness check is appropriately guarded by #ifndef NDEBUG to avoid performance overhead in release builds. However, if sorted data is a critical invariant, consider whether corrupted/unsorted input in production could cause subtle bugs (e.g., missing results, infinite loops).

The current code would fail at line 74 (results_it == results_it_max) if results are out-of-order, providing some protection.


36-77: LGTM on the calcResultsByPack refactor.

The change from rvalue reference to const reference for sorted_results is appropriate since the caller now owns the sorted data. The logic correctly accumulates per-pack offsets and validates that all results are consumed.


119-119: LGTM on the type update.

Changing from VectorIndex::Key to VectorIndexViewer::Key is consistent with the viewer abstraction and the type alias remains UInt32.

dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)

42-57: LGTM on the VectorIndexBuilder::create factory.

The dimension validation is correct, and the factory properly delegates to VectorIndexHNSWBuilder for the HNSW index kind. The error message includes the enum name for easier debugging.


59-74: LGTM on the VectorIndexViewer::view factory.

The implementation correctly:

  1. Validates dimension bounds
  2. Parses the index kind from the protobuf string representation
  3. Delegates to the appropriate HNSW viewer implementation
  4. Provides a meaningful error message for unsupported index types

33-40: LGTM on isSupportedType implementation.

The type check correctly handles both nullable and non-nullable array types, verifying that the inner type is Float32.

dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (2)

63-72: LGTM on the constructor signature update.

The change from move semantics to const reference for sorted_results_ aligns with the implementation that processes the data immediately in calcResultsByPack. The VectorIndexViewerPtr parameter correctly uses the new viewer abstraction.


47-47: LGTM on the member type change.

The change from VectorIndexPtr to VectorIndexViewerPtr correctly reflects that this reader only needs view capabilities (search and get operations), not builder functionality.

dbms/src/TiDB/Schema/VectorIndex.h (3)

27-36: LGTM on the VectorIndexDefinition struct.

The struct cleanly encapsulates vector index metadata using protobuf types directly. The default initialization to INVALID_* values is a good defensive pattern that helps catch uninitialized usage.

The TODO comment on line 34-35 appropriately notes that additional fields (like efConstruct) may be added later.


47-74: LGTM on the fmt formatters.

The formatters provide clean string representations using protobuf's _Name functions. The null-safe handling in the pointer formatter (outputting <no_idx>) is appropriate.


38-43: LGTM on the immutable pointer pattern and dimension constant.

Using std::shared_ptr<const VectorIndexDefinition> enforces immutability at the type level, which is appropriate for data constructed from TiDB table definitions.

The MAX_VECTOR_DIMENSION constant of 16,383 correctly reflects TiDB's maximum supported vector dimensions.

dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)

40-55: Consider potential TOCTOU race between cache.contains() and cache.remove().

Between checking cache.contains(file_path) at line 40 and calling cache.remove(file_path) at line 53, another thread could potentially add the same entry back into the cache via getOrSet(). If the LRU cache's remove() is safe to call on a freshly-added entry, this is benign. However, if this could cause issues, consider holding a lock across both operations.

Since this is a cleanup routine that runs periodically and the worst case is re-downloading an index file, the current approach is likely acceptable for simplicity.


63-84: LGTM: Background cleanup loop with proper shutdown handling.

The cleanup loop correctly:

  • Uses condition variable with timeout for efficient waiting.
  • Checks shutdown flag after waking to exit cleanly.
  • Wraps cleanup in try-catch to prevent thread termination on exceptions.

86-98: LGTM: Constructor and destructor properly manage background thread lifecycle.

The constructor launches the cleaner thread, and the destructor correctly signals shutdown via the atomic flag, notifies the condition variable, and joins the thread.

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (2)

17-22: LGTM: Header refactoring with forward declarations.

Good practice to use forward declarations and move implementations to the cpp file. This reduces compilation dependencies and improves build times.


182-187: LGTM: Member type updates align with the new Builder/Viewer architecture.

The type changes from VectorIndexPtr to VectorIndexViewerPtr and from BitmapFilter to std::vector<UInt32> sorted_results correctly reflect the new vector index design where viewing/searching is separated from building.

dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)

34-81: LGTM: Constructor with comprehensive validation.

The constructor properly validates:

  • ann_query_info is not null
  • Column ID matches between vec_cd and ann_query_info
  • All reader columns exist in header_layout
  • Vector column is not duplicated in reader.read_columns
  • Header layout has expected column count

The start_offset_to_pack_id map initialization is efficient and correct.


106-123: LGTM: Filter application logic is correct.

The read() method correctly handles both cases:

  • When return_filter == true: returns the filter to caller for late materialization
  • When return_filter == false: applies the filter inline

156-164: Consider using std::lower_bound return value more efficiently.

The loop is correct, but you could potentially use std::upper_bound to find the end of the range and iterate between bounds for slightly cleaner code. However, the current implementation is clear and efficient.


386-410: LGTM: Vector search result processing with proper deduplication.

The implementation correctly:

  • Validates that valid_rows covers the dmfile rows
  • Sorts results for efficient lookup in readImpl
  • Removes duplicates (defensive, as noted in comment)
  • Creates VectorColumnFromIndexReader with sorted results
  • Updates scan_context with performance metrics

425-450: LGTM: Pack filter optimization is well-implemented.

The loop efficiently maps search results to packs and sets unused packs to RSResult::None. The final RUNTIME_CHECK ensures all results were processed, which is a good safety check.

dbms/src/TiDB/Schema/TiDB.cpp (2)

409-420: LGTM: JSON serialization with proper validation.

The serialization correctly validates that the vector index kind and distance metric are not invalid before serializing, preventing malformed data from being persisted.


474-500: LGTM: Robust JSON deserialization with comprehensive validation.

The deserialization correctly:

  • Parses string values to protobuf enums using _Parse helpers
  • Validates parsed values are not INVALID_* sentinel values
  • Validates dimension is within bounds (0, MAX_VECTOR_DIMENSION]
  • Constructs an immutable VectorIndexDefinition wrapped in shared_ptr<const>

Good defensive programming with RUNTIME_CHECK_MSG for actionable error messages.

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)

24-39: LGTM: Builder class with clear interface.

The VectorIndexHNSWBuilder class correctly encapsulates the HNSW index building logic with:

  • Constructor taking vector index definition
  • addBlock for adding vectors with delete mark handling
  • save for persisting to disk
  • Private added_rows counter for tracking keys (including nulls/deletes)

41-56: LGTM: Viewer class with memory-mapped index access.

The VectorIndexHNSWViewer class provides read-only access to persisted HNSW indices with:

  • Static view factory method for loading from file props and path
  • search method with row filter support
  • get method for retrieving individual vectors

The static analysis hints about namespace DB are false positives - this is a namespace declaration, not a variable.

dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (4)

38-56: LGTM: Metric mapping with proper validation.

The getUSearchMetricKind function correctly maps TiDB's distance metrics to usearch's internal types. The L1 metric is explicitly noted as unsupported with a clear error message.

The static analysis hint about unused parameter d is a false positive - the parameter is clearly used in the switch statement.


67-108: LGTM: Block addition with proper null/delete handling.

The addBlock method correctly:

  • Handles both nullable and non-nullable array columns
  • Skips rows marked as deleted
  • Skips NULL values
  • Validates dimension consistency
  • Uses ceil2 for efficient capacity reservation

156-176: Good defensive exception handling in search predicate.

The predicate lambda correctly catches exceptions and sets a flag to propagate the error after search completes. This is necessary because usearch's search may run on different threads where exceptions cannot be directly propagated.

The has_exception_in_search atomic flag and post-search exception throw pattern is a reasonable approach.


193-197: Good defensive post-filtering.

As noted in the comment, usearch doesn't always apply the predicate to all results. The explicit post-filtering ensures correctness regardless of usearch's internal behavior.

dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (7)

92-98: LGTM: Utility method for vector column definition.

The cdVec() helper method provides a clean way to create a vector column definition without the vector index attached, which is appropriate for read-path testing where the index definition is typically read from file metadata.


203-209: LGTM: Test updated to use new VectorIndexDefinition API.

The test correctly uses the new TiDB::VectorIndexDefinition with tipb::VectorIndexKind::HNSW and tipb::VectorDistanceMetric::L2 instead of the old enum types.


463-472: LGTM: Error message expectation updated for new metric naming.

The expected error message correctly reflects the updated distance metric naming convention (COSINE instead of lowercase Cosine).


1202-1274: LGTM: Comprehensive S3 test fixture setup.

The VectorIndexSegmentOnS3Test class provides a well-structured test fixture for testing vector index functionality with S3 storage:

  • Enables S3 configuration and creates test bucket
  • Sets up remote data store, page storage, and file cache
  • Creates a segment with vector index column
  • Properly tears down in reverse order

The static analysis hints about uninitialized members are false positives - SetUp() initializes all required members.


1478-1524: LGTM: Test verifies S3 download and caching behavior.

This test validates the critical path of:

  1. First query downloads from S3 (total_vector_idx_load_from_s3 = 1)
  2. Second query uses memory cache (total_vector_idx_load_from_cache = 1)

Good verification of the caching layer working correctly.


1818-1882: LGTM: Concurrent download test with sync points.

This test validates that concurrent ANN queries for the same vector index:

  1. First query performs the S3 download (fg_download_from_s3 = 1)
  2. Second query waits for the first download (fg_wait_download_from_s3 = 1)

This prevents duplicate downloads and ensures efficient resource usage.


1884-1907: LGTM: S3 failure test with proper exception handling.

The test verifies that S3 download failures are properly propagated as exceptions, preventing silent failures and ensuring the application can handle network/storage issues gracefully.

dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (4)

17-26: LGTM!

Includes are appropriate for the new builder/viewer architecture. The added dmfile.pb.h include properly supports the dtpb::VectorIndexFileProps type used in VectorIndexViewer.


28-29: Static analysis false positive - ignore.

The clang-tidy warnings about DB being a "non-const globally accessible variable" and having a "too short name" are false positives. Line 28 uses valid C++17 nested namespace syntax (namespace DB::DM), not a variable declaration.


31-56: LGTM!

The VectorIndexBuilder class provides a clean abstract interface:

  • Proper polymorphic base with virtual destructor
  • Clear pure virtual methods for addBlock and save
  • Factory pattern via create static method
  • Key type alias properly documents its purpose as row offset in DMFile

60-90: LGTM!

The VectorIndexViewer class provides a well-designed interface:

  • Reuses Key type from VectorIndexBuilder ensuring consistency
  • RowFilter alias with clear documentation of semantics (true = valid row)
  • Factory view method properly takes protobuf-based file properties
  • search and get methods provide complete read API for vector index queries

auto lock = getLock();
if (shared->minmax_index_cache)
shared->minmax_index_cache->reset();
shared->minmax_index_cache.reset();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the specific lines mentioned in Context.cpp
# to understand the cache dropping behavior
cd dbms/src/Interpreters && wc -l Context.cpp

Repository: CalvinNeo/tiflash

Length of output: 77


🏁 Script executed:

# Read the relevant sections of Context.cpp
head -n 1420 dbms/src/Interpreters/Context.cpp | tail -n +1350

Repository: CalvinNeo/tiflash

Length of output: 1682


🏁 Script executed:

# Search for dropMinMaxIndexCache, dropVectorIndexCache, and dropMarkCache definitions
rg -n 'drop(MinMaxIndexCache|VectorIndexCache|MarkCache)' dbms/src/Interpreters/Context.cpp -A 5

Repository: CalvinNeo/tiflash

Length of output: 596


🏁 Script executed:

# Find type definitions for these caches
rg -n 'minmax_index_cache|vector_index_cache|mark_cache' dbms/src/Interpreters/Context.h -B 2 -A 2

Repository: CalvinNeo/tiflash

Length of output: 43


🏁 Script executed:

# Search for type aliases in Context.h
rg -n 'MarkCachePtr|MinMaxIndexCachePtr|VectorIndexCachePtr' dbms/src/Interpreters/Context.h -B 2 -A 2

Repository: CalvinNeo/tiflash

Length of output: 43


🏁 Script executed:

# Look for the actual class definitions
rg -n 'class.*MarkCache|class.*MinMaxIndexCache|class.*VectorIndexCache' --type=cpp --type=h

Repository: CalvinNeo/tiflash

Length of output: 691


🏁 Script executed:

# Check the using declarations for these cache types
rg -n 'using.*MarkCache|using.*MinMaxIndexCache|using.*VectorIndexCache' --type=h

Repository: CalvinNeo/tiflash

Length of output: 377


🏁 Script executed:

# Search in the shared context structure definition
rg -n 'struct SharedData|class SharedData' dbms/src/Interpreters/Context.cpp -A 100 | head -150

Repository: CalvinNeo/tiflash

Length of output: 43


🏁 Script executed:

# Search for LRUCache reset() method definition
rg -n 'reset\(\)' dbms/src/Common/LRUCache.h -B 3 -A 5

Repository: CalvinNeo/tiflash

Length of output: 745


🏁 Script executed:

# Look at the MarkCache reset method specifically
rg -n 'class MarkCache' dbms/src/Storages/MarkCache.h -A 50 | grep -E 'reset|class MarkCache'

Repository: CalvinNeo/tiflash

Length of output: 168


🏁 Script executed:

# Check usages of dropMarkCache, dropMinMaxIndexCache, dropVectorIndexCache across codebase
rg -n 'dropMarkCache|dropMinMaxIndexCache|dropVectorIndexCache' --type=cpp -B 2 -A 2

Repository: CalvinNeo/tiflash

Length of output: 3754


Fix inconsistent cache dropping behavior across mark_cache, minmax_index_cache, and vector_index_cache.

The code exhibits inconsistent cache destruction semantics:

  • dropMarkCache() (line 1364) calls shared->mark_cache->reset(), which clears cache contents while keeping the shared_ptr alive.
  • dropMinMaxIndexCache() (line 1388) and dropVectorIndexCache() (line 1410) call .reset() on the shared_ptr itself, destroying the pointer entirely.

This means getMarkCache() returns an empty cache object after dropping, while getMinMaxIndexCache() and getVectorIndexCache() return nullptr. Standardize this behavior across all three cache types—either all should clear contents via ->reset() or all should destroy pointers via .reset().

🤖 Prompt for AI Agents
In @dbms/src/Interpreters/Context.cpp at line 1388, The three cache dropters are
inconsistent: dropMarkCache calls shared->mark_cache->reset() (clearing contents
but keeping the shared_ptr), while dropMinMaxIndexCache and dropVectorIndexCache
call shared->minmax_index_cache.reset() and shared->vector_index_cache.reset()
(destroying the shared_ptr), causing getMarkCache to return an empty object but
the others to return nullptr; make them consistent by choosing one semantic and
applying it to all three—either change dropMarkCache to reset the pointer
(shared->mark_cache.reset()) to match the other two, or change
dropMinMaxIndexCache and dropVectorIndexCache to clear contents via ->reset() to
match dropMarkCache; update the corresponding getMarkCache, getMinMaxIndexCache,
and getVectorIndexCache callers/assumptions if needed.

Comment on lines +58 to +59
/// Views a VectorIndex file.
/// It may nor may not read the whole content of the file into memory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in comment.

"may nor may not" should be "may or may not".

Proposed fix
 /// Views a VectorIndex file.
-/// It may nor may not read the whole content of the file into memory.
+/// It may or may not read the whole content of the file into memory.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Views a VectorIndex file.
/// It may nor may not read the whole content of the file into memory.
/// Views a VectorIndex file.
/// It may or may not read the whole content of the file into memory.
🤖 Prompt for AI Agents
In @dbms/src/Storages/DeltaMerge/Index/VectorIndex.h around lines 58 - 59, Fix
the typo in the class/file comment for VectorIndex: replace "may nor may not
read the whole content of the file into memory" with "may or may not read the
whole content of the file into memory" so the comment for the VectorIndex view
is grammatically correct and clear.

Comment on lines +28 to +31
#ifdef __clang__
#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
#endif
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Missing #pragma clang diagnostic pop directive.

The #pragma clang diagnostic push at line 29 suppresses deprecated-declarations warnings, but there's no corresponding pop directive before the end of the file (line 355). This causes the warning suppression to leak into any translation unit that includes this header, potentially hiding legitimate warnings in unrelated code.

🔧 Add the missing pop directive

Add the following at the end of the file, just before the namespace closing bracket:

 } // namespace DB::DM
+
+#ifdef __clang__
+#pragma clang diagnostic pop
+#endif

Alternatively, if the pragmas are unnecessary (i.e., no deprecated APIs are actually used in this header), consider removing them entirely.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @dbms/src/Storages/DeltaMerge/ScanContext.h around lines 28 - 31, The header
opens a Clang diagnostic push/ignore block (the __clang__ / #pragma clang
diagnostic push and ignored "-Wdeprecated-declarations" in ScanContext.h) but
never restores it, causing suppression to leak to including TUs; add a matching
clang diagnostic pop (i.e., the corresponding #pragma clang diagnostic pop)
right before the file/namespace is closed (or remove the push/ignore pair if
unused) so the suppression is properly scoped and does not affect other
translation units.

Comment on lines +236 to +297
FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize)
{
auto s3_key = s3_fname.toFullKey();
auto file_type = getFileType(s3_key);
auto & table = tables[static_cast<UInt64>(file_type)];

std::unique_lock lock(mtx);

auto f = table.get(s3_key);
if (f != nullptr)
{
lock.unlock();
f->setLastAccessTime(std::chrono::system_clock::now());
auto status = f->waitForNotEmpty();
if (status == FileSegment::Status::Complete)
{
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment();
return f;
}
else
{
// On-going download failed, let the caller retry.
return nullptr;
}
}

GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment();

auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type);
if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true))
{
// Space not enough.
GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment();
LOG_DEBUG(
log,
"s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache",
s3_key,
cache_capacity,
cache_used,
estimzted_size);

// Just throw, no need to let the caller retry.
throw Exception( //
ErrorCodes::S3_ERROR,
"Cannot reserve {} space for object {}",
estimzted_size,
s3_key);
return nullptr;
}

auto file_seg
= std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type);
table.set(s3_key, file_seg);
lock.unlock();

++PerfContext::file_cache.fg_download_from_s3;
fgDownload(s3_key, file_seg);
if (!file_seg || !file_seg->isReadyToRead())
throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key);

return file_seg;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

ast-grep --pattern 'getOrWait($$$)'

Repository: CalvinNeo/tiflash

Length of output: 150


🏁 Script executed:

# Also find the get() method to understand the comparison
rg -B 3 -A 20 'FileSegmentPtr FileCache::get\('

Repository: CalvinNeo/tiflash

Length of output: 1622


🏁 Script executed:

# Find the FileCache class definition and both methods
fd 'FileCache' --type f

Repository: CalvinNeo/tiflash

Length of output: 204


🏁 Script executed:

# Look at the context around the getOrWait call at line 164
sed -n '150,180p' dbms/src/Storages/S3/FileCache.cpp

Repository: CalvinNeo/tiflash

Length of output: 1268


🏁 Script executed:

# Get the complete get() method to compare
sed -n '214,234p' dbms/src/Storages/S3/FileCache.cpp

Repository: CalvinNeo/tiflash

Length of output: 739


🏁 Script executed:

# Check for all occurrences of getOrWait in the entire codebase
rg 'getOrWait' --type cpp --type h

Repository: CalvinNeo/tiflash

Length of output: 357


Asymmetry in error handling between get() and getOrWait() is intentional but could be better documented.

The getOrWait() method throws an exception when space cannot be reserved (lines 278-282), while get() returns nullptr in the same scenario. This design choice is intentional—downloadFileForLocalRead() is the sole caller of getOrWait() and it's wrapped in a try-catch block prepared to handle exceptions. However, the asymmetry warrants explicit documentation in the method's contract or header comments explaining why callers should expect exceptions from getOrWait() but not from get().

🤖 Prompt for AI Agents
In @dbms/src/Storages/S3/FileCache.cpp around lines 236 - 297, Add a clear
header comment to FileCache::getOrWait explaining its error behavior: unlike
FileCache::get which returns nullptr on reservation failure, getOrWait will
throw an Exception when it cannot reserve space (and is intended to be used by
downloadFileForLocalRead which handles that exception). Mention the rationale
(caller is expected to retry or handle via try/catch), the specific exception
type (ErrorCodes::S3_ERROR) and that callers of getOrWait must be prepared for
exceptions rather than nullptr, and reference FileCache::get for the alternative
semantics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants