Conversation
Signed-off-by: Wish <breezewish@outlook.com>
…pingcap#166) Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Wish <breezewish@outlook.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
Signed-off-by: Lloyd-Pottiger <yan1579196623@gmail.com>
WalkthroughAdds a builder/viewer vector-index architecture, integrates vector-index-based DMFile read paths, introduces an in-memory VectorIndexCache with background cleaning, adds S3 foreground-download and wait semantics to FileCache, updates TiDB vector-index types, and several supporting API/signature adjustments across delta-merge and server components. Changes
Sequence Diagram(s)sequenceDiagram
participant Reader as DMFileWithVectorIndexBlockInputStream
participant VecCache as VectorIndexCache
participant FileCache as S3 FileCache
participant Viewer as VectorIndexViewer
participant DMFile as DMFileReader
Reader->>VecCache: getOrSet(index_file_path, load viewer)
alt viewer in cache
VecCache-->>Reader: return cached Viewer
else not in cache
VecCache->>FileCache: request index file (fgDownload via getOrWait)
FileCache->>FileCache: download from S3 (foreground) / write local file
FileCache-->>VecCache: local file path ready
VecCache->>Viewer: load/view from local path
Viewer-->>VecCache: store Viewer
VecCache-->>Reader: return Viewer
end
Reader->>Viewer: search(queryInfo, valid_rows)
Viewer-->>Reader: sorted rowid results
Reader->>DMFile: read packs using pack filter (pack ids from results)
DMFile-->>Reader: vector + other columns per pack
Reader->>Reader: assemble output Block and optional Filter
Reader-->>Client: return Block / filter
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Outside diff range, codebase verification and nitpick comments (9)
dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)
44-44: Ensure Serialization and Deserialization ofstd::optional<dtpb::VectorIndexFileProps>The
toProtoandmergeFromProtomethods inColumnStat.hdo not currently handle thestd::optional<dtpb::VectorIndexFileProps>. Ensure that these methods are updated to serialize and deserialize this field correctly to prevent potential issues.
- File:
dbms/src/Storages/DeltaMerge/File/ColumnStat.h- Methods:
toProto,mergeFromProtoAnalysis chain
Verify the impact on serialization and deserialization.
The change to
std::optional<dtpb::VectorIndexFileProps>may affect howvector_indexis serialized and deserialized. Ensure that all related methods handle the new type correctly.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the impact of the type change on serialization and deserialization. # Test: Search for methods related to serialization and deserialization of `ColumnStat`. rg --type cpp -A 3 $'toProto|mergeFromProto|serializeToBuffer|parseFromBuffer'Length of output: 8784
Script:
#!/bin/bash # Extract the implementation of `toProto` and `mergeFromProto` methods from `ColumnStat.h` ast-grep --lang cpp --pattern $'dtpb::ColumnStat toProto() const { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.h ast-grep --lang cpp --pattern $'void mergeFromProto(const dtpb::ColumnStat & proto) { $$$ }' dbms/src/Storages/DeltaMerge/File/ColumnStat.hLength of output: 262
Script:
#!/bin/bash # Use ripgrep to search for the definitions of `toProto` and `mergeFromProto` methods in `ColumnStat.h` rg --type cpp 'dtpb::ColumnStat toProto\(\) const' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10 rg --type cpp 'void mergeFromProto\(const dtpb::ColumnStat & proto\)' dbms/src/Storages/DeltaMerge/File/ColumnStat.h -A 10Length of output: 1164
dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
104-108: Update documentation foraddStreamsmethod.The
addStreamsmethod signature has been updated to useTiDB::VectorIndexDefinitionPtrinstead ofTiDB::VectorIndexInfoPtr. Ensure that any related documentation or comments are updated to reflect this change.dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)
83-103: Consider Logging Additional Information in Destructor.The destructor logs the completion of reading a DMFile with vector index. Consider logging additional context, such as the number of rows processed or any errors encountered, to enhance traceability.
- "Finished read DMFile with vector index for column dmf_{}/{}(id={}), " + "Finished read DMFile with vector index for column dmf_{}/{}(id={}), processed_rows={}, "
106-123: Clarify Filtering Logic inreadMethod.The
readmethod contains logic for filtering based on aFilterPtr. Ensure that the comments and logic are clear, especially regarding the implications ofreturn_filterbeingfalse.- // If return_filter == false, we must filter by ourselves. + // If return_filter == false, apply filtering within this method.
170-215: Consider Caching Results inreadByIndexReader.The method
readByIndexReaderiterates over packs and reads vector columns. Consider caching results if this method is called frequently with the same parameters to improve performance.
258-271: Add Logging for Load Status inloadMethod.Consider adding logging to indicate when the
loadmethod is invoked and completed, which can be useful for debugging and performance monitoring.+ LOG_DEBUG(log, "Loading vector index and search results");
477-485: Clarify the Purpose ofgetPackIdFromBlock.The method
getPackIdFromBlockretrieves the pack ID from a block. Ensure that the purpose and usage of this method are well-documented, especially regarding assumptions about block alignment.- // The start offset of a block is ensured to be aligned with the pack. + // Assumes the start offset of a block is aligned with the pack.dbms/src/Storages/S3/FileCache.cpp (2)
68-107: Use named constants for timeout values.The method uses magic numbers (30 and 300 seconds) for timeouts. Consider defining named constants for these values to improve readability and maintainability.
constexpr auto WAIT_TIMEOUT = std::chrono::seconds(30); constexpr auto MAX_WAIT_TIMEOUT = std::chrono::seconds(300);
160-176: Enhance logging for file download attempts.Consider adding more detailed logging to trace the execution flow, especially when a file is removed from the cache due to inconsistencies.
LOG_DEBUG(log, "Attempting to download file for local read: {}", s3_fname.toFullKey());
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (36)
- contrib/tipb (1 hunks)
- dbms/src/Common/FailPoint.cpp (1 hunks)
- dbms/src/Common/LRUCache.h (1 hunks)
- dbms/src/Interpreters/Context.cpp (3 hunks)
- dbms/src/Interpreters/Context.h (1 hunks)
- dbms/src/Server/Server.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/DMContext.h (2 hunks)
- dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1 hunks)
- dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (5 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (4 hunks)
- dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4 hunks)
- dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (5 hunks)
- dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (4 hunks)
- dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (3 hunks)
- dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1 hunks)
- dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1 hunks)
- dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2 hunks)
- dbms/src/Storages/DeltaMerge/ScanContext.h (6 hunks)
- dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (26 hunks)
- dbms/src/Storages/S3/FileCache.cpp (8 hunks)
- dbms/src/Storages/S3/FileCache.h (6 hunks)
- dbms/src/Storages/S3/FileCachePerf.cpp (1 hunks)
- dbms/src/Storages/S3/FileCachePerf.h (1 hunks)
- dbms/src/TiDB/Schema/TiDB.cpp (3 hunks)
- dbms/src/TiDB/Schema/TiDB.h (1 hunks)
- dbms/src/TiDB/Schema/VectorIndex.h (1 hunks)
Files skipped from review due to trivial changes (1)
- contrib/tipb
Additional comments not posted (76)
dbms/src/Storages/S3/FileCachePerf.cpp (1)
1-22: LGTM!The file sets up a thread-local performance context for file caching. The setup is straightforward and aligns with typical practices for performance tracking.
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (1)
1-22: LGTM!The file sets up a thread-local performance context for vector search operations. The setup is straightforward and aligns with typical practices for performance tracking.
dbms/src/Storages/DeltaMerge/Index/RSIndex.h (1)
Line range hint
1-26:
Simplification acknowledged and approved.The removal of the
vectormember and its constructor simplifies theRSIndexstructure, focusing on the essential components. This change enhances clarity and reduces unnecessary complexity.dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)
24-28: LGTM! Modular design enhances clarity and maintainability.The introduction of
VectorIndexBuilderandVectorIndexViewerclasses, along with their smart pointer types, suggests a more modular approach. This can improve code clarity and maintainability by separating concerns between building and viewing vector indices.dbms/src/Storages/S3/FileCachePerf.h (1)
1-37: LGTM! Effective use of thread-local storage for performance tracking.The
FileCachePerfContextstructure provides a clear way to track performance metrics related to file caching. Usingthread_localforfile_cacheensures that performance data is managed per thread, which is suitable for concurrent environments.dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)
1-37: LGTM! Efficient thread-local performance tracking for vector search.The
VectorSearchPerfContextstructure is well-designed for tracking performance metrics in vector search operations. The use ofthread_localforvector_searchensures that performance data is isolated per thread, which is advantageous in concurrent processing.dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)
27-39: ReviewVectorIndexHNSWBuilderclass changes.The class
VectorIndexHNSWBuilderhas been introduced, replacingUSearchIndexWithSerialization. The constructor now takes aVectorIndexDefinitionPtr, which suggests a design shift towards using definitions for initialization. Ensure that the new design aligns with the overall architecture and that theaddBlockandsavemethods are correctly implemented.Verification successful
VectorIndexHNSWBuilder Class Implementation Verified
The
VectorIndexHNSWBuilderclass and its methodsaddBlockandsaveare implemented correctly, with appropriate type checks and error handling. The integration with the rest of the codebase is consistent with the new design. No issues found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexHNSWBuilder` class. # Test: Search for the class definition and usage. Expect: Consistent usage with new constructor. rg --type cpp -A 5 $'class VectorIndexHNSWBuilder'Length of output: 652
Script:
#!/bin/bash # Description: Find all usages of the `VectorIndexHNSWBuilder` class to verify its integration and usage. # Test: Search for all instances where `VectorIndexHNSWBuilder` is used in the codebase. rg --type cpp 'VectorIndexHNSWBuilder'Length of output: 867
Script:
#!/bin/bash # Description: Verify the implementation of the `addBlock` and `save` methods in `Index.cpp`. # Test: Retrieve the implementation details of `addBlock` and `save` methods. rg --type cpp -A 10 'void VectorIndexHNSWBuilder::addBlock' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp rg --type cpp -A 10 'void VectorIndexHNSWBuilder::save' dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cppLength of output: 1248
41-55: ReviewVectorIndexHNSWViewerclass introduction.The
VectorIndexHNSWViewerclass extendsVectorIndexViewerand introduces a staticviewmethod. Thesearchmethod has been updated to remove theSearchStatisticsparameter, focusing on query information and valid rows. Ensure that this change does not impact the functionality negatively and aligns with the intended design.Verification successful
Removal of
SearchStatisticsParameter VerifiedThe removal of the
SearchStatisticsparameter from thesearchmethod in theVectorIndexHNSWViewerclass does not appear to impact functionality negatively. There are no references toSearchStatisticsin the codebase, indicating it was not crucial to the method's operation. This change aligns with the intended design.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexHNSWViewer` class. # Test: Search for the class definition and usage. Expect: Proper integration with the new search method. rg --type cpp -A 5 $'class VectorIndexHNSWViewer'Length of output: 650
Script:
#!/bin/bash # Description: Search for the usage of the `search` method in the `VectorIndexHNSWViewer` class. # Test: Find all occurrences of `search` method calls. Expect: Understand how the method is used and ensure the change is safe. rg --type cpp 'search\(' -A 3Length of output: 22355
Script:
#!/bin/bash # Description: Search for any references to `SearchStatistics` in relation to the `search` method. # Test: Find all occurrences of `SearchStatistics` to understand its previous usage and impact. rg --type cpp 'SearchStatistics'Length of output: 32
Script:
#!/bin/bash # Description: Search for comments or documentation mentioning `SearchStatistics`. # Test: Find any comments or documentation that reference `SearchStatistics`. rg --type cpp 'SearchStatistics' -g '*.h' -g '*.cpp' -A 3 -B 3Length of output: 62
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)
30-77: ReviewVectorIndexCacheclass implementation.The
VectorIndexCacheclass uses an LRUCache for cachingVectorIndexViewerobjects. It includes mechanisms for cleaning outdated cache entries and managing shutdown operations. Ensure that the use of mutexes and condition variables is correct and that the cleaning logic is efficient and does not introduce race conditions.Verification successful
Integration and Usage of
VectorIndexCacheVerifiedThe
VectorIndexCacheclass is well-integrated within the codebase, with clear methods for managing its lifecycle. It is used in various components, including the server setup and context management, and its functionality is covered by unit tests.
Server.cpp: Sets the vector index cache in the global context.Context.cpp: Provides methods to set, get, and drop the vector index cache.- Test files confirm the cache's functionality is tested.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexCache` class. # Test: Search for the class definition and usage. Expect: Proper integration with cache management logic. rg --type cpp -A 5 $'class VectorIndexCache'Length of output: 1190
Script:
#!/bin/bash # Description: Search for the usage of `VectorIndexCache` to understand its integration and usage. # Test: Find all occurrences of `VectorIndexCache` to verify its integration. rg --type cpp 'VectorIndexCache'Length of output: 3577
dbms/src/TiDB/Schema/VectorIndex.h (2)
27-35: ReviewVectorIndexDefinitionstruct introduction.The
VectorIndexDefinitionstruct replaces previous enums and structures, encapsulating index properties. Ensure that the struct is comprehensive and correctly represents the necessary properties. The TODO comment suggests potential future additions, which should be tracked.Verification successful
The
VectorIndexDefinitionstruct is consistently integrated.The
VectorIndexDefinitionstruct is used across various components and is tested, indicating proper integration. The struct's usage in different contexts, including formatters for output, suggests it is well-incorporated into the codebase. No issues found with its current implementation.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation and usage of the `VectorIndexDefinition` struct. # Test: Search for the struct definition and usage. Expect: Consistent usage with new properties. rg --type cpp -A 5 $'struct VectorIndexDefinition'Length of output: 497
Script:
#!/bin/bash # Description: Search for usage of `VectorIndexDefinition` in the codebase to verify its integration. # Test: Find all occurrences of `VectorIndexDefinition` usage. rg --type cpp 'VectorIndexDefinition'Length of output: 3281
Line range hint
48-73:
Reviewfmt::formatterforVectorIndexDefinition.The
fmt::formatterspecialization forVectorIndexDefinitionhas been updated. Ensure that the formatting logic correctly reflects the new struct and that it integrates well with the existing logging or output mechanisms.dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)
Line range hint
33-41:
Refactor toVectorIndexBuilder: Ensure type support logic is correct.The
isSupportedTypemethod has been moved toVectorIndexBuilder. This refactoring aligns with the new class responsibilities. Verify that the logic for checking supported types remains correct and comprehensive.
59-73: Refactor toVectorIndexViewer: Ensure view logic is correct.The
viewmethod replacesloadand now usesVectorIndexFileProps. This refactoring is intended to improve clarity and modularity. Verify that the logic for viewing vector index files is correctly implemented and that all relevant cases are handled.
42-56: Refactor toVectorIndexBuilder: Check method signature and logic.The
createmethod now usesVectorIndexDefinitionPtrinstead ofVectorIndexInfo. This change enhances encapsulation and type safety. Ensure that all usages ofcreateare updated to match the new signature and that the switch-case logic correctly handles all necessary cases.Verification successful
Verification Successful:
createMethod Signature Updated CorrectlyThe usage of
VectorIndexBuilder::createhas been successfully updated to the new signature usingVectorIndexDefinitionPtr. All identified usages are consistent with this change.
dbms/src/Storages/DeltaMerge/File/DMFileWriter.h: Thedo_vector_indexparameter is correctly typed asTiDB::VectorIndexDefinitionPtr.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `create` method are updated to the new signature. # Test: Search for the `create` method usage. Expect: Only occurrences of the new signature. rg --type cpp -A 5 $'VectorIndexBuilder::create'Length of output: 1216
Script:
#!/bin/bash # Description: Check the type of `do_vector_index` in `DMFileWriter.h` to verify it matches the new signature. rg --type cpp 'do_vector_index' dbms/src/Storages/DeltaMerge/File/DMFileWriter.h -A 5 -B 5Length of output: 1700
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)
31-55: Introduction ofVectorIndexBuilder: Review class structure and methods.The
VectorIndexBuilderclass is introduced to handle the construction of vector indices. The methodscreate,isSupportedType,addBlock, andsaveare aligned with its responsibilities. Ensure that the class design supports efficient and correct index building.
58-90: Introduction ofVectorIndexViewer: Review class structure and methods.The
VectorIndexViewerclass is introduced to handle viewing vector index files. The methodsview,search, andgetare aligned with its responsibilities. Ensure that the class design supports efficient and correct index viewing.dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (3)
47-47: Update toVectorIndexViewerPtr: Ensure index interaction is correct.The
indexmember variable now usesVectorIndexViewerPtr, reflecting the new class structure. Verify that all interactions with the index are correctly updated to use the viewer.
56-56: UpdatecalcResultsByPackmethod signature: Ensure efficient result handling.The method now accepts a reference to a vector of
VectorIndexViewer::Key, improving efficiency by avoiding unnecessary copies. Ensure that all calls to this method are updated to match the new signature.
65-71: Update constructor: Ensure alignment with new indexing strategy.The constructor now takes
VectorIndexViewerPtrand a reference to a vector ofVectorIndexViewer::Key. This change aligns with the new indexing strategy. Ensure that all instances of this class are constructed using the updated parameters.dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)
93-98: Destructor implementation looks good.The destructor correctly handles shutting down the cleaner thread and joining it.
86-91: Verify thread management.Ensure that the cleaner thread is properly managed and terminated to prevent resource leaks.
Verification successful
Thread management is correctly implemented in
VectorIndexCache.The destructor of
VectorIndexCacheensures proper termination of the cleaner thread by setting a shutdown flag, notifying the condition variable, and joining the thread. This prevents resource leaks and ensures clean shutdown.
- Destructor:
VectorIndexCache::~VectorIndexCache()setsis_shutting_down = true, notifiesshutdown_cv, and callscleaner_thread.join().Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify proper thread management in the constructor of `VectorIndexCache`. # Test: Check for proper thread management and termination. rg --type cpp 'VectorIndexCache' -A 5Length of output: 16167
63-83: Ensure proper shutdown handling.The loop correctly handles shutdown signals, but ensure that any resources are properly released if exceptions occur.
Verification successful
Proper shutdown handling and thread management confirmed.
The
cleanOutdatedLoopmethod and its associated thread are correctly managed. The destructor properly sets the shutdown flag, notifies the condition variable, and joins the thread, ensuring a clean and safe shutdown process. No further action is needed.
- Destructor:
VectorIndexCache::~VectorIndexCache()properly joins thecleaner_thread.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify proper shutdown handling in the `cleanOutdatedLoop`. # Test: Check for any resources that might not be released upon exceptions. rg --type cpp 'cleanOutdatedLoop' -A 5Length of output: 1429
Script:
#!/bin/bash # Description: Check if `cleaner_thread` is properly managed in the `VectorIndexCache` class. # Test: Look for thread management (join or detach) in the destructor or elsewhere in the class. rg --type cpp 'cleaner_thread' -A 10Length of output: 1717
dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)
90-94: Addition ofVectorIndexFilePropslooks good.The new message is well-defined and should enhance the handling of vector index properties.
65-67: Verify backward compatibility.The change from
ColumnVectorIndexInfotoVectorIndexFilePropsmay affect backward compatibility. Ensure that the reserved field is correctly handled in the codebase.Verification successful
Backward compatibility maintained.
The change from
ColumnVectorIndexInfotoVectorIndexFilePropsdoes not impact backward compatibility, as there are no remaining references toColumnVectorIndexInfoin the codebase. The use of the reserved field 101 ensures that the field number is not reused.
- The reserved field 101 is correctly implemented in
dmfile.proto.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify backward compatibility for changes in `ColumnStat`. # Test: Check for usage of the reserved field and ensure compatibility. rg --type proto 'reserved 101' -A 5Length of output: 67
Script:
#!/bin/bash # Description: Verify backward compatibility for changes in `ColumnStat`. # Test: Check for usage of the reserved field and ensure compatibility. rg 'reserved 101' --glob '*.proto' -A 5Length of output: 540
Script:
#!/bin/bash # Description: Check for references to `ColumnVectorIndexInfo` in the codebase to ensure backward compatibility. # Test: Search for `ColumnVectorIndexInfo` in the codebase. rg 'ColumnVectorIndexInfo'Length of output: 26
dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)
96-96: Verify impact of access modifier change forserializeColumnFileSet.Changing the access modifier from public to private may affect accessibility. Ensure that this change is intentional and does not introduce any issues.
Verification successful
Access modifier change for
serializeColumnFileSetis safe.The method
serializeColumnFileSetis only used within theSerializer.cppfile and does not have external dependencies. Changing its access modifier to private improves encapsulation without affecting other parts of the codebase.
- No external usage of
serializeColumnFileSetfound outsideSerializer.cpp.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the impact of changing the access modifier for `serializeColumnFileSet`. # Test: Check for any unintended consequences of making `serializeColumnFileSet` private. rg --type cpp 'serializeColumnFileSet' -A 5Length of output: 4711
96-96: Verify impact of access modifier change forserializeSegment.Changing the access modifier from private to public may affect encapsulation. Ensure that this change is intentional and does not introduce any issues.
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)
119-119: Type consistency maintained.The update to use
VectorIndexViewer::Keyforrowidensures consistency with the new type usage across the file.
37-47: Consider the performance impact of the runtime check.The addition of a runtime check to ensure that
sorted_resultsis sorted improves robustness. However, this could introduce a performance overhead in critical paths. Consider evaluating the necessity of this check in production environments or providing a build configuration to disable it.dbms/src/Storages/DeltaMerge/DMContext.h (1)
136-137: Verify the usage of the newscan_contextparameter.The addition of the
scan_contextparameter enhances flexibility, but ensure that its usage is correctly handled throughout the codebase.dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)
75-75: Use builder pattern for vector index creation.The
vector_indexis now created usingVectorIndexBuilder::create, which enhances flexibility. Ensure that the builder pattern is correctly implemented and utilized.
101-101: Update member variable type.The
vector_indexmember variable type has been updated toVectorIndexBuilderPtr. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Update member variable type verified.
The
vector_indexmember variable type has been updated toVectorIndexBuilderPtrin the relevant parts of the codebase. No inconsistencies were found in the application of this change.
- The
vector_indexmember variable inDMFileWriter.hhas been correctly updated toVectorIndexBuilderPtr.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vector_index` member variable are updated to use `VectorIndexBuilderPtr`. # Test: Search for `vector_index` usage. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'vector_index'Length of output: 30888
57-57: Update constructor parameter type.The constructor now uses
TiDB::VectorIndexDefinitionPtrinstead ofTiDB::VectorIndexInfoPtr. Ensure that all instances of this class are updated to pass the correct type.
163-163: Update method parameter type.The
addStreamsmethod now usesTiDB::VectorIndexDefinitionPtrinstead ofTiDB::VectorIndexInfoPtr. Ensure that all calls to this method are updated to pass the correct type.Verification successful
All
addStreamsmethod calls use the correct parameter type.The
addStreamsmethod now correctly usesTiDB::VectorIndexDefinitionPtrfor thedo_vector_indexparameter, and all relevant calls to this method have been updated accordingly. The verification confirms that the type change has been properly implemented.
DMFileWriter.cpp: The call toaddStreamsusescd.vector_index, which is correctly defined asTiDB::VectorIndexDefinitionPtr.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all calls to `addStreams` use the correct type for `do_vector_index`. # Test: Search for `addStreams` calls. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'addStreams'Length of output: 1531
Script:
#!/bin/bash # Use ast-grep to find all calls to the addStreams method and check the type of the do_vector_index parameter. ast-grep --lang cpp --pattern 'addStreams($_, $_, $_, $_)'Length of output: 60
Script:
#!/bin/bash # Use rg to find all calls to the addStreams method and include context lines to check parameter types. rg --type cpp 'addStreams' -A 3 -B 2Length of output: 1678
Script:
#!/bin/bash # Search for the definition of `cd` to confirm the type of `cd.vector_index`. rg --type cpp 'cd\s*=' -A 3 -B 2Length of output: 11390
dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (2)
95-95: Update member variable type.The
vector_indexmember variable type has been updated toTiDB::VectorIndexDefinitionPtr. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Verification complete: Member variable type updated consistently.
The
vector_indexmember variable type change toTiDB::VectorIndexDefinitionPtrinColumnDefinehas been consistently applied across the codebase. No discrepancies found.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vector_index` member variable in `ColumnDefine` are updated to use `TiDB::VectorIndexDefinitionPtr`. # Test: Search for `vector_index` usage in `ColumnDefine`. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'ColumnDefine'Length of output: 324094
102-102: Update constructor parameter type.The constructor now uses
TiDB::VectorIndexDefinitionPtrforvector_index_. Ensure that all instances of this struct are updated to pass the correct type.Verification successful
No issues found with
ColumnDefineconstructor usage. The search did not find any instances ofColumnDefinebeing instantiated with thevector_index_parameter, indicating that the codebase does not directly use this parameter in constructor calls.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all instantiations of the `ColumnDefine` struct use the correct type for `vector_index_`. # Test: Search for `ColumnDefine` instantiations. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'ColumnDefine'Length of output: 324094
Script:
#!/bin/bash # Search for all constructor calls for `ColumnDefine` that include the `vector_index_` parameter. ast-grep --lang cpp --pattern 'ColumnDefine($_, $_, $_, $_, $_)' --jsonLength of output: 76
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (7)
96-96: Override destructor.The destructor now explicitly uses the
overridekeyword, which improves clarity and ensures that the base class has a virtual destructor.
108-108: Override method signature.The
readmethod now explicitly uses theoverridekeyword, which improves clarity and ensures that the base class has a virtual method.
144-144: Declare method without implementation.The
readByIndexReadermethod is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.
148-148: Declare method without implementation.The
readByFollowingOtherColumnsmethod is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.
182-182: Update member variable type.The
vec_indexmember variable type has been updated toVectorIndexViewerPtr. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Member variable type update verified.
The
vec_indexmember variable type has been correctly updated toVectorIndexViewerPtrin all relevant parts of the codebase. No further action is needed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `vec_index` member variable are updated to use `VectorIndexViewerPtr`. # Test: Search for `vec_index` usage. Expect: Only occurrences with the new type. rg --type cpp -A 5 $'vec_index'Length of output: 15367
186-186: Update member variable name and purpose.The
valid_rows_after_searchvariable has been replaced withsorted_results, indicating a change in how results are managed. Ensure that this change is reflected in all relevant parts of the codebase.Verification successful
Change from
valid_rows_after_searchtosorted_resultsis correctly reflected.The
sorted_resultsvariable is used consistently across the codebase, and there are no remaining references tovalid_rows_after_search. The transition appears to be complete and correctly implemented.
- Files checked:
DMFileWithVectorIndexBlockInputStream.hVectorColumnFromIndexReader.hScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of the `sorted_results` member variable are updated accordingly. # Test: Search for `sorted_results` usage. Expect: Only occurrences with the new name and purpose. rg --type cpp -A 5 $'sorted_results'Length of output: 11304
113-113: Declare method without implementation.The
readImplmethod is now declared without implementation in the header file, suggesting a move towards deferred implementation. Ensure that the implementation is provided in the corresponding source file.Verification successful
Implementation Found for
readImplMethodThe
readImplmethod declared inDMFileWithVectorIndexBlockInputStream.his implemented in the corresponding source fileDMFileWithVectorIndexBlockInputStream.cpp. This addresses the concern regarding the method's implementation. No further action is required.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `readImpl` is provided in the corresponding source file. # Test: Search for `readImpl` implementation. Expect: Implementation in the corresponding source file. rg --type cpp -A 5 $'readImpl'Length of output: 88199
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (5)
38-54: LGTM!The
getUSearchMetricKindfunction correctly maps distance metrics and handles unsupported metrics with a runtime check.
58-64: Constructor looks good!The
VectorIndexHNSWBuilderconstructor correctly initializes the index with the appropriate metric and checks the index kind.
Line range hint
67-106: Robust method implementation!The
addBlockmethod effectively handles memory allocation and data validity checks, ensuring robust data addition to the index.
110-113: Save method is well-implemented!The
savemethod effectively handles errors using runtime checks, ensuring the index is saved correctly.
Line range hint
134-198: Comprehensive search method!The
searchmethod includes detailed checks and exception handling, ensuring robust search functionality. Note the TODO comment regardingefSearchsupport.dbms/src/Common/LRUCache.h (1)
74-80: Thread-safecontainsfunction!The
containsfunction correctly checks for the presence of a key using a lock guard, ensuring thread safety without altering the LRU order.dbms/src/Storages/S3/FileCache.h (4)
228-233: Efficient caching strategy!The
downloadFileForLocalReadmethod effectively ensures that files in use are not evicted prematurely, enhancing caching strategy.
249-253: Robust retrieval strategy!The
getOrWaitmethod provides a robust mechanism to ensure file availability, handling waiting and foreground downloading effectively.
256-256: Foreground download method looks good!The
fgDownloadmethod is well-implemented to handle file downloads in the foreground, ensuring correct processing.
75-76: VerifywaitForNotEmptyimplementation.Ensure that the
waitForNotEmptymethod correctly blocks until the segment is populated, maintaining thread safety.Verification successful
waitForNotEmptyMethod Implementation VerifiedThe
waitForNotEmptymethod inFileSegmentis correctly implemented to block until the segment is populated, using mutexes and condition variables to ensure thread safety. The method includes a timeout to prevent indefinite blocking.
- File:
dbms/src/Storages/S3/FileCache.cpp- Implementation Details: Uses a mutex (
mtx) and a condition variable (cv_ready) to wait for the status to change fromEmpty.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `waitForNotEmpty` method in `FileSegment`. # Test: Check if `waitForNotEmpty` is implemented and used correctly. ast-grep --lang cpp --pattern $'Status FileSegment::waitForNotEmpty() { $$$ }'Length of output: 81
Script:
#!/bin/bash # Description: Search for any method named `waitForNotEmpty` in the entire codebase to locate its implementation or usage. # Search for `waitForNotEmpty` method across all C++ files. rg 'waitForNotEmpty' --type cpp -A 5Length of output: 2284
dbms/src/TiDB/Schema/TiDB.h (1)
209-209: Verify the impact of changingvector_indextype.The type of
vector_indexin theColumnInfostruct has been changed fromVectorIndexInfoPtrtoVectorIndexDefinitionPtr. This change may affect how vector indices are used throughout the codebase. Ensure that all usages ofvector_indexare updated to accommodate this new type.dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
64-64: EnsureVectorIndexBuilder::isSupportedTypeis correctly implemented.The change from
VectorIndex::isSupportedTypetoVectorIndexBuilder::isSupportedTypesuggests a restructuring of vector index handling. Verify that the new method is implemented correctly and covers all necessary type checks.Verification successful
Verification of
VectorIndexBuilder::isSupportedTypeImplementationThe
VectorIndexBuilder::isSupportedTypemethod is correctly implemented. It checks for nullable types and supportsDataTypeFloat32. This implementation aligns with its intended use inDMFileWriter.cppto ensure only supported types are processed for vector indexing.
- File:
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp- Method:
bool VectorIndexBuilder::isSupportedType(const IDataType & type)Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the implementation of `VectorIndexBuilder::isSupportedType`. # Test: Search for the implementation of `isSupportedType`. Expect: Correct type checks. ast-grep --lang cpp --pattern $'class VectorIndexBuilder { $$$ static bool isSupportedType($_) { $$$ } $$$ }'Length of output: 469
Script:
#!/bin/bash # Search for the definition and usage of `isSupportedType` in the codebase rg 'isSupportedType' --context 5 --type cpp # Search for the definition of the `VectorIndexBuilder` class rg 'class VectorIndexBuilder' --context 10 --type cppLength of output: 5909
dbms/src/Storages/DeltaMerge/ScanContext.h (2)
93-93: Initializetotal_vector_idx_load_from_s3correctly.The new atomic variable
total_vector_idx_load_from_s3is introduced to track vector index load times from S3. Ensure that it is correctly initialized and used throughout the class.
150-150: Verify integration oftotal_vector_idx_load_from_s3in serialization and merging.The new variable
total_vector_idx_load_from_s3is integrated into serialization, deserialization, and merging logic. Verify that these changes are correctly implemented and that the variable is accurately tracked and aggregated.Also applies to: 202-202, 262-262, 314-314
Verification successful
Integration of
total_vector_idx_load_from_s3is correctly implemented.The variable
total_vector_idx_load_from_s3is properly integrated into serialization and merging logic. It is serialized and deserialized usingtiflash_scan_context_pb, and its value is aggregated during merging. Additionally, its behavior is verified through multiple tests. No issues found with its integration.
- Serialization and Deserialization: Properly handled via
tiflash_scan_context_pb.- Merging Logic: Correctly aggregated with other contexts.
- Testing: Adequately tested in
gtest_dm_vector_index.cpp.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the integration of `total_vector_idx_load_from_s3` in serialization, deserialization, and merging. # Test: Search for `total_vector_idx_load_from_s3` usage in serialization and merging. Expect: Correct integration. rg --type cpp --word-regexp 'total_vector_idx_load_from_s3'Length of output: 2866
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)
125-168: Optimize Filter Assignment inreadImplMethod.The filter assignment in
readImplcould be optimized by using a more efficient algorithm or data structure, especially ifsorted_resultsis large.- auto it = std::lower_bound(sorted_results.begin(), sorted_results.end(), start_offset); - while (it != sorted_results.end()) - { - auto rowid = *it; - if (rowid >= max_rowid_exclusive) - break; - filter[rowid - start_offset] = 1; - ++it; - } + for (auto rowid : sorted_results) + { + if (rowid >= start_offset && rowid < max_rowid_exclusive) + filter[rowid - start_offset] = 1; + }Likely invalid or redundant comment.
34-81: Ensure Proper Error Handling for Constructor Preconditions.The constructor of
DMFileWithVectorIndexBlockInputStreamusesRUNTIME_CHECKto validate preconditions. Ensure that these checks are appropriate and that exceptions are handled correctly if they fail.
217-255: Ensure Consistency inreadByFollowingOtherColumns.The method
readByFollowingOtherColumnsassembles a block usingheader_layout. Ensure consistency in the layout and verify that all necessary columns are present.Verification successful
Consistency in
readByFollowingOtherColumnsis ensured.The constructor of
DMFileWithVectorIndexBlockInputStreamincludes runtime checks to verify thatheader_layoutcontains all necessary columns. These checks ensure that the layout used inreadByFollowingOtherColumnsis consistent. No further action is needed.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify consistency in the layout of `readByFollowingOtherColumns`. # Test: Ensure that all necessary columns are present in the layout. # Search for the usage of `header_layout` in the method. rg --type cpp --context 5 'header_layout' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'Length of output: 1937
273-384: Verify S3 File Cache Configuration inloadVectorIndex.The method
loadVectorIndexrelies on S3 file cache. Ensure that the configuration is correct and that potential errors are handled gracefully.Verification successful
S3 File Cache Configuration and Error Handling Verified
The
loadVectorIndexmethod correctly verifies the presence of the S3 file cache and handles potential errors by retrying downloads and throwing exceptions if necessary. The configuration and error handling are properly implemented.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify S3 file cache configuration in `loadVectorIndex`. # Test: Ensure that the configuration is correct and errors are handled. # Search for S3 file cache usage in the method. rg --type cpp --context 5 'S3' -- 'dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp'Length of output: 1961
386-475: Ensure Correctness of Vector Search Logic inloadVectorSearchResult.The method
loadVectorSearchResultperforms a vector search and updates the pack filter. Verify the correctness of the search logic and the impact on the pack filter.dbms/src/Common/FailPoint.cpp (1)
109-109: Verify Integration of New Fail Pointfile_cache_fg_download_fail.Ensure that the new fail point
file_cache_fg_download_failis correctly integrated and used in the codebase. Verify that it serves its intended purpose without unintended side effects.Verification successful
Fail Point
file_cache_fg_download_failis Correctly IntegratedThe fail point
file_cache_fg_download_failis properly declared, used, and tested within the codebase, indicating correct integration. No issues were found regarding its implementation or usage.
- Declaration:
dbms/src/Storages/S3/FileCache.cpp- Usage: Triggered in
FileCache.cppand tested ingtest_dm_vector_index.cppScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify integration of new fail point `file_cache_fg_download_fail`. # Test: Ensure the fail point is correctly integrated and used. # Search for the usage of `file_cache_fg_download_fail` in the codebase. rg --type cpp 'file_cache_fg_download_fail'Length of output: 758
dbms/src/Interpreters/Context.h (1)
395-395: Verify Impact of Parameter Name Change insetVectorIndexCache.The parameter name change from
cache_size_in_bytestocache_entitiesmay impact how the method is used. Ensure that all calls to this method are updated accordingly and that the new name accurately reflects its purpose.Verification successful
Parameter Name Change Verified:
setVectorIndexCacheThe parameter name change from
cache_size_in_bytestocache_entitiesis correctly implemented. The calls to this method have been updated, and the new name accurately reflects its purpose, as evidenced by the context and comments in the code.
gtest_dm_vector_index.cpp: Uses a literal value for cache entities.Server.cpp: Uses a configuration value for cache entities with appropriate comments explaining the change.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify impact of parameter name change in `setVectorIndexCache`. # Test: Ensure all calls to this method are updated and the new name reflects its purpose. # Search for calls to `setVectorIndexCache` in the codebase. rg --type cpp 'setVectorIndexCache'Length of output: 420
Script:
#!/bin/bash # Description: Gather context around calls to `setVectorIndexCache` to verify parameter name change. # Get 5 lines of context around each call to understand usage and purpose. # Get context around the call in `gtest_dm_vector_index.cpp`. rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp # Get context around the call in `Server.cpp`. rg --type cpp 'setVectorIndexCache' -A 5 -B 5 dbms/src/Server/Server.cppLength of output: 1392
dbms/src/Storages/S3/FileCache.cpp (1)
676-700: Foreground download logic is well-implemented.The method effectively manages foreground downloads with appropriate logging and exception handling. The use of fail points for testing is commendable.
dbms/src/TiDB/Schema/TiDB.cpp (2)
411-417: Improved type safety and error handling for vector index.The use of
tipb::VectorIndexKindandtipb::VectorDistanceMetric, along with runtime checks, enhances type safety and robustness.
475-499: Enhanced clarity and robustness in vector index parsing.The explicit parsing and structured initialization improve clarity and error handling in vector index management.
dbms/src/Interpreters/Context.cpp (4)
57-57: Include Directive Update Approved.The change from
VectorIndex.htoVectorIndexCache.haligns with modifications in cache handling and improves clarity.
1388-1388: Simplified Cache Reset Approved.The direct call to
reset()on the shared pointer enhances clarity and conciseness.
1410-1410: Simplified Cache Reset Approved.The direct call to
reset()on the shared pointer enhances clarity and conciseness.
1391-1397: Method Signature Change Approved.The parameter change from
cache_size_in_bytestocache_entitiesaligns with the new focus on entity count. Ensure all usages of this method are updated accordingly.Verification successful
Method Signature Change Verified.
All usages of
setVectorIndexCachehave been updated to reflect the new parametercache_entities. The changes are consistent across the codebase.
dbms/src/Server/Server.cpp: Usage withvec_index_cache_entities.dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Usage with1000.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of `setVectorIndexCache` to ensure they match the new parameter `cache_entities`. # Test: Search for the method usage. Expect: Only occurrences of the new parameter name. rg --type cpp -A 3 $'setVectorIndexCache'Length of output: 1274
dbms/src/Server/Server.cpp (1)
1442-1445: Verify the impact of changing vector index cache configuration.The change from a byte size to an entity count for the vector index cache could have significant implications on memory management and performance. Ensure that this change is well-documented and tested to confirm it meets the desired performance and memory usage goals.
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (5)
93-96: Addition ofcdVec()method is appropriate.The
cdVec()method provides a useful utility for defining vector columns, enhancing code reusability and readability.
884-887: Update toVectorIndexDefinitionis appropriate.The change from
VectorIndexInfotoVectorIndexDefinitionenhances type safety and aligns with the refactoring efforts. Ensure that all usages ofVectorIndexDefinitionare correctly updated throughout the codebase.
1427-1907: Comprehensive test cases for S3 and file cache interactions.The test cases cover critical scenarios, including cache eviction and S3 failures, ensuring robustness. Consider verifying the test coverage to ensure all edge cases are addressed.
Verification successful
Comprehensive test cases for S3 and file cache interactions confirmed.
The test cases in
VectorIndexSegmentOnS3Testcover various scenarios, ensuring robust handling of S3 and file cache interactions. The coverage appears comprehensive.
FileCacheNotEnabledReadWithoutIndexReadFromIndexFileCacheEvictFileCacheEvictAndVectorCacheDropFileCacheDeletedFileCacheDeletedAndVectorCacheDropConcurrentDownloadFromS3S3FailureScripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify test coverage for S3 and file cache interactions. # Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage. rg --type cpp --type c++ --type c 'TEST_F\(VectorIndexSegmentOnS3Test'Length of output: 100
Script:
#!/bin/bash # Description: Verify test coverage for S3 and file cache interactions. # Test: Search for test cases related to S3 and file cache. Expect: Comprehensive coverage. rg --type cpp 'TEST_F\(VectorIndexSegmentOnS3Test'Length of output: 1111
1207-1290: S3 interaction and file cache management setup is well-implemented.The
SetUpandTearDownmethods ensure proper configuration and cleanup of S3 and file cache environments, which are essential for the tests. Verify that these configurations are consistent with the overall system requirements.Verification successful
S3 and file cache management setup is consistent across the codebase.
The configurations in
gtest_dm_vector_index.cppalign with the widespread use ofS3::ClientFactory::instance()andFileCache::initialize()in other parts of the codebase, indicating a consistent approach. This setup is well-implemented and matches the overall system requirements.
- Instances found in files like
FlashService.cpp,S3LockService.cpp,Server.cpp, and others confirm the consistency.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify S3 configuration and file cache management in the codebase. # Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase. rg --type cpp --type c++ --type c 'S3::ClientFactory|FileCache::initialize'Length of output: 105
Script:
#!/bin/bash # Description: Verify S3 configuration and file cache management in the codebase. # Test: Search for S3 configuration and file cache management. Expect: Consistent configuration across the codebase. rg --type cpp --type c 'S3::ClientFactory|FileCache::initialize'Length of output: 9671
205-208: Update toVectorIndexDefinitionis appropriate.The change from
VectorIndexInfotoVectorIndexDefinitionenhances type safety and aligns with the refactoring efforts. Ensure that all usages ofVectorIndexDefinitionare correctly updated throughout the codebase.Verification successful
VectorIndexDefinitionis consistently used across the codebase.The transition from
VectorIndexInfotoVectorIndexDefinitionis well-implemented throughout the relevant files. This change enhances type safety and aligns with the refactoring objectives.
dbms/src/TiDB/Schema/VectorIndex.h: Definition and usage ofVectorIndexDefinition.dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp: Multiple instances ofVectorIndexDefinitionusage in tests.- Other related files in
DeltaMergeandTiDBmodules.Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify all usages of `VectorIndexDefinition` in the codebase. # Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase. rg --type cpp --type c++ --type c 'VectorIndexDefinition'Length of output: 85
Script:
#!/bin/bash # Description: Verify all usages of `VectorIndexDefinition` in the codebase. # Test: Search for `VectorIndexDefinition` usage. Expect: Consistent usage across the codebase. rg --type cpp 'VectorIndexDefinition'Length of output: 3281
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @dbms/src/Interpreters/Context.cpp:
- Line 1388: The three cache dropters are inconsistent: dropMarkCache calls
shared->mark_cache->reset() (clearing contents but keeping the shared_ptr),
while dropMinMaxIndexCache and dropVectorIndexCache call
shared->minmax_index_cache.reset() and shared->vector_index_cache.reset()
(destroying the shared_ptr), causing getMarkCache to return an empty object but
the others to return nullptr; make them consistent by choosing one semantic and
applying it to all three—either change dropMarkCache to reset the pointer
(shared->mark_cache.reset()) to match the other two, or change
dropMinMaxIndexCache and dropVectorIndexCache to clear contents via ->reset() to
match dropMarkCache; update the corresponding getMarkCache, getMinMaxIndexCache,
and getVectorIndexCache callers/assumptions if needed.
In @dbms/src/Storages/DeltaMerge/Index/VectorIndex.h:
- Around line 58-59: Fix the typo in the class/file comment for VectorIndex:
replace "may nor may not read the whole content of the file into memory" with
"may or may not read the whole content of the file into memory" so the comment
for the VectorIndex view is grammatically correct and clear.
In @dbms/src/Storages/DeltaMerge/ScanContext.h:
- Around line 28-31: The header opens a Clang diagnostic push/ignore block (the
__clang__ / #pragma clang diagnostic push and ignored
"-Wdeprecated-declarations" in ScanContext.h) but never restores it, causing
suppression to leak to including TUs; add a matching clang diagnostic pop (i.e.,
the corresponding #pragma clang diagnostic pop) right before the file/namespace
is closed (or remove the push/ignore pair if unused) so the suppression is
properly scoped and does not affect other translation units.
In @dbms/src/Storages/S3/FileCache.cpp:
- Around line 236-297: Add a clear header comment to FileCache::getOrWait
explaining its error behavior: unlike FileCache::get which returns nullptr on
reservation failure, getOrWait will throw an Exception when it cannot reserve
space (and is intended to be used by downloadFileForLocalRead which handles that
exception). Mention the rationale (caller is expected to retry or handle via
try/catch), the specific exception type (ErrorCodes::S3_ERROR) and that callers
of getOrWait must be prepared for exceptions rather than nullptr, and reference
FileCache::get for the alternative semantics.
🧹 Nitpick comments (8)
dbms/src/Storages/DeltaMerge/Remote/Serializer.h (2)
87-95: PublicserializeSegmentsignature expansion is likely ABI-breaking; consider an options struct to stop signature bloat.Making this
publicand adding multiple parameters (segment_epoch,segment_range,read_ranges,MemTrackerWrapper &) will ripple across the codebase and can break ABI for downstream linkage. If this API is expected to evolve further (seems likely with vector-index plumbing), consider grouping the new fields into a single params/options type to reduce churn and make call sites self-documenting.Also please ensure the header declaration exactly matches the out-of-line definition and all call sites were updated.
96-100: Good:serializeColumnFileSetkept private; consider forward-declaringMemTrackerWrapperto reduce header coupling.The explicit
private:and the helper extraction look reasonable. SinceMemTrackerWrapperappears in multiple declarations in this header, if it’s only used by reference, a forward declaration (and moving the include to the.cpp) can reduce rebuild impact—if nothing else in this header truly needsFlash/Mpp/TrackedMppDataPacket.h.dbms/src/Storages/S3/FileCache.cpp (2)
341-341: Consider documenting the lock type change.The change from
lock_guardtounique_lockin theremove()method is subtle but likely necessary for consistency with other methods. However, it's not immediately clear why this change was needed as part of this PR.Do the other methods that were also changed to use
unique_lockrequire early unlock or additional lock operations? If not, consider adding a brief comment explaining whenunique_lockvslock_guardshould be used.
68-107: Extract the hardcoded timeout values into configurable parameters.The 300-second timeout for waiting on file segments is hardcoded (line 95), as is the 30-second per-iteration timeout (line 83). While the comment "Snapshot time is 300s" references the snapshot interval concept, the actual timeout value lacks configuration. This contrasts with other parts of the S3 codebase—S3GCManager uses configurable
mark_delete_timeout_secondsand PocoHTTPClient usesConnectionTimeouts—and with FileCache's own pattern of configurable parameters likedt_filecache_max_downloading_count_scaleanddt_filecache_min_age_seconds.Consider making both timeouts configurable through Settings or StorageRemoteCacheConfig to support varying operational requirements without code changes.
dbms/src/Common/LRUCache.h (1)
74-80: Consider lock type consistency and const correctness.The new non-const
contains()method has two minor style inconsistencies:
Lock type variance: This method uses
std::lock_guardwhile other methods (e.g.,get(),set()) usestd::scoped_lock. Although functionally equivalent for single-mutex locking, consistency aids maintainability.Const correctness: Since this method doesn't mutate LRU order or internal state, it could be marked
constto match the existing const overload at Line 195-199, eliminating the need for two separate implementations.♻️ Proposed refinement
If a non-const overload is genuinely required (e.g., for template resolution), at minimum align the lock type:
- bool contains(const Key & key) + bool contains(const Key & key) const { - std::lock_guard cache_lock(mutex); + std::scoped_lock cache_lock(mutex); return cells.contains(key); }Otherwise, remove this overload and rely solely on the existing const version.
dbms/src/Server/Server.cpp (1)
1442-1445: LGTM! Cache unit changed from bytes to entity count.The refactor from
vec_index_cache_size(bytes) tovec_index_cache_entities(count) with a default of 1000 is reasonable. The updated comment correctly notes that mmap delegates memory management to the OS. The conditional check prevents setting the cache when the count is zero.Optional: Consider upper bound validation.
While the zero-check prevents misconfiguration, there's no upper bound validation. An extremely large value (e.g., millions) could cause issues. Consider adding a sanity check or logging a warning for unusually large values.
♻️ Optional: Add upper bound validation
size_t vec_index_cache_entities = config().getUInt64("vec_index_cache_entities", 1000); +if (vec_index_cache_entities > 100000) +{ + LOG_WARNING(log, "vec_index_cache_entities is set to a very large value: {}, consider reducing it", vec_index_cache_entities); +} if (vec_index_cache_entities) global_context->setVectorIndexCache(vec_index_cache_entities);dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)
66-76: Consider usingstd::shared_lockfor read-only cache lookups.The
getOrSetmethod acquires an exclusive lock (std::scoped_lock) just to insert intofiles_to_check. If the file path is already in the set, a shared lock would suffice for the lookup. However, sincestd::unordered_set::insertis a modifying operation, the current approach is correct for the insert case.One potential optimization: check if the key exists first with a shared lock, then upgrade to exclusive only if needed. This could reduce contention under high read loads.
♻️ Optional optimization for reduced contention
template <typename LoadFunc> Cache::MappedPtr getOrSet(const Cache::Key & file_path, LoadFunc && load) { { - std::scoped_lock lock(mu); - files_to_check.insert(file_path); + // Try shared lock first to check existence + { + std::shared_lock read_lock(mu); + if (files_to_check.count(file_path) > 0) + goto cache_lookup; + } + // Upgrade to exclusive lock for insertion + { + std::scoped_lock write_lock(mu); + files_to_check.insert(file_path); + } } - +cache_lookup: auto result = cache.getOrSet(file_path, load); return result.first; }dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (1)
305-329: Consider making the retry count configurable.The hardcoded retry count of 3 for S3 downloads works but could be made configurable for different deployment scenarios. This is a minor improvement that could be deferred.
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (36)
contrib/tipbdbms/src/Common/FailPoint.cppdbms/src/Common/LRUCache.hdbms/src/Interpreters/Context.cppdbms/src/Interpreters/Context.hdbms/src/Server/Server.cppdbms/src/Storages/DeltaMerge/DMContext.hdbms/src/Storages/DeltaMerge/DeltaMergeDefines.hdbms/src/Storages/DeltaMerge/File/ColumnStat.hdbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cppdbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.hdbms/src/Storages/DeltaMerge/File/DMFileWriter.cppdbms/src/Storages/DeltaMerge/File/DMFileWriter.hdbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cppdbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.hdbms/src/Storages/DeltaMerge/File/dtpb/dmfile.protodbms/src/Storages/DeltaMerge/Index/RSIndex.hdbms/src/Storages/DeltaMerge/Index/VectorIndex.cppdbms/src/Storages/DeltaMerge/Index/VectorIndex.hdbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cppdbms/src/Storages/DeltaMerge/Index/VectorIndexCache.hdbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cppdbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.hdbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.hdbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cppdbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.hdbms/src/Storages/DeltaMerge/Remote/Serializer.hdbms/src/Storages/DeltaMerge/ScanContext.hdbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cppdbms/src/Storages/S3/FileCache.cppdbms/src/Storages/S3/FileCache.hdbms/src/Storages/S3/FileCachePerf.cppdbms/src/Storages/S3/FileCachePerf.hdbms/src/TiDB/Schema/TiDB.cppdbms/src/TiDB/Schema/TiDB.hdbms/src/TiDB/Schema/VectorIndex.h
💤 Files with no reviewable changes (1)
- dbms/src/Storages/DeltaMerge/Index/RSIndex.h
🧰 Additional context used
🧬 Code graph analysis (18)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (2)
dbms/src/Storages/DeltaMerge/ScanContext.h (1)
DB(33-355)dbms/src/Storages/S3/FileCachePerf.h (1)
DB(24-37)
dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (3)
dbms/src/TiDB/Schema/VectorIndex.h (1)
TiDB(24-45)dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1)
VectorIndexBuilder(32-56)dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
create(42-57)create(42-42)
dbms/src/Storages/DeltaMerge/DMContext.h (1)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (2)
scan_context(1393-1406)scan_context(1393-1393)
dbms/src/Storages/S3/FileCachePerf.cpp (1)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (1)
file_cache(1341-1349)
dbms/src/Interpreters/Context.h (1)
dbms/src/Interpreters/Context.cpp (2)
setVectorIndexCache(1391-1398)setVectorIndexCache(1391-1391)
dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (2)
VectorIndexBuilder(32-56)VectorIndexViewer(60-90)
dbms/src/TiDB/Schema/VectorIndex.h (1)
dbms/src/TiDB/Schema/TiDB.h (1)
TiDB(47-289)
dbms/src/Storages/S3/FileCachePerf.h (1)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (1)
DB(24-37)
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (2)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (1)
VectorIndexViewer(60-90)dbms/src/Common/ColumnsHashing.h (1)
Key(529-550)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (2)
view(116-132)view(116-116)
dbms/src/Storages/S3/FileCache.cpp (2)
dbms/src/Storages/S3/tests/gtest_filecache.cpp (16)
file_cache(185-199)file_cache(185-185)file_cache(245-245)file_cache(269-269)file_cache(286-286)file_cache(308-308)file_cache(344-344)file_cache(446-446)file_cache(465-465)file_cache(484-484)file_cache(503-503)file_cache(522-522)file_cache(541-541)file_cache(560-560)file_cache(579-579)file_cache(598-598)dbms/src/Common/FailPoint.cpp (4)
lock(208-215)lock(210-210)lock(217-221)lock(219-219)
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (1)
VectorIndexCache(30-77)
dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (1)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
isSupportedType(33-40)isSupportedType(33-33)
dbms/src/Storages/S3/FileCache.h (1)
dbms/src/Storages/S3/FileCache.cpp (10)
waitForNotEmpty(68-107)waitForNotEmpty(68-68)downloadFileForLocalRead(160-176)downloadFileForLocalRead(160-162)getOrWait(236-297)getOrWait(236-236)bgDownload(664-674)bgDownload(664-664)fgDownload(676-700)fgDownload(676-676)
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (4)
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (1)
DMFileWithVectorIndexBlockInputStream(53-81)dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (4)
get(202-206)get(202-202)view(116-132)view(116-116)dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (2)
read(79-137)read(79-79)dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
view(59-74)view(59-59)
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (1)
dbms/src/Columns/ColumnArray.h (2)
create(53-56)create(58-58)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (2)
dbms/src/Storages/DeltaMerge/Filter/RSOperator.cpp (2)
wrapWithANNQueryInfo(90-93)wrapWithANNQueryInfo(90-90)dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (2)
get(202-206)get(202-202)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (3)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (1)
VectorIndexHNSWBuilder(27-39)dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (2)
view(59-74)view(59-59)dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/usearch_index_dense.h (2)
metric(822-841)metric(1977-1981)
🪛 Clang (14.0.6)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h
[error] 24-24: variable 'DB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 24-24: variable name 'DB' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
dbms/src/Storages/S3/FileCachePerf.cpp
[error] 20-20: variable 'file_cache' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
dbms/src/TiDB/Schema/VectorIndex.h
[error] 24-24: variable 'TiDB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h
[error] 27-27: variable 'DB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 27-27: variable name 'DB' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
dbms/src/Storages/S3/FileCachePerf.h
[error] 24-24: variable 'DB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 24-24: variable name 'DB' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp
[error] 20-20: variable 'vector_search' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
dbms/src/Storages/S3/FileCache.cpp
[error] 59-59: do not declare C-style arrays, use std::array<> instead
(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp
[error] 49-49: do not declare C-style arrays, use std::array<> instead
(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)
[error] 50-50: do not declare C-style arrays, use std::array<> instead
(cppcoreguidelines-avoid-c-arrays,-warnings-as-errors)
[error] 503-503: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 570-570: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1202-1202: constructor does not initialize these fields: page_id_allocator, storage_path_pool, storage_pool, table_columns, settings, wn_segment
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 1207-1207: method 'SetUp' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 1207-1207: invalid case style for method 'SetUp'
(readability-identifier-naming,-warnings-as-errors)
[error] 1209-1209: variable 'force_use_dmfile_format_v3' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1249-1249: variable 'file_cache_config' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1276-1276: method 'TearDown' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 1276-1276: invalid case style for method 'TearDown'
(readability-identifier-naming,-warnings-as-errors)
[error] 1278-1278: variable 'force_use_dmfile_format_v3' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1288-1288: variable 's3_client' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1295-1295: 4 adjacent parameters of 'createComputeNodeStream' of similar type ('const int &') are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 1295-1295: parameter 'write_node_segment' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1296-1296: parameter 'columns_to_read' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1297-1297: parameter 'filter' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1298-1298: parameter 'read_scan_context' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1343-1343: variable 'file_cache' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1351-1351: method 'prepareWriteNodeStable' can be made static
(readability-convert-member-functions-to-static,-warnings-as-errors)
[error] 1354-1354: variable 'block' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 1369-1369: parameter 'ref_vec' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1370-1370: 2 adjacent parameters of 'computeNodeANNQuery' of similar type are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 1370-1370: parameter 'top_k' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1371-1371: parameter 'read_scan_context' is unused
(misc-unused-parameters,-warnings-as-errors)
[error] 1408-1408: redundant access specifier has the same accessibility as the previous access specifier
(readability-redundant-access-specifiers,-warnings-as-errors)
[error] 1410-1410: member variable 'page_id_allocator' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1411-1411: member variable 'storage_path_pool' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1412-1412: member variable 'storage_pool' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1413-1413: member variable 'table_columns' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1414-1414: member variable 'settings' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1416-1416: member variable 'ns_id' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1419-1419: member variable 'wn_segment' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1421-1421: member variable 'orig_mode' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1424-1424: member variable 'dummy_mem_tracker' has protected visibility
(cppcoreguidelines-non-private-member-variables-in-classes,-warnings-as-errors)
[error] 1427-1427: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1452-1452: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1478-1478: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1526-1526: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1600-1600: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1679-1679: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1746-1746: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1818-1818: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
[error] 1884-1884: all parameters should be named in a function
(readability-named-parameter,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp
[error] 38-38: parameter 'd' is unused
(misc-unused-parameters,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h
[error] 28-28: variable 'DB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 28-28: variable name 'DB' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h
[error] 21-21: variable 'DB' is non-const and globally accessible, consider making it const
(cppcoreguidelines-avoid-non-const-global-variables,-warnings-as-errors)
[error] 21-21: variable name 'DB' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
🪛 Cppcheck (2.19.0)
dbms/src/Storages/S3/FileCache.cpp
[error] 169-169: syntax error
(syntaxError)
🔇 Additional comments (80)
contrib/tipb (1)
1-1: Subproject commit update is valid and properly integrated.The
contrib/tipbsubmodule pointer update to56b0212a1fa1c92ccf3b9ed7945c716346dad299exists and is reachable. Vector-index related code in the repository correctly includes and uses the tipb protobuf types (tipb::VectorIndexKind,tipb::VectorDistanceMetric) without apparent issues. The integration appears sound.dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.cpp (2)
1-14: LGTM: Standard license header.The Apache 2.0 license header is correctly formatted with the appropriate copyright year.
15-22: LGTM: Correct thread_local performance context implementation.The implementation correctly defines the per-thread performance context following the established pattern from
FileCachePerfContext. The thread_local variable is properly declared asexternin the header and defined here with default initialization.Note: The static analysis warning about "non-const global variable" is a false positive and can be safely ignored. The variable must be mutable for performance tracking, and
thread_localensures thread-safety.dbms/src/Storages/DeltaMerge/Index/VectorSearchPerf.h (3)
1-17: LGTM: Standard header structure.The license header, include guard, and necessary includes are correctly formatted.
19-22: LGTM: Good defensive programming.The macro safeguard to undefine Poco's
thread_localmacro is appropriate defensive programming, ensuring compatibility with the C++11thread_localkeyword.
24-37: LGTM: Well-designed performance context following established patterns.The
VectorSearchPerfContextstruct correctly follows the pattern established byFileCachePerfContextin the codebase. The design is appropriate:
- Public members are suitable for a simple performance tracking struct
- Member names (
visited_nodes,discarded_nodes) align with metrics defined inScanContext.h- The
reset()method using aggregate initialization is idiomatic C++- The
extern thread_localdeclaration is the correct pattern for thread_local variables shared across translation unitsNote: The static analysis warnings about "variable 'DB' is non-const" and "variable name 'DB' is too short" are false positives—
DBis a namespace, not a variable, and follows the established naming convention in this codebase.dbms/src/Storages/DeltaMerge/ScanContext.h (4)
93-93: LGTM: Metric declaration is consistent.The new
total_vector_idx_load_from_s3metric follows the established naming convention and is properly placed alongside related vector index metrics.
150-150: LGTM: Deserialization is correct.The deserialization of
total_vector_idx_load_from_s3from the protobuf field follows the same pattern as other vector index metrics.
202-202: LGTM: Serialization is correct.The serialization of
total_vector_idx_load_from_s3to the protobuf field follows the same pattern as other vector index metrics.
262-262: LGTM: Merge logic is correct.Both merge methods correctly accumulate
total_vector_idx_load_from_s3using the+=operator, consistent with the pattern used for other vector index metrics.Also applies to: 314-314
dbms/src/Common/FailPoint.cpp (1)
109-109: LGTM! Fail point added correctly.The new fail point
file_cache_fg_download_failis properly added to theAPPLY_FOR_FAILPOINTSmacro, enabling fault injection for foreground S3 downloads.dbms/src/Storages/S3/FileCachePerf.cpp (1)
17-22: LGTM! Per-thread performance context correctly defined.The thread_local variable enables per-thread tracking of foreground download metrics without synchronization overhead.
dbms/src/Storages/S3/FileCachePerf.h (1)
19-37: LGTM! Performance context header well-structured.The header properly defines a thread_local performance context for tracking foreground download metrics, following the established pattern from
VectorSearchPerfContext.dbms/src/Storages/S3/FileCache.h (5)
29-29: LGTM! Required include for synchronization.The
<condition_variable>header is necessary for the newly addedcv_readymember used to coordinate file segment state transitions.
75-75: LGTM! Synchronization primitives properly integrated.The
waitForNotEmpty()method andcv_readycondition variable work together to enable foreground callers to wait until a file segment is downloaded. The notification insetStatus()correctly wakes waiting threads when the status changes fromEmpty.Also applies to: 87-88, 134-134
228-233: LGTM! Method provides local-read guarantees.The
downloadFileForLocalRead()method ensures that callers receive aFileSegmentPtrguard, preventing the cached file from being evicted while in use. This is essential for local reads that require stable file access.
249-253: LGTM! Foreground download semantics well-documented.The
getOrWait()method provides blocking semantics for foreground downloads with appropriate documentation about potentialnullptrreturns requiring caller retry.
256-256: LGTM! Foreground download method added.The
fgDownload()method complements the existingbgDownload()for background downloads, providing synchronous download semantics.dbms/src/Storages/S3/FileCache.cpp (3)
17-17: LGTM! Required headers and namespace declarations added.The includes for FailPoint, SyncPoint, and FileCachePerf support the new foreground download functionality with fault injection and performance tracking.
Also applies to: 20-20, 30-30, 57-60
160-176: LGTM! Consistency check prevents stale cache entries.The
downloadFileForLocalRead()implementation properly handles the edge case where cache metadata indicates a file is Complete but the actual file was manually removed from disk. The forced removal ensures cache consistency.
676-700: LGTM! Foreground download implementation is well-structured.The
fgDownload()implementation correctly:
- Uses
SYNC_FORfor test synchronization- Integrates fail point for testing
- Handles exceptions and marks failures appropriately
- Increments performance metrics
- Cleans up failed downloads by removing them from cache
The error handling ensures that failed foreground downloads don't leave orphaned cache entries.
dbms/src/TiDB/Schema/TiDB.h (1)
209-209: LGTM! Type updated consistently with refactor.The change from
VectorIndexInfoPtrtoVectorIndexDefinitionPtraligns with the broader refactor introducing the builder/viewer architecture for vector indices. The TODO comment appropriately flags that this will eventually move to the table level.dbms/src/Storages/DeltaMerge/DeltaMergeDefines.h (1)
95-108: LGTM! Type signature updated consistently.The member and constructor parameter types are correctly updated from
VectorIndexInfoPtrtoVectorIndexDefinitionPtr, maintaining consistency with the broader refactor. The initialization logic remains unchanged.dbms/src/Storages/DeltaMerge/File/DMFileWriter.h (4)
57-57: LGTM! Parameter type updated consistently.The Stream constructor parameter type correctly changed from
VectorIndexInfoPtrtoTiDB::VectorIndexDefinitionPtr, aligning with the broader refactor.
75-75: LGTM! Factory method updated to builder pattern.The change from
VectorIndex::createtoVectorIndexBuilder::createcorrectly adopts the new builder/viewer architecture. Based on the relevant code snippets,VectorIndexBuilder::createvalidates the definition and returns the appropriate builder implementation.
101-101: LGTM! Member type updated to builder.The member type correctly changed from
VectorIndexPtrtoVectorIndexBuilderPtr, consistent with the Stream constructor initialization at Line 75.
163-163: LGTM! Method signature updated consistently.The
addStreamsparameter type correctly changed toTiDB::VectorIndexDefinitionPtr, matching the Stream constructor signature at Line 57.dbms/src/Interpreters/Context.h (1)
395-395: LGTM! Parameter semantics updated from bytes to entity count.The signature change from
cache_size_in_bytestocache_entitiescorrectly reflects the shift to entity-based caching for the VectorIndexCache.dbms/src/Storages/DeltaMerge/DMContext.h (1)
136-150: LGTM! Scan context properly propagated.The addition of the
scan_contextparameter with a default value ofnullptrmaintains backward compatibility while enabling scan context propagation through the DM read paths.dbms/src/Storages/DeltaMerge/File/ColumnStat.h (1)
44-44: LGTM! Protobuf type updated consistently.The change from
ColumnVectorIndexInfotoVectorIndexFilePropsaligns with the protobuf schema changes and is properly handled in serialization/deserialization methods.dbms/src/Interpreters/Context.cpp (2)
57-57: LGTM! Include updated to VectorIndexCache.The include change aligns with the new VectorIndexCache architecture.
1391-1397: LGTM! VectorIndexCache construction updated with entity-based parameter.The parameter semantic change from
cache_size_in_bytestocache_entitiesis correctly implemented and aligns with the updated API.dbms/src/Storages/DeltaMerge/File/dtpb/dmfile.proto (2)
65-67: LGTM! Protobuf field migration handled correctly.The old field 101 is properly reserved, and the new
vector_indexfield uses field number 102 with the updatedVectorIndexFilePropstype. This ensures backward compatibility while migrating to the new schema.
84-94: LGTM! Well-documented protobuf message.The
VectorIndexFilePropsmessage is clearly documented, distinguishing its read-oriented purpose (providing information to read out the index) fromVectorIndexDefinition's construction-oriented purpose. The field choices are appropriate for storing basic index metadata.dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.h (2)
30-77: LGTM on the overall cache design.The
VectorIndexCacheclass provides a clean abstraction for cachingVectorIndexViewerinstances with background cleanup. The use ofLRUCachewith a separate tracking set for periodic validation is a reasonable approach. The thread-safety primitives (shared_mutex, atomic, condition_variable) are appropriately chosen.
43-45: The shutdown mechanism is correctly implemented.The destructor properly notifies
shutdown_cv.notify_all()immediately after settingis_shutting_down = true. Additionally, theshutdown_cv.wait_for()call in the cleanup loop includes a predicate that checks the atomic flag, and has a 1-minute timeout to prevent indefinite blocking. No changes are needed.Likely an incorrect or invalid review comment.
dbms/src/Storages/DeltaMerge/Index/VectorIndex_fwd.h (1)
24-28: LGTM on the forward declaration updates.The forward declarations correctly reflect the new builder/viewer abstraction pattern. The separation of
VectorIndexBuilderandVectorIndexViewerwith their respectiveshared_ptrtypedefs is clean and aligns with the broader refactor.dbms/src/Storages/DeltaMerge/File/DMFileWriter.cpp (3)
350-355: LGTM on vector index metadata population.The metadata fields (
index_kind,distance_metric,dimensions) are correctly populated from the vector index definition using protobuf enum name lookups. This ensures proper serialization for later restoration when reading.
104-127: LGTM on theaddStreamssignature update.The parameter change from
VectorIndexInfoPtrtoVectorIndexDefinitionPtraligns with the broader type refactor. The null-safe propagation ofdo_vector_indexthrough the substream callback is correctly handled.
340-356: No changes needed. Thesave()method already includes error handling viaRUNTIME_CHECK_MSG, which throws an exception if the save operation fails. This preventsgetSize()from being called on a non-existent file. Additionally,save()returnsvoid, making the suggestion to use its return value infeasible.Likely an incorrect or invalid review comment.
dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.cpp (3)
43-48: Good defensive check, but consider adding a release-mode fast-fail for corrupted data.The sortedness check is appropriately guarded by
#ifndef NDEBUGto avoid performance overhead in release builds. However, if sorted data is a critical invariant, consider whether corrupted/unsorted input in production could cause subtle bugs (e.g., missing results, infinite loops).The current code would fail at line 74 (
results_it == results_it_max) if results are out-of-order, providing some protection.
36-77: LGTM on thecalcResultsByPackrefactor.The change from rvalue reference to const reference for
sorted_resultsis appropriate since the caller now owns the sorted data. The logic correctly accumulates per-pack offsets and validates that all results are consumed.
119-119: LGTM on the type update.Changing from
VectorIndex::KeytoVectorIndexViewer::Keyis consistent with the viewer abstraction and the type alias remainsUInt32.dbms/src/Storages/DeltaMerge/Index/VectorIndex.cpp (3)
42-57: LGTM on theVectorIndexBuilder::createfactory.The dimension validation is correct, and the factory properly delegates to
VectorIndexHNSWBuilderfor the HNSW index kind. The error message includes the enum name for easier debugging.
59-74: LGTM on theVectorIndexViewer::viewfactory.The implementation correctly:
- Validates dimension bounds
- Parses the index kind from the protobuf string representation
- Delegates to the appropriate HNSW viewer implementation
- Provides a meaningful error message for unsupported index types
33-40: LGTM onisSupportedTypeimplementation.The type check correctly handles both nullable and non-nullable array types, verifying that the inner type is
Float32.dbms/src/Storages/DeltaMerge/File/VectorColumnFromIndexReader.h (2)
63-72: LGTM on the constructor signature update.The change from move semantics to const reference for
sorted_results_aligns with the implementation that processes the data immediately incalcResultsByPack. TheVectorIndexViewerPtrparameter correctly uses the new viewer abstraction.
47-47: LGTM on the member type change.The change from
VectorIndexPtrtoVectorIndexViewerPtrcorrectly reflects that this reader only needs view capabilities (search and get operations), not builder functionality.dbms/src/TiDB/Schema/VectorIndex.h (3)
27-36: LGTM on theVectorIndexDefinitionstruct.The struct cleanly encapsulates vector index metadata using protobuf types directly. The default initialization to
INVALID_*values is a good defensive pattern that helps catch uninitialized usage.The TODO comment on line 34-35 appropriately notes that additional fields (like
efConstruct) may be added later.
47-74: LGTM on the fmt formatters.The formatters provide clean string representations using protobuf's
_Namefunctions. The null-safe handling in the pointer formatter (outputting<no_idx>) is appropriate.
38-43: LGTM on the immutable pointer pattern and dimension constant.Using
std::shared_ptr<const VectorIndexDefinition>enforces immutability at the type level, which is appropriate for data constructed from TiDB table definitions.The MAX_VECTOR_DIMENSION constant of 16,383 correctly reflects TiDB's maximum supported vector dimensions.
dbms/src/Storages/DeltaMerge/Index/VectorIndexCache.cpp (3)
40-55: Consider potential TOCTOU race betweencache.contains()andcache.remove().Between checking
cache.contains(file_path)at line 40 and callingcache.remove(file_path)at line 53, another thread could potentially add the same entry back into the cache viagetOrSet(). If the LRU cache'sremove()is safe to call on a freshly-added entry, this is benign. However, if this could cause issues, consider holding a lock across both operations.Since this is a cleanup routine that runs periodically and the worst case is re-downloading an index file, the current approach is likely acceptable for simplicity.
63-84: LGTM: Background cleanup loop with proper shutdown handling.The cleanup loop correctly:
- Uses condition variable with timeout for efficient waiting.
- Checks shutdown flag after waking to exit cleanly.
- Wraps cleanup in try-catch to prevent thread termination on exceptions.
86-98: LGTM: Constructor and destructor properly manage background thread lifecycle.The constructor launches the cleaner thread, and the destructor correctly signals shutdown via the atomic flag, notifies the condition variable, and joins the thread.
dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.h (2)
17-22: LGTM: Header refactoring with forward declarations.Good practice to use forward declarations and move implementations to the cpp file. This reduces compilation dependencies and improves build times.
182-187: LGTM: Member type updates align with the new Builder/Viewer architecture.The type changes from
VectorIndexPtrtoVectorIndexViewerPtrand fromBitmapFiltertostd::vector<UInt32> sorted_resultscorrectly reflect the new vector index design where viewing/searching is separated from building.dbms/src/Storages/DeltaMerge/File/DMFileWithVectorIndexBlockInputStream.cpp (5)
34-81: LGTM: Constructor with comprehensive validation.The constructor properly validates:
- ann_query_info is not null
- Column ID matches between vec_cd and ann_query_info
- All reader columns exist in header_layout
- Vector column is not duplicated in reader.read_columns
- Header layout has expected column count
The start_offset_to_pack_id map initialization is efficient and correct.
106-123: LGTM: Filter application logic is correct.The
read()method correctly handles both cases:
- When
return_filter == true: returns the filter to caller for late materialization- When
return_filter == false: applies the filter inline
156-164: Consider usingstd::lower_boundreturn value more efficiently.The loop is correct, but you could potentially use
std::upper_boundto find the end of the range and iterate between bounds for slightly cleaner code. However, the current implementation is clear and efficient.
386-410: LGTM: Vector search result processing with proper deduplication.The implementation correctly:
- Validates that valid_rows covers the dmfile rows
- Sorts results for efficient lookup in readImpl
- Removes duplicates (defensive, as noted in comment)
- Creates VectorColumnFromIndexReader with sorted results
- Updates scan_context with performance metrics
425-450: LGTM: Pack filter optimization is well-implemented.The loop efficiently maps search results to packs and sets unused packs to
RSResult::None. The final RUNTIME_CHECK ensures all results were processed, which is a good safety check.dbms/src/TiDB/Schema/TiDB.cpp (2)
409-420: LGTM: JSON serialization with proper validation.The serialization correctly validates that the vector index kind and distance metric are not invalid before serializing, preventing malformed data from being persisted.
474-500: LGTM: Robust JSON deserialization with comprehensive validation.The deserialization correctly:
- Parses string values to protobuf enums using
_Parsehelpers- Validates parsed values are not INVALID_* sentinel values
- Validates dimension is within bounds (0, MAX_VECTOR_DIMENSION]
- Constructs an immutable
VectorIndexDefinitionwrapped inshared_ptr<const>Good defensive programming with RUNTIME_CHECK_MSG for actionable error messages.
dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.h (2)
24-39: LGTM: Builder class with clear interface.The
VectorIndexHNSWBuilderclass correctly encapsulates the HNSW index building logic with:
- Constructor taking vector index definition
addBlockfor adding vectors with delete mark handlingsavefor persisting to disk- Private
added_rowscounter for tracking keys (including nulls/deletes)
41-56: LGTM: Viewer class with memory-mapped index access.The
VectorIndexHNSWViewerclass provides read-only access to persisted HNSW indices with:
- Static
viewfactory method for loading from file props and pathsearchmethod with row filter supportgetmethod for retrieving individual vectorsThe static analysis hints about namespace
DBare false positives - this is a namespace declaration, not a variable.dbms/src/Storages/DeltaMerge/Index/VectorIndexHNSW/Index.cpp (4)
38-56: LGTM: Metric mapping with proper validation.The
getUSearchMetricKindfunction correctly maps TiDB's distance metrics to usearch's internal types. The L1 metric is explicitly noted as unsupported with a clear error message.The static analysis hint about unused parameter
dis a false positive - the parameter is clearly used in the switch statement.
67-108: LGTM: Block addition with proper null/delete handling.The
addBlockmethod correctly:
- Handles both nullable and non-nullable array columns
- Skips rows marked as deleted
- Skips NULL values
- Validates dimension consistency
- Uses ceil2 for efficient capacity reservation
156-176: Good defensive exception handling in search predicate.The predicate lambda correctly catches exceptions and sets a flag to propagate the error after search completes. This is necessary because usearch's search may run on different threads where exceptions cannot be directly propagated.
The
has_exception_in_searchatomic flag and post-search exception throw pattern is a reasonable approach.
193-197: Good defensive post-filtering.As noted in the comment, usearch doesn't always apply the predicate to all results. The explicit post-filtering ensures correctness regardless of usearch's internal behavior.
dbms/src/Storages/DeltaMerge/tests/gtest_dm_vector_index.cpp (7)
92-98: LGTM: Utility method for vector column definition.The
cdVec()helper method provides a clean way to create a vector column definition without the vector index attached, which is appropriate for read-path testing where the index definition is typically read from file metadata.
203-209: LGTM: Test updated to use new VectorIndexDefinition API.The test correctly uses the new
TiDB::VectorIndexDefinitionwithtipb::VectorIndexKind::HNSWandtipb::VectorDistanceMetric::L2instead of the old enum types.
463-472: LGTM: Error message expectation updated for new metric naming.The expected error message correctly reflects the updated distance metric naming convention (
COSINEinstead of lowercaseCosine).
1202-1274: LGTM: Comprehensive S3 test fixture setup.The
VectorIndexSegmentOnS3Testclass provides a well-structured test fixture for testing vector index functionality with S3 storage:
- Enables S3 configuration and creates test bucket
- Sets up remote data store, page storage, and file cache
- Creates a segment with vector index column
- Properly tears down in reverse order
The static analysis hints about uninitialized members are false positives -
SetUp()initializes all required members.
1478-1524: LGTM: Test verifies S3 download and caching behavior.This test validates the critical path of:
- First query downloads from S3 (total_vector_idx_load_from_s3 = 1)
- Second query uses memory cache (total_vector_idx_load_from_cache = 1)
Good verification of the caching layer working correctly.
1818-1882: LGTM: Concurrent download test with sync points.This test validates that concurrent ANN queries for the same vector index:
- First query performs the S3 download (fg_download_from_s3 = 1)
- Second query waits for the first download (fg_wait_download_from_s3 = 1)
This prevents duplicate downloads and ensures efficient resource usage.
1884-1907: LGTM: S3 failure test with proper exception handling.The test verifies that S3 download failures are properly propagated as exceptions, preventing silent failures and ensuring the application can handle network/storage issues gracefully.
dbms/src/Storages/DeltaMerge/Index/VectorIndex.h (4)
17-26: LGTM!Includes are appropriate for the new builder/viewer architecture. The added
dmfile.pb.hinclude properly supports thedtpb::VectorIndexFilePropstype used inVectorIndexViewer.
28-29: Static analysis false positive - ignore.The clang-tidy warnings about
DBbeing a "non-const globally accessible variable" and having a "too short name" are false positives. Line 28 uses valid C++17 nested namespace syntax (namespace DB::DM), not a variable declaration.
31-56: LGTM!The
VectorIndexBuilderclass provides a clean abstract interface:
- Proper polymorphic base with virtual destructor
- Clear pure virtual methods for
addBlockandsave- Factory pattern via
createstatic methodKeytype alias properly documents its purpose as row offset in DMFile
60-90: LGTM!The
VectorIndexViewerclass provides a well-designed interface:
- Reuses
Keytype fromVectorIndexBuilderensuring consistencyRowFilteralias with clear documentation of semantics (true = valid row)- Factory
viewmethod properly takes protobuf-based file propertiessearchandgetmethods provide complete read API for vector index queries
| auto lock = getLock(); | ||
| if (shared->minmax_index_cache) | ||
| shared->minmax_index_cache->reset(); | ||
| shared->minmax_index_cache.reset(); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the specific lines mentioned in Context.cpp
# to understand the cache dropping behavior
cd dbms/src/Interpreters && wc -l Context.cppRepository: CalvinNeo/tiflash
Length of output: 77
🏁 Script executed:
# Read the relevant sections of Context.cpp
head -n 1420 dbms/src/Interpreters/Context.cpp | tail -n +1350Repository: CalvinNeo/tiflash
Length of output: 1682
🏁 Script executed:
# Search for dropMinMaxIndexCache, dropVectorIndexCache, and dropMarkCache definitions
rg -n 'drop(MinMaxIndexCache|VectorIndexCache|MarkCache)' dbms/src/Interpreters/Context.cpp -A 5Repository: CalvinNeo/tiflash
Length of output: 596
🏁 Script executed:
# Find type definitions for these caches
rg -n 'minmax_index_cache|vector_index_cache|mark_cache' dbms/src/Interpreters/Context.h -B 2 -A 2Repository: CalvinNeo/tiflash
Length of output: 43
🏁 Script executed:
# Search for type aliases in Context.h
rg -n 'MarkCachePtr|MinMaxIndexCachePtr|VectorIndexCachePtr' dbms/src/Interpreters/Context.h -B 2 -A 2Repository: CalvinNeo/tiflash
Length of output: 43
🏁 Script executed:
# Look for the actual class definitions
rg -n 'class.*MarkCache|class.*MinMaxIndexCache|class.*VectorIndexCache' --type=cpp --type=hRepository: CalvinNeo/tiflash
Length of output: 691
🏁 Script executed:
# Check the using declarations for these cache types
rg -n 'using.*MarkCache|using.*MinMaxIndexCache|using.*VectorIndexCache' --type=hRepository: CalvinNeo/tiflash
Length of output: 377
🏁 Script executed:
# Search in the shared context structure definition
rg -n 'struct SharedData|class SharedData' dbms/src/Interpreters/Context.cpp -A 100 | head -150Repository: CalvinNeo/tiflash
Length of output: 43
🏁 Script executed:
# Search for LRUCache reset() method definition
rg -n 'reset\(\)' dbms/src/Common/LRUCache.h -B 3 -A 5Repository: CalvinNeo/tiflash
Length of output: 745
🏁 Script executed:
# Look at the MarkCache reset method specifically
rg -n 'class MarkCache' dbms/src/Storages/MarkCache.h -A 50 | grep -E 'reset|class MarkCache'Repository: CalvinNeo/tiflash
Length of output: 168
🏁 Script executed:
# Check usages of dropMarkCache, dropMinMaxIndexCache, dropVectorIndexCache across codebase
rg -n 'dropMarkCache|dropMinMaxIndexCache|dropVectorIndexCache' --type=cpp -B 2 -A 2Repository: CalvinNeo/tiflash
Length of output: 3754
Fix inconsistent cache dropping behavior across mark_cache, minmax_index_cache, and vector_index_cache.
The code exhibits inconsistent cache destruction semantics:
dropMarkCache()(line 1364) callsshared->mark_cache->reset(), which clears cache contents while keeping the shared_ptr alive.dropMinMaxIndexCache()(line 1388) anddropVectorIndexCache()(line 1410) call.reset()on the shared_ptr itself, destroying the pointer entirely.
This means getMarkCache() returns an empty cache object after dropping, while getMinMaxIndexCache() and getVectorIndexCache() return nullptr. Standardize this behavior across all three cache types—either all should clear contents via ->reset() or all should destroy pointers via .reset().
🤖 Prompt for AI Agents
In @dbms/src/Interpreters/Context.cpp at line 1388, The three cache dropters are
inconsistent: dropMarkCache calls shared->mark_cache->reset() (clearing contents
but keeping the shared_ptr), while dropMinMaxIndexCache and dropVectorIndexCache
call shared->minmax_index_cache.reset() and shared->vector_index_cache.reset()
(destroying the shared_ptr), causing getMarkCache to return an empty object but
the others to return nullptr; make them consistent by choosing one semantic and
applying it to all three—either change dropMarkCache to reset the pointer
(shared->mark_cache.reset()) to match the other two, or change
dropMinMaxIndexCache and dropVectorIndexCache to clear contents via ->reset() to
match dropMarkCache; update the corresponding getMarkCache, getMinMaxIndexCache,
and getVectorIndexCache callers/assumptions if needed.
| /// Views a VectorIndex file. | ||
| /// It may nor may not read the whole content of the file into memory. |
There was a problem hiding this comment.
Typo in comment.
"may nor may not" should be "may or may not".
Proposed fix
/// Views a VectorIndex file.
-/// It may nor may not read the whole content of the file into memory.
+/// It may or may not read the whole content of the file into memory.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /// Views a VectorIndex file. | |
| /// It may nor may not read the whole content of the file into memory. | |
| /// Views a VectorIndex file. | |
| /// It may or may not read the whole content of the file into memory. |
🤖 Prompt for AI Agents
In @dbms/src/Storages/DeltaMerge/Index/VectorIndex.h around lines 58 - 59, Fix
the typo in the class/file comment for VectorIndex: replace "may nor may not
read the whole content of the file into memory" with "may or may not read the
whole content of the file into memory" so the comment for the VectorIndex view
is grammatically correct and clear.
| #ifdef __clang__ | ||
| #pragma clang diagnostic push | ||
| #pragma clang diagnostic ignored "-Wdeprecated-declarations" | ||
| #endif |
There was a problem hiding this comment.
Critical: Missing #pragma clang diagnostic pop directive.
The #pragma clang diagnostic push at line 29 suppresses deprecated-declarations warnings, but there's no corresponding pop directive before the end of the file (line 355). This causes the warning suppression to leak into any translation unit that includes this header, potentially hiding legitimate warnings in unrelated code.
🔧 Add the missing pop directive
Add the following at the end of the file, just before the namespace closing bracket:
} // namespace DB::DM
+
+#ifdef __clang__
+#pragma clang diagnostic pop
+#endifAlternatively, if the pragmas are unnecessary (i.e., no deprecated APIs are actually used in this header), consider removing them entirely.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In @dbms/src/Storages/DeltaMerge/ScanContext.h around lines 28 - 31, The header
opens a Clang diagnostic push/ignore block (the __clang__ / #pragma clang
diagnostic push and ignored "-Wdeprecated-declarations" in ScanContext.h) but
never restores it, causing suppression to leak to including TUs; add a matching
clang diagnostic pop (i.e., the corresponding #pragma clang diagnostic pop)
right before the file/namespace is closed (or remove the push/ignore pair if
unused) so the suppression is properly scoped and does not affect other
translation units.
| FileSegmentPtr FileCache::getOrWait(const S3::S3FilenameView & s3_fname, const std::optional<UInt64> & filesize) | ||
| { | ||
| auto s3_key = s3_fname.toFullKey(); | ||
| auto file_type = getFileType(s3_key); | ||
| auto & table = tables[static_cast<UInt64>(file_type)]; | ||
|
|
||
| std::unique_lock lock(mtx); | ||
|
|
||
| auto f = table.get(s3_key); | ||
| if (f != nullptr) | ||
| { | ||
| lock.unlock(); | ||
| f->setLastAccessTime(std::chrono::system_clock::now()); | ||
| auto status = f->waitForNotEmpty(); | ||
| if (status == FileSegment::Status::Complete) | ||
| { | ||
| GET_METRIC(tiflash_storage_remote_cache, type_dtfile_hit).Increment(); | ||
| return f; | ||
| } | ||
| else | ||
| { | ||
| // On-going download failed, let the caller retry. | ||
| return nullptr; | ||
| } | ||
| } | ||
|
|
||
| GET_METRIC(tiflash_storage_remote_cache, type_dtfile_miss).Increment(); | ||
|
|
||
| auto estimzted_size = filesize ? *filesize : getEstimatedSizeOfFileType(file_type); | ||
| if (!reserveSpaceImpl(file_type, estimzted_size, /*try_evict*/ true)) | ||
| { | ||
| // Space not enough. | ||
| GET_METRIC(tiflash_storage_remote_cache, type_dtfile_full).Increment(); | ||
| LOG_DEBUG( | ||
| log, | ||
| "s3_key={} space not enough(capacity={} used={} estimzted_size={}), skip cache", | ||
| s3_key, | ||
| cache_capacity, | ||
| cache_used, | ||
| estimzted_size); | ||
|
|
||
| // Just throw, no need to let the caller retry. | ||
| throw Exception( // | ||
| ErrorCodes::S3_ERROR, | ||
| "Cannot reserve {} space for object {}", | ||
| estimzted_size, | ||
| s3_key); | ||
| return nullptr; | ||
| } | ||
|
|
||
| auto file_seg | ||
| = std::make_shared<FileSegment>(toLocalFilename(s3_key), FileSegment::Status::Empty, estimzted_size, file_type); | ||
| table.set(s3_key, file_seg); | ||
| lock.unlock(); | ||
|
|
||
| ++PerfContext::file_cache.fg_download_from_s3; | ||
| fgDownload(s3_key, file_seg); | ||
| if (!file_seg || !file_seg->isReadyToRead()) | ||
| throw Exception(ErrorCodes::S3_ERROR, "Download object {} failed", s3_key); | ||
|
|
||
| return file_seg; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
ast-grep --pattern 'getOrWait($$$)'Repository: CalvinNeo/tiflash
Length of output: 150
🏁 Script executed:
# Also find the get() method to understand the comparison
rg -B 3 -A 20 'FileSegmentPtr FileCache::get\('Repository: CalvinNeo/tiflash
Length of output: 1622
🏁 Script executed:
# Find the FileCache class definition and both methods
fd 'FileCache' --type fRepository: CalvinNeo/tiflash
Length of output: 204
🏁 Script executed:
# Look at the context around the getOrWait call at line 164
sed -n '150,180p' dbms/src/Storages/S3/FileCache.cppRepository: CalvinNeo/tiflash
Length of output: 1268
🏁 Script executed:
# Get the complete get() method to compare
sed -n '214,234p' dbms/src/Storages/S3/FileCache.cppRepository: CalvinNeo/tiflash
Length of output: 739
🏁 Script executed:
# Check for all occurrences of getOrWait in the entire codebase
rg 'getOrWait' --type cpp --type hRepository: CalvinNeo/tiflash
Length of output: 357
Asymmetry in error handling between get() and getOrWait() is intentional but could be better documented.
The getOrWait() method throws an exception when space cannot be reserved (lines 278-282), while get() returns nullptr in the same scenario. This design choice is intentional—downloadFileForLocalRead() is the sole caller of getOrWait() and it's wrapped in a try-catch block prepared to handle exceptions. However, the asymmetry warrants explicit documentation in the method's contract or header comments explaining why callers should expect exceptions from getOrWait() but not from get().
🤖 Prompt for AI Agents
In @dbms/src/Storages/S3/FileCache.cpp around lines 236 - 297, Add a clear
header comment to FileCache::getOrWait explaining its error behavior: unlike
FileCache::get which returns nullptr on reservation failure, getOrWait will
throw an Exception when it cannot reserve space (and is intended to be used by
downloadFileForLocalRead which handles that exception). Mention the rationale
(caller is expected to retry or handle via try/catch), the specific exception
type (ErrorCodes::S3_ERROR) and that callers of getOrWait must be prepared for
exceptions rather than nullptr, and reference FileCache::get for the alternative
semantics.
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
✏️ Tip: You can customize this high-level summary in your review settings.