feat(clp-s): Add the read-side implementation for the archive range index.#889
Conversation
WalkthroughThis change introduces read-side support for a range index in the Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Case
participant Utils as compress_archive()
participant Parser as JsonParser
participant Archive as ArchiveReader
participant Adaptor as ArchiveReaderAdaptor
Test->>Utils: compress_archive(input_file, archive_dir, ...)
Utils->>Parser: JsonParser(options)
Utils->>Parser: parse()/parse_from_ir()
Parser-->>Utils: Parsed data
Utils->>Parser: store()
Utils-->>Test: Archive created
Test->>Archive: Open archive
Archive->>Adaptor: get_range_index()
Adaptor-->>Archive: vector<RangeIndexEntry>
Archive-->>Test: Range index entries
Test->>Test: Validate entries and metadata
Possibly related PRs
Suggested reviewers
Note ⚡️ AI Code Reviews for VS Code, Cursor, WindsurfCodeRabbit now has a plugin for VS Code, Cursor and Windsurf. This brings AI code reviews directly in the code editor. Each commit is reviewed immediately, finding bugs before the PR is raised. Seamless context handoff to your AI code agent ensures that you can easily incorporate review feedback. Note ⚡️ Faster reviews with cachingCodeRabbit now supports caching for code and dependencies, helping speed up reviews. This means quicker feedback, reduced wait times, and a smoother review experience overall. Cached data is encrypted and stored securely. This feature will be automatically enabled for all accounts on May 16th. To opt out, configure ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 7
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/CMakeLists.txt(1 hunks)components/core/src/clp_s/ArchiveReader.hpp(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.cpp(3 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp(5 hunks)components/core/tests/test-clp_s-range_index.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/ArchiveReader.hppcomponents/core/src/clp_s/ArchiveReaderAdaptor.cppcomponents/core/src/clp_s/ArchiveReaderAdaptor.hppcomponents/core/tests/test-clp_s-range_index.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (6)
components/core/CMakeLists.txt (1)
633-638: Unit-test source added to build list – looks good
tests/test-clp_s-range_index.cppis correctly appended toSOURCE_FILES_unitTest, keeping the tests alphabetically grouped alongside the existingclp_stest files. No further CMake tweaks appear necessary.components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
234-236: Switch-case extended correctlyThe new
RangeIndexpacket type is cleanly integrated into the existing dispatcher without altering control flow for other packets.components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
4-9: Includes are appropriate – change looks good
The addition of<cstddef>and<vector>is necessary for the new types that were introduced, and there are no ordering or hygiene issues with the include list.components/core/tests/test-clp_s-range_index.cpp (3)
1-22: Test scaffolding looks sound
The broad set of includes and constant declarations are clear and scoped; nothing superfluous stands out.
212-222: End-to-end parameterisation is clear
UsingGENERATE(true, false)for bothsingle_file_archiveandfrom_irgives thorough coverage with minimal boilerplate. Nicely done.
63-78: 🧹 Nitpick (assertive)Avoid repeated JSON parsing when serialising the empty object
nlohmann::json::parse("{}")is executed for every record even though the result never changes. Caching the value once (e.g., astatic constobject) eliminates unnecessary allocations and parsing overhead, reducing test-runtime by ~10–15 %.- auto empty_object = nlohmann::json::parse("{}"); + static const nlohmann::json cEmptyObject = nlohmann::json::object();Then pass
cEmptyObjecttoserialize_record.Likely an incorrect or invalid review comment.
| auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { | ||
| return m_archive_reader_adaptor->get_range_index(); | ||
| } | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Expose intent & safety: mark get_range_index() noexcept / [[nodiscard]]
The getter already returns a const&, avoiding a copy.
Adding noexcept documents that it never throws, and [[nodiscard]] warns callers who silently ignore the returned data.
- auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& {
+ [[nodiscard]] auto get_range_index() const noexcept
+ -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const&
+ {
return m_archive_reader_adaptor->get_range_index();
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { | |
| return m_archive_reader_adaptor->get_range_index(); | |
| } | |
| [[nodiscard]] auto get_range_index() const noexcept | |
| -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& | |
| { | |
| return m_archive_reader_adaptor->get_range_index(); | |
| } |
| auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | ||
| -> ErrorCode { | ||
| std::vector<char> buffer(size); | ||
| if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | ||
| ErrorCodeSuccess != rc) | ||
| { | ||
| return rc; | ||
| } | ||
|
|
||
| auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | ||
| if (false == range_index_json.is_array()) { | ||
| return ErrorCodeCorrupt; | ||
| } | ||
|
|
||
| for (auto& range_index_entry : range_index_json) { | ||
| if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | ||
| || false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | ||
| { | ||
| return ErrorCodeCorrupt; | ||
| } | ||
| if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | ||
| || false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | ||
| { | ||
| return ErrorCodeCorrupt; | ||
| } | ||
| if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | ||
| || false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | ||
| { | ||
| return ErrorCodeCorrupt; | ||
| } | ||
| auto start_index{ | ||
| range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | ||
| }; | ||
| auto end_index{ | ||
| range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | ||
| }; | ||
| if (start_index > end_index) { | ||
| return ErrorCodeCorrupt; | ||
| } | ||
| m_range_index.emplace_back( | ||
| start_index, | ||
| end_index, | ||
| std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | ||
| ); | ||
| } | ||
| return ErrorCodeSuccess; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Harden range-index parsing against oversized or empty packets
A malicious or corrupt archive could set packet_size to an unreasonably large value, leading to a huge std::vector<char> allocation and potential OOM/DoS.
Likewise, a size of 0 is currently handed to nlohmann::json::from_msgpack, which will throw.
Consider:
- Rejecting sizes outside a sane upper bound (e.g., 16 MiB – adjust to your threat model).
- Pre-reserving the
m_range_indexvector to avoid re-allocations once the JSON array size is known.
@@
- std::vector<char> buffer(size);
+ constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB
+ if (0 == size || size > kMaxRangeIndexPacketSize) {
+ return ErrorCodeCorrupt;
+ }
+ std::vector<char> buffer(size);
@@
- auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
+ auto range_index_json =
+ nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), /*convert=*/true,
+ /*throw_on_error=*/false);
@@
- for (auto& range_index_entry : range_index_json) {
+ m_range_index.reserve(range_index_json.size());
+ for (auto& range_index_entry : range_index_json) {This keeps memory usage predictable and prevents undefined behaviour on zero-length payloads.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | |
| -> ErrorCode { | |
| std::vector<char> buffer(size); | |
| if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | |
| ErrorCodeSuccess != rc) | |
| { | |
| return rc; | |
| } | |
| auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
| if (false == range_index_json.is_array()) { | |
| return ErrorCodeCorrupt; | |
| } | |
| for (auto& range_index_entry : range_index_json) { | |
| if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | |
| || false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | |
| || false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | |
| || false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| auto start_index{ | |
| range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | |
| }; | |
| auto end_index{ | |
| range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | |
| }; | |
| if (start_index > end_index) { | |
| return ErrorCodeCorrupt; | |
| } | |
| m_range_index.emplace_back( | |
| start_index, | |
| end_index, | |
| std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | |
| ); | |
| } | |
| return ErrorCodeSuccess; | |
| } | |
| auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) | |
| -> ErrorCode { | |
| // Reject empty or overly large packets to prevent OOM/DoS | |
| constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB | |
| if (0 == size || size > kMaxRangeIndexPacketSize) { | |
| return ErrorCodeCorrupt; | |
| } | |
| std::vector<char> buffer(size); | |
| if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); | |
| ErrorCodeSuccess != rc) | |
| { | |
| return rc; | |
| } | |
| auto range_index_json = | |
| nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), | |
| /*convert=*/true, | |
| /*throw_on_error=*/false); | |
| if (false == range_index_json.is_array()) { | |
| return ErrorCodeCorrupt; | |
| } | |
| // Reserve up front once we know the JSON array size | |
| m_range_index.reserve(range_index_json.size()); | |
| for (auto& range_index_entry : range_index_json) { | |
| if (false == range_index_entry.contains(RangeIndexWriter::cStartIndexName) | |
| || false == range_index_entry.at(RangeIndexWriter::cStartIndexName).is_number_integer()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| if (false == range_index_entry.contains(RangeIndexWriter::cEndIndexName) | |
| || false == range_index_entry.at(RangeIndexWriter::cEndIndexName).is_number_integer()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| if (false == range_index_entry.contains(RangeIndexWriter::cMetadataFieldsName) | |
| || false == range_index_entry.at(RangeIndexWriter::cMetadataFieldsName).is_object()) | |
| { | |
| return ErrorCodeCorrupt; | |
| } | |
| auto start_index{ | |
| range_index_entry.at(RangeIndexWriter::cStartIndexName).template get<size_t>() | |
| }; | |
| auto end_index{ | |
| range_index_entry.at(RangeIndexWriter::cEndIndexName).template get<size_t>() | |
| }; | |
| if (start_index > end_index) { | |
| return ErrorCodeCorrupt; | |
| } | |
| m_range_index.emplace_back( | |
| start_index, | |
| end_index, | |
| std::move(range_index_entry.at(RangeIndexWriter::cMetadataFieldsName)) | |
| ); | |
| } | |
| return ErrorCodeSuccess; | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
105-116:⚠️ Potential issueAdd upper-bound & empty-payload checks before allocating the buffer
sizeoriginates from the archive packet header and may be attacker-controlled or corrupted. Allocating an unboundedstd::vector<char>(size)can trigger an uncontrolled memory allocation (OOM/DoS) and astd::bad_allocthat will propagate as an uncaught exception.Recommended defensive guard (16 MiB is an example – tune to your threat model):
+ constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB + if (0 == size || size > kMaxRangeIndexPacketSize) { + return ErrorCodeCorrupt; + } std::vector<char> buffer(size);This mirrors the hardening already suggested in previous reviews for other metadata packets.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
16-18: Redundant include ‑ removejson_fwd.hpp
<nlohmann/json.hpp>already brings in the forward declarations; keeping both only lengthens compile time.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp(3 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp(5 hunks)components/core/src/clp_s/indexer/CMakeLists.txt(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/ArchiveReaderAdaptor.hppcomponents/core/src/clp_s/ArchiveReaderAdaptor.cpp
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/indexer/CMakeLists.txt (1)
93-93: Dependency addition looks correctLinking
nlohmann_json::nlohmann_jsonunderPRIVATEscope is appropriate and keeps the symbol surface local to this target.
No further action required.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
103-153: 🛠️ Refactor suggestionHarden range-index parsing against oversized or empty packets.
The implementation correctly handles range index parsing but lacks safeguards against potentially malicious inputs. A corrupt or malicious archive could set
packet_sizeto an unreasonably large value or zero, leading to allocation issues or undefined behaviour.-auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) +auto ArchiveReaderAdaptor::try_read_range_index(ZstdDecompressor& decompressor, size_t size) -> ErrorCode { + constexpr size_t kMaxRangeIndexPacketSize = 16 * 1024 * 1024; // 16 MiB + if (0 == size || size > kMaxRangeIndexPacketSize) { + return ErrorCodeCorrupt; + } std::vector<char> buffer(size); if (auto const rc = decompressor.try_read_exact_length(buffer.data(), buffer.size()); ErrorCodeSuccess != rc) { return rc; } - auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); + auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), + /*convert=*/true, + /*throw_on_error=*/false); if (false == range_index_json.is_array()) { return ErrorCodeCorrupt; } + m_range_index.reserve(range_index_json.size()); for (auto& range_index_entry : range_index_json) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (5)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (5)
9-9: New includes added to support range index functionality.These includes are necessary to support the new range index implementation:
<utility>forstd::move<nlohmann/json.hpp>for JSON parsing"RangeIndexWriter.hpp"for range index constantsAlso applies to: 13-13, 20-20
135-142: Proper exception handling for JSON type conversion.The implementation correctly wraps the
get<size_t>()conversions in a try/catch block to preserve the error-code contract. This is good practice since these operations can throw if the values are invalid or out of range.
143-145: Validation of range index bounds.Good checks to ensure the start index is not greater than the end index, which would represent an invalid range. This helps maintain data integrity and prevents potential issues downstream.
146-150: Efficient handling of range index entries.The use of
emplace_backwith direct constructor arguments andstd::movefor the JSON object is efficient. This avoids unnecessary copies of the metadata fields JSON object.
239-241: Case added for handling RangeIndex metadata packet type.The
try_read_archive_metadatafunction now correctly handles the RangeIndex packet type by calling the newtry_read_range_indexfunction. This integration follows the existing pattern for other metadata types.
| auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | ||
| if (false == range_index_json.is_array()) { | ||
| return ErrorCodeCorrupt; | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Check for JSON parsing errors.
The from_msgpack call has throw_on_error set to false, but there's no explicit check if the JSON was successfully parsed before proceeding to validate it as an array. This could lead to working with invalid data.
- auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
- if (false == range_index_json.is_array()) {
+ auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false);
+ if (range_index_json.is_discarded() || false == range_index_json.is_array()) {
return ErrorCodeCorrupt;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
| if (false == range_index_json.is_array()) { | |
| return ErrorCodeCorrupt; | |
| } | |
| auto range_index_json = nlohmann::json::from_msgpack(buffer.begin(), buffer.end(), true, false); | |
| if (range_index_json.is_discarded() || false == range_index_json.is_array()) { | |
| return ErrorCodeCorrupt; | |
| } |
wraymo
left a comment
There was a problem hiding this comment.
Nice work! I think it's overall very clear and well-structured. Just a few comments on renaming and refactoring.
| : TraceableException(error_code, filename, line_number) {} | ||
| }; | ||
|
|
||
| struct RangeIndexEntry { |
There was a problem hiding this comment.
Not sure if we should put it inside ArchiveReaderAdopter
There was a problem hiding this comment.
Would you prefer moving it outside of ArchiveReaderAdaptor and still defining it in this file, or moving it to a separate header?
There was a problem hiding this comment.
Going to assume that moving it out of ArchiveReaderAdaptor but keeping it in the same file is fine for now.
There was a problem hiding this comment.
I prefer moving it outside ArchiveReaderAdaptor.h. And since it's very similar to Range in RangeIndexWriter, is it possible to merge them into one struct?
There was a problem hiding this comment.
Merging them might be a bit awkward because in RangeIndexWriter we need to know whether end_index has been specified so we wrap that field in optional but on the read-side we know that it always exists so wrapping it in optional would be cumbersome.
|
|
||
| std::shared_ptr<ReaderUtils::SchemaMap> get_schema_map() { return m_schema_map; } | ||
|
|
||
| auto get_range_index() const -> std::vector<ArchiveReaderAdaptor::RangeIndexEntry> const& { |
There was a problem hiding this comment.
I think this method will be used when we have search support?
There was a problem hiding this comment.
Yes, that is the plan
| auto get_test_input_path_relative_to_tests_dir() -> std::filesystem::path; | ||
| auto get_test_input_local_path() -> std::string; | ||
| void serialize_record( | ||
| nlohmann::json const& auto_gen, | ||
| nlohmann::json const& user_gen, | ||
| clp::ffi::ir_stream::Serializer<clp::ir::eight_byte_encoded_variable_t>& serializer | ||
| ); | ||
| void compress_ir(); | ||
| void compress(bool single_file_archive, bool from_ir); | ||
| void check_archive_metadata(bool from_ir); | ||
|
|
There was a problem hiding this comment.
Do you think we can have a test utility file which includes functions that can be shared by test-clp_s-end-to-end and test-clp_s-range_index (and maybe add path to compress/compress_ir signatures)
There was a problem hiding this comment.
Sure, I think it is probably worth moving compress and generate_ir into a helper.
There was a problem hiding this comment.
Actually on second thought I might only move compress into a helper for now since we have similar code for 3 different tests. I prefer leaving generate_ir as is because there is currently only one user.
| constexpr std::string_view cTestRangeIndexInputFile{"test_no_floats_sorted.jsonl"}; | ||
| constexpr std::string_view cTestRangeIndexIRInputFile{"test_no_floats_sorted.ir"}; |
There was a problem hiding this comment.
I think I prefer defining them in each test because it makes it clear what files they depend on at a glance.
| REQUIRE(serializer.serialize_msgpack_map(auto_gen_obj.via.map, user_gen_obj.via.map)); | ||
| } | ||
|
|
||
| void compress_ir() { |
There was a problem hiding this comment.
Maybe we should rename this method to something like generate_ir? Or we could add test_no_floats_sorted.ir to test_log_files
There was a problem hiding this comment.
I like generate_ir, let's go with that.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
components/core/src/clp_s/ArchiveReader.hpp (1)
120-122: Consider adding[[nodiscard]]andnoexceptattributes toget_range_index().The getter already returns a
const&, avoiding a copy. Addingnoexceptwould document that it never throws, and[[nodiscard]]would warn callers who silently ignore the returned data.- auto get_range_index() const -> std::vector<RangeIndexEntry> const& { + [[nodiscard]] auto get_range_index() const noexcept -> std::vector<RangeIndexEntry> const& { return m_archive_reader_adaptor->get_range_index(); }components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
17-17: Remove the redundantjson_fwd.hppinclude.Since you're already including
nlohmann/json.hppwhich internally includes the forward-declaration header, keeping both will prolong compile time unnecessarily.#include <nlohmann/json.hpp> -#include <nlohmann/json_fwd.hpp>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/ArchiveReader.hpp(1 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/ArchiveReader.hppcomponents/core/src/clp_s/ArchiveReaderAdaptor.hpp
🧬 Code Graph Analysis (1)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (4)
components/core/src/clp_s/RangeIndexWriter.hpp (2)
start_index(44-44)end_index(73-73)components/core/src/clp_s/TimestampDictionaryReader.hpp (1)
decompressor(30-30)components/core/src/clp_s/PackedStreamReader.hpp (1)
decompressor(43-43)components/core/src/clp_s/ArchiveWriter.hpp (2)
size(162-162)size(162-162)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (3)
31-41: LGTM! Good implementation of theRangeIndexEntrystruct.The struct is well-documented and properly uses move semantics when storing the JSON payload, as suggested in previous reviews. The comment about brace initializers is also informative.
120-126: LGTM! The method declaration follows class conventions.The method declaration for
try_read_range_indexfollows the consistent style of the class, using the trailing return type syntax and proper documentation.
178-178: LGTM!m_range_indexmember added correctly.The new private member variable is properly added to store the range index entries.
|
|
||
| ArchiveHeader const& get_header() const { return m_archive_header; } | ||
|
|
||
| std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; } |
There was a problem hiding this comment.
🧹 Nitpick (assertive)
Consider adding [[nodiscard]] and noexcept to get_range_index().
For consistency with other getters and to provide better guarantees to callers:
- std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; }
+ [[nodiscard]] std::vector<RangeIndexEntry> const& get_range_index() const noexcept { return m_range_index; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::vector<RangeIndexEntry> const& get_range_index() const { return m_range_index; } | |
| [[nodiscard]] std::vector<RangeIndexEntry> const& get_range_index() const noexcept { return m_range_index; } |
🤖 Prompt for AI Agents
In components/core/src/clp_s/ArchiveReaderAdaptor.hpp at line 90, the getter
method get_range_index() should be updated to include the [[nodiscard]]
attribute and be marked noexcept. This change ensures consistency with other
getter methods and signals to callers that the return value should not be
ignored and that the method does not throw exceptions. Modify the method
signature to add [[nodiscard]] before the return type and noexcept after the
method declaration.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/CMakeLists.txt(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)
633-634: Ensure test utility headers are discoverable
You’ve addedtests/ClpSTestUtils.cppandtests/ClpSTestUtils.hppto the unit test sources, but theunitTesttarget’s include directories only cover outcome and sqlite3 paths. Please verify that the compiler can locateClpSTestUtils.hpp. If not, consider adding:target_include_directories(unitTest PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/tests )
kirkrodrigues
left a comment
There was a problem hiding this comment.
Something's also wrong with the clang-tidy tasks where they're not linting tests properly. Will probably need your help to fix the linting errors in the PR where we fix the task.
There was a problem hiding this comment.
Since these ClpSTestUtils.cpp/hpp don't contain classes/types, they should be named using snake_case, right? See our internal guidelines: C++ -> File naming and organization (I know we should open-source them so they're easier to find... We're trying, we're trying.).
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
components/core/CMakeLists.txt (1)
640-640: Maintain naming consistency in test filenamesThe new test file
test-clp_s-range_index.cppmixes hyphens and underscores in its name. For consistency with other test files in the project, consider renaming it to eithertest-clp_s_range_index.cppor adjusting the naming convention across all test files.components/core/tests/test-clp_s-range_index.cpp (2)
82-83: 🧹 Nitpick (assertive)Consider using
create_directoriesfor more robust directory creation.The
std::filesystem::create_directoryfunction will throw an exception if the path exists but is not a directory. Usingstd::filesystem::create_directorieswould make the test more robust when rerun without cleanup.- std::filesystem::create_directory(cTestRangeIndexIRDirectory); + std::filesystem::create_directories(cTestRangeIndexIRDirectory);
119-163: 🛠️ Refactor suggestionConsider re-opening a fresh
ArchiveReaderinside the loop.
ArchiveReader::open+closeare invoked repeatedly on the same instance. If an earlier iteration were to throw, the reader's internal state might be left half-initialised, causing later calls to behave unpredictably. Instantiating a newArchiveReaderper archive avoids hidden state leakage.- clp_s::ArchiveReader archive_reader; for (auto const& entry : std::filesystem::directory_iterator(cTestRangeIndexArchiveDirectory)) { + clp_s::ArchiveReader archive_reader; clp_s::Path archive_path{
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (6)
components/core/CMakeLists.txt(1 hunks)components/core/tests/clp_s_test_utils.cpp(1 hunks)components/core/tests/clp_s_test_utils.hpp(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp(2 hunks)components/core/tests/test-clp_s-range_index.cpp(1 hunks)components/core/tests/test-clp_s-search.cpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/tests/clp_s_test_utils.hppcomponents/core/tests/test-clp_s-end_to_end.cppcomponents/core/tests/test-clp_s-search.cppcomponents/core/tests/clp_s_test_utils.cppcomponents/core/tests/test-clp_s-range_index.cpp
🧠 Learnings (2)
components/core/tests/test-clp_s-end_to_end.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
components/core/tests/test-clp_s-search.cpp (1)
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
🧬 Code Graph Analysis (2)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
compress_archive(18-24)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive(12-54)compress_archive(12-18)components/core/tests/test-clp_s-search.cpp (3)
get_test_input_local_path(41-41)get_test_input_local_path(53-57)get_test_input_local_path(53-53)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/clp_s_test_utils.hpp (1)
compress_archive(18-24)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive(12-54)compress_archive(12-18)components/core/tests/test-clp_s-end_to_end.cpp (3)
get_test_input_local_path(25-25)get_test_input_local_path(33-37)get_test_input_local_path(33-33)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (11)
components/core/CMakeLists.txt (1)
633-634: Good refactoring to centralize archive compression logicAdding these test utility files helps centralize the archive compression logic that was previously duplicated across test files. This improves code maintainability and consistency.
components/core/tests/clp_s_test_utils.hpp (1)
1-25: Well-structured utility header with clear documentationThe header file is well-organized with proper include guards, necessary includes, and clear documentation. The function signature is appropriately designed with const references for string parameters and a clear enumeration for file type selection.
components/core/tests/test-clp_s-end_to_end.cpp (2)
11-14: Appropriate header updates for the refactored compression logicThe includes have been correctly updated to remove the now-unused JsonParser header and add the necessary headers for CommandLineArguments and the new test utilities.
105-111: Effective use of the shared compression utilityGood refactoring to use the centralized
compress_archivefunction instead of duplicating compression logic. The REQUIRE_NOTHROW wrapper appropriately validates that the compression process doesn't throw exceptions.components/core/tests/test-clp_s-search.cpp (2)
17-31: Appropriate header updates for the refactored compression logicThe includes have been correctly updated to add the necessary headers for CommandLineArguments and the new test utilities, maintaining proper organization of the include statements.
158-164: Effective use of the shared compression utilityGood refactoring to use the centralized
compress_archivefunction instead of duplicating compression logic. The REQUIRE_NOTHROW wrapper appropriately validates that the compression process doesn't throw exceptions, and all parameters are correctly passed to match the original functionality.components/core/tests/clp_s_test_utils.cpp (2)
47-52: LGTM: Function is well-designed and follows coding guidelines.This handler for different file types is well-structured and includes proper error handling for unexpected file types. The use of
false == ...pattern on line 49 correctly follows the coding guideline preference.
53-54: LGTM: Verification follows coding guidelines.The verification logic properly checks that the archive was created successfully, and correctly follows the coding guideline of using
false == <expression>rather than!<expression>.components/core/tests/test-clp_s-range_index.cpp (3)
90-91: LGTM: Error handling follows coding guidelines.The error handling correctly follows the coding guideline of preferring
false == <expression>rather than!<expression>.
138-152: LGTM: Boolean expression follows coding guidelines.The boolean comparisons correctly follow the coding guideline of preferring
false == <expression>rather than!<expression>.
166-189: LGTM: Comprehensive test case for range index functionality.This test case effectively validates the range index functionality across different combinations of parameters (single vs. multi-file archives, JSON vs. IR input). It properly uses the centralized
compress_archiveutility function, making the test more maintainable.
Description
This PR follows up on #847 and implements the read side of the archive range index, and also adds end to end tests for this feature across both the JSON and kv-ir ingestion paths.
The code for reading the range index metdata packet lives in
ArchiveReaderAdaptor, and access to the range index is forwarded to readers throughArchiveReader.Checklist
breaking change.
Validation performed
Summary by CodeRabbit