feat(clp-s): Add the write-side implementation for the archive range index.#847
Conversation
WalkthroughThis update introduces a range index feature to the archive writing and reading process. The Changes
Sequence Diagram(s)sequenceDiagram
participant JsonParser
participant ArchiveWriter
participant RangeIndexWriter
participant ZstdCompressor
JsonParser->>ArchiveWriter: add_field_to_current_range(key, value)
ArchiveWriter->>RangeIndexWriter: add_value_to_range(key, value)
JsonParser->>ArchiveWriter: close_current_range()
ArchiveWriter->>RangeIndexWriter: close_range(end_index)
ArchiveWriter->>ArchiveWriter: write_archive_metadata()
ArchiveWriter->>RangeIndexWriter: write(ZstdCompressor)
RangeIndexWriter->>ZstdCompressor: write msgpack-encoded range index
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (9)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (7)
components/core/src/clp_s/ArchiveWriter.cpp (2)
155-160: Prefer aconstexprfor constant packets
num_constant_packetsnever changes; turning it into aconstexprclarifies intent and enables compiler optimisation.- uint8_t const num_constant_packets{3U}; + constexpr uint8_t num_constant_packets{3U};
187-191: Avoid an unconditional write when the range index is empty
RangeIndexWriter::write()returns immediately when empty, but calling it anyway incurs a function call and makes the intent less obvious. A small guard keeps the code self-documenting.- // Write range index - if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) { + // Write range index (only if present) + if (false == m_range_index_writer.empty()) { + if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) { throw OperationFailed(rc, __FILENAME__, __LINE__); - } + } + }components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
100-106: Reading unknown packets can exhaust memory – stream instead of bulk-allocatingAllocating a
std::vector<char>(size)for every unknown packet may allocate many GiB for a malformed or adversarial archive. Consider discarding the data in fixed-size blocks to cap memory usage.- std::vector<char> buffer(size); - return decompressor.try_read_exact_length(buffer.data(), buffer.size()); + constexpr size_t cBufSize = 64 * 1024; + std::array<char, cBufSize> buf{}; + size_t bytes_remaining = size; + while (bytes_remaining > 0) { + size_t to_read = std::min(bytes_remaining, cBufSize); + auto rc = decompressor.try_read_exact_length(buf.data(), to_read); + if (ErrorCodeSuccess != rc) { + return rc; + } + bytes_remaining -= to_read; + } + return ErrorCodeSuccess;components/core/src/clp_s/RangeIndexWriter.cpp (1)
32-48: Invalidate the handle after a successful closeAfter finalising a range, later misuse of the same handle will currently return
ErrorCodeNotReady. Setting the handle tostd::numeric_limits<handle_t>::max()(or similar sentinel) immediately eliminates accidental re-use.range.end_index = end_index; + handle = std::numeric_limits<handle_t>::max();components/core/src/clp_s/ArchiveWriter.hpp (1)
171-174: Provide proper documentation instead of TODO placeholdersBoth
add_field_to_current_rangeandclose_current_rangeare part of the public API and will be consumed by callers outside this class. Detailed comments describing behaviour, failure modes, and thread-safety expectations are essential.components/core/src/clp_s/JsonParser.cpp (1)
512-557: Consider de-duplicating repeated initialisation logicThe lambda
initialize_fields_for_archive(and its twin inparse_from_ir) contains nearly identical code blocks. Extracting this into a private helper method would remove duplication, reduce maintenance overhead, and ensure both code paths evolve together.components/core/src/clp_s/RangeIndexWriter.hpp (1)
51-70: Markadd_value_to_rangeas[[nodiscard]]to surface ignored errorsThe function returns an
ErrorCodewhich can easily be ignored by callers. Adding[[nodiscard]]will prompt the compiler to warn about accidental drops:- [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value) + [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value)(only the attribute added).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
components/core/src/clp_s/ArchiveReaderAdaptor.cpp(2 hunks)components/core/src/clp_s/ArchiveReaderAdaptor.hpp(1 hunks)components/core/src/clp_s/ArchiveWriter.cpp(3 hunks)components/core/src/clp_s/ArchiveWriter.hpp(4 hunks)components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/JsonParser.cpp(9 hunks)components/core/src/clp_s/RangeIndexWriter.cpp(1 hunks)components/core/src/clp_s/RangeIndexWriter.hpp(1 hunks)components/core/src/clp_s/SingleFileArchiveDefs.hpp(2 hunks)components/core/src/clp_s/archive_constants.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/SingleFileArchiveDefs.hppcomponents/core/src/clp_s/ArchiveReaderAdaptor.cppcomponents/core/src/clp_s/archive_constants.hppcomponents/core/src/clp_s/ArchiveReaderAdaptor.hppcomponents/core/src/clp_s/ArchiveWriter.cppcomponents/core/src/clp_s/RangeIndexWriter.cppcomponents/core/src/clp_s/ArchiveWriter.hppcomponents/core/src/clp_s/RangeIndexWriter.hppcomponents/core/src/clp_s/JsonParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (8)
components/core/src/clp_s/CMakeLists.txt (1)
116-117: LGTM! Added RangeIndexWriter source files to the build system.The new source files for the RangeIndexWriter implementation have been correctly added to the CLP_S_SOURCES variable in the build system.
components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)
4-6: LGTM! Necessary includes added.The addition of
<cstdint>and<vector>headers provides the necessary declarations for types used in this file.
14-14: LGTM! Archive version incremented appropriately.The patch version has been incremented from 0 to 1, which is appropriate for adding a new feature in a backward-compatible manner.
38-38: LGTM! New RangeIndex metadata packet type added.Added the RangeIndex packet type with value 3 to the ArchiveMetadataPacketType enum, which aligns with the PR objective of implementing range index functionality.
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)
93-99: LGTM! Added method to handle unknown metadata packets.The new method allows the reader to properly handle and skip unknown metadata packets, enabling forward compatibility when new packet types (like the range index) are added in future versions.
components/core/src/clp_s/archive_constants.hpp (1)
39-43: LGTM! Added range index metadata constants.The new namespace with constants for range index metadata field names provides a centralized place to define these keys, promoting consistency throughout the codebase. The constants align with the requirements described in the PR objectives.
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
174-186: Good defensive handling of future packet typesExplicitly consuming – rather than silently ignoring – unknown metadata packets maintains decompressor state and forward-compatibility. Well done!
components/core/src/clp_s/RangeIndexWriter.cpp (1)
18-30: Inclusive vs. exclusive range boundary is ambiguous
open_rangeallowsstart_indexto be equal to a previous range’send_indexbecause the overlap check uses<rather than<=.
Confirm whether ranges are [inclusive, exclusive) or [inclusive, inclusive]; if the latter, equality should be rejected.- || start_index < it->end_index.value()) + || start_index <= it->end_index.value()) // Reject touching ranges if both bounds are inclusive
wraymo
left a comment
There was a problem hiding this comment.
This is a great PR with a very clean design and implementation! I just have a few questions.
| } | ||
|
|
||
| auto | ||
| ArchiveReaderAdaptor::try_read_unknown_metadata_packet(ZstdDecompressor& decompressor, size_t size) |
There was a problem hiding this comment.
What is an unknown metadata packet?
There was a problem hiding this comment.
Any metadata packet that this version of the code isn't familiar with. Generally I'm trying to make it so that if you use this version of the code to read a later archive format with more metadata packet types the code will just skip the metadata packet and keep reading as normal.
For this PR specifically it also makes it so that I can split the read side code into a separate PR but let the archive still be readable.
| compressor.write_numeric_value(num_packets); | ||
| uint8_t num_optional_packets{0ULL}; | ||
| if (false == m_range_index_writer.empty()) { | ||
| ++num_optional_packets; |
There was a problem hiding this comment.
Do we have any other optional packets?
There was a problem hiding this comment.
By the spec I think the timestamp dictionary is optional, but we've always written it even when it is unused for the whole time we've had the metadata section.
Should I open an issue to refactor the timestamp dictionary writing code to actually write it only when it is used?
|
|
||
| size_t start_index{0ULL}; | ||
| std::optional<size_t> end_index{std::nullopt}; | ||
| std::map<std::string, nlohmann::json> fields; |
There was a problem hiding this comment.
Is that to support multiple data types and make serialization easier with the nlohmann library?
There was a problem hiding this comment.
Yep, that's the idea -- nlohmann makes it easy to serialize/derialize msgpack + makes it easy to support multiple types. Since the metadata section is relatively small I don't think that using nlohmann will cause any performance problems for us.
| log_event_idx_node_id | ||
| = add_metadata_field(constants::cLogEventIdxName, NodeType::Integer); | ||
| } | ||
| if (auto const rc = m_archive_writer->add_field_to_current_range( |
There was a problem hiding this comment.
Is it possible to deduplicate the code from line 985-1023 with code from line 517-555
There was a problem hiding this comment.
I think it's a bit hard with the way the code is currently written. In general I think we want to deduplicate a lot of the code between the kv-ir/json parse_X flows though. Will give it some thought.
There was a problem hiding this comment.
I think it might be a somewhat substantial refactor to merge the kvir/json parsing flow in a way that lets us share a lot of this code at the moment.
I think we could maybe do it by doing something like the IrUnitHandlerInterface concept in Deserializer where we define a class that holds onto state relevant to a specific parsing flow + defines handlers that get invoked in different parts of the parsing flow e.g. "parse the next record", "handle the start of a new archive", etc..
Then we could handle most stuff like populating the default metadata fields + adding the log_event_idx metadata field to each log record separately from the core parsing stuff.
Maybe I can put up a github issue to that effect?
| if (ErrorCodeSuccess != m_archive_writer->close_current_range()) { | ||
| m_archive_writer->close(); | ||
| return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Do you think it's better to close the range after checking the simdjson errors?
There was a problem hiding this comment.
Good point, will move it after.
| for (auto const& range : m_ranges) { | ||
| if (start_index < range.start_index || false == range.end_index.has_value() | ||
| || start_index < range.end_index.value()) | ||
| { | ||
| return std::errc::invalid_argument; | ||
| } | ||
| } |
There was a problem hiding this comment.
Do we want to print an error message in this case (and probably better error code)?
| return handle; | ||
| } | ||
|
|
||
| auto RangeIndexWriter::close_range(handle_t handle, size_t end_index) -> ErrorCode { |
There was a problem hiding this comment.
Not sure if handle always equals to m_ranges.size() -1. If that's the case, we may get rid of this variable.
There was a problem hiding this comment.
Yeah, I think it could potentially help catch some weird bugs, but probably not really necessary. I'll get rid of it.
|
For the range index schema, does the following make sense? |
I think it would make it a little bit more complicated to search against compared to the current design, particularly for queries that touch multiple metadata fields. |
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.cpp (1)
46-76:⚠️ Potential issuePossible 4 GiB overflow when writing the serialised blob
serialized_range_index.size()is cast touint32_t; blobs ≥ 4 294 967 296 bytes wrap to 0. Detect overflow and report an error:- writer.write_numeric_value<uint32_t>(serialized_range_index.size()); + if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) { + return ErrorCodeUnsupported; // Too large to encode safely + } + writer.write_numeric_value<uint32_t>( + static_cast<uint32_t>(serialized_range_index.size()));A future enhancement would avoid the temporary buffer and stream the MessagePack directly into the compressor, eliminating the RAM spike altogether.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/src/clp_s/ArchiveWriter.cpp(3 hunks)components/core/src/clp_s/ArchiveWriter.hpp(4 hunks)components/core/src/clp_s/JsonParser.cpp(9 hunks)components/core/src/clp_s/RangeIndexWriter.cpp(1 hunks)components/core/src/clp_s/RangeIndexWriter.hpp(1 hunks)
🧰 Additional context used
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp
[style] 18-18: The function 'open_range' is never used.
(unusedFunction)
[style] 31-31: The function 'close_range' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (1)
60-66: Avoid suppressing a failure when closing the last open range
close()callsclose_current_range()and then simply proceeds, regardless of the error code already returned.
Ifclose_current_range()fails (e.g., the end index is smaller than the start index), the exception is thrown and execution never reaches this block, so the logic is fine. But if future refactors convertclose_current_range()to return an error code instead of throwing inside, the surrounding code path must respect that.No immediate action needed, but keep an eye on error-propagation semantics if you change the exception policy.
components/core/src/clp_s/ArchiveWriter.hpp (1)
178-189:⚠️ Potential issueDon’t mark the range as open until the open succeeds
If
open_range()returns an error,m_range_openis still set totrue, leaving the writer in an inconsistent state and guaranteeing that subsequent calls will route toadd_value_to_range()without a valid range.- if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id); - ErrorCodeSuccess != rc) - { - return rc; - } - m_range_open = true; + if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id); + ErrorCodeSuccess != rc) + { + return rc; // keep flag unchanged + } + m_range_open = true; // safe: open succeededLikely an incorrect or invalid review comment.
components/core/src/clp_s/JsonParser.cpp (1)
510-556: Possible missingopen_range()call before populating range fields
initialize_fields_for_archive()immediately callsadd_field_to_current_range(...).
Ifclose_current_range()(line 659) actually closes the only open range and does not implicitly open the next one, the subsequentinitialize_fields_for_archive()in the next file-iteration will operate on a writer that has no open range, causingErrorCodeNotReady.Please confirm that
ArchiveWriter::close_current_range()guarantees a fresh range is open afterwards.
If it does not, prepend an explicitopen_range(start_idx)(or equivalent) before the first metadata insertion.
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (4)
components/core/src/clp_s/ArchiveWriter.hpp (1)
289-290: Member ordering still wastes padding bytesPrevious review pointed out that placing the heavyweight
RangeIndexWriterfirst and the 1-byteboolafterwards shrinks the struct on most ABIs. The order has not been adjusted.components/core/src/clp_s/RangeIndexWriter.cpp (2)
70-76: 4 GiB overflow & silent truncation still present
serialized_range_index.size()is written as auint32_twithout a bounds check.
Archives whose range-index exceeds 4 294 967 295 bytes will be silently corrupted.This was raised before; please apply the suggested guard:
- writer.write_numeric_value<uint32_t>(serialized_range_index.size()); + if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) { + return ErrorCodeUnsupported; // too large to encode safely + } + writer.write_numeric_value<uint32_t>( + static_cast<uint32_t>(serialized_range_index.size()));
75-76: Lost diagnostic information makes debugging harderCatching
std::exceptionand discardinge.what()removes valuable context.
At minimum, log the message before returning.- } catch (std::exception const& e) { - return ErrorCodeFailure; + } catch (std::exception const& e) { + SPDLOG_ERROR("Failed to serialise range index: {}", e.what()); + return ErrorCodeFailure; }components/core/src/clp_s/JsonParser.cpp (1)
510-556: Initialisation block duplicated – extract a helperThe lambda
initialize_fields_for_archiveis 40 + lines long and appears almost verbatim in both the JSON and KV-IR paths.
Maintaining two copies will lead to drift.A small private helper such as
bool JsonParser::init_range_metadata( std::string_view filename, size_t& file_split_number, std::string const& archive_creator_id, std::optional<nlohmann::json> const& extra_user_md = std::nullopt);would remove the duplication and make future changes (e.g., adding another field) less error-prone.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/ArchiveWriter.hpp(4 hunks)components/core/src/clp_s/JsonParser.cpp(9 hunks)components/core/src/clp_s/RangeIndexWriter.cpp(1 hunks)components/core/src/clp_s/RangeIndexWriter.hpp(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
components/core/src/clp_s/RangeIndexWriter.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp
[style] 17-17: The function 'open_range' is never used.
(unusedFunction)
[style] 30-30: The function 'close_range' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/RangeIndexWriter.hpp (3)
59-72: The error handling looks goodThe template method properly checks for error conditions before proceeding:
- Verifies that there's at least one range
- Ensures the most recent range is not already closed
- Prevents duplicate keys in a range's fields
The error codes returned (ErrorCodeNotReady, ErrorCodeBadParam) clearly indicate the failure reason.
93-101: Good design of the Range structThe Range struct is well-designed with:
- An explicit constructor that requires a start index
- Default initialization of start_index (though redundant with the constructor)
- Use of std::optional for end_index to represent ranges that haven't been closed yet
- std::map for fields to ensure sorted keys and efficient lookups
The use of nlohmann::json for field values provides flexibility for storing different types of metadata.
21-36: Clear documentation of the range index structureThe class documentation clearly explains the purpose and structure of the range index, including the non-overlapping constraint and support for empty ranges. This will be valuable for maintainers working with this code in the future.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (6)
components/core/src/clp_s/ArchiveWriter.hpp (3)
178-183: Preferstd::string_viewto avoid an extra copyThe key is never modified or stored past the call, so accepting
std::string_viewavoids an
allocation when the caller already holds a string literal or view.Same optimisation was recommended previously.
290-291: 🧹 Nitpick (assertive)Minor padding win: place the bool after the heavy writer
Re-ordering the members saves padding on most platforms and keeps the boolean next to the flag it
controls.- bool m_range_open{false}; - RangeIndexWriter m_range_index_writer; + RangeIndexWriter m_range_index_writer; + bool m_range_open{false};
178-190:⚠️ Potential issueGuard against half-opened ranges
add_field_to_current_rangeleavesm_range_open == truewhenadd_value_to_range()fails.
Callers cannot detect that the range is only partially initialised, and a subsequent attempt to
re-open/close will return confusingNotReadyerrors.- return m_range_index_writer.add_value_to_range(key, value); + auto const rc = m_range_index_writer.add_value_to_range(key, value); + if (ErrorCodeSuccess != rc) { + // roll back: the range is unusable + m_range_open = false; + } + return rc;This mirrors the defensive logic suggested for
close_current_range()in an earlier review.components/core/src/clp_s/RangeIndexWriter.cpp (2)
76-78: Surface exception detailsThe catch block converts every exception into
ErrorCodeFailure, discardinge.what(). Logging the
message (spdlog orSPDLOG_ERROR) would aid post-mortem debugging and aligns with prior feedback.
70-75:⚠️ Potential issuePossible 4 GiB overflow when serialising
serialized_range_index.size()is cast touint32_twithout bounds checking. Archives whose range
index exceeds 4 294 967 295 bytes will be silently truncated, corrupting the metadata.- writer.write_numeric_value<uint32_t>(serialized_range_index.size()); + if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) { + return ErrorCodeUnsupported; // too large to encode safely + } + writer.write_numeric_value<uint32_t>( + static_cast<uint32_t>(serialized_range_index.size()));components/core/src/clp_s/JsonParser.cpp (1)
512-557: Repeated metadata initialisation – consider extracting a helperThe lambda
initialize_fields_for_archivehere is virtually identical to the one in the
kv-ir branch (lines 981-1047). Duplicated 40-line blocks are hard to keep in sync.A small private helper such as
bool JsonParser::init_range_metadata( Path const& path, size_t file_split_number, std::string const& archive_creator_id, nlohmann::json const* extra_ir_metadata = nullptr);would eliminate duplication and make future fixes affect both paths.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/ArchiveWriter.hpp(4 hunks)components/core/src/clp_s/JsonParser.cpp(9 hunks)components/core/src/clp_s/RangeIndexWriter.cpp(1 hunks)components/core/src/clp_s/RangeIndexWriter.hpp(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
components/core/src/clp_s/RangeIndexWriter.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp
[style] 17-17: The function 'open_range' is never used.
(unusedFunction)
[style] 31-31: The function 'close_range' is never used.
(unusedFunction)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)
1-107: Overall implementation looks robustThe RangeIndexWriter class provides a well-documented and clean interface for building and serializing range index data. The error handling is comprehensive, checking for invalid operations like adding values to non-existent or closed ranges. The class effectively maintains the invariant that ranges cannot overlap.
Some implementation details will depend on the .cpp file, but the header establishes a solid foundation for the functionality described in the PR objectives.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
components/core/src/clp_s/ArchiveWriter.hpp (1)
171-190: 🧹 Nitpick (assertive)Consider using
std::string_viewfor more efficient key parameterThe
add_field_to_current_rangemethod currently accepts astd::string const¶meter, which forces callers to construct/own an actualstd::string. Usingstd::string_viewwould avoid unnecessary heap allocations when keys already exist elsewhere (common in parsers).- template <typename T> - [[nodiscard]] auto add_field_to_current_range(std::string const& key, T const& value) - -> ErrorCode { + template <typename T> + [[nodiscard]] auto add_field_to_current_range(std::string_view key, T const& value) + -> ErrorCode {components/core/src/clp_s/RangeIndexWriter.hpp (2)
40-42: 🧹 Nitpick (assertive)Consider grouping serialization constants in a struct
To better represent the structure of your serialized format and make the relationship between these constants more explicit, consider grouping them in a struct.
-static constexpr std::string_view cStartIndexName{"s"}; -static constexpr std::string_view cEndIndexName{"e"}; -static constexpr std::string_view cMetadataFieldsName{"f"}; +struct SerializationKeys { + static constexpr std::string_view start{"s"}; + static constexpr std::string_view end{"e"}; + static constexpr std::string_view fields{"f"}; +};
57-71: 🧹 Nitpick (assertive)Consider adding type constraints to the template method
The template method
add_value_to_rangeaccepts any type T without constraints, which could lead to compilation errors if types incompatible with nlohmann::json are used.If using C++20:
-template <typename T> +template <typename T> + requires nlohmann::detail::has_to_json<nlohmann::json, T>::value [[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {Or for C++17:
-template <typename T> +template <typename T, + typename = std::enable_if_t<nlohmann::detail::has_to_json<nlohmann::json, T>::value>> [[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/ArchiveWriter.hpp(4 hunks)components/core/src/clp_s/RangeIndexWriter.hpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/ArchiveWriter.hppcomponents/core/src/clp_s/RangeIndexWriter.hpp
🔇 Additional comments (4)
components/core/src/clp_s/ArchiveWriter.hpp (2)
196-202: Good implementation ofclose_current_rangeThe implementation correctly only resets
m_range_openupon successful closure, preserving the error state and allowing callers to retry after fixing issues.
290-291: Optimal member ordering for efficient memory layoutGood placement of the lightweight boolean flag next to the writer object to reduce padding in the struct.
components/core/src/clp_s/RangeIndexWriter.hpp (2)
67-69: Efficient use of try_emplaceGood use of
try_emplacewhich performs existence check and insertion in a single operation, rather than using two separate operations (find + emplace).
11-16:Details
❓ Verification inconclusive
Confirm if both JSON includes are necessary
While the comment explains why both headers are needed, typically including
json.hppalready pulls in all forward declarations, makingjson_fwd.hppredundant. This could slow down compilation.
🏁 Script executed:
#!/bin/bash # Check if removing json_fwd.hpp breaks compilation # First, let's check what each header actually includes echo "Examining json.hpp content:" head -n 20 $(fd -t f "json.hpp" | head -n 1) echo "Examining json_fwd.hpp content:" head -n 20 $(fd -t f "json_fwd.hpp" | head -n 1) # Check if json.hpp already includes json_fwd.hpp grep -A 2 "json_fwd.hpp" $(fd -t f "json.hpp" | head -n 1)Length of output: 291
Verify necessity of json_fwd.hpp when including json.hpp
I wasn’t able to locate the external nlohmann headers in the repo to confirm if
<nlohmann/json.hpp>already pulls in the forward declarations fromjson_fwd.hpp. Please check in your local environment whetherjson.hppincludesjson_fwd.hppand if so remove the redundant include to speed up compilation.• File: components/core/src/clp_s/RangeIndexWriter.hpp (lines 11–16)
#include <nlohmann/json.hpp> #include <nlohmann/json_fwd.hpp>Please verify and update accordingly.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Reviewed CMakeLists.txt
|
Just making sure that the unit tests I have written so far in the read side PR pass when rebased on top of this branch, then I will merge. |
Seems to be passing, will merge. |
Description
This PR adds the write side of the archive range index as described in #737.
In the current implementation each split of a file has an associated range index entry that looks something like the following:
{"start_idx": N, "end_index": M, "properties": {...}}This records properties associated with that split of a file corresponding to the logical range of the archive [N, M).
The range index will take in user-defined metadata from kv-ir streams and also always set the following properties for each range:
_filename: The full path to the file corresponding to the range_file_split_number: Which split of a file corresponds to the range (0-indexed and incremented each time an archive is split while ingesting a file)_archive_creator_id: UUID associated with a specific invocation of compression -- allows disambiguating files with the same path compressed across different compression invocationsMost of the logic for writing and maintaining the range index is contained in
RangeIndexWriter.{cpp,hpp}, this class is wrapped inArchiveWriter.{cpp,hpp}, and properties are written in the main parsing loop(s) ofJsonParser.cpp.We bump the minor version of the archive format in this PR (new version is
0.3.1) because this does constitute a change to the archive format, but the change is non-breaking (0.3.0archives that do not have the archive range index metadata packet are valid0.3.1archives).Proper tests will be added in the read-side PR.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes
Chores