feat(clp-s): Add the write-side implementation for the archive range index. by gibber9809 · Pull Request #847 · y-scope/clp

gibber9809 · 2025-04-24T17:17:19Z

Description

This PR adds the write side of the archive range index as described in #737.

In the current implementation each split of a file has an associated range index entry that looks something like the following:
{"start_idx": N, "end_index": M, "properties": {...}}
This records properties associated with that split of a file corresponding to the logical range of the archive [N, M).

The range index will take in user-defined metadata from kv-ir streams and also always set the following properties for each range:
_filename: The full path to the file corresponding to the range
_file_split_number: Which split of a file corresponds to the range (0-indexed and incremented each time an archive is split while ingesting a file)
_archive_creator_id: UUID associated with a specific invocation of compression -- allows disambiguating files with the same path compressed across different compression invocations

Most of the logic for writing and maintaining the range index is contained in RangeIndexWriter.{cpp,hpp}, this class is wrapped in ArchiveWriter.{cpp,hpp}, and properties are written in the main parsing loop(s) of JsonParser.cpp.

We bump the minor version of the archive format in this PR (new version is 0.3.1) because this does constitute a change to the archive format, but the change is non-breaking (0.3.0 archives that do not have the archive range index metadata packet are valid 0.3.1 archives).

Proper tests will be added in the read-side PR.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Manually validated that range index metadata section can be deserialized and has expected contents both for common case and during archive splitting.

Summary by CodeRabbit

New Features
- Introduced range-based metadata tracking for archives, including file split numbering and unique archive creator IDs.
- Added support for associating custom metadata fields with archive ranges and enhanced propagation of user-defined metadata.
- Enhanced archive format with a new range index structure for detailed metadata serialization.
- Added new methods to manage archive ranges, including adding fields and closing ranges.
Bug Fixes
- Improved handling of unknown metadata packets to maintain decompressor state and archive integrity.
Chores
- Updated archive patch version and build configuration to include new range index components.

… index

…0.3.1

coderabbitai · 2025-04-24T17:17:26Z

Walkthrough

This update introduces a range index feature to the archive writing and reading process. The ArchiveWriter class is extended with new methods and members to manage range metadata, including the addition and closure of range fields, tracked via a new RangeIndexWriter class. The archive metadata structure is updated to include a new packet type for range indices, and the archive patch version is incremented. The JsonParser logic is revised to generate and propagate archive creator IDs, file split numbers, and user metadata into the range index. The reader is updated to explicitly handle and skip unknown metadata packets, ensuring decompressor state consistency. New constants and CMake entries are added to support these features.

Changes

File(s)	Change Summary
components/core/src/clp_s/ArchiveReaderAdaptor.cpp, components/core/src/clp_s/ArchiveReaderAdaptor.hpp	Added a private method to read and discard unknown metadata packets, updating the metadata reading logic to use this method for unknown packet types, ensuring the decompressor's state is maintained.
components/core/src/clp_s/ArchiveWriter.cpp	Enhanced `close()` to handle range index closure, updated `write_archive_metadata()` to dynamically include range index packets and write range index data with error handling.
components/core/src/clp_s/ArchiveWriter.hpp	Added `add_field_to_current_range` (templated) and `close_current_range` public methods; introduced `m_range_open` flag and `m_range_index_writer` member for range management; updated includes.
components/core/src/clp_s/CMakeLists.txt	Included `RangeIndexWriter.cpp` and `RangeIndexWriter.hpp` in the build sources.
components/core/src/clp_s/JsonParser.cpp	Added logic to generate archive creator IDs and file split numbers; replaced and expanded field initialization for archive ranges; propagated user metadata in IR parsing; improved error handling for metadata/range operations; ensured range closure at file end.
components/core/src/clp_s/RangeIndexWriter.cpp, components/core/src/clp_s/RangeIndexWriter.hpp	Introduced new `RangeIndexWriter` class for managing and serializing archive range indices, with methods for opening, updating, closing, and writing ranges, as well as error handling and validation.
components/core/src/clp_s/SingleFileArchiveDefs.hpp	Incremented archive patch version; added `RangeIndex` metadata packet type to the enum; included new headers.
components/core/src/clp_s/archive_constants.hpp	Added new constants for range index metadata keys in a nested `range_index` namespace.
components/core/CMakeLists.txt	Added `src/clp_s/RangeIndexWriter.cpp` and `src/clp_s/RangeIndexWriter.hpp` to the unit test source files list.

Sequence Diagram(s)

sequenceDiagram
    participant JsonParser
    participant ArchiveWriter
    participant RangeIndexWriter
    participant ZstdCompressor

    JsonParser->>ArchiveWriter: add_field_to_current_range(key, value)
    ArchiveWriter->>RangeIndexWriter: add_value_to_range(key, value)
    JsonParser->>ArchiveWriter: close_current_range()
    ArchiveWriter->>RangeIndexWriter: close_range(end_index)
    ArchiveWriter->>ArchiveWriter: write_archive_metadata()
    ArchiveWriter->>RangeIndexWriter: write(ZstdCompressor)
    RangeIndexWriter->>ZstdCompressor: write msgpack-encoded range index

Suggested reviewers

wraymo

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47d0e94 and 36c1fa0.

📒 Files selected for processing (1)

components/core/CMakeLists.txt (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (9)

GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: build-macos (macos-13, true)

🔇 Additional comments (1)

components/core/CMakeLists.txt (1)

320-321: Inclusion of RangeIndexWriter in clp_s unit tests

The new RangeIndexWriter.cpp and RangeIndexWriter.hpp files have been correctly added to the SOURCE_FILES_clp_s_unitTest list, ensuring the write-side archive range index functionality is covered by existing unit tests.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (7)

components/core/src/clp_s/ArchiveWriter.cpp (2)
155-160: Prefer a constexpr for constant packets

num_constant_packets never changes; turning it into a constexpr clarifies intent and enables compiler optimisation.
-    uint8_t const num_constant_packets{3U};
+    constexpr uint8_t num_constant_packets{3U};
187-191: Avoid an unconditional write when the range index is empty

RangeIndexWriter::write() returns immediately when empty, but calling it anyway incurs a function call and makes the intent less obvious. A small guard keeps the code self-documenting.
-    // Write range index
-    if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) {
+    // Write range index (only if present)
+    if (false == m_range_index_writer.empty()) {
+        if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) {
             throw OperationFailed(rc, __FILENAME__, __LINE__);
-    }
+        }
+    }
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)
100-106: Reading unknown packets can exhaust memory – stream instead of bulk-allocating

Allocating a std::vector<char>(size) for every unknown packet may allocate many GiB for a malformed or adversarial archive. Consider discarding the data in fixed-size blocks to cap memory usage.
-    std::vector<char> buffer(size);
-    return decompressor.try_read_exact_length(buffer.data(), buffer.size());
+    constexpr size_t cBufSize = 64 * 1024;
+    std::array<char, cBufSize> buf{};
+    size_t bytes_remaining = size;
+    while (bytes_remaining > 0) {
+        size_t to_read = std::min(bytes_remaining, cBufSize);
+        auto rc = decompressor.try_read_exact_length(buf.data(), to_read);
+        if (ErrorCodeSuccess != rc) {
+            return rc;
+        }
+        bytes_remaining -= to_read;
+    }
+    return ErrorCodeSuccess;
components/core/src/clp_s/RangeIndexWriter.cpp (1)
32-48: Invalidate the handle after a successful close

After finalising a range, later misuse of the same handle will currently return ErrorCodeNotReady. Setting the handle to std::numeric_limits<handle_t>::max() (or similar sentinel) immediately eliminates accidental re-use.
     range.end_index = end_index;
+    handle = std::numeric_limits<handle_t>::max();
components/core/src/clp_s/ArchiveWriter.hpp (1)

171-174: Provide proper documentation instead of TODO placeholders

Both add_field_to_current_range and close_current_range are part of the public API and will be consumed by callers outside this class. Detailed comments describing behaviour, failure modes, and thread-safety expectations are essential.

components/core/src/clp_s/JsonParser.cpp (1)

512-557: Consider de-duplicating repeated initialisation logic

The lambda initialize_fields_for_archive (and its twin in parse_from_ir) contains nearly identical code blocks. Extracting this into a private helper method would remove duplication, reduce maintenance overhead, and ensure both code paths evolve together.
components/core/src/clp_s/RangeIndexWriter.hpp (1)
51-70: Mark add_value_to_range as [[nodiscard]] to surface ignored errors

The function returns an ErrorCode which can easily be ignored by callers. Adding [[nodiscard]] will prompt the compiler to warn about accidental drops:
-    [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value)
+    [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value)
(only the attribute added).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6300fa5 and c308293.

📒 Files selected for processing (10)

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (2 hunks)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
components/core/src/clp_s/SingleFileArchiveDefs.hpp (2 hunks)
components/core/src/clp_s/archive_constants.hpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/SingleFileArchiveDefs.hpp
components/core/src/clp_s/ArchiveReaderAdaptor.cpp
components/core/src/clp_s/archive_constants.hpp
components/core/src/clp_s/ArchiveReaderAdaptor.hpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/src/clp_s/RangeIndexWriter.cpp
components/core/src/clp_s/ArchiveWriter.hpp
components/core/src/clp_s/RangeIndexWriter.hpp
components/core/src/clp_s/JsonParser.cpp

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-15, false)
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-13, true)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: lint-check (macos-latest)

🔇 Additional comments (8)

components/core/src/clp_s/CMakeLists.txt (1)

116-117: LGTM! Added RangeIndexWriter source files to the build system.

The new source files for the RangeIndexWriter implementation have been correctly added to the CLP_S_SOURCES variable in the build system.

components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)

4-6: LGTM! Necessary includes added.

The addition of <cstdint> and <vector> headers provides the necessary declarations for types used in this file.

14-14: LGTM! Archive version incremented appropriately.

The patch version has been incremented from 0 to 1, which is appropriate for adding a new feature in a backward-compatible manner.

38-38: LGTM! New RangeIndex metadata packet type added.

Added the RangeIndex packet type with value 3 to the ArchiveMetadataPacketType enum, which aligns with the PR objective of implementing range index functionality.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

93-99: LGTM! Added method to handle unknown metadata packets.

The new method allows the reader to properly handle and skip unknown metadata packets, enabling forward compatibility when new packet types (like the range index) are added in future versions.

components/core/src/clp_s/archive_constants.hpp (1)

39-43: LGTM! Added range index metadata constants.

The new namespace with constants for range index metadata field names provides a centralized place to define these keys, promoting consistency throughout the codebase. The constants align with the requirements described in the PR objectives.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

174-186: Good defensive handling of future packet types

Explicitly consuming – rather than silently ignoring – unknown metadata packets maintains decompressor state and forward-compatibility. Well done!
components/core/src/clp_s/RangeIndexWriter.cpp (1)

18-30: Inclusive vs. exclusive range boundary is ambiguous

open_range allows start_index to be equal to a previous range’s end_index because the overlap check uses < rather than <=.
Confirm whether ranges are [inclusive, exclusive) or [inclusive, inclusive]; if the latter, equality should be rejected.
-        || start_index < it->end_index.value())
+        || start_index <= it->end_index.value())   // Reject touching ranges if both bounds are inclusive

wraymo

This is a great PR with a very clean design and implementation! I just have a few questions.

wraymo · 2025-05-01T17:29:03Z

 }

+auto
+ArchiveReaderAdaptor::try_read_unknown_metadata_packet(ZstdDecompressor& decompressor, size_t size)


What is an unknown metadata packet?

Any metadata packet that this version of the code isn't familiar with. Generally I'm trying to make it so that if you use this version of the code to read a later archive format with more metadata packet types the code will just skip the metadata packet and keep reading as normal.

For this PR specifically it also makes it so that I can split the read side code into a separate PR but let the archive still be readable.

wraymo · 2025-05-01T17:52:42Z

-    compressor.write_numeric_value(num_packets);
+    uint8_t num_optional_packets{0ULL};
+    if (false == m_range_index_writer.empty()) {
+        ++num_optional_packets;


Do we have any other optional packets?

By the spec I think the timestamp dictionary is optional, but we've always written it even when it is unused for the whole time we've had the metadata section.

Should I open an issue to refactor the timestamp dictionary writing code to actually write it only when it is used?

wraymo · 2025-05-01T18:09:24Z

+
+        size_t start_index{0ULL};
+        std::optional<size_t> end_index{std::nullopt};
+        std::map<std::string, nlohmann::json> fields;


Is that to support multiple data types and make serialization easier with the nlohmann library?

Yep, that's the idea -- nlohmann makes it easy to serialize/derialize msgpack + makes it easy to support multiple types. Since the metadata section is relatively small I don't think that using nlohmann will cause any performance problems for us.

wraymo · 2025-05-01T18:24:59Z

                log_event_idx_node_id
                        = add_metadata_field(constants::cLogEventIdxName, NodeType::Integer);
            }
+            if (auto const rc = m_archive_writer->add_field_to_current_range(


Is it possible to deduplicate the code from line 985-1023 with code from line 517-555

I think it's a bit hard with the way the code is currently written. In general I think we want to deduplicate a lot of the code between the kv-ir/json parse_X flows though. Will give it some thought.

I think it might be a somewhat substantial refactor to merge the kvir/json parsing flow in a way that lets us share a lot of this code at the moment.

I think we could maybe do it by doing something like the IrUnitHandlerInterface concept in Deserializer where we define a class that holds onto state relevant to a specific parsing flow + defines handlers that get invoked in different parts of the parsing flow e.g. "parse the next record", "handle the start of a new archive", etc..

Then we could handle most stuff like populating the default metadata fields + adding the log_event_idx metadata field to each log record separately from the core parsing stuff.

Maybe I can put up a github issue to that effect?

wraymo · 2025-05-01T18:37:09Z

+        if (ErrorCodeSuccess != m_archive_writer->close_current_range()) {
+            m_archive_writer->close();
+            return false;
+        }
+


Do you think it's better to close the range after checking the simdjson errors?

Good point, will move it after.

wraymo · 2025-05-01T18:40:39Z

+    for (auto const& range : m_ranges) {
+        if (start_index < range.start_index || false == range.end_index.has_value()
+            || start_index < range.end_index.value())
+        {
+            return std::errc::invalid_argument;
+        }
+    }


Do we want to print an error message in this case (and probably better error code)?

wraymo · 2025-05-01T18:43:23Z

+    return handle;
+}
+
+auto RangeIndexWriter::close_range(handle_t handle, size_t end_index) -> ErrorCode {


Not sure if handle always equals to m_ranges.size() -1. If that's the case, we may get rid of this variable.

Yeah, I think it could potentially help catch some weird bugs, but probably not really necessary. I'll get rid of it.

wraymo · 2025-05-01T18:52:45Z

For the range index schema, does the following make sense?

[{
  "tag_name": ...,
  "tag_type": ...,
  "ranges": [{
      "start": ...,
      "end": ...,
      "value": ...,
      }, ...
   ]
}, ...
]

gibber9809 · 2025-05-01T19:11:33Z

For the range index schema, does the following make sense?

[{
  "tag_name": ...,
  "tag_type": ...,
  "ranges": [{
      "start": ...,
      "end": ...,
      "value": ...,
      }, ...
   ]
}, ...
]

I think it would make it a little bit more complicated to search against compared to the current design, particularly for queries that touch multiple metadata fields.

coderabbitai

Actionable comments posted: 10

♻️ Duplicate comments (1)

components/core/src/clp_s/RangeIndexWriter.cpp (1)
46-76: ⚠️ Potential issue

Possible 4 GiB overflow when writing the serialised blob

serialized_range_index.size() is cast to uint32_t; blobs ≥ 4 294 967 296 bytes wrap to 0. Detect overflow and report an error:
-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // Too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+            static_cast<uint32_t>(serialized_range_index.size()));
A future enhancement would avoid the temporary buffer and stream the MessagePack directly into the compressor, eliminating the RAM spike altogether.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 011bd01 and c06163f.

📒 Files selected for processing (5)

components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)

🧰 Additional context used

🪛 Cppcheck (2.10-2)

components/core/src/clp_s/RangeIndexWriter.cpp

[style] 18-18: The function 'open_range' is never used.

(unusedFunction)

[style] 31-31: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-13, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: lint-check (macos-latest)
GitHub Check: lint-check (ubuntu-latest)

🔇 Additional comments (3)

components/core/src/clp_s/ArchiveWriter.cpp (1)

60-66: Avoid suppressing a failure when closing the last open range

close() calls close_current_range() and then simply proceeds, regardless of the error code already returned.
If close_current_range() fails (e.g., the end index is smaller than the start index), the exception is thrown and execution never reaches this block, so the logic is fine. But if future refactors convert close_current_range() to return an error code instead of throwing inside, the surrounding code path must respect that.

No immediate action needed, but keep an eye on error-propagation semantics if you change the exception policy.
components/core/src/clp_s/ArchiveWriter.hpp (1)

178-189: ⚠️ Potential issue

Don’t mark the range as open until the open succeeds

If open_range() returns an error, m_range_open is still set to true, leaving the writer in an inconsistent state and guaranteeing that subsequent calls will route to add_value_to_range() without a valid range.
-            if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id);
-                ErrorCodeSuccess != rc)
-            {
-                return rc;
-            }
-            m_range_open = true;
+            if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id);
+                ErrorCodeSuccess != rc)
+            {
+                return rc;                       // keep flag unchanged
+            }
+            m_range_open = true;                // safe: open succeeded
Likely an incorrect or invalid review comment.
components/core/src/clp_s/JsonParser.cpp (1)

510-556: Possible missing open_range() call before populating range fields

initialize_fields_for_archive() immediately calls add_field_to_current_range(...).
If close_current_range() (line 659) actually closes the only open range and does not implicitly open the next one, the subsequent initialize_fields_for_archive() in the next file-iteration will operate on a writer that has no open range, causing ErrorCodeNotReady.

Please confirm that ArchiveWriter::close_current_range() guarantees a fresh range is open afterwards.
If it does not, prepend an explicit open_range(start_idx) (or equivalent) before the first metadata insertion.

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (4)

components/core/src/clp_s/ArchiveWriter.hpp (1)

289-290: Member ordering still wastes padding bytes

Previous review pointed out that placing the heavyweight RangeIndexWriter first and the 1-byte bool afterwards shrinks the struct on most ABIs. The order has not been adjusted.
components/core/src/clp_s/RangeIndexWriter.cpp (2)
70-76: 4 GiB overflow & silent truncation still present

serialized_range_index.size() is written as a uint32_t without a bounds check.
Archives whose range-index exceeds 4 294 967 295 bytes will be silently corrupted.

This was raised before; please apply the suggested guard:
-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+                static_cast<uint32_t>(serialized_range_index.size()));
75-76: Lost diagnostic information makes debugging harder

Catching std::exception and discarding e.what() removes valuable context.
At minimum, log the message before returning.
-    } catch (std::exception const& e) {
-        return ErrorCodeFailure;
+    } catch (std::exception const& e) {
+        SPDLOG_ERROR("Failed to serialise range index: {}", e.what());
+        return ErrorCodeFailure;
     }
components/core/src/clp_s/JsonParser.cpp (1)
510-556: Initialisation block duplicated – extract a helper

The lambda initialize_fields_for_archive is 40 + lines long and appears almost verbatim in both the JSON and KV-IR paths.
Maintaining two copies will lead to drift.

A small private helper such as
bool JsonParser::init_range_metadata(
        std::string_view filename,
        size_t& file_split_number,
        std::string const& archive_creator_id,
        std::optional<nlohmann::json> const& extra_user_md = std::nullopt);
would remove the duplication and make future changes (e.g., adding another field) less error-prone.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c06163f and c7fc681.

📒 Files selected for processing (4)

components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

components/core/src/clp_s/RangeIndexWriter.cpp (2)

Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.

Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).

🪛 Cppcheck (2.10-2)

components/core/src/clp_s/RangeIndexWriter.cpp

[style] 17-17: The function 'open_range' is never used.

(unusedFunction)

[style] 30-30: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: build-macos (macos-15, false)
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-15, true)
GitHub Check: build-macos (macos-13, true)
GitHub Check: lint-check (macos-latest)
GitHub Check: lint-check (ubuntu-latest)

🔇 Additional comments (3)

components/core/src/clp_s/RangeIndexWriter.hpp (3)

59-72: The error handling looks good

The template method properly checks for error conditions before proceeding:

Verifies that there's at least one range

Ensures the most recent range is not already closed

Prevents duplicate keys in a range's fields

The error codes returned (ErrorCodeNotReady, ErrorCodeBadParam) clearly indicate the failure reason.

93-101: Good design of the Range struct

The Range struct is well-designed with:

An explicit constructor that requires a start index

Default initialization of start_index (though redundant with the constructor)

Use of std::optional for end_index to represent ranges that haven't been closed yet

std::map for fields to ensure sorted keys and efficient lookups

The use of nlohmann::json for field values provides flexibility for storing different types of metadata.

21-36: Clear documentation of the range index structure

The class documentation clearly explains the purpose and structure of the range index, including the non-overlapping constraint and support for empty ranges. This will be valuable for maintainers working with this code in the future.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (6)

components/core/src/clp_s/ArchiveWriter.hpp (3)
178-183: Prefer std::string_view to avoid an extra copy

The key is never modified or stored past the call, so accepting std::string_view avoids an
allocation when the caller already holds a string literal or view.

Same optimisation was recommended previously.

290-291: 🧹 Nitpick (assertive)

Minor padding win: place the bool after the heavy writer

Re-ordering the members saves padding on most platforms and keeps the boolean next to the flag it
controls.
-    bool m_range_open{false};
-    RangeIndexWriter m_range_index_writer;
+    RangeIndexWriter m_range_index_writer;
+    bool m_range_open{false};
178-190: ⚠️ Potential issue

Guard against half-opened ranges

add_field_to_current_range leaves m_range_open == true when add_value_to_range() fails.
Callers cannot detect that the range is only partially initialised, and a subsequent attempt to
re-open/close will return confusing NotReady errors.
-        return m_range_index_writer.add_value_to_range(key, value);
+        auto const rc = m_range_index_writer.add_value_to_range(key, value);
+        if (ErrorCodeSuccess != rc) {
+            // roll back: the range is unusable
+            m_range_open = false;
+        }
+        return rc;
This mirrors the defensive logic suggested for close_current_range() in an earlier review.
components/core/src/clp_s/RangeIndexWriter.cpp (2)
76-78: Surface exception details

The catch block converts every exception into ErrorCodeFailure, discarding e.what(). Logging the
message (spdlog or SPDLOG_ERROR) would aid post-mortem debugging and aligns with prior feedback.

70-75: ⚠️ Potential issue

Possible 4 GiB overflow when serialising

serialized_range_index.size() is cast to uint32_t without bounds checking. Archives whose range
index exceeds 4 294 967 295 bytes will be silently truncated, corrupting the metadata.
-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+                static_cast<uint32_t>(serialized_range_index.size()));
components/core/src/clp_s/JsonParser.cpp (1)
512-557: Repeated metadata initialisation – consider extracting a helper

The lambda initialize_fields_for_archive here is virtually identical to the one in the
kv-ir branch (lines 981-1047). Duplicated 40-line blocks are hard to keep in sync.

A small private helper such as
bool JsonParser::init_range_metadata(
        Path const& path,
        size_t file_split_number,
        std::string const& archive_creator_id,
        nlohmann::json const* extra_ir_metadata = nullptr);
would eliminate duplication and make future fixes affect both paths.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7fc681 and a8c3705.

📒 Files selected for processing (4)

components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
components/core/src/clp_s/JsonParser.cpp (9 hunks)
components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

components/core/src/clp_s/RangeIndexWriter.cpp (2)

Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.

Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).

🪛 Cppcheck (2.10-2)

components/core/src/clp_s/RangeIndexWriter.cpp

[style] 17-17: The function 'open_range' is never used.

(unusedFunction)

[style] 31-31: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)

GitHub Check: centos-stream-9-dynamic-linked-bins
GitHub Check: centos-stream-9-static-linked-bins
GitHub Check: ubuntu-jammy-dynamic-linked-bins
GitHub Check: ubuntu-jammy-lint
GitHub Check: ubuntu-jammy-static-linked-bins
GitHub Check: build-macos (macos-14, true)
GitHub Check: build-macos (macos-14, false)
GitHub Check: build-macos (macos-13, true)
GitHub Check: build-macos (macos-13, false)
GitHub Check: lint-check (ubuntu-latest)
GitHub Check: lint-check (macos-latest)

🔇 Additional comments (1)

components/core/src/clp_s/RangeIndexWriter.hpp (1)

1-107: Overall implementation looks robust

The RangeIndexWriter class provides a well-documented and clean interface for building and serializing range index data. The error handling is comprehensive, checking for invalid operations like adding values to non-existent or closed ranges. The class effectively maintains the invariant that ranges cannot overlap.

Some implementation details will depend on the .cpp file, but the header establishes a solid foundation for the functionality described in the PR objectives.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

components/core/src/clp_s/ArchiveWriter.hpp (1)
171-190: 🧹 Nitpick (assertive)

Consider using std::string_view for more efficient key parameter

The add_field_to_current_range method currently accepts a std::string const& parameter, which forces callers to construct/own an actual std::string. Using std::string_view would avoid unnecessary heap allocations when keys already exist elsewhere (common in parsers).
-    template <typename T>
-    [[nodiscard]] auto add_field_to_current_range(std::string const& key, T const& value)
-            -> ErrorCode {
+    template <typename T>
+    [[nodiscard]] auto add_field_to_current_range(std::string_view key, T const& value)
+            -> ErrorCode {
components/core/src/clp_s/RangeIndexWriter.hpp (2)
40-42: 🧹 Nitpick (assertive)

Consider grouping serialization constants in a struct

To better represent the structure of your serialized format and make the relationship between these constants more explicit, consider grouping them in a struct.
-static constexpr std::string_view cStartIndexName{"s"};
-static constexpr std::string_view cEndIndexName{"e"};
-static constexpr std::string_view cMetadataFieldsName{"f"};
+struct SerializationKeys {
+    static constexpr std::string_view start{"s"};
+    static constexpr std::string_view end{"e"};
+    static constexpr std::string_view fields{"f"};
+};
57-71: 🧹 Nitpick (assertive)

Consider adding type constraints to the template method

The template method add_value_to_range accepts any type T without constraints, which could lead to compilation errors if types incompatible with nlohmann::json are used.

If using C++20:
-template <typename T>
+template <typename T>
+    requires nlohmann::detail::has_to_json<nlohmann::json, T>::value
[[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {
Or for C++17:
-template <typename T>
+template <typename T,
+         typename = std::enable_if_t<nlohmann::detail::has_to_json<nlohmann::json, T>::value>>
[[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8c3705 and 47d0e94.

📒 Files selected for processing (2)

components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

components/core/src/clp_s/ArchiveWriter.hpp
components/core/src/clp_s/RangeIndexWriter.hpp

🔇 Additional comments (4)

components/core/src/clp_s/ArchiveWriter.hpp (2)

196-202: Good implementation of close_current_range

The implementation correctly only resets m_range_open upon successful closure, preserving the error state and allowing callers to retry after fixing issues.

290-291: Optimal member ordering for efficient memory layout

Good placement of the lightweight boolean flag next to the writer object to reduce padding in the struct.
components/core/src/clp_s/RangeIndexWriter.hpp (2)

67-69: Efficient use of try_emplace

Good use of try_emplace which performs existence check and insertion in a single operation, rather than using two separate operations (find + emplace).

11-16:
Details

❓ Verification inconclusive

Confirm if both JSON includes are necessary

While the comment explains why both headers are needed, typically including json.hpp already pulls in all forward declarations, making json_fwd.hpp redundant. This could slow down compilation.

🏁 Script executed:
#!/bin/bash
# Check if removing json_fwd.hpp breaks compilation

# First, let's check what each header actually includes
echo "Examining json.hpp content:"
head -n 20 $(fd -t f "json.hpp" | head -n 1)

echo "Examining json_fwd.hpp content:"
head -n 20 $(fd -t f "json_fwd.hpp" | head -n 1)

# Check if json.hpp already includes json_fwd.hpp
grep -A 2 "json_fwd.hpp" $(fd -t f "json.hpp" | head -n 1)
Length of output: 291
Verify necessity of json_fwd.hpp when including json.hpp

I wasn’t able to locate the external nlohmann headers in the repo to confirm if <nlohmann/json.hpp> already pulls in the forward declarations from json_fwd.hpp. Please check in your local environment whether json.hpp includes json_fwd.hpp and if so remove the redundant include to speed up compilation.

• File: components/core/src/clp_s/RangeIndexWriter.hpp (lines 11–16)
#include <nlohmann/json.hpp>
#include <nlohmann/json_fwd.hpp>
Please verify and update accordingly.

kirkrodrigues

Reviewed CMakeLists.txt

gibber9809 · 2025-05-07T21:16:04Z

Just making sure that the unit tests I have written so far in the read side PR pass when rebased on top of this branch, then I will merge.

gibber9809 · 2025-05-07T21:21:25Z

Just making sure that the unit tests I have written so far in the read side PR pass when rebased on top of this branch, then I will merge.

Seems to be passing, will merge.

gibber9809 added 8 commits April 24, 2025 15:50

Add RangeIndexWriter class to manage details of writing archive range…

eebbcd2

… index

Update CMakeLists.txt

ea8b137

Add RangeIndex metadata packet type

6d4cb44

Allow skipping unknown metadata packet types when reading an archive

c79ed57

Fix several clang-tidy errors

dce6b6f

Fix bugs in RangeIndexWriter

e7bb1a2

Record basic metadata during ingestion; increment archive version to …

139d302

…0.3.1

Fix clang-tidy issues

c308293

gibber9809 requested review from a team and wraymo as code owners April 24, 2025 17:17

coderabbitai Bot reviewed Apr 24, 2025

View reviewed changes

gibber9809 added 15 commits April 24, 2025 17:48

Add RangeIndexWriter to CMakeLists sources for unit test build

30f4750

Address rabbit comments

fba7785

Add missing docstrings

e868abe

Fix several clang-tidy warnings

166c770

Fix mac build issue

b9550be

Fix another clang-tidy error

0dd24c4

Merge remote-tracking branch 'upstream/main' into range-index

90415a9

Fix include paths after merging upstream

3f3064c

Attempt to fix clang-tidy errors related to nlohmann::json

200fd60

Add NOLINTNEXTLINE to work around clang-tidy issue

2b664fd

Merge remote-tracking branch 'upstream/main' into range-index

36ac56c

Replace string literals with constants

7ec7737

Fix compilation error

2e34156

Fix clang-tidy error

7cbb060

Merge remote-tracking branch 'upstream/main' into range-index

011bd01

wraymo reviewed May 1, 2025

View reviewed changes

Address review comments

c06163f

coderabbitai Bot reviewed May 2, 2025

View reviewed changes

Address rabbit review comments

c7fc681

gibber9809 requested a review from wraymo May 2, 2025 15:51

coderabbitai Bot reviewed May 2, 2025

View reviewed changes

Address more rabbit comments

a8c3705

coderabbitai Bot reviewed May 2, 2025

View reviewed changes

Address rabbit comments

47d0e94

coderabbitai Bot reviewed May 5, 2025

View reviewed changes

Comment thread components/core/src/clp_s/RangeIndexWriter.hpp

Comment thread components/core/src/clp_s/RangeIndexWriter.hpp

wraymo approved these changes May 7, 2025

View reviewed changes

gibber9809 added 2 commits May 7, 2025 10:20

Merge branch 'main' into range-index

36c1fa0

Merge branch 'main' into range-index

ca4a371

kirkrodrigues approved these changes May 7, 2025

View reviewed changes

gibber9809 changed the title ~~feat(clp-s): Add write-side implementation for archive range index.~~ feat(clp-s): Add the write-side implementation for the archive range index. May 7, 2025

gibber9809 merged commit 3424d2a into y-scope:main May 7, 2025

gibber9809 mentioned this pull request May 8, 2025

feat(clp-s): Add the read-side implementation for the archive range index. #889

Merged

3 tasks

coderabbitai Bot mentioned this pull request Jun 13, 2025

feat(clp-s): Expose archive range index metadata in printed archive statistics; Forward archive-level statistics to c++ writers. #1005

Merged

3 tasks

coderabbitai Bot mentioned this pull request Mar 5, 2026

feat(clp-s): Add ReaderInterface-based open path for ArchiveReader and ArchiveReaderAdaptor. #2070

Merged

4 tasks

Conversation

gibber9809 commented Apr 24, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wraymo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wraymo commented May 1, 2025

Uh oh!

gibber9809 commented May 1, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

gibber9809 commented Apr 24, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)