Skip to content

feat(clp-s): Add the write-side implementation for the archive range index.#847

Merged
gibber9809 merged 29 commits into
y-scope:mainfrom
gibber9809:range-index
May 7, 2025
Merged

feat(clp-s): Add the write-side implementation for the archive range index.#847
gibber9809 merged 29 commits into
y-scope:mainfrom
gibber9809:range-index

Conversation

@gibber9809

@gibber9809 gibber9809 commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

Description

This PR adds the write side of the archive range index as described in #737.

In the current implementation each split of a file has an associated range index entry that looks something like the following:
{"start_idx": N, "end_index": M, "properties": {...}}
This records properties associated with that split of a file corresponding to the logical range of the archive [N, M).

The range index will take in user-defined metadata from kv-ir streams and also always set the following properties for each range:
_filename: The full path to the file corresponding to the range
_file_split_number: Which split of a file corresponds to the range (0-indexed and incremented each time an archive is split while ingesting a file)
_archive_creator_id: UUID associated with a specific invocation of compression -- allows disambiguating files with the same path compressed across different compression invocations

Most of the logic for writing and maintaining the range index is contained in RangeIndexWriter.{cpp,hpp}, this class is wrapped in ArchiveWriter.{cpp,hpp}, and properties are written in the main parsing loop(s) of JsonParser.cpp.

We bump the minor version of the archive format in this PR (new version is 0.3.1) because this does constitute a change to the archive format, but the change is non-breaking (0.3.0 archives that do not have the archive range index metadata packet are valid 0.3.1 archives).

Proper tests will be added in the read-side PR.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Manually validated that range index metadata section can be deserialized and has expected contents both for common case and during archive splitting.

Summary by CodeRabbit

  • New Features

    • Introduced range-based metadata tracking for archives, including file split numbering and unique archive creator IDs.
    • Added support for associating custom metadata fields with archive ranges and enhanced propagation of user-defined metadata.
    • Enhanced archive format with a new range index structure for detailed metadata serialization.
    • Added new methods to manage archive ranges, including adding fields and closing ranges.
  • Bug Fixes

    • Improved handling of unknown metadata packets to maintain decompressor state and archive integrity.
  • Chores

    • Updated archive patch version and build configuration to include new range index components.

@gibber9809 gibber9809 requested review from a team and wraymo as code owners April 24, 2025 17:17
@coderabbitai

coderabbitai Bot commented Apr 24, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

This update introduces a range index feature to the archive writing and reading process. The ArchiveWriter class is extended with new methods and members to manage range metadata, including the addition and closure of range fields, tracked via a new RangeIndexWriter class. The archive metadata structure is updated to include a new packet type for range indices, and the archive patch version is incremented. The JsonParser logic is revised to generate and propagate archive creator IDs, file split numbers, and user metadata into the range index. The reader is updated to explicitly handle and skip unknown metadata packets, ensuring decompressor state consistency. New constants and CMake entries are added to support these features.

Changes

File(s) Change Summary
components/core/src/clp_s/ArchiveReaderAdaptor.cpp,
components/core/src/clp_s/ArchiveReaderAdaptor.hpp
Added a private method to read and discard unknown metadata packets, updating the metadata reading logic to use this method for unknown packet types, ensuring the decompressor's state is maintained.
components/core/src/clp_s/ArchiveWriter.cpp Enhanced close() to handle range index closure, updated write_archive_metadata() to dynamically include range index packets and write range index data with error handling.
components/core/src/clp_s/ArchiveWriter.hpp Added add_field_to_current_range (templated) and close_current_range public methods; introduced m_range_open flag and m_range_index_writer member for range management; updated includes.
components/core/src/clp_s/CMakeLists.txt Included RangeIndexWriter.cpp and RangeIndexWriter.hpp in the build sources.
components/core/src/clp_s/JsonParser.cpp Added logic to generate archive creator IDs and file split numbers; replaced and expanded field initialization for archive ranges; propagated user metadata in IR parsing; improved error handling for metadata/range operations; ensured range closure at file end.
components/core/src/clp_s/RangeIndexWriter.cpp,
components/core/src/clp_s/RangeIndexWriter.hpp
Introduced new RangeIndexWriter class for managing and serializing archive range indices, with methods for opening, updating, closing, and writing ranges, as well as error handling and validation.
components/core/src/clp_s/SingleFileArchiveDefs.hpp Incremented archive patch version; added RangeIndex metadata packet type to the enum; included new headers.
components/core/src/clp_s/archive_constants.hpp Added new constants for range index metadata keys in a nested range_index namespace.
components/core/CMakeLists.txt Added src/clp_s/RangeIndexWriter.cpp and src/clp_s/RangeIndexWriter.hpp to the unit test source files list.

Sequence Diagram(s)

sequenceDiagram
    participant JsonParser
    participant ArchiveWriter
    participant RangeIndexWriter
    participant ZstdCompressor

    JsonParser->>ArchiveWriter: add_field_to_current_range(key, value)
    ArchiveWriter->>RangeIndexWriter: add_value_to_range(key, value)
    JsonParser->>ArchiveWriter: close_current_range()
    ArchiveWriter->>RangeIndexWriter: close_range(end_index)
    ArchiveWriter->>ArchiveWriter: write_archive_metadata()
    ArchiveWriter->>RangeIndexWriter: write(ZstdCompressor)
    RangeIndexWriter->>ZstdCompressor: write msgpack-encoded range index
Loading

Suggested reviewers

  • wraymo

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47d0e94 and 36c1fa0.

📒 Files selected for processing (1)
  • components/core/CMakeLists.txt (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (1)
components/core/CMakeLists.txt (1)

320-321: Inclusion of RangeIndexWriter in clp_s unit tests

The new RangeIndexWriter.cpp and RangeIndexWriter.hpp files have been correctly added to the SOURCE_FILES_clp_s_unitTest list, ensuring the write-side archive range index functionality is covered by existing unit tests.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
components/core/src/clp_s/ArchiveWriter.cpp (2)

155-160: Prefer a constexpr for constant packets

num_constant_packets never changes; turning it into a constexpr clarifies intent and enables compiler optimisation.

-    uint8_t const num_constant_packets{3U};
+    constexpr uint8_t num_constant_packets{3U};

187-191: Avoid an unconditional write when the range index is empty

RangeIndexWriter::write() returns immediately when empty, but calling it anyway incurs a function call and makes the intent less obvious. A small guard keeps the code self-documenting.

-    // Write range index
-    if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) {
+    // Write range index (only if present)
+    if (false == m_range_index_writer.empty()) {
+        if (auto rc = m_range_index_writer.write(compressor); ErrorCodeSuccess != rc) {
             throw OperationFailed(rc, __FILENAME__, __LINE__);
-    }
+        }
+    }
components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

100-106: Reading unknown packets can exhaust memory – stream instead of bulk-allocating

Allocating a std::vector<char>(size) for every unknown packet may allocate many GiB for a malformed or adversarial archive. Consider discarding the data in fixed-size blocks to cap memory usage.

-    std::vector<char> buffer(size);
-    return decompressor.try_read_exact_length(buffer.data(), buffer.size());
+    constexpr size_t cBufSize = 64 * 1024;
+    std::array<char, cBufSize> buf{};
+    size_t bytes_remaining = size;
+    while (bytes_remaining > 0) {
+        size_t to_read = std::min(bytes_remaining, cBufSize);
+        auto rc = decompressor.try_read_exact_length(buf.data(), to_read);
+        if (ErrorCodeSuccess != rc) {
+            return rc;
+        }
+        bytes_remaining -= to_read;
+    }
+    return ErrorCodeSuccess;
components/core/src/clp_s/RangeIndexWriter.cpp (1)

32-48: Invalidate the handle after a successful close

After finalising a range, later misuse of the same handle will currently return ErrorCodeNotReady. Setting the handle to std::numeric_limits<handle_t>::max() (or similar sentinel) immediately eliminates accidental re-use.

     range.end_index = end_index;
+    handle = std::numeric_limits<handle_t>::max();
components/core/src/clp_s/ArchiveWriter.hpp (1)

171-174: Provide proper documentation instead of TODO placeholders

Both add_field_to_current_range and close_current_range are part of the public API and will be consumed by callers outside this class. Detailed comments describing behaviour, failure modes, and thread-safety expectations are essential.

components/core/src/clp_s/JsonParser.cpp (1)

512-557: Consider de-duplicating repeated initialisation logic

The lambda initialize_fields_for_archive (and its twin in parse_from_ir) contains nearly identical code blocks. Extracting this into a private helper method would remove duplication, reduce maintenance overhead, and ensure both code paths evolve together.

components/core/src/clp_s/RangeIndexWriter.hpp (1)

51-70: Mark add_value_to_range as [[nodiscard]] to surface ignored errors

The function returns an ErrorCode which can easily be ignored by callers. Adding [[nodiscard]] will prompt the compiler to warn about accidental drops:

-    [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value)
+    [[nodiscard]] auto add_value_to_range(handle_t handle, std::string const& key, T const& value)

(only the attribute added).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6300fa5 and c308293.

📒 Files selected for processing (10)
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp (2 hunks)
  • components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1 hunks)
  • components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (9 hunks)
  • components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp (2 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/SingleFileArchiveDefs.hpp
  • components/core/src/clp_s/ArchiveReaderAdaptor.cpp
  • components/core/src/clp_s/archive_constants.hpp
  • components/core/src/clp_s/ArchiveReaderAdaptor.hpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/src/clp_s/RangeIndexWriter.cpp
  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/src/clp_s/RangeIndexWriter.hpp
  • components/core/src/clp_s/JsonParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (8)
components/core/src/clp_s/CMakeLists.txt (1)

116-117: LGTM! Added RangeIndexWriter source files to the build system.

The new source files for the RangeIndexWriter implementation have been correctly added to the CLP_S_SOURCES variable in the build system.

components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)

4-6: LGTM! Necessary includes added.

The addition of <cstdint> and <vector> headers provides the necessary declarations for types used in this file.


14-14: LGTM! Archive version incremented appropriately.

The patch version has been incremented from 0 to 1, which is appropriate for adding a new feature in a backward-compatible manner.


38-38: LGTM! New RangeIndex metadata packet type added.

Added the RangeIndex packet type with value 3 to the ArchiveMetadataPacketType enum, which aligns with the PR objective of implementing range index functionality.

components/core/src/clp_s/ArchiveReaderAdaptor.hpp (1)

93-99: LGTM! Added method to handle unknown metadata packets.

The new method allows the reader to properly handle and skip unknown metadata packets, enabling forward compatibility when new packet types (like the range index) are added in future versions.

components/core/src/clp_s/archive_constants.hpp (1)

39-43: LGTM! Added range index metadata constants.

The new namespace with constants for range index metadata field names provides a centralized place to define these keys, promoting consistency throughout the codebase. The constants align with the requirements described in the PR objectives.

components/core/src/clp_s/ArchiveReaderAdaptor.cpp (1)

174-186: Good defensive handling of future packet types

Explicitly consuming – rather than silently ignoring – unknown metadata packets maintains decompressor state and forward-compatibility. Well done!

components/core/src/clp_s/RangeIndexWriter.cpp (1)

18-30: Inclusive vs. exclusive range boundary is ambiguous

open_range allows start_index to be equal to a previous range’s end_index because the overlap check uses < rather than <=.
Confirm whether ranges are [inclusive, exclusive) or [inclusive, inclusive]; if the latter, equality should be rejected.

-        || start_index < it->end_index.value())
+        || start_index <= it->end_index.value())   // Reject touching ranges if both bounds are inclusive

Comment thread components/core/src/clp_s/ArchiveWriter.cpp
Comment thread components/core/src/clp_s/RangeIndexWriter.cpp
Comment thread components/core/src/clp_s/ArchiveWriter.hpp Outdated
Comment thread components/core/src/clp_s/ArchiveWriter.hpp Outdated
Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR with a very clean design and implementation! I just have a few questions.

}

auto
ArchiveReaderAdaptor::try_read_unknown_metadata_packet(ZstdDecompressor& decompressor, size_t size)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an unknown metadata packet?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any metadata packet that this version of the code isn't familiar with. Generally I'm trying to make it so that if you use this version of the code to read a later archive format with more metadata packet types the code will just skip the metadata packet and keep reading as normal.

For this PR specifically it also makes it so that I can split the read side code into a separate PR but let the archive still be readable.

compressor.write_numeric_value(num_packets);
uint8_t num_optional_packets{0ULL};
if (false == m_range_index_writer.empty()) {
++num_optional_packets;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any other optional packets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the spec I think the timestamp dictionary is optional, but we've always written it even when it is unused for the whole time we've had the metadata section.

Should I open an issue to refactor the timestamp dictionary writing code to actually write it only when it is used?


size_t start_index{0ULL};
std::optional<size_t> end_index{std::nullopt};
std::map<std::string, nlohmann::json> fields;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that to support multiple data types and make serialization easier with the nlohmann library?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the idea -- nlohmann makes it easy to serialize/derialize msgpack + makes it easy to support multiple types. Since the metadata section is relatively small I don't think that using nlohmann will cause any performance problems for us.

log_event_idx_node_id
= add_metadata_field(constants::cLogEventIdxName, NodeType::Integer);
}
if (auto const rc = m_archive_writer->add_field_to_current_range(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to deduplicate the code from line 985-1023 with code from line 517-555

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a bit hard with the way the code is currently written. In general I think we want to deduplicate a lot of the code between the kv-ir/json parse_X flows though. Will give it some thought.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a somewhat substantial refactor to merge the kvir/json parsing flow in a way that lets us share a lot of this code at the moment.

I think we could maybe do it by doing something like the IrUnitHandlerInterface concept in Deserializer where we define a class that holds onto state relevant to a specific parsing flow + defines handlers that get invoked in different parts of the parsing flow e.g. "parse the next record", "handle the start of a new archive", etc..

Then we could handle most stuff like populating the default metadata fields + adding the log_event_idx metadata field to each log record separately from the core parsing stuff.

Maybe I can put up a github issue to that effect?

Comment on lines +636 to +640
if (ErrorCodeSuccess != m_archive_writer->close_current_range()) {
m_archive_writer->close();
return false;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's better to close the range after checking the simdjson errors?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will move it after.

Comment on lines +21 to +27
for (auto const& range : m_ranges) {
if (start_index < range.start_index || false == range.end_index.has_value()
|| start_index < range.end_index.value())
{
return std::errc::invalid_argument;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to print an error message in this case (and probably better error code)?

return handle;
}

auto RangeIndexWriter::close_range(handle_t handle, size_t end_index) -> ErrorCode {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if handle always equals to m_ranges.size() -1. If that's the case, we may get rid of this variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it could potentially help catch some weird bugs, but probably not really necessary. I'll get rid of it.

@wraymo

wraymo commented May 1, 2025

Copy link
Copy Markdown
Contributor

For the range index schema, does the following make sense?

[{
  "tag_name": ...,
  "tag_type": ...,
  "ranges": [{
      "start": ...,
      "end": ...,
      "value": ...,
      }, ...
   ]
}, ...
]

@gibber9809

Copy link
Copy Markdown
Contributor Author

For the range index schema, does the following make sense?

[{
  "tag_name": ...,
  "tag_type": ...,
  "ranges": [{
      "start": ...,
      "end": ...,
      "value": ...,
      }, ...
   ]
}, ...
]

I think it would make it a little bit more complicated to search against compared to the current design, particularly for queries that touch multiple metadata fields.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (1)
components/core/src/clp_s/RangeIndexWriter.cpp (1)

46-76: ⚠️ Potential issue

Possible 4 GiB overflow when writing the serialised blob

serialized_range_index.size() is cast to uint32_t; blobs ≥ 4 294 967 296 bytes wrap to 0. Detect overflow and report an error:

-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // Too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+            static_cast<uint32_t>(serialized_range_index.size()));

A future enhancement would avoid the temporary buffer and stream the MessagePack directly into the compressor, eliminating the RAM spike altogether.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 011bd01 and c06163f.

📒 Files selected for processing (5)
  • components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/JsonParser.cpp (9 hunks)
  • components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
🧰 Additional context used
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp

[style] 18-18: The function 'open_range' is never used.

(unusedFunction)


[style] 31-31: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (1)

60-66: Avoid suppressing a failure when closing the last open range

close() calls close_current_range() and then simply proceeds, regardless of the error code already returned.
If close_current_range() fails (e.g., the end index is smaller than the start index), the exception is thrown and execution never reaches this block, so the logic is fine. But if future refactors convert close_current_range() to return an error code instead of throwing inside, the surrounding code path must respect that.

No immediate action needed, but keep an eye on error-propagation semantics if you change the exception policy.

components/core/src/clp_s/ArchiveWriter.hpp (1)

178-189: ⚠️ Potential issue

Don’t mark the range as open until the open succeeds

If open_range() returns an error, m_range_open is still set to true, leaving the writer in an inconsistent state and guaranteeing that subsequent calls will route to add_value_to_range() without a valid range.

-            if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id);
-                ErrorCodeSuccess != rc)
-            {
-                return rc;
-            }
-            m_range_open = true;
+            if (auto const rc = m_range_index_writer.open_range(m_next_log_event_id);
+                ErrorCodeSuccess != rc)
+            {
+                return rc;                       // keep flag unchanged
+            }
+            m_range_open = true;                // safe: open succeeded

Likely an incorrect or invalid review comment.

components/core/src/clp_s/JsonParser.cpp (1)

510-556: Possible missing open_range() call before populating range fields

initialize_fields_for_archive() immediately calls add_field_to_current_range(...).
If close_current_range() (line 659) actually closes the only open range and does not implicitly open the next one, the subsequent initialize_fields_for_archive() in the next file-iteration will operate on a writer that has no open range, causing ErrorCodeNotReady.

Please confirm that ArchiveWriter::close_current_range() guarantees a fresh range is open afterwards.
If it does not, prepend an explicit open_range(start_idx) (or equivalent) before the first metadata insertion.

Comment thread components/core/src/clp_s/ArchiveWriter.cpp
Comment thread components/core/src/clp_s/ArchiveWriter.cpp
Comment thread components/core/src/clp_s/ArchiveWriter.hpp Outdated
Comment thread components/core/src/clp_s/ArchiveWriter.hpp Outdated
Comment thread components/core/src/clp_s/RangeIndexWriter.cpp
Comment thread components/core/src/clp_s/RangeIndexWriter.cpp Outdated
Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp Outdated
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
@gibber9809 gibber9809 requested a review from wraymo May 2, 2025 15:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (4)
components/core/src/clp_s/ArchiveWriter.hpp (1)

289-290: Member ordering still wastes padding bytes

Previous review pointed out that placing the heavyweight RangeIndexWriter first and the 1-byte bool afterwards shrinks the struct on most ABIs. The order has not been adjusted.

components/core/src/clp_s/RangeIndexWriter.cpp (2)

70-76: 4 GiB overflow & silent truncation still present

serialized_range_index.size() is written as a uint32_t without a bounds check.
Archives whose range-index exceeds 4 294 967 295 bytes will be silently corrupted.

This was raised before; please apply the suggested guard:

-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+                static_cast<uint32_t>(serialized_range_index.size()));

75-76: Lost diagnostic information makes debugging harder

Catching std::exception and discarding e.what() removes valuable context.
At minimum, log the message before returning.

-    } catch (std::exception const& e) {
-        return ErrorCodeFailure;
+    } catch (std::exception const& e) {
+        SPDLOG_ERROR("Failed to serialise range index: {}", e.what());
+        return ErrorCodeFailure;
     }
components/core/src/clp_s/JsonParser.cpp (1)

510-556: Initialisation block duplicated – extract a helper

The lambda initialize_fields_for_archive is 40 + lines long and appears almost verbatim in both the JSON and KV-IR paths.
Maintaining two copies will lead to drift.

A small private helper such as

bool JsonParser::init_range_metadata(
        std::string_view filename,
        size_t& file_split_number,
        std::string const& archive_creator_id,
        std::optional<nlohmann::json> const& extra_user_md = std::nullopt);

would remove the duplication and make future changes (e.g., adding another field) less error-prone.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c06163f and c7fc681.

📒 Files selected for processing (4)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/JsonParser.cpp (9 hunks)
  • components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
components/core/src/clp_s/RangeIndexWriter.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp

[style] 17-17: The function 'open_range' is never used.

(unusedFunction)


[style] 30-30: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: lint-check (macos-latest)
  • GitHub Check: lint-check (ubuntu-latest)
🔇 Additional comments (3)
components/core/src/clp_s/RangeIndexWriter.hpp (3)

59-72: The error handling looks good

The template method properly checks for error conditions before proceeding:

  1. Verifies that there's at least one range
  2. Ensures the most recent range is not already closed
  3. Prevents duplicate keys in a range's fields

The error codes returned (ErrorCodeNotReady, ErrorCodeBadParam) clearly indicate the failure reason.


93-101: Good design of the Range struct

The Range struct is well-designed with:

  • An explicit constructor that requires a start index
  • Default initialization of start_index (though redundant with the constructor)
  • Use of std::optional for end_index to represent ranges that haven't been closed yet
  • std::map for fields to ensure sorted keys and efficient lookups

The use of nlohmann::json for field values provides flexibility for storing different types of metadata.


21-36: Clear documentation of the range index structure

The class documentation clearly explains the purpose and structure of the range index, including the non-overlapping constraint and support for empty ranges. This will be valuable for maintainers working with this code in the future.

Comment thread components/core/src/clp_s/ArchiveWriter.hpp Outdated
Comment thread components/core/src/clp_s/ArchiveWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.cpp
Comment thread components/core/src/clp_s/JsonParser.cpp Outdated
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp Outdated
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (6)
components/core/src/clp_s/ArchiveWriter.hpp (3)

178-183: Prefer std::string_view to avoid an extra copy

The key is never modified or stored past the call, so accepting std::string_view avoids an
allocation when the caller already holds a string literal or view.

Same optimisation was recommended previously.


290-291: 🧹 Nitpick (assertive)

Minor padding win: place the bool after the heavy writer

Re-ordering the members saves padding on most platforms and keeps the boolean next to the flag it
controls.

-    bool m_range_open{false};
-    RangeIndexWriter m_range_index_writer;
+    RangeIndexWriter m_range_index_writer;
+    bool m_range_open{false};

178-190: ⚠️ Potential issue

Guard against half-opened ranges

add_field_to_current_range leaves m_range_open == true when add_value_to_range() fails.
Callers cannot detect that the range is only partially initialised, and a subsequent attempt to
re-open/close will return confusing NotReady errors.

-        return m_range_index_writer.add_value_to_range(key, value);
+        auto const rc = m_range_index_writer.add_value_to_range(key, value);
+        if (ErrorCodeSuccess != rc) {
+            // roll back: the range is unusable
+            m_range_open = false;
+        }
+        return rc;

This mirrors the defensive logic suggested for close_current_range() in an earlier review.

components/core/src/clp_s/RangeIndexWriter.cpp (2)

76-78: Surface exception details

The catch block converts every exception into ErrorCodeFailure, discarding e.what(). Logging the
message (spdlog or SPDLOG_ERROR) would aid post-mortem debugging and aligns with prior feedback.


70-75: ⚠️ Potential issue

Possible 4 GiB overflow when serialising

serialized_range_index.size() is cast to uint32_t without bounds checking. Archives whose range
index exceeds 4 294 967 295 bytes will be silently truncated, corrupting the metadata.

-        writer.write_numeric_value<uint32_t>(serialized_range_index.size());
+        if (serialized_range_index.size() > std::numeric_limits<uint32_t>::max()) {
+            return ErrorCodeUnsupported;   // too large to encode safely
+        }
+        writer.write_numeric_value<uint32_t>(
+                static_cast<uint32_t>(serialized_range_index.size()));
components/core/src/clp_s/JsonParser.cpp (1)

512-557: Repeated metadata initialisation – consider extracting a helper

The lambda initialize_fields_for_archive here is virtually identical to the one in the
kv-ir branch (lines 981-1047). Duplicated 40-line blocks are hard to keep in sync.

A small private helper such as

bool JsonParser::init_range_metadata(
        Path const& path,
        size_t file_split_number,
        std::string const& archive_creator_id,
        nlohmann::json const* extra_ir_metadata = nullptr);

would eliminate duplication and make future fixes affect both paths.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7fc681 and a8c3705.

📒 Files selected for processing (4)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/JsonParser.cpp (9 hunks)
  • components/core/src/clp_s/RangeIndexWriter.cpp (1 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
components/core/src/clp_s/RangeIndexWriter.cpp (2)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
Learnt from: gibber9809
PR: y-scope/clp#847
File: components/core/src/clp_s/RangeIndexWriter.cpp:19-29
Timestamp: 2025-05-02T15:43:36.648Z
Learning: In RangeIndexWriter, the open_range method intentionally uses a strict comparison (< rather than <=) when checking for overlaps with existing ranges. This design choice allows for successive size 0 ranges where start_index equals end_index, and enables ranges to be adjacent to each other (one ending where the next begins).
🪛 Cppcheck (2.10-2)
components/core/src/clp_s/RangeIndexWriter.cpp

[style] 17-17: The function 'open_range' is never used.

(unusedFunction)


[style] 31-31: The function 'close_range' is never used.

(unusedFunction)

⏰ Context from checks skipped due to timeout of 90000ms (11)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-13, true)
  • GitHub Check: build-macos (macos-13, false)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp_s/RangeIndexWriter.hpp (1)

1-107: Overall implementation looks robust

The RangeIndexWriter class provides a well-documented and clean interface for building and serializing range index data. The error handling is comprehensive, checking for invalid operations like adding values to non-existent or closed ranges. The class effectively maintains the invariant that ranges cannot overlap.

Some implementation details will depend on the .cpp file, but the header establishes a solid foundation for the functionality described in the PR objectives.

Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
components/core/src/clp_s/ArchiveWriter.hpp (1)

171-190: 🧹 Nitpick (assertive)

Consider using std::string_view for more efficient key parameter

The add_field_to_current_range method currently accepts a std::string const& parameter, which forces callers to construct/own an actual std::string. Using std::string_view would avoid unnecessary heap allocations when keys already exist elsewhere (common in parsers).

-    template <typename T>
-    [[nodiscard]] auto add_field_to_current_range(std::string const& key, T const& value)
-            -> ErrorCode {
+    template <typename T>
+    [[nodiscard]] auto add_field_to_current_range(std::string_view key, T const& value)
+            -> ErrorCode {
components/core/src/clp_s/RangeIndexWriter.hpp (2)

40-42: 🧹 Nitpick (assertive)

Consider grouping serialization constants in a struct

To better represent the structure of your serialized format and make the relationship between these constants more explicit, consider grouping them in a struct.

-static constexpr std::string_view cStartIndexName{"s"};
-static constexpr std::string_view cEndIndexName{"e"};
-static constexpr std::string_view cMetadataFieldsName{"f"};
+struct SerializationKeys {
+    static constexpr std::string_view start{"s"};
+    static constexpr std::string_view end{"e"};
+    static constexpr std::string_view fields{"f"};
+};

57-71: 🧹 Nitpick (assertive)

Consider adding type constraints to the template method

The template method add_value_to_range accepts any type T without constraints, which could lead to compilation errors if types incompatible with nlohmann::json are used.

If using C++20:

-template <typename T>
+template <typename T>
+    requires nlohmann::detail::has_to_json<nlohmann::json, T>::value
[[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {

Or for C++17:

-template <typename T>
+template <typename T,
+         typename = std::enable_if_t<nlohmann::detail::has_to_json<nlohmann::json, T>::value>>
[[nodiscard]] auto add_value_to_range(std::string const& key, T const& value) -> ErrorCode {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a8c3705 and 47d0e94.

📒 Files selected for processing (2)
  • components/core/src/clp_s/ArchiveWriter.hpp (4 hunks)
  • components/core/src/clp_s/RangeIndexWriter.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • components/core/src/clp_s/ArchiveWriter.hpp
  • components/core/src/clp_s/RangeIndexWriter.hpp
🔇 Additional comments (4)
components/core/src/clp_s/ArchiveWriter.hpp (2)

196-202: Good implementation of close_current_range

The implementation correctly only resets m_range_open upon successful closure, preserving the error state and allowing callers to retry after fixing issues.


290-291: Optimal member ordering for efficient memory layout

Good placement of the lightweight boolean flag next to the writer object to reduce padding in the struct.

components/core/src/clp_s/RangeIndexWriter.hpp (2)

67-69: Efficient use of try_emplace

Good use of try_emplace which performs existence check and insertion in a single operation, rather than using two separate operations (find + emplace).


11-16:

Details

❓ Verification inconclusive

Confirm if both JSON includes are necessary

While the comment explains why both headers are needed, typically including json.hpp already pulls in all forward declarations, making json_fwd.hpp redundant. This could slow down compilation.


🏁 Script executed:

#!/bin/bash
# Check if removing json_fwd.hpp breaks compilation

# First, let's check what each header actually includes
echo "Examining json.hpp content:"
head -n 20 $(fd -t f "json.hpp" | head -n 1)

echo "Examining json_fwd.hpp content:"
head -n 20 $(fd -t f "json_fwd.hpp" | head -n 1)

# Check if json.hpp already includes json_fwd.hpp
grep -A 2 "json_fwd.hpp" $(fd -t f "json.hpp" | head -n 1)

Length of output: 291


Verify necessity of json_fwd.hpp when including json.hpp

I wasn’t able to locate the external nlohmann headers in the repo to confirm if <nlohmann/json.hpp> already pulls in the forward declarations from json_fwd.hpp. Please check in your local environment whether json.hpp includes json_fwd.hpp and if so remove the redundant include to speed up compilation.

• File: components/core/src/clp_s/RangeIndexWriter.hpp (lines 11–16)

#include <nlohmann/json.hpp>
#include <nlohmann/json_fwd.hpp>

Please verify and update accordingly.

Comment thread components/core/src/clp_s/RangeIndexWriter.hpp
Comment thread components/core/src/clp_s/RangeIndexWriter.hpp

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed CMakeLists.txt

@gibber9809 gibber9809 changed the title feat(clp-s): Add write-side implementation for archive range index. feat(clp-s): Add the write-side implementation for the archive range index. May 7, 2025
@gibber9809

Copy link
Copy Markdown
Contributor Author

Just making sure that the unit tests I have written so far in the read side PR pass when rebased on top of this branch, then I will merge.

@gibber9809

Copy link
Copy Markdown
Contributor Author

Just making sure that the unit tests I have written so far in the read side PR pass when rebased on top of this branch, then I will merge.

Seems to be passing, will merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants