Skip to content

feat(clp-s): Optimize KV-IR ingestion by directly ingesting EncodedTextAst into clp-s archives.#1468

Merged
gibber9809 merged 8 commits into
y-scope:mainfrom
gibber9809:ingest-encoded-text-ast
Oct 28, 2025
Merged

feat(clp-s): Optimize KV-IR ingestion by directly ingesting EncodedTextAst into clp-s archives.#1468
gibber9809 merged 8 commits into
y-scope:mainfrom
gibber9809:ingest-encoded-text-ast

Conversation

@gibber9809

@gibber9809 gibber9809 commented Oct 22, 2025

Copy link
Copy Markdown
Contributor

Description

This PR changes the ingestion flow in clp-s so that we directly ingest the parsed representation of logtext (EncodedTextAst) from the KV-IR instead of marhsalling it into the original logtext and parsing it against. This seems to improve ingestion speed for KV-IR inputs by about 25% compared to the previous implementation.

These changes do somewhat decrease JSON ingestion speed. Across the open source datasets overall ingestion speed reduction ranged from 0.5%-4.5%. This isn't ideal, but should be fine for now. Granted, we should probably revisit JSON ingestion performance soon since several recent changes have slightly degraded ingestion speed.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Existing tests already cover KV-IR and JSON ingestion -- validated that these tests still pass
  • Validated that performance for KV-IR ingestion

Summary by CodeRabbit

  • Refactor
    • Consolidated handling of encoded-text messages across ingestion and archival paths.
  • New Features
    • Direct handling and storage of two encoded-text message variants without decoding.
    • Public message representation extended to include encoded-text types.
  • Bug Fixes
    • More accurate tracking and reporting of original uncompressed byte counts for archived messages.
  • Chores
    • Adjusted build linkage visibility for the archive writer component.

@gibber9809 gibber9809 requested review from a team and wraymo as code owners October 22, 2025 19:29
@coderabbitai

coderabbitai Bot commented Oct 22, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Refactors encoding APIs to accept and propagate ir::EncodedTextAst views instead of full LogEvent objects, updates call sites and parsers to emit/store encoded-text ASTs, extends parsed-message variant to include encoded-text AST types, and changes a CMake target link visibility to PUBLIC.

Changes

Cohort / File(s) Summary
EncodedVariableInterpreter header
components/core/src/clp/EncodedVariableInterpreter.hpp
Added #include "ir/EncodedTextAst.hpp". Changed an overload of encode_and_add_to_dictionary to accept ir::EncodedTextAst<EncodedVariableType> const& log_message (renamed param) and updated docs/impl to use the EncodedTextAst and emit raw byte count via out param.
EncodedVariableInterpreter callsite (Archive writer)
components/core/src/clp/streaming_archive/writer/Archive.cpp
Updated write_log_event_ir to call encode_and_add_to_dictionary with log_event.get_message() (the message/view) and pass an added original_num_bytes out parameter; subsequent logic uses that byte count.
ParsedMessage variant
components/core/src/clp_s/ParsedMessage.hpp
Added #include "../clp/ir/EncodedTextAst.hpp". Extended variable_t variant to include clp::ir::EightByteEncodedTextAst and clp::ir::FourByteEncodedTextAst between std::string and bool.
JSON parser: emit/store encoded views
components/core/src/clp_s/JsonParser.cpp
For NodeType::ClpString and NodeType::UnstructuredArray, non-timestamp paths now emit/store EightByteEncodedTextAst or FourByteEncodedTextAst views directly; timestamped paths decode once, ingest a timestamp entry, and record encoding id + timestamp. Unified Four/EightByte handling.
Column writer: accept encoded ASTs
components/core/src/clp_s/ColumnWriter.cpp
ClpStringColumnWriter::add_value now branches to handle clp::ir::EightByteEncodedTextAst and clp::ir::FourByteEncodedTextAst, delegating to encode_and_add_to_dictionary with an estimated_raw_size; preserves existing std::string path.
Build configuration visibility
components/core/src/clp_s/CMakeLists.txt
Moved clp_s::clp_dependencies from PRIVATE to PUBLIC in target_link_libraries for clp_s_archive_writer.

Sequence Diagram(s)

sequenceDiagram
    participant JsonParser as JsonParser
    participant ParsedMsg as ParsedMessage
    participant ColumnWriter as ColumnWriter
    participant ArchiveWriter as ArchiveWriter
    participant EncodedVarInterp as EncodedVariableInterpreter

    Note over JsonParser,ParsedMsg: Parser emits EncodedTextAst views when possible
    JsonParser->>ParsedMsg: produce Eight/FourByte EncodedTextAst
    ParsedMsg->>ColumnWriter: pass EncodedTextAst value
    ColumnWriter->>EncodedVarInterp: encode_and_add_to_dictionary(EncodedTextAst, ..., &original_num_bytes)
    EncodedVarInterp-->>ColumnWriter: encoded_vars + var_ids and original_num_bytes
    ColumnWriter->>ArchiveWriter: forward encoded_vars / var_ids / original_num_bytes
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Files warranting extra attention:
    • components/core/src/clp_s/JsonParser.cpp — timestamp vs non-timestamp branching and lifetimes of encoded-text views.
    • components/core/src/clp_s/ColumnWriter.cpp — variant matching and estimated_raw_size correctness.
    • components/core/src/clp/EncodedVariableInterpreter.hpp and its implementation — signature change, include, and out-parameter behavior.
    • components/core/src/clp/streaming_archive/writer/Archive.cpp — propagation and use of original_num_bytes.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Optimize KV-IR ingestion by directly ingesting EncodedTextAst into clp-s archives" directly and accurately reflects the main objective of the changeset. The modifications across multiple files (EncodedVariableInterpreter.hpp, Archive.cpp, ColumnWriter.cpp, JsonParser.cpp, and ParsedMessage.hpp) consistently work toward this single primary goal: replacing the previous round-trip serialization/deserialization approach with direct EncodedTextAst ingestion, which improves KV-IR ingestion speed by approximately 25%. The title is concise, clear, and free of vague terminology, making it easy for team members scanning history to immediately understand the core change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d64bed and 301cccb.

📒 Files selected for processing (3)
  • components/core/src/clp_s/ColumnWriter.cpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
  • components/core/src/clp_s/ParsedMessage.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/src/clp_s/ParsedMessage.hpp
  • components/core/src/clp_s/JsonParser.cpp
🧠 Learnings (3)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/ColumnWriter.cpp (2)
components/core/src/clp_s/ParsedMessage.hpp (4)
  • value (83-85)
  • value (83-83)
  • value (93-95)
  • value (93-93)
components/core/src/clp_s/ColumnWriter.hpp (10)
  • value (29-29)
  • value (59-59)
  • value (76-76)
  • value (94-94)
  • value (111-111)
  • value (131-131)
  • value (149-149)
  • value (176-176)
  • value (233-233)
  • value (251-251)
components/core/src/clp_s/JsonParser.cpp (3)
components/core/src/clp_s/ArchiveWriter.cpp (2)
  • matches_timestamp (290-299)
  • matches_timestamp (290-290)
components/core/src/clp_s/ParsedMessage.hpp (6)
  • node_id (48-50)
  • node_id (48-48)
  • node_id (58-60)
  • node_id (58-58)
  • node_id (68-70)
  • node_id (68-68)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_timestamp_key (57-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (5)
components/core/src/clp_s/ParsedMessage.hpp (2)

11-11: LGTM: Include directive is correct.

The include path for EncodedTextAst.hpp is properly specified and necessary for the new variant types.


23-24: No action required—variant reordering is safe.

The codebase exclusively uses type-based variant access (std::get<Type>(), std::holds_alternative<Type>()), not index-based access. Reordering types in the variable_t variant does not affect type-based operations. No numeric index-based access patterns or .index() comparisons were found in the codebase.

components/core/src/clp_s/ColumnWriter.cpp (2)

11-11: LGTM!

The include is necessary for the new EncodedTextAst types used in the refactored logic.


92-92: Incorporate estimated_raw_size into the return value or cease calculating it.

The estimated_raw_size variable is populated by encode_and_add_to_dictionary() on lines 109 and 118, but the value is never read afterward. Unlike Archive::write_log_event_ir() which captures and uses this output parameter when calling write_encoded_msg(), the value is discarded here. The [[maybe_unused]] marker suppresses compiler warnings but obscures this unused computation. Either:

  • Include estimated_raw_size in the return value calculation (compare line 126 with how Archive.cpp uses the value), or
  • Remove it from the function calls to avoid unnecessary computation.
components/core/src/clp_s/JsonParser.cpp (1)

1219-1237: Verify EncodedTextAst lifetime and confirm decode error handling.

The search results did not return sufficient information to conclusively determine how ParsedMessage stores or copies EncodedTextAst payloads, or when ArchiveWriter::append_message consumes them. Since the absence of evidence is not evidence of absence, please manually verify:

  1. ParsedMessage::add_value() – Check whether it deep-copies the EncodedTextAst views passed to it, or stores references/views that become invalid after ir_unit_handler.clear().
  2. Timing of consumption – Confirm that ArchiveWriter::append_message() consumes all message values synchronously before any buffers or units are cleared.

The unchecked decode_and_unparse().value() on the timestamp path (lines 1219–1237) remains a concrete issue: the code does not guard against decode errors. Add an explicit has_error() check before calling .value(), or use the proposed helper function to harden error handling.

⛔ Skipped due to learnings
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Comment thread components/core/src/clp_s/ColumnWriter.cpp
Comment thread components/core/src/clp_s/JsonParser.cpp

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall lgtm.
Shall we create an issue to track improving ParsedMessage::variable_t's efficiency?

}
} break;
case NodeType::ClpString: {
bool uses_eight_byte_encoding{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool uses_eight_byte_encoding{
bool const uses_eight_byte_encoding{

}
} break;
case NodeType::ClpString: {
bool uses_eight_byte_encoding{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about is_eight_byte_encoding?

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8613ba1 and 28f8c45.

📒 Files selected for processing (1)
  • components/core/src/clp_s/JsonParser.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonParser.cpp
🧠 Learnings (2)
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonParser.cpp (2)
components/core/src/clp_s/ArchiveWriter.cpp (2)
  • matches_timestamp (290-299)
  • matches_timestamp (290-290)
components/core/src/clp_s/ParsedMessage.hpp (6)
  • node_id (48-50)
  • node_id (48-48)
  • node_id (58-60)
  • node_id (58-58)
  • node_id (68-70)
  • node_id (68-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (2)
components/core/src/clp_s/JsonParser.cpp (2)

1239-1260: LGTM on timestamp ClpString handling.

The approach correctly decodes the AST (EightByte or FourByte) to obtain the string value needed for timestamp parsing, then follows the unified timestamp ingestion path via ingest_timestamp_entry. This is necessary since timestamp extraction requires the decoded string representation.


1219-1237: Ownership semantics verified—code is safe to merge.

The direct AST ingestion approach is sound. get_immutable_view() returns const references to AST objects within KeyValuePairLogEvent, which are then passed to add_value(). The emplace() call synchronously copies each AST object into m_current_parsed_message's variant, making the stored data independent of the original KeyValuePairLogEvent lifetime. This design achieves the PR's goal of avoiding unnecessary decode/re-encode cycles while maintaining proper ownership.

The variable naming is_eight_byte_encoding improves clarity, and the code properly follows the coding guideline (false == matches_timestamp).

Comment thread components/core/src/clp_s/JsonParser.cpp
@gibber9809

Copy link
Copy Markdown
Contributor Author

Overall lgtm. Shall we create an issue to track improving ParsedMessage::variable_t's efficiency?

I put up the issue here.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-s): Optimize KV-IR ingestion by directly ingesting `EncodedTextAst` into clp-s archives.

@gibber9809 gibber9809 changed the title feat(clp-s): Directly ingest EncodedTextAst fields from KV-IR streams. feat(clp-s): Optimize KV-IR ingestion by directly ingesting EncodedTextAst into clp-s archives. Oct 28, 2025
@gibber9809 gibber9809 merged commit 249d7b7 into y-scope:main Oct 28, 2025
29 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…extAst` into clp-s archives. (y-scope#1468)

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants