feat(clp-s): Optimize KV-IR ingestion by directly ingesting EncodedTextAst into clp-s archives.#1468
Conversation
WalkthroughRefactors encoding APIs to accept and propagate ir::EncodedTextAst views instead of full LogEvent objects, updates call sites and parsers to emit/store encoded-text ASTs, extends parsed-message variant to include encoded-text AST types, and changes a CMake target link visibility to PUBLIC. Changes
Sequence Diagram(s)sequenceDiagram
participant JsonParser as JsonParser
participant ParsedMsg as ParsedMessage
participant ColumnWriter as ColumnWriter
participant ArchiveWriter as ArchiveWriter
participant EncodedVarInterp as EncodedVariableInterpreter
Note over JsonParser,ParsedMsg: Parser emits EncodedTextAst views when possible
JsonParser->>ParsedMsg: produce Eight/FourByte EncodedTextAst
ParsedMsg->>ColumnWriter: pass EncodedTextAst value
ColumnWriter->>EncodedVarInterp: encode_and_add_to_dictionary(EncodedTextAst, ..., &original_num_bytes)
EncodedVarInterp-->>ColumnWriter: encoded_vars + var_ids and original_num_bytes
ColumnWriter->>ArchiveWriter: forward encoded_vars / var_ids / original_num_bytes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp_s/ColumnWriter.cpp(2 hunks)components/core/src/clp_s/JsonParser.cpp(2 hunks)components/core/src/clp_s/ParsedMessage.hpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/ColumnWriter.cppcomponents/core/src/clp_s/ParsedMessage.hppcomponents/core/src/clp_s/JsonParser.cpp
🧠 Learnings (3)
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/ColumnWriter.cpp (2)
components/core/src/clp_s/ParsedMessage.hpp (4)
value(83-85)value(83-83)value(93-95)value(93-93)components/core/src/clp_s/ColumnWriter.hpp (10)
value(29-29)value(59-59)value(76-76)value(94-94)value(111-111)value(131-131)value(149-149)value(176-176)value(233-233)value(251-251)
components/core/src/clp_s/JsonParser.cpp (3)
components/core/src/clp_s/ArchiveWriter.cpp (2)
matches_timestamp(290-299)matches_timestamp(290-290)components/core/src/clp_s/ParsedMessage.hpp (6)
node_id(48-50)node_id(48-48)node_id(58-60)node_id(58-58)node_id(68-70)node_id(68-68)components/core/src/clp_s/CommandLineArguments.hpp (1)
m_timestamp_key(57-57)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (5)
components/core/src/clp_s/ParsedMessage.hpp (2)
11-11: LGTM: Include directive is correct.The include path for
EncodedTextAst.hppis properly specified and necessary for the new variant types.
23-24: No action required—variant reordering is safe.The codebase exclusively uses type-based variant access (
std::get<Type>(),std::holds_alternative<Type>()), not index-based access. Reordering types in thevariable_tvariant does not affect type-based operations. No numeric index-based access patterns or.index()comparisons were found in the codebase.components/core/src/clp_s/ColumnWriter.cpp (2)
11-11: LGTM!The include is necessary for the new
EncodedTextAsttypes used in the refactored logic.
92-92: Incorporate estimated_raw_size into the return value or cease calculating it.The
estimated_raw_sizevariable is populated byencode_and_add_to_dictionary()on lines 109 and 118, but the value is never read afterward. UnlikeArchive::write_log_event_ir()which captures and uses this output parameter when callingwrite_encoded_msg(), the value is discarded here. The[[maybe_unused]]marker suppresses compiler warnings but obscures this unused computation. Either:
- Include
estimated_raw_sizein the return value calculation (compare line 126 with how Archive.cpp uses the value), or- Remove it from the function calls to avoid unnecessary computation.
components/core/src/clp_s/JsonParser.cpp (1)
1219-1237: Verify EncodedTextAst lifetime and confirm decode error handling.The search results did not return sufficient information to conclusively determine how ParsedMessage stores or copies EncodedTextAst payloads, or when ArchiveWriter::append_message consumes them. Since the absence of evidence is not evidence of absence, please manually verify:
- ParsedMessage::add_value() – Check whether it deep-copies the EncodedTextAst views passed to it, or stores references/views that become invalid after
ir_unit_handler.clear().- Timing of consumption – Confirm that ArchiveWriter::append_message() consumes all message values synchronously before any buffers or units are cleared.
The unchecked
decode_and_unparse().value()on the timestamp path (lines 1219–1237) remains a concrete issue: the code does not guard against decode errors. Add an explicithas_error()check before calling.value(), or use the proposed helper function to harden error handling.⛔ Skipped due to learnings
Learnt from: AVMatthews PR: y-scope/clp#543 File: components/core/src/clp_s/JsonParser.cpp:735-794 Timestamp: 2024-10-07T21:35:04.362Z Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.Learnt from: AVMatthews PR: y-scope/clp#543 File: components/core/src/clp_s/JsonParser.cpp:769-779 Timestamp: 2024-10-07T21:16:41.660Z Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.Learnt from: gibber9809 PR: y-scope/clp#630 File: components/core/src/clp_s/JsonParser.cpp:702-703 Timestamp: 2024-12-10T16:03:08.691Z Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Overall lgtm.
Shall we create an issue to track improving ParsedMessage::variable_t's efficiency?
| } | ||
| } break; | ||
| case NodeType::ClpString: { | ||
| bool uses_eight_byte_encoding{ |
There was a problem hiding this comment.
| bool uses_eight_byte_encoding{ | |
| bool const uses_eight_byte_encoding{ |
| } | ||
| } break; | ||
| case NodeType::ClpString: { | ||
| bool uses_eight_byte_encoding{ |
There was a problem hiding this comment.
How about is_eight_byte_encoding?
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/JsonParser.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonParser.cpp
🧠 Learnings (2)
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (1)
components/core/src/clp_s/JsonParser.cpp (2)
components/core/src/clp_s/ArchiveWriter.cpp (2)
matches_timestamp(290-299)matches_timestamp(290-290)components/core/src/clp_s/ParsedMessage.hpp (6)
node_id(48-50)node_id(48-48)node_id(58-60)node_id(58-58)node_id(68-70)node_id(68-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (2)
components/core/src/clp_s/JsonParser.cpp (2)
1239-1260: LGTM on timestamp ClpString handling.The approach correctly decodes the AST (EightByte or FourByte) to obtain the string value needed for timestamp parsing, then follows the unified timestamp ingestion path via
ingest_timestamp_entry. This is necessary since timestamp extraction requires the decoded string representation.
1219-1237: Ownership semantics verified—code is safe to merge.The direct AST ingestion approach is sound.
get_immutable_view()returns const references to AST objects withinKeyValuePairLogEvent, which are then passed toadd_value(). Theemplace()call synchronously copies each AST object intom_current_parsed_message's variant, making the stored data independent of the originalKeyValuePairLogEventlifetime. This design achieves the PR's goal of avoiding unnecessary decode/re-encode cycles while maintaining proper ownership.The variable naming
is_eight_byte_encodingimproves clarity, and the code properly follows the coding guideline (false == matches_timestamp).
I put up the issue here. |
LinZhihao-723
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(clp-s): Optimize KV-IR ingestion by directly ingesting `EncodedTextAst` into clp-s archives.
EncodedTextAst fields from KV-IR streams.EncodedTextAst into clp-s archives.
…extAst` into clp-s archives. (y-scope#1468) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
This PR changes the ingestion flow in clp-s so that we directly ingest the parsed representation of logtext (
EncodedTextAst) from the KV-IR instead of marhsalling it into the original logtext and parsing it against. This seems to improve ingestion speed for KV-IR inputs by about 25% compared to the previous implementation.These changes do somewhat decrease JSON ingestion speed. Across the open source datasets overall ingestion speed reduction ranged from 0.5%-4.5%. This isn't ideal, but should be fine for now. Granted, we should probably revisit JSON ingestion performance soon since several recent changes have slightly degraded ingestion speed.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit