feat(clp-s): Explicitly reject unstructured log inputs during compression.#1434
Conversation
WalkthroughAdds a new FileType LogText with detection and reader-path integration, prevents direct ingestion of unstructured LogText in JsonParser, and integrates simdjson into the build and clp_s_io linkage. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Detection as "InputConfig::peek_start_and_deduce_type"
participant Factory as "try_deduce_reader_type"
participant Parser as "JsonParser::ingest"
participant Archive as "ArchiveWriter"
User->>Detection: provide peek_buf
alt detected as LogText
Detection->>Factory: returns FileType::LogText
Factory->>Parser: select LogText reader path
Parser->>Archive: close()
Parser-->>User: returns false (error)
else detected as Json/KV-IR/other
Detection->>Factory: returns other FileType
Factory->>Parser: select appropriate reader
Parser-->>User: continue ingestion
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/cmake/Options/options.cmake(1 hunks)components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/InputConfig.cpp(6 hunks)components/core/src/clp_s/InputConfig.hpp(1 hunks)components/core/src/clp_s/JsonParser.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/InputConfig.hppcomponents/core/src/clp_s/InputConfig.cpp
🧠 Learnings (3)
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: rust-checks (macos-15)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (8)
components/core/src/clp_s/InputConfig.hpp (1)
26-26: LGTM!The addition of
LogTextto theFileTypeenum is straightforward and correctly positioned betweenKeyValueIrandZstd.components/core/src/clp_s/CMakeLists.txt (1)
217-217: LGTM!The private linkage of
simdjson::simdjsonto theclp_s_iotarget is appropriate for the new LogText detection functionality that requires UTF-8 validation.components/core/cmake/Options/options.cmake (1)
273-273: LGTM!Adding
CLP_NEED_SIMDJSONto the dependency flags forCLP_BUILD_CLP_S_IOis consistent with the CMakeLists.txt changes and properly integrates simdjson into the build configuration.components/core/src/clp_s/JsonParser.cpp (1)
640-646: LGTM!The error handling for
FileType::LogTextis correct: it logs a clear error message, properly closes the archive writer, and returns false to reject unsupported unstructured log inputs. This aligns with the PR objective to explicitly reject logtext during compression.components/core/src/clp_s/InputConfig.cpp (4)
10-10: LGTM!The new includes (
<optional>,<simdjson.h>,utf8_utils.hpp) are correctly added to support the logtext detection functionality.Also applies to: 14-14, 26-26
140-146: LGTM!The function declaration and documentation are clear and consistent with the existing
could_be_zstd,could_be_kvir, andcould_be_jsonfunctions.
319-321: LGTM!The integration of
could_be_logtextinto the type detection flow is correct, with logtext detection performed after JSON detection and before returningFileType::Unknown.
368-368: LGTM!The
FileType::LogTextcase is correctly added to the switch statement, allowing detected logtext inputs to return early with the appropriate reader chain.
davidlion
left a comment
There was a problem hiding this comment.
One nit pick, but otherwise lgtm.
For the title I think we can drop the first part and just have what you wrote after the semicolon (since you can't reject/check for something without support anyway).
| auto could_be_logtext(char const* peek_buf, size_t peek_size) -> bool { | ||
| constexpr size_t cMaxUtf8CodepointBytes = 4ULL; | ||
| // Full 4-byte character followed by truncated 3-bytes of 4-byte character. | ||
| constexpr size_t cMaxRunWithoutFullUtf8Codepoint = 2 * cMaxUtf8CodepointBytes - 1ULL; | ||
|
|
||
| size_t cur_byte{ | ||
| peek_size < cMaxRunWithoutFullUtf8Codepoint | ||
| ? 0ULL | ||
| : (peek_size - cMaxRunWithoutFullUtf8Codepoint) | ||
| }; | ||
|
|
||
| std::optional<size_t> legal_last_byte_index{std::nullopt}; | ||
| auto mark_last_legal_character = [&](size_t remaining_bytes_in_char) { | ||
| auto const last_byte_in_char = cur_byte + remaining_bytes_in_char; | ||
| if (last_byte_in_char < peek_size) { | ||
| legal_last_byte_index = last_byte_in_char; | ||
| } | ||
| cur_byte = last_byte_in_char; | ||
| }; | ||
|
|
||
| // Find the last complete UTF-8 character if it exists. We don't bother validating continuation | ||
| // bytes, since they will be checked in the final validation step anyway. | ||
| for (; cur_byte < peek_size; ++cur_byte) { | ||
| uint8_t const c{static_cast<uint8_t>(peek_buf[cur_byte])}; | ||
| if ((clp::cFourByteUtf8CharHeaderMask & c) == clp::cFourByteUtf8CharHeader) { | ||
| mark_last_legal_character(3ULL); | ||
| } else if ((clp::cThreeByteUtf8CharHeaderMask & c) == clp::cThreeByteUtf8CharHeader) { | ||
| mark_last_legal_character(2ULL); | ||
| } else if ((clp::cTwoByteUtf8CharHeaderMask & c) == clp::cTwoByteUtf8CharHeader) { | ||
| mark_last_legal_character(1ULL); | ||
| } else if (clp::utf8_utils_internal::is_ascii_char(c)) { | ||
| mark_last_legal_character(0ULL); | ||
| } | ||
| } | ||
|
|
||
| // We were not able to find a complete UTF-8 character, so the input is definitely not UTF-8. | ||
| if (false == legal_last_byte_index.has_value()) { | ||
| return false; | ||
| } | ||
|
|
||
| return simdjson::validate_utf8(peek_buf, legal_last_byte_index.value() + 1ULL); |
There was a problem hiding this comment.
We should try to follow the google and linux guidelines on comments. Generally speaking, the only time to have comments inline of code should be to explain some confusing behaviour (usually from an external library/dependency).
For comments applying to only a few lines usually it implies the naming is too confusing to read or the comment is just repeating the code. If it is not possible to making the naming in the code readable on its own, then we can add an implementation comment above the method/function definition. This puts all of the logic in one place so a reader can understand the implementation without needing to scan everything.
In this specific case, I think merging these comments together above the definition is fine.
(Sorry, I know most of the code generally doesn't follow this yet, but we need to start somewhere.)
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/InputConfig.cpp (1)
297-305: 2‑byte UTF‑8 header compare is now correct.The constant/mask pair for 2‑byte headers is fixed here; resolves the earlier mismatch.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/InputConfig.cpp(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/InputConfig.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: rust-checks (ubuntu-24.04)
- GitHub Check: rust-checks (ubuntu-22.04)
🔇 Additional comments (4)
components/core/src/clp_s/InputConfig.cpp (4)
10-10: Build/linkage check for new deps.Ensure CI targets locate <simdjson.h> and link against simdjson::simdjson in all build configs; confirm utf8_utils.hpp is exported where needed.
Also applies to: 14-14, 26-26
140-147: API shape LGTM.Clear contract for logtext detection; name and comments are consistent with usage.
335-338: Detection order LGTM.Placing LogText after JSON preserves stricter JSON detection first.
384-385: Early return on LogText is appropriate.Treating LogText like Json/KV‑IR in the non‑decompressor chain aligns with the new ingestion policy.
* feat(webui): Add drawer to display guided query and errors. (y-scope#1421) * docs: Add Slack community invite badge to home page README. (y-scope#1418) * refactor(clp-package): Simplify StrEnum and Path serialization via Annotated serializers. (y-scope#1417) Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * build(clp-package): Adopt uv + hatchling as the build and packaging backend for Python components (resolves y-scope#1396); Upgrade dependencies for Python components. (y-scope#1405) * chore(docker): Add `--link` flags to COPY/ADD operations for improved build performance (fixes y-scope#1408). (y-scope#1411) * fix(ci): Correctly update and restore cache of `lint:check-cpp-lint-static-full`'s generated files (fixes y-scope#1419): (y-scope#1430) - Save cache entries using unique key per entry. - Restore latest entries using key prefix. - Avoid using outputs from optionally-run `restore` task. * fix(clp-rust-utils): Use AWS SDK default configuration with latest behavior version for S3 client. (y-scope#1445) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * refactor(clp-package): Remove unused `python-dotenv` dependency and related imports (fixes y-scope#1443). (y-scope#1444) * fix(webui): Submit queries that failed ANTLR validation to Presto. (y-scope#1450) * feat(clp-s): Explicitly reject unstructured log inputs during compression. (y-scope#1434) * feat(webui): Show query speed in native search status. (y-scope#1429) * fix(job-orchestration): Make `tag_ids` a required `list[int]` for compatibility with Spider compressor. (y-scope#1453) * feat(clp-mcp-server): Add log viewer links to query results for displaying in LLM output. (y-scope#1454) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * feat(ci): Add tasks for checking and updating Rust lock file (`Cargo.lock`); Add check to GH workflow. (y-scope#1448) * feat(webui): Trigger submit action when pressing Enter on Monaco single line editor. (y-scope#1459) * Add filters. * Update cargo lock. --------- Co-authored-by: davemarco <83603688+davemarco@users.noreply.github.com> Co-authored-by: Abigail Matthews <abigail.v.matthews@gmail.com> Co-authored-by: sitaowang1998 <sitaowang1998@outlook.com> Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com> Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com> Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com> Co-authored-by: Huangshi Tian <All-less@users.noreply.github.com>
* feat(webui): Add drawer to display guided query and errors. (y-scope#1421) * docs: Add Slack community invite badge to home page README. (y-scope#1418) * refactor(clp-package): Simplify StrEnum and Path serialization via Annotated serializers. (y-scope#1417) Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * build(clp-package): Adopt uv + hatchling as the build and packaging backend for Python components (resolves y-scope#1396); Upgrade dependencies for Python components. (y-scope#1405) * chore(docker): Add `--link` flags to COPY/ADD operations for improved build performance (fixes y-scope#1408). (y-scope#1411) * fix(ci): Correctly update and restore cache of `lint:check-cpp-lint-static-full`'s generated files (fixes y-scope#1419): (y-scope#1430) - Save cache entries using unique key per entry. - Restore latest entries using key prefix. - Avoid using outputs from optionally-run `restore` task. * fix(clp-rust-utils): Use AWS SDK default configuration with latest behavior version for S3 client. (y-scope#1445) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * refactor(clp-package): Remove unused `python-dotenv` dependency and related imports (fixes y-scope#1443). (y-scope#1444) * fix(webui): Submit queries that failed ANTLR validation to Presto. (y-scope#1450) * feat(clp-s): Explicitly reject unstructured log inputs during compression. (y-scope#1434) * feat(webui): Show query speed in native search status. (y-scope#1429) * fix(job-orchestration): Make `tag_ids` a required `list[int]` for compatibility with Spider compressor. (y-scope#1453) * feat(clp-mcp-server): Add log viewer links to query results for displaying in LLM output. (y-scope#1454) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * feat(ci): Add tasks for checking and updating Rust lock file (`Cargo.lock`); Add check to GH workflow. (y-scope#1448) * feat(webui): Trigger submit action when pressing Enter on Monaco single line editor. (y-scope#1459) * Add filters. * Update cargo lock. --------- Co-authored-by: davemarco <83603688+davemarco@users.noreply.github.com> Co-authored-by: Abigail Matthews <abigail.v.matthews@gmail.com> Co-authored-by: sitaowang1998 <sitaowang1998@outlook.com> Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com> Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com> Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com> Co-authored-by: Huangshi Tian <All-less@users.noreply.github.com>
* feat(webui): Add drawer to display guided query and errors. (y-scope#1421) * docs: Add Slack community invite badge to home page README. (y-scope#1418) * refactor(clp-package): Simplify StrEnum and Path serialization via Annotated serializers. (y-scope#1417) Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * build(clp-package): Adopt uv + hatchling as the build and packaging backend for Python components (resolves y-scope#1396); Upgrade dependencies for Python components. (y-scope#1405) * chore(docker): Add `--link` flags to COPY/ADD operations for improved build performance (fixes y-scope#1408). (y-scope#1411) * fix(ci): Correctly update and restore cache of `lint:check-cpp-lint-static-full`'s generated files (fixes y-scope#1419): (y-scope#1430) - Save cache entries using unique key per entry. - Restore latest entries using key prefix. - Avoid using outputs from optionally-run `restore` task. * fix(clp-rust-utils): Use AWS SDK default configuration with latest behavior version for S3 client. (y-scope#1445) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * refactor(clp-package): Remove unused `python-dotenv` dependency and related imports (fixes y-scope#1443). (y-scope#1444) * fix(webui): Submit queries that failed ANTLR validation to Presto. (y-scope#1450) * feat(clp-s): Explicitly reject unstructured log inputs during compression. (y-scope#1434) * feat(webui): Show query speed in native search status. (y-scope#1429) * fix(job-orchestration): Make `tag_ids` a required `list[int]` for compatibility with Spider compressor. (y-scope#1453) * feat(clp-mcp-server): Add log viewer links to query results for displaying in LLM output. (y-scope#1454) Co-authored-by: Junhao Liao <junhao.liao@yscope.com> * feat(ci): Add tasks for checking and updating Rust lock file (`Cargo.lock`); Add check to GH workflow. (y-scope#1448) * feat(webui): Trigger submit action when pressing Enter on Monaco single line editor. (y-scope#1459) * Add filters. * Update cargo lock. * Stupid fix --------- Co-authored-by: davemarco <83603688+davemarco@users.noreply.github.com> Co-authored-by: Abigail Matthews <abigail.v.matthews@gmail.com> Co-authored-by: sitaowang1998 <sitaowang1998@outlook.com> Co-authored-by: Junhao Liao <junhao@junhao.ca> Co-authored-by: Junhao Liao <junhao.liao@yscope.com> Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com> Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com> Co-authored-by: hoophalab <200652805+hoophalab@users.noreply.github.com> Co-authored-by: Huangshi Tian <All-less@users.noreply.github.com>
Description
This PR adds support for detecting that an input might be unstructured logs during ingestion, and uses that support to explicitly reject unstructured log inputs during compression. This feature will also be used by the upcoming utility that converts unstructured logs to a structured representation for ingestion into clp-s.
The heuristic for detecting logtext is straightforwards -- we simply check if the first peeked chunk of the input stream corresponds to valid UTF-8, while accounting for the fact that the end of the buffer might contain truncated UTF-8.
Since a UTF-8 codepoint is at most 4 bytes, we know that we must scan at most 7 bytes to find the last complete UTF-8 codepoint in our peeked input. This is because in the worst case the stream terminates with a 4-byte codepoint followed by a 4-byte codepoint with its last byte truncated. The approach, then, is to scan the last 7 bytes of the peeked buffer to find the last complete codepoint and then validate that the entire buffer through that last complete codepoint corresponds to valid UTF-8. If there is not a complete codepoint in the last 7 bytes, we also reject the input without having to validate the whole buffer.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes / Behaviour