feat(clp-s::timestamp_parser): Make control characters and lone double quotes illegal in timestamp patterns.#1723
Conversation
…start and end) in timestamp patterns.
…uences and fix bugs.
WalkthroughAdds JSON-style escape handling and stricter pattern validation to the timestamp parser, introduces three exported predicate helpers and dynamic pattern assembly, threads an Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}⚙️ CodeRabbit configuration file
Files:
🧠 Learnings (4)📚 Learning: 2024-10-07T21:16:41.660ZApplied to files:
📚 Learning: 2024-10-07T21:35:04.362ZApplied to files:
📚 Learning: 2025-07-17T16:08:23.185ZApplied to files:
📚 Learning: 2025-07-17T16:02:57.688ZApplied to files:
🧬 Code graph analysis (1)components/core/src/clp_s/timestamp_parser/TimestampParser.hpp (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| for (size_t pattern_idx{start_idx}; pattern_idx < end_idx; ++pattern_idx) { | ||
| auto const cur_format_specifier{pattern.at(pattern_idx)}; | ||
| if (is_unsupported_character(cur_format_specifier)) { | ||
| return ErrorCode{ErrorCodeEnum::InvalidTimestampPattern}; |
There was a problem hiding this comment.
How about we create two special error code for unsupported control characters and unsupported escape sequence?
| if ('\\' == timestamp[timestamp_idx]) { | ||
| size_t const field_length{is_json_literal ? 2ULL : 1ULL}; | ||
| if (timestamp_idx + field_length > timestamp.size()) { | ||
| return ErrorCode{ErrorCodeEnum::IncompatibleTimestampPattern}; | ||
| } | ||
|
|
||
| if (false == is_json_literal && '\\' == timestamp.at(timestamp_idx)) { | ||
| ++timestamp_idx; | ||
| continue; | ||
| } | ||
| if (timestamp.substr(timestamp_idx).starts_with(cJsonEscapedBackslash)) { | ||
| timestamp_idx += cJsonEscapedBackslash.size(); | ||
| continue; | ||
| } | ||
| return ErrorCode{ErrorCodeEnum::IncompatibleTimestampPattern}; |
There was a problem hiding this comment.
How about rewrite in this way to make it cleaner to understand:
auto const expected_backslash{is_json_literal ? cJsonEscapedBackslash : "\\"};
if (false == timestamp.substr(timestamp_idx).starts_with(expected_backslash)) {
return ErrorCode{ErrorCodeEnum::IncompatibleTimestampPattern};
}
timestamp_idx += expected_backslash.size();
break;start_with should handle the case where the string view is shorter than expected_backslash.
| std::string& generated_pattern, | ||
| bool is_json_literal |
There was a problem hiding this comment.
We should swap them since generated_pattern is an output param
| }; | ||
| for (auto const& illegal_timestamp_pattern_template : illegal_timestamp_pattern_templates) { | ||
| auto const result{TimestampPattern::create(illegal_timestamp_pattern_template)}; | ||
| REQUIRE(result.has_error()); |
There was a problem hiding this comment.
We should assert the returned error code if we plan to distinguish invalid/unsupported chars from a general error.
| * @param c | ||
| * @return Whether `c` is an unsupported character. | ||
| */ | ||
| [[nodiscard]] auto is_unsupported_character(char c) -> bool; |
There was a problem hiding this comment.
Nit: Shall we be explicit to jsut call it "control characters"?
There was a problem hiding this comment.
The thing is that it is control characters + ", so I wanted a more generic name.
| * - \P Unknown-precision epoch time -- resolves to \E, \L, \C, or \N based on a heuristic. | ||
| * - \O{...} One of several literal characters, described by content between {}. | ||
| * - \s Generic zero-padded second (00-60) -- resolves to \S or \J. | ||
| * |
There was a problem hiding this comment.
Shall we document our limitations on those unsupported chars in this docstring?
…n-control-chars-and-escape-sequences
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
components/core/src/clp_s/timestamp_parser/ErrorCode.cpp(1 hunks)components/core/src/clp_s/timestamp_parser/ErrorCode.hpp(1 hunks)components/core/src/clp_s/timestamp_parser/TimestampParser.cpp(14 hunks)components/core/src/clp_s/timestamp_parser/TimestampParser.hpp(5 hunks)components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cpp(11 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/timestamp_parser/ErrorCode.cppcomponents/core/src/clp_s/timestamp_parser/ErrorCode.hppcomponents/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cppcomponents/core/src/clp_s/timestamp_parser/TimestampParser.cppcomponents/core/src/clp_s/timestamp_parser/TimestampParser.hpp
🧠 Learnings (6)
📚 Learning: 2024-11-26T19:12:09.102Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/src/clp/error_handling/ErrorCode.hpp:131-134
Timestamp: 2024-11-26T19:12:09.102Z
Learning: In the error handling system (`components/core/src/clp/error_handling/ErrorCode.hpp`), the `ErrorCode` class is intentionally designed to wrap the `ErrorCodeEnum`, not to use the enum directly. This wrapper provides a general interface for adding customized error enums.
Applied to files:
components/core/src/clp_s/timestamp_parser/ErrorCode.cppcomponents/core/src/clp_s/timestamp_parser/ErrorCode.hpp
📚 Learning: 2024-11-26T19:10:05.051Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 486
File: components/core/src/clp/error_handling/ErrorCode.hpp:34-34
Timestamp: 2024-11-26T19:10:05.051Z
Learning: In the `ErrorCategory` template class in `ErrorCode.hpp`, the `name()` and `message(ErrorCodeEnum)` methods should not be declared as pure virtual functions because the class relies on template specializations, and making them pure virtual would make the class abstract, which is not intended.
Applied to files:
components/core/src/clp_s/timestamp_parser/ErrorCode.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cppcomponents/core/src/clp_s/timestamp_parser/TimestampParser.cppcomponents/core/src/clp_s/timestamp_parser/TimestampParser.hpp
📚 Learning: 2025-01-29T22:16:52.665Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
Applied to files:
components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cppcomponents/core/src/clp_s/timestamp_parser/TimestampParser.hpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build (macos-15)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: build-macos (macos-15, true)
🔇 Additional comments (14)
components/core/src/clp_s/timestamp_parser/ErrorCode.cpp (1)
28-31: LGTM!The new error code messages are clear and appropriately describe the error conditions. The integration with the existing switch statement is correct.
components/core/src/clp_s/timestamp_parser/ErrorCode.hpp (1)
13-15: LGTM!The new error codes
InvalidEscapeSequenceandInvalidCharacterare appropriately added to the enum, supporting the new validation logic for timestamp patterns.components/core/src/clp_s/timestamp_parser/test/test_TimestampParser.cpp (3)
146-179: LGTM! Comprehensive test coverage for illegal pattern validation.The test cases thoroughly cover:
- Control characters (
\x00,\x01,\x1f)- Double quotes in invalid positions
- All unsupported JSON escape sequences (
\b,\f,\n,\r,\t,\u,\uXXXX)The error code assertions are correctly implemented as requested in the past review.
181-198: LGTM! Well-structured tests for JSON-literal vs raw UTF-8 backslash handling.The test cases correctly verify:
is_json_literal=false: pattern\\matches input\(single backslash)is_json_literal=true: pattern\\matches input\\(escaped backslash)
656-686: LGTM! Good integration test for JSON escape sequence round-trip.The test verifies that timestamps with JSON-escaped backslashes can be parsed with
is_json_literal=trueand that marshalling produces the same output, ensuring bidirectional consistency.components/core/src/clp_s/timestamp_parser/TimestampParser.hpp (2)
202-211: LGTM!The documentation clearly specifies the allowed escape sequence (
\\) and illegal characters (control characters\x00-\x1Fand double quotes except as pattern delimiters).
232-237: LGTM!The new
is_json_literalparameter is consistently added to bothparse_timestampandsearch_known_timestamp_patterns, enabling proper handling of JSON-escaped content during parsing.Also applies to: 265-270
components/core/src/clp_s/timestamp_parser/TimestampParser.cpp (7)
121-122: LGTM!The
cJsonEscapedBackslashconstant is appropriately defined for representing a JSON-escaped backslash.
884-886: Verify signed char behavior for control character detection.The condition
'\x00' <= c && c <= '\x1f'may behave unexpectedly on platforms with signedchar. Ifchas a value >= 0x80 (extended ASCII),cwould be negative as a signed char, and'\x00' <= cwould be false, which is actually the desired behavior. However, for clarity and portability, consider casting tounsigned char.auto is_unsupported_character(char c) -> bool { - return '"' == c || ('\x00' <= c && c <= '\x1f'); + auto const uc = static_cast<unsigned char>(c); + return '"' == c || uc <= '\x1f'; }
913-941: LGTM!The pattern validation logic is well-structured:
- Correctly handles quoted patterns (allowing
"only as first/last character)- Validates characters before checking escape sequences
- Properly rejects unsupported JSON escape sequences while allowing
\\- Allows repeated use of
\O{}and\\escape sequences
759-764: LGTM!The marshal functions correctly:
- Output JSON-escaped backslashes (
\\) for the\\escape sequence- Return
InvalidEscapeSequencefor unrecognized escape sequences, providing more specific error informationAlso applies to: 858-863
1748-1757: LGTM!The backslash handling correctly differentiates between JSON literals and raw UTF-8:
- For JSON literals (
is_json_literal=true): expects\\in the timestamp- For raw UTF-8 (
is_json_literal=false): expects\in the timestampThe use of
starts_withproperly handles edge cases. The coding guideline (false == <expression>) is correctly followed.
1102-1105: LGTM!Using
std::string_view{m_pattern}.substr()instead ofm_pattern.substr()avoids potential temporary allocations and is more efficient for these read-only operations.Also applies to: 1117-1120, 1134-1137, 1149-1152
1838-1851: LGTM!The
is_json_literalparameter is correctly propagated toparse_timestamp, ensuring consistent JSON-literal handling throughout the timestamp pattern search.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Slightly polished the docstring otherwise we should be ready to go.
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
LinZhihao-723
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(clp-s::timestamp_parser): Invalidate unsupported characters and escape sequences in timestamp patterns.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp (1)
29-30: Fix the missing closing backtick in the documentation.Line 30 is missing a closing backtick after
pattern.Apply this diff to fix the typo:
- * - ErrorCodeEnum::InvalidCharacter if `pattern contains an unsupported character. + * - ErrorCodeEnum::InvalidCharacter if `pattern` contains an unsupported character.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp
🧠 Learnings (3)
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp
📚 Learning: 2025-07-17T16:08:23.185Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:42-43
Timestamp: 2025-07-17T16:08:23.185Z
Learning: In log-surgeon, regex patterns like `(?:\d{1,2}| \d)` are not supported. Log-surgeon has its own regex implementation with different capabilities compared to standard regex engines. Patterns like `[ 0-9]{2}` that allow character classes with spaces are valid in log-surgeon schemas.
Applied to files:
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp
📚 Learning: 2025-07-17T16:02:57.688Z
Learnt from: SharafMohamed
Repo: y-scope/clp PR: 1033
File: components/core/config/schemas.txt:96-97
Timestamp: 2025-07-17T16:02:57.688Z
Learning: In log-surgeon, the `.` character is treated as a literal non-delimiter character rather than a regex wildcard that matches any character. This means patterns like `.*` in log-surgeon schemas match sequences of non-delimiter characters, not arbitrary characters like in standard regex.
Applied to files:
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp
🧬 Code graph analysis (1)
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp (1)
components/core/src/clp_s/timestamp_parser/TimestampParser.cpp (1)
nodiscard(1838-1851)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: build (macos-15)
- GitHub Check: build (ubuntu-24.04)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (4)
components/core/src/clp_s/timestamp_parser/TimestampParser.hpp (4)
232-237: LGTM! The new parameter is well-documented and appropriately positioned.The addition of the
is_json_literalparameter aligns with the PR objectives to support parsing both JSON literals and raw UTF-8 strings. The parameter is correctly documented and positioned before the output parameter.
230-230: LGTM! Error code documentation is complete.The addition of
ErrorCodeEnum::InvalidEscapeSequenceto the documented error codes properly describes the new validation behaviour.
240-240: LGTM! Documentation accurately describes JSON literal marshalling.The updated documentation clearly indicates that timestamps are marshalled as JSON literals, which aligns with the PR's goal of supporting JSON-literal handling.
265-270: LGTM! Signature update is consistent with the parsing function.The addition of the
is_json_literalparameter tosearch_known_timestamp_patternsis consistent with the corresponding change toparse_timestamp, and the implementation correctly propagates this flag through to the parsing logic.
How about: Trying to be a bit more precise since technically the unsupported escape sequences were already illegal, we're just being a bit more defensive about them. |
…e quotes illegal in timestamp patterns. (y-scope#1723) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
Description
This PR is the last in the series of PRs readying
clp_s::timestamp_parserfor use in clp-s (#1708, #1673, #1646, #1626, #1599, #1564, and #1385).This PR makes several characters and escape sequences illegal in timestamp patterns, as well as fixes parsing and marshalling behaviour for the
\\escape sequence, in order to put timestamp patterns in a state where:The banned characters and escape sequences are:
\x00-\x1F)"(unless it is both the first and last character in a timestamp pattern, then the start and end quote are allowed)\\(\uXXXX,\b,\f,\n,\r,\t)Section 2.4 of the timestamp format doc has been updated to reflect these changes.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.