feat(clp-s): Automatically infer the file types of inputs to compression.#1312
Conversation
…pe in an input stream.
…er when wrapping reader with zstddecompressor
… each input file.
WalkthroughRemoves CLI-configured input file type; adds runtime input-type deduction (JSON, KV‑IR, Zstd) via buffered reader chains. Refactors parser to ingest() with ingest_json/ingest_kvir, updates tests and docs, and adds BufferedReader/BufferReader and streaming-compression sources to builds. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as User
participant CLPS as clp-s (main)
participant Parser as JsonParser
participant IC as InputConfig
participant BR as BufferedReader
participant DZ as Decompressor
Note over User,CLPS: Single entrypoint (no --file-type)
User->>CLPS: Run clp-s compress ...
CLPS->>Parser: ingest()
Parser->>IC: try_deduce_reader_type(reader)
IC->>BR: wrap base reader (buffer up to 64KB) and peek
alt Zstd detected
IC->>DZ: wrap with Decompressor and open
DZ-->>IC: decompressed stream
IC->>BR: re-wrap/re-peek inner data
end
alt JSON detected
IC-->>Parser: [reader_chain, FileType::Json]
Parser->>Parser: ingest_json(reader_chain.back(), path, creator_id)
else KV‑IR detected
IC-->>Parser: [reader_chain, FileType::KeyValueIr]
Parser->>Parser: ingest_kvir(reader_chain.back(), path, creator_id)
else Unknown
IC-->>Parser: [empty, FileType::Unknown]
Parser-->>CLPS: log error, fail
end
Parser-->>CLPS: success/failure
CLPS-->>User: exit
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
components/core/tests/test-clp_s-end_to_end.cpp (1)
108-116: Updated call site—OK. Add a zstd‑wrapped input case?Consider adding a variant that feeds a Zstd‑compressed JSONL into
compress_archiveto exercise nested detection.I can sketch a small helper that writes a zstd‑compressed copy of
test_no_floats_sorted.jsonlusing the in‑repo ZstdCompressor and then reuses this test.components/core/src/clp_s/JsonParser.cpp (3)
34-34: Remove duplicate include of JsonParser.hpp.It’s already included on Line 1.
Apply this diff:
-#include "JsonParser.hpp"
28-28: Unused include: zstd/Decompressor.hpp.Not referenced in this TU; InputConfig handles wrapping. Consider removing to tighten dependencies.
Apply this diff and ensure it still builds:
-#include "../clp/streaming_compression/zstd/Decompressor.hpp"
81-83: Optional: clear stored event to free memory between log events.clear() toggles a flag but retains the previous KeyValuePairLogEvent. Resetting the optional can reduce peak memory.
Apply this diff:
- void clear() { m_is_complete = false; } + void clear() { + m_deserialized_log_event.reset(); + m_is_complete = false; + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (16)
components/core/src/clp_s/CMakeLists.txt(1 hunks)components/core/src/clp_s/CommandLineArguments.cpp(0 hunks)components/core/src/clp_s/CommandLineArguments.hpp(0 hunks)components/core/src/clp_s/InputConfig.cpp(5 hunks)components/core/src/clp_s/InputConfig.hpp(3 hunks)components/core/src/clp_s/JsonParser.cpp(3 hunks)components/core/src/clp_s/JsonParser.hpp(2 hunks)components/core/src/clp_s/clp-s.cpp(1 hunks)components/core/src/clp_s/indexer/CMakeLists.txt(1 hunks)components/core/tests/clp_s_test_utils.cpp(2 hunks)components/core/tests/clp_s_test_utils.hpp(1 hunks)components/core/tests/test-clp_s-delta-encode-log-order.cpp(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp(1 hunks)components/core/tests/test-clp_s-range_index.cpp(1 hunks)components/core/tests/test-clp_s-search.cpp(1 hunks)docs/src/user-docs/core-clp-s.md(1 hunks)
💤 Files with no reviewable changes (2)
- components/core/src/clp_s/CommandLineArguments.cpp
- components/core/src/clp_s/CommandLineArguments.hpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/tests/test-clp_s-search.cppcomponents/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/InputConfig.hppcomponents/core/src/clp_s/clp-s.cppcomponents/core/tests/test-clp_s-delta-encode-log-order.cppcomponents/core/tests/test-clp_s-end_to_end.cppcomponents/core/tests/clp_s_test_utils.hppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/tests/test-clp_s-range_index.cppcomponents/core/src/clp_s/JsonParser.cpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-search.cppcomponents/core/tests/clp_s_test_utils.cppcomponents/core/tests/test-clp_s-delta-encode-log-order.cppcomponents/core/tests/test-clp_s-end_to_end.cppcomponents/core/tests/clp_s_test_utils.hppcomponents/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/tests/clp_s_test_utils.cpp
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Applied to files:
components/core/tests/clp_s_test_utils.cppcomponents/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/CMakeLists.txtcomponents/core/src/clp_s/indexer/CMakeLists.txtcomponents/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
Applied to files:
components/core/tests/test-clp_s-delta-encode-log-order.cppcomponents/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
🧬 Code graph analysis (4)
components/core/tests/clp_s_test_utils.cpp (1)
components/core/src/clp_s/clp-s.cpp (1)
parser(108-108)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(278-328)
components/core/src/clp_s/JsonParser.hpp (2)
components/core/src/clp_s/JsonParser.cpp (6)
nodiscard(49-54)nodiscard(56-61)nodiscard(63-69)nodiscard(71-74)nodiscard(76-79)nodiscard(83-83)components/core/src/clp_s/JsonFileIterator.hpp (3)
nodiscard(41-41)nodiscard(46-46)nodiscard(59-59)
components/core/src/clp_s/JsonParser.cpp (4)
components/core/src/clp_s/InputConfig.cpp (1)
try_deduce_reader_type(278-279)components/core/src/clp_s/InputConfig.hpp (1)
try_deduce_reader_type(138-139)components/core/src/clp_s/JsonParser.hpp (3)
reader(85-89)reader(99-103)path(225-225)components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
reader(59-61)reader(77-80)reader(141-142)reader(184-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
- GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (13)
components/core/tests/test-clp_s-search.cpp (1)
233-241: Updated call site—OK.Switching to auto‑deduction in tests is consistent with the new API.
components/core/src/clp_s/InputConfig.hpp (1)
25-28: Extending FileType: add/verify string mappers and persistence handling
- Switch coverage: FileType::Zstd and FileType::Unknown are already handled in components/core/src/clp_s/InputConfig.cpp and components/core/src/clp_s/JsonParser.cpp (search shows case labels and deduce/returns).
- Missing mapper: no to_string(FileType) or other string/enum mapper found in the components tree — add one if FileType is converted to/from text (logs, CLI, config).
- Persistence/on‑wire: confirm no on‑wire or persisted format relies on enum ordinals; if it does, either preserve ordinals or migrate to explicit serialization (strings/versioning).
components/core/src/clp_s/CMakeLists.txt (1)
10-13: BufferReader vs BufferedReader — distinct classes found; confirm indexer CMake & namingBoth exist and are used: components/core/src/clp/BufferReader.{hpp,cpp} and components/core/src/clp/BufferedReader.{hpp,cpp}; BufferReader also exists at components/core/src/glt/BufferReader.{hpp,cpp}. If both are intended keep them; otherwise rename to avoid cross-module confusion. Duplicate this verification in the indexer CMake to ensure the intended files are included.
components/core/tests/clp_s_test_utils.hpp (1)
22-28: Confirm all tests updated to the new compress_archive signature (5 args, no FileType).Verify every call in components/core/tests—especially:
components/core/tests/test-clp_s-search.cpp,
components/core/tests/test-clp_s-range_index.cpp,
components/core/tests/test-clp_s-delta-encode-log-order.cpp,
components/core/tests/test-clp_s-end_to_end.cpp
—and ensure each call uses (file_path, archive_directory, timestamp_key, single_file_archive, structurize_arrays) with no FileType argument.components/core/tests/test-clp_s-delta-encode-log-order.cpp (1)
69-75: API update in compress_archive call looks correctSwitched to the 5‑arg form; param order matches the updated signature.
components/core/src/clp_s/JsonParser.hpp (2)
68-68: Adding [[nodiscard]] to parse() is goodPrevents accidental ignore of parse result.
91-104: Ingestion helpers’ API looks reasonableSignatures/readability are fine; no blocking issues here.
components/core/tests/clp_s_test_utils.cpp (2)
19-20: Signature change propagated correctlyFifth argument is now structurize_arrays.
48-48: parse() assertion is appropriateREQUIRE(parser.parse()) aligns with the new API.
components/core/tests/test-clp_s-range_index.cpp (1)
223-230: API update in compress_archive call looks correctRemoved obsolete file-type arg; parameter order matches updated helper.
components/core/src/clp_s/InputConfig.cpp (1)
243-264: Detection order/logic is soundZstd → KV‑IR → JSON is reasonable; early returns are correct.
components/core/src/clp_s/JsonParser.cpp (2)
91-97: Constructor option wiring looks correct.New fields are plumbed cleanly into member initializers.
791-803: Right‑size logging for expected deserialization termination.Past guidance accepts certain IR deserialization “errors” as normal termination signals. Consider lowering to DEBUG or filtering on specific error codes to avoid alarming warnings in healthy flows.
Do these codes ever carry expected end‑of‑stream signals besides IrUnitType::EndOfStream? If yes, propose:
- SPDLOG_WARN( + SPDLOG_DEBUG( "Encountered error while deserializing kv-ir log event from stream \"{}\": "Please confirm the exact error codes emitted by Deserializer::deserialize_next_ir_unit on normal endings vs truncated streams.
| ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \ | ||
| https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick
Docs: call out auto type inference and nested Zstd explicitly.
Since --file-type is removed and detection supports nested Zstd, add a short note under “Compression” clarifying that inputs may be JSON or KV‑IR, optionally Zstd‑compressed, and that the tool auto‑detects and unwraps nested formats. Also add guidance on what happens when deduction fails and how users should proceed.
🤖 Prompt for AI Agents
In docs/src/user-docs/core-clp-s.md around lines 64 to 66, add a short
“Compression” note explaining that input archives may be JSON or KV‑IR and may
optionally be Zstd‑compressed (including nested Zstd layers), that the CLI
auto‑detects file format and will automatically unwrap nested Zstd compression,
and that if format deduction fails the tool will surface an error instructing
users to either provide a known format (e.g., via a --file-type flag or
equivalent) or to pre‑decompress the archive before running the command; keep
the note concise and place it immediately under the Compression subsection where
the example command appears.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/core/src/clp_s/JsonParser.cpp (1)
927-955: Return type mismatch with header for add_node_to_archive_and_translations.Header declares int32_t; definition returns int. This won’t compile.
-auto JsonParser::add_node_to_archive_and_translations( +auto JsonParser::add_node_to_archive_and_translations( uint32_t ir_node_id, clp::ffi::SchemaTree::Node const& ir_node_to_add, NodeType archive_node_type, int32_t parent_node_id, bool matches_timestamp -) -> int { +) -> int32_t {
♻️ Duplicate comments (9)
components/core/src/clp_s/InputConfig.cpp (4)
199-206: Harden Zstd magic check against signed-char pitfalls.Define the magic as unsigned bytes to avoid implementation-defined char literals and keep intent clear. You already included .
-auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool { - constexpr std::array<char, 4> cZstdMagicNumber = {'\x28', '\xB5', '\x2F', '\xFD'}; +auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool { + constexpr std::array<unsigned char, 4> cZstdMagicNumber = {0x28, 0xB5, 0x2F, 0xFD}; if (peek_size < cZstdMagicNumber.size()) { return false; } - - return 0 == std::memcmp(peek_buf, cZstdMagicNumber.data(), cZstdMagicNumber.size()); + return 0 + == std::memcmp( + peek_buf, + reinterpret_cast<char const*>(cZstdMagicNumber.data()), + cZstdMagicNumber.size() + ); }
227-240: Fix UB in std::isspace and recognise BOM/top‑level arrays.Cast to unsigned char for ctype, skip UTF‑8 BOM, and accept '[' as valid JSON start.
-auto could_be_json(char const* peek_buf, size_t peek_size) -> bool { - for (size_t i{0ULL}; i < peek_size; ++i) { - if (std::isspace(static_cast<int>(peek_buf[i]))) { - continue; - } - - if ('{' == peek_buf[i]) { - return true; - } else { - return false; - } - } - return false; -} +auto could_be_json(char const* peek_buf, size_t peek_size) -> bool { + size_t i{0ULL}; + if (peek_size >= 3 + && static_cast<unsigned char>(peek_buf[0]) == 0xEF + && static_cast<unsigned char>(peek_buf[1]) == 0xBB + && static_cast<unsigned char>(peek_buf[2]) == 0xBF) + { + i = 3; + } + for (; i < peek_size; ++i) { + unsigned char c = static_cast<unsigned char>(peek_buf[i]); + if (false == std::isspace(c)) { + return ('{' == static_cast<char>(c)) || ('[' == static_cast<char>(c)); + } + } + return false; +}
285-291: Don’t reject readers that can’t report position.Only bail if try_get_pos succeeds and position != 0. Network streams often don’t support positioning.
- size_t original_pos{}; - if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos) - || 0ULL != original_pos) - { - return {{}, FileType::Unknown}; - } + size_t original_pos{}; + auto const pos_rc = reader->try_get_pos(original_pos); + if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) { + return {{}, FileType::Unknown}; + }
292-301: Add debug for buffered refill failure to aid field diagnostics.A one‑line debug before returning Unknown helps users pinpoint I/O errors.
- auto const rc{buffered_reader->try_refill_buffer_if_empty()}; - if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { - return {{}, FileType::Unknown}; - } + auto const rc{buffered_reader->try_refill_buffer_if_empty()}; + if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { + SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc)); + return {{}, FileType::Unknown}; + }components/core/src/clp_s/InputConfig.hpp (1)
128-139: Tighten API docs for try_deduce_reader_type (preconditions/lifecycle).Clarify that the root reader should be valid and at offset 0 if seekable; describe returned chain order and caller responsibility to close in reverse.
-/** - * Tries to deduce the underlying file-type of the file opened by `reader`, and return a +/** + * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of * compression. - * @param reader + * Preconditions: + * - `reader` is valid; if it supports positioning, it should be at offset 0. + * Postconditions: + * - The returned vector contains the reader chain (outermost first). The last element is open + * and positioned at the start of the deduced content. Callers should close/teardown in reverse. + * @param reader * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is * open for reading content of the type described by the element in the pair. When the content type * cannot be deduced, we return an empty vector and `FileType::Unknown`. */components/core/src/clp_s/JsonParser.cpp (4)
499-516: Defensive check before nested_readers.back().Avoid UB if the vector is empty on an internal error.
- auto const [nested_readers, file_type] = try_deduce_reader_type(reader); - bool ingestion_successful{}; + auto const [nested_readers, file_type] = try_deduce_reader_type(reader); + if (nested_readers.empty()) { + SPDLOG_ERROR("Internal error: no reader layers returned for input {}", path.path); + std::ignore = m_archive_writer->close(); + return false; + } + bool ingestion_successful{};
549-596: Don’t skip filename/split/creator metadata when record order is disabled (JSON path).Gate only the per‑record log index field; always add per‑archive metadata.
- auto initialize_fields_for_archive = [&]() -> bool { - if (false == m_record_log_order) { - return true; - } - log_event_idx_node_id - = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + auto initialize_fields_for_archive = [&]() -> bool { + if (m_record_log_order) { + log_event_idx_node_id + = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cFilename}, path.path ); ErrorCodeSuccess != rc) {
611-617: Clarify error message: this rejects scalar values, not arrays.Change “non‑json‑object” → “scalar JSON value” to reduce confusion.
- SPDLOG_ERROR( - "Encountered non-json-object while trying to parse {} after parsing {} " + SPDLOG_ERROR( + "Encountered scalar JSON value while trying to parse {} after parsing {} " "bytes",
715-783: Don’t skip filename/split/creator metadata when record order is disabled (KV‑IR path).Same gating issue as JSON path.
- auto initialize_fields_for_archive = [&]() -> bool { - if (false == m_record_log_order) { - return true; - } - log_event_idx_node_id - = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + auto initialize_fields_for_archive = [&]() -> bool { + if (m_record_log_order) { + log_event_idx_node_id + = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cFilename}, path.path ); ErrorCodeSuccess != rc) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/src/clp_s/InputConfig.cpp(5 hunks)components/core/src/clp_s/InputConfig.hpp(3 hunks)components/core/src/clp_s/JsonParser.cpp(3 hunks)components/core/src/clp_s/JsonParser.hpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/InputConfig.hpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.
Applied to files:
components/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/InputConfig.hpp
🧬 Code graph analysis (3)
components/core/src/clp_s/JsonParser.cpp (4)
components/core/src/clp_s/InputConfig.cpp (1)
try_deduce_reader_type(277-278)components/core/src/clp_s/InputConfig.hpp (1)
try_deduce_reader_type(137-138)components/core/src/clp_s/JsonParser.hpp (3)
reader(84-88)reader(98-102)path(224-224)components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
reader(59-61)reader(77-80)reader(141-142)reader(184-188)
components/core/src/clp_s/JsonParser.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(277-327)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(277-327)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: package-image
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/src/clp_s/JsonParser.hpp (1)
64-69: Docstring fix and return type change look good.“successfully” fix and
auto parse() -> boolare consistent with the implementation.
| private: | ||
| /** | ||
| * Parses JSON input and ingests it into the current archive, splitting the archive if it grows | ||
| * beyond the target encoded size. | ||
| * @param reader | ||
| * @param path | ||
| * @param archive_creator_id | ||
| * @return Whether ingestion was successful or not. | ||
| */ | ||
| [[nodiscard]] auto ingest_json( | ||
| std::shared_ptr<clp::ReaderInterface> reader, | ||
| Path const& path, | ||
| std::string const& archive_creator_id | ||
| ) -> bool; | ||
|
|
||
| /** | ||
| * Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows | ||
| * beyond the target encoded size. | ||
| * @param reader | ||
| * @param path | ||
| * @param archive_creator_id | ||
| * @return Whether ingestion was successful or not. | ||
| */ | ||
| [[nodiscard]] auto ingest_kvir( | ||
| std::shared_ptr<clp::ReaderInterface> reader, | ||
| Path const& path, | ||
| std::string const& archive_creator_id | ||
| ) -> bool; | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Clarify ingest_json/ingest_kvir contracts (preconditions/lifetime).
Document that the reader must be the last element returned by try_deduce_reader_type (already opened/positioned at start of the deduced content) and that callers own closing/teardown via JsonParser::store(). This avoids ambiguity about ownership and expected position.
/**
* Parses JSON input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
- * @param reader
+ * Preconditions:
+ * - `reader` is the final, ready-to-read layer returned by try_deduce_reader_type()
+ * (positioned at the start of JSON content).
+ * Ownership:
+ * - JsonParser manages archive lifecycle; callers invoke store() to flush/close.
+ * @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| private: | |
| /** | |
| * Parses JSON input and ingests it into the current archive, splitting the archive if it grows | |
| * beyond the target encoded size. | |
| * @param reader | |
| * @param path | |
| * @param archive_creator_id | |
| * @return Whether ingestion was successful or not. | |
| */ | |
| [[nodiscard]] auto ingest_json( | |
| std::shared_ptr<clp::ReaderInterface> reader, | |
| Path const& path, | |
| std::string const& archive_creator_id | |
| ) -> bool; | |
| /** | |
| * Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows | |
| * beyond the target encoded size. | |
| * @param reader | |
| * @param path | |
| * @param archive_creator_id | |
| * @return Whether ingestion was successful or not. | |
| */ | |
| [[nodiscard]] auto ingest_kvir( | |
| std::shared_ptr<clp::ReaderInterface> reader, | |
| Path const& path, | |
| std::string const& archive_creator_id | |
| ) -> bool; | |
| private: | |
| /** | |
| * Parses JSON input and ingests it into the current archive, splitting the archive if it grows | |
| * beyond the target encoded size. | |
| * Preconditions: | |
| * - `reader` is the final, ready-to-read layer returned by try_deduce_reader_type() | |
| * (positioned at the start of JSON content). | |
| * Ownership: | |
| * - JsonParser manages archive lifecycle; callers invoke store() to flush/close. | |
| * @param reader | |
| * @param path | |
| * @param archive_creator_id | |
| * @return Whether ingestion was successful or not. | |
| */ | |
| [[nodiscard]] auto ingest_json( | |
| std::shared_ptr<clp::ReaderInterface> reader, | |
| Path const& path, | |
| std::string const& archive_creator_id | |
| ) -> bool; | |
| /** | |
| * Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows | |
| * beyond the target encoded size. | |
| * @param reader | |
| * @param path | |
| * @param archive_creator_id | |
| * @return Whether ingestion was successful or not. | |
| */ | |
| [[nodiscard]] auto ingest_kvir( | |
| std::shared_ptr<clp::ReaderInterface> reader, | |
| Path const& path, | |
| std::string const& archive_creator_id | |
| ) -> bool; |
🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.hpp around lines 75 to 103, the
ingest_json and ingest_kvir method contracts are ambiguous about the reader
lifetime and position; update their comments to state that the reader parameter
must be the last element returned by try_deduce_reader_type, must already be
opened and positioned at the start of the deduced content, and that callers
retain ownership and are responsible for closing/teardown (e.g., via
JsonParser::store()) after these methods return; make the wording concise and
identical for both methods so callers clearly understand preconditions and
ownership.
wraymo
left a comment
There was a problem hiding this comment.
Great PR! Just had a few small comments.
| } | ||
|
|
||
| auto could_be_json(char const* peek_buf, size_t peek_size) -> bool { | ||
| for (size_t i{0ULL}; i < peek_size; ++i) { |
There was a problem hiding this comment.
What if we have more spaces than peek_size and do you think the UTF‑8 BOM comment from coderabbit make sense?
There was a problem hiding this comment.
JSON doesn't allow BOM, and the current code should already reject BOM, so that part is fine. I'll double check about the negative characters though.
There was a problem hiding this comment.
If we have more spaces than peek size I'm personally fine treating the input as not JSON -- for reference the peek size should 64KB unless the input is smaller, so the input has to be really malformed for this to not work.
| continue; | ||
| } | ||
|
|
||
| if ('{' == peek_buf[i]) { |
There was a problem hiding this comment.
How do we handle arrays or integers?
There was a problem hiding this comment.
Currently we end up treating them as "Unknown" content type, and the user will see an error log like "could not deduce file-type for file xyz" before we exit. Since we don't actually support parsing JSON types besides object at the root right now I think this is fine -- we can change this check later if we start supporting other types.
|
|
||
| /** | ||
| * Checks if an input contains Zstd data, based on the first few bytes of data from the input. | ||
| * @return Whether the input could be Zstd. |
There was a problem hiding this comment.
Should we add parameter descriptions for these three methods?
| ingestion_successful = ingest_json(nested_readers.back(), path, archive_creator_id); | ||
| break; | ||
| case FileType::KeyValueIr: | ||
| ingestion_successful = ingest_kvir(nested_readers.back(), path, archive_creator_id); |
There was a problem hiding this comment.
Since we only need the last one, is it required to return a vector?
There was a problem hiding this comment.
Unfortunately yes -- Since zstd::Decompressor only takes a reference to a reader instead of a shared_ptr if we return a decompressor as the last reader alone (as we would do for kv-ir inputs) the previous readers would get deallocated. Returning a vector helps avoid this case. It can also be useful for example if we need to close multiple readers in the hierarchy before we exit (which I should be doing for the decompressors, but forgot about, so I'll add that to the PR).
| * @return whether the IR Stream was parsed successfully | ||
| */ | ||
| [[nodiscard]] auto parse_from_ir() -> bool; | ||
| [[nodiscard]] auto parse() -> bool; |
There was a problem hiding this comment.
Not sure if we should also update this method name.
There was a problem hiding this comment.
Sure, I can rename parse to ingest
| [[nodiscard]] auto try_deduce_reader_type(std::shared_ptr<clp::ReaderInterface> reader) | ||
| -> std::pair<std::vector<std::shared_ptr<clp::ReaderInterface>>, FileType> { | ||
| constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB | ||
| constexpr size_t cMaxNestedFormatDepth = 5; |
There was a problem hiding this comment.
It's a somewhat arbitrary choice -- I'm just trying to avoid cases where someone maliciously or accidentally gives us a file that we interpret as being wrapped in many layers of compression and wasting compute/memory. I figure 5 is more than enough for real cases, especially for now.
Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
components/core/src/clp_s/InputConfig.hpp (1)
128-139: Header docs: clarify preconditions, ownership, and return semanticsDocs still omit critical preconditions and lifetime/ownership details; align with the implementation and previous feedback.
Apply:
/** - * Tries to deduce the underlying file-type of the file opened by `reader`, and returns a + * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of * compression. - * @param reader - * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is - * open for reading content of the type described by the element in the pair. When the content type - * cannot be deduced, we return an empty vector and `FileType::Unknown`. + * Preconditions: + * - `reader` must be valid and (when position is known) positioned at offset 0. + * Postconditions: + * - The returned vector contains the reader chain (outermost first). The last element is ready + * to read the deduced content type. Callers tear down/close in reverse order. + * @param reader Root reader to inspect and unwrap + * @return A vector of all created `clp::ReaderInterface`s (outermost first) and the deduced + * FileType. When the content type cannot be deduced, returns an empty vector and + * `FileType::Unknown`. */components/core/src/clp_s/InputConfig.cpp (4)
199-206: Use unsigned bytes for Zstd magic to avoid signed‑char pitfallsDefine the magic as unsigned and compare via memcmp against unsigned data.
-auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool { - constexpr std::array<char, 4> cZstdMagicNumber = {'\x28', '\xB5', '\x2F', '\xFD'}; +auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool { + constexpr std::array<unsigned char, 4> cZstdMagicNumber = {0x28, 0xB5, 0x2F, 0xFD}; if (peek_size < cZstdMagicNumber.size()) { return false; } - - return 0 == std::memcmp(peek_buf, cZstdMagicNumber.data(), cZstdMagicNumber.size()); + return 0 + == std::memcmp( + peek_buf, + reinterpret_cast<char const*>(cZstdMagicNumber.data()), + cZstdMagicNumber.size() + ); }
281-286: Don’t reject readers that don’t support positioningOnly bail when try_get_pos succeeds AND reports a non‑zero position; otherwise proceed.
- size_t original_pos{}; - if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos) - || 0ULL != original_pos) - { - return {{}, FileType::Unknown}; - } + size_t original_pos{}; + auto const pos_rc = reader->try_get_pos(original_pos); + if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) { + return {{}, FileType::Unknown}; + }#!/bin/bash # Verify behaviour of try_get_pos across ReaderInterface implementations. rg -nP --type=cpp -C3 '\bclass\s+\w+.*?ReaderInterface|\btry_get_pos\s*\('
293-296: Add debug log on buffer refill failureA small log greatly helps explaining Unknown results during inference.
- if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { - return {{}, FileType::Unknown}; - } + if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { + SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc)); + return {{}, FileType::Unknown}; + }
227-236: Fix UB in std::isspace and recognise top-level JSON arrays
- Cast peek_buf[i] to unsigned char before calling std::isspace to avoid UB for signed char.
- Treat '[' as a valid top-level JSON start or document/gate it if arrays aren't supported.
-auto could_be_json(char const* peek_buf, size_t peek_size) -> bool { - for (size_t i{0ULL}; i < peek_size; ++i) { - if (std::isspace(static_cast<int>(peek_buf[i]))) { - continue; - } - - return '{' == peek_buf[i]; - } - return false; -} +auto could_be_json(char const* peek_buf, size_t peek_size) -> bool { + for (size_t i{0ULL}; i < peek_size; ++i) { + unsigned char c = static_cast<unsigned char>(peek_buf[i]); + if (false == std::isspace(c)) { + char const ch = static_cast<char>(c); + return ('{' == ch) || ('[' == ch); + } + } + return false; +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp_s/InputConfig.cpp(5 hunks)components/core/src/clp_s/InputConfig.hpp(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/InputConfig.hpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.
Applied to files:
components/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/InputConfig.hpp
🧬 Code graph analysis (1)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(273-323)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (2)
components/core/src/clp_s/InputConfig.cpp (2)
3-7: Includes: correct and scopedAdded headers are appropriate for memcmp, detection helpers, and zstd decompression.
Also applies to: 16-17, 22-22
104-129: TU‑local helpers: good visibility controlForward declarations inside an anonymous namespace keep helpers internal. LGTM.
| constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB | ||
| constexpr size_t cMaxNestedFormatDepth = 5; |
There was a problem hiding this comment.
🧹 Nitpick
Document nesting cap rationale
Add a short comment to justify 5 and signal DOS protection intent.
- constexpr size_t cMaxNestedFormatDepth = 5;
+ // Defensive cap to prevent excessive/untrusted nesting (compression bombs, etc.).
+ // 5 levels should cover realistic cases (e.g., zstd->json, zstd->kvir->zstd->json).
+ constexpr size_t cMaxNestedFormatDepth = 5;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB | |
| constexpr size_t cMaxNestedFormatDepth = 5; | |
| constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB | |
| // Defensive cap to prevent excessive/untrusted nesting (compression bombs, etc.). | |
| // 5 levels should cover realistic cases (e.g., zstd->json, zstd->kvir->zstd->json). | |
| constexpr size_t cMaxNestedFormatDepth = 5; |
🤖 Prompt for AI Agents
In components/core/src/clp_s/InputConfig.cpp around lines 275 to 276, the
constant cMaxNestedFormatDepth is set to 5 without explanation; add a short
inline comment explaining the rationale (limits nesting depth to prevent
excessive recursion/stack use and mitigate DoS from deeply nested input), e.g.,
mention safety/DoS protection and trade-off between flexibility and resource
limits so future maintainers understand why 5 was chosen.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (12)
components/core/src/clp_s/indexer/CMakeLists.txt (1)
7-10: BufferReader vs BufferedReader: keep both; looks correct.Both classes exist and serve different purposes; including both here is intentional.
components/core/src/clp_s/CMakeLists.txt (1)
10-13: BufferReader and BufferedReader additions are appropriate.Required for peek/rewrap flow; additions align with the new deduction path.
components/core/src/clp_s/clp-s.cpp (1)
109-111: Enhance parse failure log with actionable context.Log how many input paths were attempted so users can triage quickly.
- if (false == parser.ingest()) { - SPDLOG_ERROR("Encountered error while parsing input."); + if (false == parser.ingest()) { + SPDLOG_ERROR( + "Encountered error while parsing {} input path(s).", + option.input_paths.size() + ); return false; }components/core/src/clp_s/JsonParser.hpp (1)
75-103: Clarify ingest_json/ingest_kvir reader contract (preconditions/ownership).Document that reader must be the last layer from try_deduce_reader_type(), positioned at start-of-content; and lifecycle/close is handled by store()/close_nested_readers().
- * Parses JSON input and ingests it into the current archive, splitting the archive if it grows + * Parses JSON input and ingests it into the current archive, splitting the archive if it grows * beyond the target encoded size. - * @param reader + * Preconditions: + * - `reader` is the final layer returned by try_deduce_reader_type(), positioned at the start + * of JSON content. + * Ownership/Lifetime: + * - JsonParser manages archive lifecycle; callers finalize via store(). Reader teardown is + * performed by close_nested_readers(). + * @param readerApply the same wording to ingest_kvir().
components/core/src/clp_s/InputConfig.hpp (1)
128-139: Fix docs and state pre/postconditions for try_deduce_reader_type.Clarify reader position requirement and close order; align @param with the actual argument.
-/** - * Tries to deduce the underlying file-type of the file opened by `reader`, and returns a +/** + * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of * compression. - * @param reader + * Preconditions: + * - `reader` should be valid and preferably positioned at offset 0. + * Postconditions: + * - The returned vector contains the reader chain (outermost first); the last element is ready + * to read the deduced content. Callers close layers in reverse order via close_nested_readers(). + * @param reader Root reader to inspect and unwrap * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is * open for reading content of the type described by the element in the pair. When the content type * cannot be deduced, we return an empty vector and `FileType::Unknown`. */components/core/src/clp_s/InputConfig.cpp (3)
282-284: Document nesting cap rationale.Add a brief comment to justify the 5‑level limit (DoS guard).
- constexpr size_t cMaxNestedFormatDepth = 5; + // Defensive cap to avoid excessive/untrusted nesting (e.g., compression bombs). + // Five levels should cover realistic cases. + constexpr size_t cMaxNestedFormatDepth = 5;
288-293: Don’t hard‑fail when try_get_pos is unsupported.Some readers (e.g., network) may not support positioning; only bail if the call succeeds and position != 0.
- size_t original_pos{}; - if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos) - || 0ULL != original_pos) - { - return {{}, FileType::Unknown}; - } + size_t original_pos{}; + auto const pos_rc = reader->try_get_pos(original_pos); + if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) { + return {{}, FileType::Unknown}; + }
300-303: Optional: log buffer refill failure.A DEBUG log helps diagnose Unknown classifications due to I/O issues.
- if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { - return {{}, FileType::Unknown}; - } + if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) { + SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc)); + return {{}, FileType::Unknown}; + }components/core/src/clp_s/JsonParser.cpp (4)
499-516: Guard against empty nested_readers before using back().Avoid UB if deduction returns a known type with an empty vector.
- auto const [nested_readers, file_type] = try_deduce_reader_type(reader); - bool ingestion_successful{}; + auto const [nested_readers, file_type] = try_deduce_reader_type(reader); + if (true == nested_readers.empty()) { + SPDLOG_ERROR("Internal error: no reader layers returned for input {}", path.path); + std::ignore = m_archive_writer->close(); + return false; + } + bool ingestion_successful{};
549-595: Do not gate per‑archive metadata on record_log_order (JSON path).Filename, split number, and archive_creator_id should always be recorded; only log_event_idx should be gated.
- auto initialize_fields_for_archive = [&]() -> bool { - if (false == m_record_log_order) { - return true; - } - log_event_idx_node_id - = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + auto initialize_fields_for_archive = [&]() -> bool { + if (m_record_log_order) { + log_event_idx_node_id + = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cFilename}, path.path ); ErrorCodeSuccess != rc) { SPDLOG_ERROR( "Failed to add metadata field \"{}\" ({})", constants::range_index::cFilename, static_cast<int64_t>(rc) ); return false; } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cFileSplitNumber}, file_split_number ); ErrorCodeSuccess != rc) { SPDLOG_ERROR( "Failed to add metadata field \"{}\" ({})", constants::range_index::cFileSplitNumber, static_cast<int64_t>(rc) ); return false; } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cArchiveCreatorId}, archive_creator_id ); ErrorCodeSuccess != rc) { SPDLOG_ERROR( "Failed to add metadata field \"{}\" ({})", constants::range_index::cArchiveCreatorId, static_cast<int64_t>(rc) ); return false; } return true; }; @@ - if (m_record_log_order) { - if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) { - SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc)); - return false; - } - } + if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) { + SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc)); + return false; + }Also applies to: 687-693
611-618: Clarify error text (JSON scalar).Users may see arrays; this message specifically rejects scalars.
- SPDLOG_ERROR( - "Encountered non-json-object while trying to parse {} after parsing {} " + SPDLOG_ERROR( + "Encountered scalar JSON value while trying to parse {} after parsing {} " "bytes",
716-781: Do not gate per‑archive metadata on record_log_order (KV‑IR path).Mirror the JSON fix: always write filename/split/creator, gate only log_event_idx.
- auto initialize_fields_for_archive = [&]() -> bool { - if (false == m_record_log_order) { - return true; - } - log_event_idx_node_id - = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + auto initialize_fields_for_archive = [&]() -> bool { + if (m_record_log_order) { + log_event_idx_node_id + = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger); + } if (auto const rc = m_archive_writer->add_field_to_current_range( std::string{constants::range_index::cFilename}, path.path ); ErrorCodeSuccess != rc) { @@ - if (m_record_log_order) { - if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) { - SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc)); - return false; - } - } + if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) { + SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc)); + return false; + }Also applies to: 860-865
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (8)
components/core/src/clp_s/CMakeLists.txt(2 hunks)components/core/src/clp_s/InputConfig.cpp(5 hunks)components/core/src/clp_s/InputConfig.hpp(3 hunks)components/core/src/clp_s/JsonParser.cpp(4 hunks)components/core/src/clp_s/JsonParser.hpp(2 hunks)components/core/src/clp_s/clp-s.cpp(1 hunks)components/core/src/clp_s/indexer/CMakeLists.txt(2 hunks)components/core/tests/clp_s_test_utils.cpp(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>rather than!<expression>.
Files:
components/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/InputConfig.hppcomponents/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/JsonParser.cpp
🧠 Learnings (20)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.
Applied to files:
components/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/clp-s.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.
Applied to files:
components/core/src/clp_s/InputConfig.hppcomponents/core/src/clp_s/indexer/CMakeLists.txtcomponents/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Applied to files:
components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/CMakeLists.txt
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/tests/clp_s_test_utils.cppcomponents/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/indexer/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.
Applied to files:
components/core/tests/clp_s_test_utils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.
Applied to files:
components/core/src/clp_s/JsonParser.hppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.
Applied to files:
components/core/src/clp_s/indexer/CMakeLists.txtcomponents/core/src/clp_s/CMakeLists.txtcomponents/core/src/clp_s/InputConfig.cppcomponents/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.
Applied to files:
components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.
Applied to files:
components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (4)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(280-330)
components/core/src/clp_s/JsonParser.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
nodiscard(280-330)
components/core/src/clp_s/InputConfig.cpp (3)
components/core/src/clp_s/JsonParser.hpp (2)
reader(84-88)reader(98-102)components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
reader(59-61)reader(77-80)reader(141-142)reader(184-188)components/core/src/clp_s/ArchiveReaderAdaptor.hpp (6)
decompressor(100-100)decompressor(109-109)decompressor(118-118)decompressor(126-126)decompressor(134-134)decompressor(166-166)
components/core/src/clp_s/JsonParser.cpp (3)
components/core/src/clp_s/InputConfig.cpp (3)
try_deduce_reader_type(280-281)close_nested_readers(332-341)close_nested_readers(332-332)components/core/src/clp_s/JsonParser.hpp (3)
reader(84-88)reader(98-102)path(224-224)components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
reader(59-61)reader(77-80)reader(141-142)reader(184-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: package-image
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (13)
components/core/src/clp_s/CMakeLists.txt (1)
95-96: Ambiguous header name: Decompressor.hpp — verify and use qualified includescomponents/core/src/clp_s/CMakeLists.txt adds ../clp/streaming_compression/Decompressor.hpp (lines 95–96) while the target also provides clp_s/Decompressor.hpp (line 186). Unqualified #include "Decompressor.hpp" may be ambiguous.
- Action: Search for unqualified includes and either replace with fully qualified paths (e.g. #include "clp/streaming_compression/Decompressor.hpp" or #include "clp_s/Decompressor.hpp") or constrain the target's include directories to avoid overlap.
- Suggested check: run rg -nP '^\s*#\sinclude\s[<"]Decompressor.hpp[">]' components -g '!/build/' and fix any matches.
components/core/src/clp_s/indexer/CMakeLists.txt (1)
63-66: No ODR/name collisions — different namespaces/symbolsVerified: core zstd Decompressor is clp::streaming_compression::zstd::Decompressor (components/core/src/clp/streaming_compression/zstd/Decompressor.{hpp,cpp}) and clp-s defines clp_s::ZstdDecompressor (components/core/src/clp_s/ZstdDecompressor.{hpp,cpp}); they are different fully‑qualified types so no link‑time ODR collision. clp_s also constructs clp::streaming_compression::zstd::Decompressor in InputConfig.cpp.
components/core/tests/clp_s_test_utils.cpp (1)
14-20: Signature change LGTM.Parameter removal of file_type and addition of structurize_arrays are reflected correctly.
components/core/src/clp_s/JsonParser.hpp (1)
64-68: Public entrypoint rename LGTM.ingest() API and doc read clearly.
components/core/src/clp_s/InputConfig.hpp (2)
8-8: Include LGTM.Needed for std::pair in the new API.
25-28: FileType extensions LGTM.Json/KeyValueIr/Zstd/Unknown enum layout is clear.
components/core/src/clp_s/InputConfig.cpp (5)
105-135: Internal helpers LGTM.TU-local prototypes keep the surface area minimal.
206-213: Zstd magic check is fine.memcmp-based check is simple and correct.
234-243: JSON detection behaviour is deliberate.Only ‘{’ at the root is accepted; matches current ingestion support.
245-266: Type peek helper LGTM.Clear precedence order: Zstd → KV‑IR → JSON → Unknown.
332-341: close_nested_readers LGTM.Reverse‑order close of decompressors avoids leaks.
components/core/src/clp_s/JsonParser.cpp (2)
90-97: Constructor wiring LGTM.New options (record_log_order, input_paths, network_auth) are propagated correctly.
828-839: get_pos is supported on NetworkReader — no change requiredclp::ReaderInterface defines get_pos() (calls try_get_pos) and NetworkReader implements try_get_pos (components/core/src/clp/ReaderInterface.cpp:117; components/core/src/clp/NetworkReader.cpp:157).
| /** | ||
| * Closes all readers in a vector of nested readers, starting from the last reader. | ||
| * @param readers | ||
| */ | ||
| void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers); | ||
| } // namespace clp_s |
There was a problem hiding this comment.
🧹 Nitpick
Note behaviour of close_nested_readers in docs.
Mention that only Decompressor layers require/receive explicit close; other readers rely on RAII.
-/**
- * Closes all readers in a vector of nested readers, starting from the last reader.
+/**
+ * Closes all readers in a vector of nested readers, starting from the last reader.
+ * Currently only Decompressor layers perform an explicit close; other readers rely on RAII.
* @param readers
*/📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Closes all readers in a vector of nested readers, starting from the last reader. | |
| * @param readers | |
| */ | |
| void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers); | |
| } // namespace clp_s | |
| /** | |
| * Closes all readers in a vector of nested readers, starting from the last reader. | |
| * Currently only Decompressor layers perform an explicit close; other readers rely on RAII. | |
| * @param readers | |
| */ | |
| void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers); | |
| } // namespace clp_s |
🤖 Prompt for AI Agents
In components/core/src/clp_s/InputConfig.hpp around lines 140 to 145, update the
doc comment for close_nested_readers to clearly state its behavior: explain that
it iterates readers from last to first and only calls explicit close on
Decompressor-type layers (other reader types rely on RAII/destructors and will
not be explicitly closed), and mention that the function accepts a const vector
of shared_ptr<clp::ReaderInterface> and performs no ownership transfer; keep the
brief @param description and clarify that callers should only expect
Decompressor resources to be finalized by this function.
| REQUIRE(parser.ingest()); | ||
| REQUIRE_NOTHROW(archive_stats = parser.store()); |
There was a problem hiding this comment.
🧹 Nitpick
Good switch to ingest(); consider asserting non-empty stats.
Current checks are fine; optionally assert archive_stats.size() > 0 to catch no-op ingests.
- REQUIRE_NOTHROW(archive_stats = parser.store());
+ REQUIRE_NOTHROW(archive_stats = parser.store());
+ REQUIRE_FALSE(archive_stats.empty());📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| REQUIRE(parser.ingest()); | |
| REQUIRE_NOTHROW(archive_stats = parser.store()); | |
| REQUIRE(parser.ingest()); | |
| REQUIRE_NOTHROW(archive_stats = parser.store()); | |
| REQUIRE_FALSE(archive_stats.empty()); |
🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.cpp around lines 48-49, after calling
REQUIRE_NOTHROW(archive_stats = parser.store()); add an assertion that the
returned archive_stats is non-empty to catch no-op ingests; update the test to
assert archive_stats.size() > 0 (or use REQUIRE_FALSE(archive_stats.empty()))
immediately after the store call so the test fails when ingestion produced no
stats.
| = std::dynamic_pointer_cast<clp::streaming_compression::Decompressor>(*rit); | ||
| nullptr != decompressor) | ||
| { | ||
| decompressor->close(); |
There was a problem hiding this comment.
Do we need to explicitly close other readers?
There was a problem hiding this comment.
All of the ones we currently support don't even have a close method. Instead they do cleanup in the destructor conditional on being open.
kirkrodrigues
left a comment
There was a problem hiding this comment.
For the PR title, how about:
feat(clp-s): Automatically infer the file types of inputs to compression.
kirkrodrigues
left a comment
There was a problem hiding this comment.
Deferring to @wraymo's review.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
docs/src/user-docs/core-clp-s.md(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (1)
docs/src/user-docs/core-clp-s.md (1)
16-18: Docs: explicitly call out auto‑detection and nested Zstd; tighten NDJSON wording.Great bullets. Please:
- Explicitly state that clp‑s auto‑detects JSON vs KV‑IR and automatically unwraps nested Zstd layers.
- Note the failure mode and what users should do.
- Optional wording polish: “newline‑delimited JSON (NDJSON)”.
Apply below diff to add a concise note and tweak wording:
- * a new-line-delimited JSON (ndjson) log file (may be Zstd-compressed); + * a newline-delimited JSON (NDJSON) log file (may be Zstd-compressed); * a KV-IR file (may be Zstd-compressed); + +:::{note} +clp‑s automatically detects the input type (NDJSON or KV‑IR). If the input is Zstd‑compressed, +including nested Zstd layers, clp‑s will transparently decompress and continue detection. +If deduction fails, clp‑s will exit with an error; pre‑decompress the input (e.g., `unzstd`) or +ensure it is valid NDJSON or KV‑IR before retrying. +:::
| ```shell | ||
| AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \ | ||
| ./clp-s c --single-file-archive --file-type kv-ir --auth s3 /mnt/data/archives \ | ||
| ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \ | ||
| https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp | ||
| ``` |
There was a problem hiding this comment.
🧹 Nitpick
🧩 Analysis chain
Example update looks good; add JSONL.zst example and verify S3 filename extension.
Removing --file-type here is correct. Consider adding a JSONL.zst S3 example to demonstrate auto‑detection of compression. Also, please confirm that using .clp as an input filename for a KV‑IR object is intentional and won’t be confused with CLP single‑file archives.
Suggested addition right after this example:
AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
+
+**Compress an NDJSON log on S3 that is Zstd‑compressed (auto‑detected):**
+
+```shell
+AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
+ ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
+ https://my-bucket.s3.us-east-2.amazonaws.com/app-logs.jsonl.zst
+```To verify the filename/extension guidance across docs, you can grep for inputs that end in .clp outside decompression/search sections and confirm intent.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo
echo "1) Search for '.clp' occurrences (excluding common build/vendor dirs):"
rg -n --hidden -S '\.clp\b' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "2) Search for 'kv-ir' mentions:"
rg -n --hidden -S 'kv-?ir' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "3) Show docs/src/user-docs/core-clp-s.md (first 200 lines) if present:"
if [ -f docs/src/user-docs/core-clp-s.md ]; then
sed -n '1,200p' docs/src/user-docs/core-clp-s.md
else
echo "docs/src/user-docs/core-clp-s.md not found"
fi
echo
echo "4) Search for 'single-file-archive' mentions:"
rg -n --hidden -S 'single-file-archive' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "5) Search for 'jsonl.zst' or 'jsonl.*zst' mentions:"
rg -n --hidden -S 'jsonl(\.zst|.*zst)' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || trueLength of output: 6248
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo
echo "1) Search for '.clp' occurrences (including ignored files):"
rg -uu -n --hidden '\.clp\b' || true
echo
echo "2) Search for 'kv-ir' occurrences:"
rg -uu -n --hidden 'kv-?ir' || true
echo
echo "3) Search for 'single-file-archive' mentions:"
rg -uu -n --hidden 'single-file-archive' || true
echo
echo "4) Search for 'jsonl.zst' mentions:"
rg -uu -n --hidden 'jsonl\.zst' || true
echo
echo "5) Show docs/src/user-docs/core-clp-s.md (first 200 lines) with line numbers:"
if [ -f docs/src/user-docs/core-clp-s.md ]; then
nl -ba -w3 -s': ' docs/src/user-docs/core-clp-s.md | sed -n '1,200p'
else
echo "docs/src/user-docs/core-clp-s.md not found"
fiLength of output: 4626
Add JSONL.zst example; clarify KV‑IR filename extension
Add the JSONL.zst S3 example and clarify the KV‑IR filename in docs. The codebase defines the canonical IR file extension as ".clp.zst" (components/core/src/clp/ir/constants.hpp:7) but also treats a ".clp" stream suffix specially (components/core/src/clp/clp/FileCompressor.cpp — removes ".clp" suffix around line ~405). Update docs/src/user-docs/core-clp-s.md (example at ~lines 63–64) to use ".clp.zst" or add an explicit note that ".clp" is accepted to avoid confusion.
Suggested addition (unchanged):
AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
+
+**Compress an NDJSON log on S3 that is Zstd‑compressed (auto‑detected):**
+
+```shell
+AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
+ ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
+ https://my-bucket.s3.us-east-2.amazonaws.com/app-logs.jsonl.zst
+```🤖 Prompt for AI Agents
In docs/src/user-docs/core-clp-s.md around lines 61 to 65, the S3 example and
text are ambiguous about canonical KV‑IR filename extensions; update the example
to include a JSONL.zst S3 example (e.g. app-logs.jsonl.zst) and clarify that the
canonical IR extension is ".clp.zst" (per
components/core/src/clp/ir/constants.hpp) while noting that a ".clp" stream
suffix is also accepted/treated specially (per FileCompressor.cpp) so readers
aren’t confused; add the additional example line and a brief parenthetical note
about the accepted ".clp" suffix.
…ion. (y-scope#1312) Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Description
This PR changes the compression flow of clp-s to automatically infer the file-types of files passed to compression. We do this by iteratively:
This allows us to get rid of the
--file-typeargument, as well as add support for new input types such as JSON compressed inside a Zstd stream. This approach should also make it easy to add support for ingesting input compressed by other compression algorithms such as LZMA in the future.The bulk of these changes are split between:
try_deduce_reader_typefunction inInputConfig(and related utilities)JsonParsersuch that ingestion is always invoked by theparsed-- this comprises most of the diff, but really mostly involves moving around existing codeChecklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Refactor
Documentation
Tests
Chores