Skip to content

feat(clp-s): Automatically infer the file types of inputs to compression.#1312

Merged
gibber9809 merged 17 commits into
y-scope:mainfrom
gibber9809:infer-file-type
Sep 21, 2025
Merged

feat(clp-s): Automatically infer the file types of inputs to compression.#1312
gibber9809 merged 17 commits into
y-scope:mainfrom
gibber9809:infer-file-type

Conversation

@gibber9809

@gibber9809 gibber9809 commented Sep 18, 2025

Copy link
Copy Markdown
Contributor

Description

This PR changes the compression flow of clp-s to automatically infer the file-types of files passed to compression. We do this by iteratively:

  1. Wrapping the input stream in a buffered reader
  2. Scanning the first few buffered bytes to deduce the file type
  3. If the content type is Zstd we wrap the buffered reader in a zstd decompressor and go back to step 1
  4. Otherwise if the content is KV-IR or JSON we start parsing the input as the deduced content type

This allows us to get rid of the --file-type argument, as well as add support for new input types such as JSON compressed inside a Zstd stream. This approach should also make it easy to add support for ingesting input compressed by other compression algorithms such as LZMA in the future.

The bulk of these changes are split between:

  1. A new try_deduce_reader_type function in InputConfig (and related utilities)
  2. A refactor of JsonParser such that ingestion is always invoked by the parsed -- this comprises most of the diff, but really mostly involves moving around existing code

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Validated that all tests pass
  • Manually validated that zstd compressed jsonl files can now be ingested

Summary by CodeRabbit

  • New Features

    • Automatic input-type detection for JSON, KV‑IR and Zstd-compressed inputs; removed explicit CLI file-type option and its accessor.
  • Refactor

    • Unified ingestion flow with streaming readers, improved error handling, dynamic archive splitting and per-record metadata.
  • Documentation

    • Removed --file-type from docs and examples; documented Zstd-compressed inputs.
  • Tests

    • Test helpers and test cases updated to match automatic detection and simplified signatures.
  • Chores

    • Build updated to include streaming-compression and reader components.

@gibber9809 gibber9809 requested review from a team and wraymo as code owners September 18, 2025 20:38
@coderabbitai

coderabbitai Bot commented Sep 18, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Removes CLI-configured input file type; adds runtime input-type deduction (JSON, KV‑IR, Zstd) via buffered reader chains. Refactors parser to ingest() with ingest_json/ingest_kvir, updates tests and docs, and adds BufferedReader/BufferReader and streaming-compression sources to builds.

Changes

Cohort / File(s) Summary
Build integration for buffered readers
components/core/src/clp_s/CMakeLists.txt, components/core/src/clp_s/indexer/CMakeLists.txt
Added BufferedReader.*, BufferReader.* and streaming-compression headers/sources into clp_s and indexer source lists so buffered reader and decompressor code are compiled.
CLI option removal (file type)
components/core/src/clp_s/CommandLineArguments.cpp, components/core/src/clp_s/CommandLineArguments.hpp
Removed --file-type parsing, associated constants, storage, validation and the public getter.
Input type deduction and FileType expansion
components/core/src/clp_s/InputConfig.hpp, components/core/src/clp_s/InputConfig.cpp
Extended FileType to include Zstd and Unknown. Added try_deduce_reader_type(...) to build buffered reader chains (64 KB limit, depth 5), peek start bytes to detect Zstd/KV‑IR/JSON, unwrap Zstd with Decompressor, and return reader chain plus deduced type; added close_nested_readers.
Parser dispatch refactor
components/core/src/clp_s/JsonParser.hpp, components/core/src/clp_s/JsonParser.cpp
Removed input_file_type and parse_from_ir(). Replaced parse() with ingest() that calls try_deduce_reader_type() and dispatches to new ingest_json(...) or ingest_kvir(...); added streaming ingestion, archive-splitting, per-record metadata handling and refined error/logging flows.
Main entrypoint alignment
components/core/src/clp_s/clp-s.cpp
Dropped code referencing removed option and file-type branching; now calls parser.ingest() unconditionally and handles failure uniformly.
Test utilities and call sites
components/core/tests/clp_s_test_utils.hpp, components/core/tests/clp_s_test_utils.cpp
Removed clp_s::FileType parameter from compress_archive(...) signature/implementation; helper now uses parser.ingest().
Updated tests to new signatures/behaviour
components/core/tests/test-clp_s-*.cpp
Adjusted compress_archive(...) call sites across tests to remove final FileType argument and removed input-file-type toggles.
Documentation update
docs/src/user-docs/core-clp-s.md
Removed --file-type docs and examples; updated input-path description to note possible Zstd-compressed ndjson or KV‑IR inputs.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User as User
  participant CLPS as clp-s (main)
  participant Parser as JsonParser
  participant IC as InputConfig
  participant BR as BufferedReader
  participant DZ as Decompressor

  Note over User,CLPS: Single entrypoint (no --file-type)
  User->>CLPS: Run clp-s compress ...
  CLPS->>Parser: ingest()
  Parser->>IC: try_deduce_reader_type(reader)
  IC->>BR: wrap base reader (buffer up to 64KB) and peek
  alt Zstd detected
    IC->>DZ: wrap with Decompressor and open
    DZ-->>IC: decompressed stream
    IC->>BR: re-wrap/re-peek inner data
  end
  alt JSON detected
    IC-->>Parser: [reader_chain, FileType::Json]
    Parser->>Parser: ingest_json(reader_chain.back(), path, creator_id)
  else KV‑IR detected
    IC-->>Parser: [reader_chain, FileType::KeyValueIr]
    Parser->>Parser: ingest_kvir(reader_chain.back(), path, creator_id)
  else Unknown
    IC-->>Parser: [empty, FileType::Unknown]
    Parser-->>CLPS: log error, fail
  end
  Parser-->>CLPS: success/failure
  CLPS-->>User: exit
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately describes the primary change: automatic inference of input file types for clp-s compression. It matches the PR objectives and diffs (addition of try_deduce_reader_type, removal of the --file-type option, and JsonParser refactor to an ingest flow that detects Zstd/JSON/KV-IR). The phrasing follows conventional-commit style and is clear for teammates scanning the project history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
components/core/tests/test-clp_s-end_to_end.cpp (1)

108-116: Updated call site—OK. Add a zstd‑wrapped input case?

Consider adding a variant that feeds a Zstd‑compressed JSONL into compress_archive to exercise nested detection.

I can sketch a small helper that writes a zstd‑compressed copy of test_no_floats_sorted.jsonl using the in‑repo ZstdCompressor and then reuses this test.

components/core/src/clp_s/JsonParser.cpp (3)

34-34: Remove duplicate include of JsonParser.hpp.

It’s already included on Line 1.

Apply this diff:

-#include "JsonParser.hpp"

28-28: Unused include: zstd/Decompressor.hpp.

Not referenced in this TU; InputConfig handles wrapping. Consider removing to tighten dependencies.

Apply this diff and ensure it still builds:

-#include "../clp/streaming_compression/zstd/Decompressor.hpp"

81-83: Optional: clear stored event to free memory between log events.

clear() toggles a flag but retains the previous KeyValuePairLogEvent. Resetting the optional can reduce peak memory.

Apply this diff:

-    void clear() { m_is_complete = false; }
+    void clear() {
+        m_deserialized_log_event.reset();
+        m_is_complete = false;
+    }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3351e14 and 064c43e.

📒 Files selected for processing (16)
  • components/core/src/clp_s/CMakeLists.txt (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (0 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (0 hunks)
  • components/core/src/clp_s/InputConfig.cpp (5 hunks)
  • components/core/src/clp_s/InputConfig.hpp (3 hunks)
  • components/core/src/clp_s/JsonParser.cpp (3 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/src/clp_s/indexer/CMakeLists.txt (1 hunks)
  • components/core/tests/clp_s_test_utils.cpp (2 hunks)
  • components/core/tests/clp_s_test_utils.hpp (1 hunks)
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp (1 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (1 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (1 hunks)
  • components/core/tests/test-clp_s-search.cpp (1 hunks)
  • docs/src/user-docs/core-clp-s.md (1 hunks)
💤 Files with no reviewable changes (2)
  • components/core/src/clp_s/CommandLineArguments.cpp
  • components/core/src/clp_s/CommandLineArguments.hpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/InputConfig.hpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/tests/test-clp_s-range_index.cpp
  • components/core/src/clp_s/JsonParser.cpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
  • components/core/src/clp_s/indexer/CMakeLists.txt
  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.

Applied to files:

  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
🧬 Code graph analysis (4)
components/core/tests/clp_s_test_utils.cpp (1)
components/core/src/clp_s/clp-s.cpp (1)
  • parser (108-108)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (278-328)
components/core/src/clp_s/JsonParser.hpp (2)
components/core/src/clp_s/JsonParser.cpp (6)
  • nodiscard (49-54)
  • nodiscard (56-61)
  • nodiscard (63-69)
  • nodiscard (71-74)
  • nodiscard (76-79)
  • nodiscard (83-83)
components/core/src/clp_s/JsonFileIterator.hpp (3)
  • nodiscard (41-41)
  • nodiscard (46-46)
  • nodiscard (59-59)
components/core/src/clp_s/JsonParser.cpp (4)
components/core/src/clp_s/InputConfig.cpp (1)
  • try_deduce_reader_type (278-279)
components/core/src/clp_s/InputConfig.hpp (1)
  • try_deduce_reader_type (138-139)
components/core/src/clp_s/JsonParser.hpp (3)
  • reader (85-89)
  • reader (99-103)
  • path (225-225)
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
  • reader (59-61)
  • reader (77-80)
  • reader (141-142)
  • reader (184-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: build-macos (macos-14, true)
🔇 Additional comments (13)
components/core/tests/test-clp_s-search.cpp (1)

233-241: Updated call site—OK.

Switching to auto‑deduction in tests is consistent with the new API.

components/core/src/clp_s/InputConfig.hpp (1)

25-28: Extending FileType: add/verify string mappers and persistence handling

  • Switch coverage: FileType::Zstd and FileType::Unknown are already handled in components/core/src/clp_s/InputConfig.cpp and components/core/src/clp_s/JsonParser.cpp (search shows case labels and deduce/returns).
  • Missing mapper: no to_string(FileType) or other string/enum mapper found in the components tree — add one if FileType is converted to/from text (logs, CLI, config).
  • Persistence/on‑wire: confirm no on‑wire or persisted format relies on enum ordinals; if it does, either preserve ordinals or migrate to explicit serialization (strings/versioning).
components/core/src/clp_s/CMakeLists.txt (1)

10-13: BufferReader vs BufferedReader — distinct classes found; confirm indexer CMake & naming

Both exist and are used: components/core/src/clp/BufferReader.{hpp,cpp} and components/core/src/clp/BufferedReader.{hpp,cpp}; BufferReader also exists at components/core/src/glt/BufferReader.{hpp,cpp}. If both are intended keep them; otherwise rename to avoid cross-module confusion. Duplicate this verification in the indexer CMake to ensure the intended files are included.

components/core/tests/clp_s_test_utils.hpp (1)

22-28: Confirm all tests updated to the new compress_archive signature (5 args, no FileType).

Verify every call in components/core/tests—especially:
components/core/tests/test-clp_s-search.cpp,
components/core/tests/test-clp_s-range_index.cpp,
components/core/tests/test-clp_s-delta-encode-log-order.cpp,
components/core/tests/test-clp_s-end_to_end.cpp
—and ensure each call uses (file_path, archive_directory, timestamp_key, single_file_archive, structurize_arrays) with no FileType argument.

components/core/tests/test-clp_s-delta-encode-log-order.cpp (1)

69-75: API update in compress_archive call looks correct

Switched to the 5‑arg form; param order matches the updated signature.

components/core/src/clp_s/JsonParser.hpp (2)

68-68: Adding [[nodiscard]] to parse() is good

Prevents accidental ignore of parse result.


91-104: Ingestion helpers’ API looks reasonable

Signatures/readability are fine; no blocking issues here.

components/core/tests/clp_s_test_utils.cpp (2)

19-20: Signature change propagated correctly

Fifth argument is now structurize_arrays.


48-48: parse() assertion is appropriate

REQUIRE(parser.parse()) aligns with the new API.

components/core/tests/test-clp_s-range_index.cpp (1)

223-230: API update in compress_archive call looks correct

Removed obsolete file-type arg; parameter order matches updated helper.

components/core/src/clp_s/InputConfig.cpp (1)

243-264: Detection order/logic is sound

Zstd → KV‑IR → JSON is reasonable; early returns are correct.

components/core/src/clp_s/JsonParser.cpp (2)

91-97: Constructor option wiring looks correct.

New fields are plumbed cleanly into member initializers.


791-803: Right‑size logging for expected deserialization termination.

Past guidance accepts certain IR deserialization “errors” as normal termination signals. Consider lowering to DEBUG or filtering on specific error codes to avoid alarming warnings in healthy flows.

Do these codes ever carry expected end‑of‑stream signals besides IrUnitType::EndOfStream? If yes, propose:

-        SPDLOG_WARN(
+        SPDLOG_DEBUG(
                 "Encountered error while deserializing kv-ir log event from stream \"{}\": "

Please confirm the exact error codes emitted by Deserializer::deserialize_next_ir_unit on normal endings vs truncated streams.

Comment thread components/core/src/clp_s/clp-s.cpp Outdated
Comment thread components/core/src/clp_s/indexer/CMakeLists.txt
Comment thread components/core/src/clp_s/InputConfig.cpp
Comment thread components/core/src/clp_s/InputConfig.cpp
Comment thread components/core/src/clp_s/InputConfig.cpp
Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/JsonParser.cpp
Comment thread components/core/src/clp_s/JsonParser.hpp Outdated
Comment thread components/core/src/clp_s/JsonParser.hpp
Comment on lines +64 to 66
./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Docs: call out auto type inference and nested Zstd explicitly.

Since --file-type is removed and detection supports nested Zstd, add a short note under “Compression” clarifying that inputs may be JSON or KV‑IR, optionally Zstd‑compressed, and that the tool auto‑detects and unwraps nested formats. Also add guidance on what happens when deduction fails and how users should proceed.

🤖 Prompt for AI Agents
In docs/src/user-docs/core-clp-s.md around lines 64 to 66, add a short
“Compression” note explaining that input archives may be JSON or KV‑IR and may
optionally be Zstd‑compressed (including nested Zstd layers), that the CLI
auto‑detects file format and will automatically unwrap nested Zstd compression,
and that if format deduction fails the tool will surface an error instructing
users to either provide a known format (e.g., via a --file-type flag or
equivalent) or to pre‑decompress the archive before running the command; keep
the note concise and place it immediately under the Compression subsection where
the example command appears.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/JsonParser.cpp (1)

927-955: Return type mismatch with header for add_node_to_archive_and_translations.

Header declares int32_t; definition returns int. This won’t compile.

-auto JsonParser::add_node_to_archive_and_translations(
+auto JsonParser::add_node_to_archive_and_translations(
         uint32_t ir_node_id,
         clp::ffi::SchemaTree::Node const& ir_node_to_add,
         NodeType archive_node_type,
         int32_t parent_node_id,
         bool matches_timestamp
-) -> int {
+) -> int32_t {
♻️ Duplicate comments (9)
components/core/src/clp_s/InputConfig.cpp (4)

199-206: Harden Zstd magic check against signed-char pitfalls.

Define the magic as unsigned bytes to avoid implementation-defined char literals and keep intent clear. You already included .

-auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool {
-    constexpr std::array<char, 4> cZstdMagicNumber = {'\x28', '\xB5', '\x2F', '\xFD'};
+auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool {
+    constexpr std::array<unsigned char, 4> cZstdMagicNumber = {0x28, 0xB5, 0x2F, 0xFD};
     if (peek_size < cZstdMagicNumber.size()) {
         return false;
     }
-
-    return 0 == std::memcmp(peek_buf, cZstdMagicNumber.data(), cZstdMagicNumber.size());
+    return 0
+           == std::memcmp(
+                   peek_buf,
+                   reinterpret_cast<char const*>(cZstdMagicNumber.data()),
+                   cZstdMagicNumber.size()
+           );
 }

227-240: Fix UB in std::isspace and recognise BOM/top‑level arrays.

Cast to unsigned char for ctype, skip UTF‑8 BOM, and accept '[' as valid JSON start.

-auto could_be_json(char const* peek_buf, size_t peek_size) -> bool {
-    for (size_t i{0ULL}; i < peek_size; ++i) {
-        if (std::isspace(static_cast<int>(peek_buf[i]))) {
-            continue;
-        }
-
-        if ('{' == peek_buf[i]) {
-            return true;
-        } else {
-            return false;
-        }
-    }
-    return false;
-}
+auto could_be_json(char const* peek_buf, size_t peek_size) -> bool {
+    size_t i{0ULL};
+    if (peek_size >= 3
+        && static_cast<unsigned char>(peek_buf[0]) == 0xEF
+        && static_cast<unsigned char>(peek_buf[1]) == 0xBB
+        && static_cast<unsigned char>(peek_buf[2]) == 0xBF)
+    {
+        i = 3;
+    }
+    for (; i < peek_size; ++i) {
+        unsigned char c = static_cast<unsigned char>(peek_buf[i]);
+        if (false == std::isspace(c)) {
+            return ('{' == static_cast<char>(c)) || ('[' == static_cast<char>(c));
+        }
+    }
+    return false;
+}

285-291: Don’t reject readers that can’t report position.

Only bail if try_get_pos succeeds and position != 0. Network streams often don’t support positioning.

-    size_t original_pos{};
-    if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos)
-        || 0ULL != original_pos)
-    {
-        return {{}, FileType::Unknown};
-    }
+    size_t original_pos{};
+    auto const pos_rc = reader->try_get_pos(original_pos);
+    if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) {
+        return {{}, FileType::Unknown};
+    }

292-301: Add debug for buffered refill failure to aid field diagnostics.

A one‑line debug before returning Unknown helps users pinpoint I/O errors.

-        auto const rc{buffered_reader->try_refill_buffer_if_empty()};
-        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
-            return {{}, FileType::Unknown};
-        }
+        auto const rc{buffered_reader->try_refill_buffer_if_empty()};
+        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
+            SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc));
+            return {{}, FileType::Unknown};
+        }
components/core/src/clp_s/InputConfig.hpp (1)

128-139: Tighten API docs for try_deduce_reader_type (preconditions/lifecycle).

Clarify that the root reader should be valid and at offset 0 if seekable; describe returned chain order and caller responsibility to close in reverse.

-/**
- * Tries to deduce the underlying file-type of the file opened by `reader`, and return a
+/**
+ * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a
  * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of
  * compression.
- * @param reader
+ * Preconditions:
+ *  - `reader` is valid; if it supports positioning, it should be at offset 0.
+ * Postconditions:
+ *  - The returned vector contains the reader chain (outermost first). The last element is open
+ *    and positioned at the start of the deduced content. Callers should close/teardown in reverse.
+ * @param reader
  * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is
  * open for reading content of the type described by the element in the pair. When the content type
  * cannot be deduced, we return an empty vector and `FileType::Unknown`.
  */
components/core/src/clp_s/JsonParser.cpp (4)

499-516: Defensive check before nested_readers.back().

Avoid UB if the vector is empty on an internal error.

-        auto const [nested_readers, file_type] = try_deduce_reader_type(reader);
-        bool ingestion_successful{};
+        auto const [nested_readers, file_type] = try_deduce_reader_type(reader);
+        if (nested_readers.empty()) {
+            SPDLOG_ERROR("Internal error: no reader layers returned for input {}", path.path);
+            std::ignore = m_archive_writer->close();
+            return false;
+        }
+        bool ingestion_successful{};

549-596: Don’t skip filename/split/creator metadata when record order is disabled (JSON path).

Gate only the per‑record log index field; always add per‑archive metadata.

-    auto initialize_fields_for_archive = [&]() -> bool {
-        if (false == m_record_log_order) {
-            return true;
-        }
-        log_event_idx_node_id
-                = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+    auto initialize_fields_for_archive = [&]() -> bool {
+        if (m_record_log_order) {
+            log_event_idx_node_id
+                    = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+        }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cFilename},
                     path.path
             );
             ErrorCodeSuccess != rc)
         {

611-617: Clarify error message: this rejects scalar values, not arrays.

Change “non‑json‑object” → “scalar JSON value” to reduce confusion.

-            SPDLOG_ERROR(
-                    "Encountered non-json-object while trying to parse {} after parsing {} "
+            SPDLOG_ERROR(
+                    "Encountered scalar JSON value while trying to parse {} after parsing {} "
                     "bytes",

715-783: Don’t skip filename/split/creator metadata when record order is disabled (KV‑IR path).

Same gating issue as JSON path.

-    auto initialize_fields_for_archive = [&]() -> bool {
-        if (false == m_record_log_order) {
-            return true;
-        }
-        log_event_idx_node_id
-                = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+    auto initialize_fields_for_archive = [&]() -> bool {
+        if (m_record_log_order) {
+            log_event_idx_node_id
+                    = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+        }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cFilename},
                     path.path
             );
             ErrorCodeSuccess != rc)
         {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 064c43e and b3d6425.

📒 Files selected for processing (4)
  • components/core/src/clp_s/InputConfig.cpp (5 hunks)
  • components/core/src/clp_s/InputConfig.hpp (3 hunks)
  • components/core/src/clp_s/JsonParser.cpp (3 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
🧠 Learnings (15)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/InputConfig.hpp
🧬 Code graph analysis (3)
components/core/src/clp_s/JsonParser.cpp (4)
components/core/src/clp_s/InputConfig.cpp (1)
  • try_deduce_reader_type (277-278)
components/core/src/clp_s/InputConfig.hpp (1)
  • try_deduce_reader_type (137-138)
components/core/src/clp_s/JsonParser.hpp (3)
  • reader (84-88)
  • reader (98-102)
  • path (224-224)
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
  • reader (59-61)
  • reader (77-80)
  • reader (141-142)
  • reader (184-188)
components/core/src/clp_s/JsonParser.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (277-327)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (277-327)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: package-image
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (1)
components/core/src/clp_s/JsonParser.hpp (1)

64-69: Docstring fix and return type change look good.

“successfully” fix and auto parse() -> bool are consistent with the implementation.

Comment on lines 75 to +103
private:
/**
* Parses JSON input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_json(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;

/**
* Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_kvir(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify ingest_json/ingest_kvir contracts (preconditions/lifetime).

Document that the reader must be the last element returned by try_deduce_reader_type (already opened/positioned at start of the deduced content) and that callers own closing/teardown via JsonParser::store(). This avoids ambiguity about ownership and expected position.

  /**
   * Parses JSON input and ingests it into the current archive, splitting the archive if it grows
   * beyond the target encoded size.
- * @param reader
+ * Preconditions:
+ *  - `reader` is the final, ready-to-read layer returned by try_deduce_reader_type()
+ *    (positioned at the start of JSON content).
+ * Ownership:
+ *  - JsonParser manages archive lifecycle; callers invoke store() to flush/close.
+ * @param reader
   * @param path
   * @param archive_creator_id
   * @return Whether ingestion was successful or not.
   */
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private:
/**
* Parses JSON input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_json(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;
/**
* Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_kvir(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;
private:
/**
* Parses JSON input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* Preconditions:
* - `reader` is the final, ready-to-read layer returned by try_deduce_reader_type()
* (positioned at the start of JSON content).
* Ownership:
* - JsonParser manages archive lifecycle; callers invoke store() to flush/close.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_json(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;
/**
* Parses KV-IR input and ingests it into the current archive, splitting the archive if it grows
* beyond the target encoded size.
* @param reader
* @param path
* @param archive_creator_id
* @return Whether ingestion was successful or not.
*/
[[nodiscard]] auto ingest_kvir(
std::shared_ptr<clp::ReaderInterface> reader,
Path const& path,
std::string const& archive_creator_id
) -> bool;
🤖 Prompt for AI Agents
In components/core/src/clp_s/JsonParser.hpp around lines 75 to 103, the
ingest_json and ingest_kvir method contracts are ambiguous about the reader
lifetime and position; update their comments to state that the reader parameter
must be the last element returned by try_deduce_reader_type, must already be
opened and positioned at the start of the deduced content, and that callers
retain ownership and are responsible for closing/teardown (e.g., via
JsonParser::store()) after these methods return; make the wording concise and
identical for both methods so callers clearly understand preconditions and
ownership.

@wraymo wraymo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Just had a few small comments.

}

auto could_be_json(char const* peek_buf, size_t peek_size) -> bool {
for (size_t i{0ULL}; i < peek_size; ++i) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have more spaces than peek_size and do you think the UTF‑8 BOM comment from coderabbit make sense?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON doesn't allow BOM, and the current code should already reject BOM, so that part is fine. I'll double check about the negative characters though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have more spaces than peek size I'm personally fine treating the input as not JSON -- for reference the peek size should 64KB unless the input is smaller, so the input has to be really malformed for this to not work.

Comment thread components/core/src/clp_s/InputConfig.cpp Outdated
continue;
}

if ('{' == peek_buf[i]) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle arrays or integers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we end up treating them as "Unknown" content type, and the user will see an error log like "could not deduce file-type for file xyz" before we exit. Since we don't actually support parsing JSON types besides object at the root right now I think this is fine -- we can change this check later if we start supporting other types.


/**
* Checks if an input contains Zstd data, based on the first few bytes of data from the input.
* @return Whether the input could be Zstd.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add parameter descriptions for these three methods?

Comment thread components/core/src/clp_s/InputConfig.hpp Outdated
Comment on lines +503 to +506
ingestion_successful = ingest_json(nested_readers.back(), path, archive_creator_id);
break;
case FileType::KeyValueIr:
ingestion_successful = ingest_kvir(nested_readers.back(), path, archive_creator_id);

@wraymo wraymo Sep 19, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only need the last one, is it required to return a vector?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes -- Since zstd::Decompressor only takes a reference to a reader instead of a shared_ptr if we return a decompressor as the last reader alone (as we would do for kv-ir inputs) the previous readers would get deallocated. Returning a vector helps avoid this case. It can also be useful for example if we need to close multiple readers in the hierarchy before we exit (which I should be doing for the decompressors, but forgot about, so I'll add that to the PR).

* @return whether the IR Stream was parsed successfully
*/
[[nodiscard]] auto parse_from_ir() -> bool;
[[nodiscard]] auto parse() -> bool;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should also update this method name.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can rename parse to ingest

[[nodiscard]] auto try_deduce_reader_type(std::shared_ptr<clp::ReaderInterface> reader)
-> std::pair<std::vector<std::shared_ptr<clp::ReaderInterface>>, FileType> {
constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB
constexpr size_t cMaxNestedFormatDepth = 5;

@wraymo wraymo Sep 19, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 5?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a somewhat arbitrary choice -- I'm just trying to avoid cases where someone maliciously or accidentally gives us a file that we interpret as being wrapped in many layers of compression and wasting compute/memory. I figure 5 is more than enough for real cases, especially for now.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
components/core/src/clp_s/InputConfig.hpp (1)

128-139: Header docs: clarify preconditions, ownership, and return semantics

Docs still omit critical preconditions and lifetime/ownership details; align with the implementation and previous feedback.

Apply:

 /**
- * Tries to deduce the underlying file-type of the file opened by `reader`, and returns a
+ * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a
  * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of
  * compression.
- * @param reader
- * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is
- * open for reading content of the type described by the element in the pair. When the content type
- * cannot be deduced, we return an empty vector and `FileType::Unknown`.
+ * Preconditions:
+ *  - `reader` must be valid and (when position is known) positioned at offset 0.
+ * Postconditions:
+ *  - The returned vector contains the reader chain (outermost first). The last element is ready
+ *    to read the deduced content type. Callers tear down/close in reverse order.
+ * @param reader Root reader to inspect and unwrap
+ * @return A vector of all created `clp::ReaderInterface`s (outermost first) and the deduced
+ *         FileType. When the content type cannot be deduced, returns an empty vector and
+ *         `FileType::Unknown`.
  */
components/core/src/clp_s/InputConfig.cpp (4)

199-206: Use unsigned bytes for Zstd magic to avoid signed‑char pitfalls

Define the magic as unsigned and compare via memcmp against unsigned data.

-auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool {
-    constexpr std::array<char, 4> cZstdMagicNumber = {'\x28', '\xB5', '\x2F', '\xFD'};
+auto could_be_zstd(char const* peek_buf, size_t peek_size) -> bool {
+    constexpr std::array<unsigned char, 4> cZstdMagicNumber = {0x28, 0xB5, 0x2F, 0xFD};
     if (peek_size < cZstdMagicNumber.size()) {
         return false;
     }
-
-    return 0 == std::memcmp(peek_buf, cZstdMagicNumber.data(), cZstdMagicNumber.size());
+    return 0
+           == std::memcmp(
+                   peek_buf,
+                   reinterpret_cast<char const*>(cZstdMagicNumber.data()),
+                   cZstdMagicNumber.size()
+           );
 }

281-286: Don’t reject readers that don’t support positioning

Only bail when try_get_pos succeeds AND reports a non‑zero position; otherwise proceed.

-    size_t original_pos{};
-    if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos)
-        || 0ULL != original_pos)
-    {
-        return {{}, FileType::Unknown};
-    }
+    size_t original_pos{};
+    auto const pos_rc = reader->try_get_pos(original_pos);
+    if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) {
+        return {{}, FileType::Unknown};
+    }
#!/bin/bash
# Verify behaviour of try_get_pos across ReaderInterface implementations.
rg -nP --type=cpp -C3 '\bclass\s+\w+.*?ReaderInterface|\btry_get_pos\s*\('

293-296: Add debug log on buffer refill failure

A small log greatly helps explaining Unknown results during inference.

-        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
-            return {{}, FileType::Unknown};
-        }
+        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
+            SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc));
+            return {{}, FileType::Unknown};
+        }

227-236: Fix UB in std::isspace and recognise top-level JSON arrays

  • Cast peek_buf[i] to unsigned char before calling std::isspace to avoid UB for signed char.
  • Treat '[' as a valid top-level JSON start or document/gate it if arrays aren't supported.
-auto could_be_json(char const* peek_buf, size_t peek_size) -> bool {
-    for (size_t i{0ULL}; i < peek_size; ++i) {
-        if (std::isspace(static_cast<int>(peek_buf[i]))) {
-            continue;
-        }
-
-        return '{' == peek_buf[i];
-    }
-    return false;
-}
+auto could_be_json(char const* peek_buf, size_t peek_size) -> bool {
+    for (size_t i{0ULL}; i < peek_size; ++i) {
+        unsigned char c = static_cast<unsigned char>(peek_buf[i]);
+        if (false == std::isspace(c)) {
+            char const ch = static_cast<char>(c);
+            return ('{' == ch) || ('[' == ch);
+        }
+    }
+    return false;
+}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3d6425 and dd3be42.

📒 Files selected for processing (2)
  • components/core/src/clp_s/InputConfig.cpp (5 hunks)
  • components/core/src/clp_s/InputConfig.hpp (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/InputConfig.hpp
🧬 Code graph analysis (1)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (273-323)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-15, false)
🔇 Additional comments (2)
components/core/src/clp_s/InputConfig.cpp (2)

3-7: Includes: correct and scoped

Added headers are appropriate for memcmp, detection helpers, and zstd decompression.

Also applies to: 16-17, 22-22


104-129: TU‑local helpers: good visibility control

Forward declarations inside an anonymous namespace keep helpers internal. LGTM.

Comment on lines +275 to +276
constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB
constexpr size_t cMaxNestedFormatDepth = 5;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Document nesting cap rationale

Add a short comment to justify 5 and signal DOS protection intent.

-    constexpr size_t cMaxNestedFormatDepth = 5;
+    // Defensive cap to prevent excessive/untrusted nesting (compression bombs, etc.).
+    // 5 levels should cover realistic cases (e.g., zstd->json, zstd->kvir->zstd->json).
+    constexpr size_t cMaxNestedFormatDepth = 5;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB
constexpr size_t cMaxNestedFormatDepth = 5;
constexpr size_t cFileReadBufferCapacity = 64 * 1024; // 64 KB
// Defensive cap to prevent excessive/untrusted nesting (compression bombs, etc.).
// 5 levels should cover realistic cases (e.g., zstd->json, zstd->kvir->zstd->json).
constexpr size_t cMaxNestedFormatDepth = 5;
🤖 Prompt for AI Agents
In components/core/src/clp_s/InputConfig.cpp around lines 275 to 276, the
constant cMaxNestedFormatDepth is set to 5 without explanation; add a short
inline comment explaining the rationale (limits nesting depth to prevent
excessive recursion/stack use and mitigate DoS from deeply nested input), e.g.,
mention safety/DoS protection and trade-off between flexibility and resource
limits so future maintainers understand why 5 was chosen.

@gibber9809 gibber9809 requested a review from wraymo September 19, 2025 14:19

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (12)
components/core/src/clp_s/indexer/CMakeLists.txt (1)

7-10: BufferReader vs BufferedReader: keep both; looks correct.

Both classes exist and serve different purposes; including both here is intentional.

components/core/src/clp_s/CMakeLists.txt (1)

10-13: BufferReader and BufferedReader additions are appropriate.

Required for peek/rewrap flow; additions align with the new deduction path.

components/core/src/clp_s/clp-s.cpp (1)

109-111: Enhance parse failure log with actionable context.

Log how many input paths were attempted so users can triage quickly.

-    if (false == parser.ingest()) {
-        SPDLOG_ERROR("Encountered error while parsing input.");
+    if (false == parser.ingest()) {
+        SPDLOG_ERROR(
+                "Encountered error while parsing {} input path(s).",
+                option.input_paths.size()
+        );
         return false;
     }
components/core/src/clp_s/JsonParser.hpp (1)

75-103: Clarify ingest_json/ingest_kvir reader contract (preconditions/ownership).

Document that reader must be the last layer from try_deduce_reader_type(), positioned at start-of-content; and lifecycle/close is handled by store()/close_nested_readers().

-     * Parses JSON input and ingests it into the current archive, splitting the archive if it grows
+     * Parses JSON input and ingests it into the current archive, splitting the archive if it grows
      * beyond the target encoded size.
-     * @param reader
+     * Preconditions:
+     *  - `reader` is the final layer returned by try_deduce_reader_type(), positioned at the start
+     *    of JSON content.
+     * Ownership/Lifetime:
+     *  - JsonParser manages archive lifecycle; callers finalize via store(). Reader teardown is
+     *    performed by close_nested_readers().
+     * @param reader

Apply the same wording to ingest_kvir().

components/core/src/clp_s/InputConfig.hpp (1)

128-139: Fix docs and state pre/postconditions for try_deduce_reader_type.

Clarify reader position requirement and close order; align @param with the actual argument.

-/**
- * Tries to deduce the underlying file-type of the file opened by `reader`, and returns a
+/**
+ * Tries to deduce the underlying file-type of the stream opened by `reader`, and returns a
  * (potentially new) reader for underlying JSON or KV-IR content by unwrapping layers of
  * compression.
- * @param reader
+ * Preconditions:
+ *  - `reader` should be valid and preferably positioned at offset 0.
+ * Postconditions:
+ *  - The returned vector contains the reader chain (outermost first); the last element is ready
+ *    to read the deduced content. Callers close layers in reverse order via close_nested_readers().
+ * @param reader Root reader to inspect and unwrap
  * @return A vector of all created `clp::ReaderInterface`s, where the last entry in the vector is
  * open for reading content of the type described by the element in the pair. When the content type
  * cannot be deduced, we return an empty vector and `FileType::Unknown`.
  */
components/core/src/clp_s/InputConfig.cpp (3)

282-284: Document nesting cap rationale.

Add a brief comment to justify the 5‑level limit (DoS guard).

-    constexpr size_t cMaxNestedFormatDepth = 5;
+    // Defensive cap to avoid excessive/untrusted nesting (e.g., compression bombs).
+    // Five levels should cover realistic cases.
+    constexpr size_t cMaxNestedFormatDepth = 5;

288-293: Don’t hard‑fail when try_get_pos is unsupported.

Some readers (e.g., network) may not support positioning; only bail if the call succeeds and position != 0.

-    size_t original_pos{};
-    if (clp::ErrorCode::ErrorCode_Success != reader->try_get_pos(original_pos)
-        || 0ULL != original_pos)
-    {
-        return {{}, FileType::Unknown};
-    }
+    size_t original_pos{};
+    auto const pos_rc = reader->try_get_pos(original_pos);
+    if (clp::ErrorCode::ErrorCode_Success == pos_rc && 0ULL != original_pos) {
+        return {{}, FileType::Unknown};
+    }

300-303: Optional: log buffer refill failure.

A DEBUG log helps diagnose Unknown classifications due to I/O issues.

-        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
-            return {{}, FileType::Unknown};
-        }
+        if (clp::ErrorCode::ErrorCode_Success != rc && clp::ErrorCode::ErrorCode_EndOfFile != rc) {
+            SPDLOG_DEBUG("BufferedReader refill failed with rc={}", static_cast<int>(rc));
+            return {{}, FileType::Unknown};
+        }
components/core/src/clp_s/JsonParser.cpp (4)

499-516: Guard against empty nested_readers before using back().

Avoid UB if deduction returns a known type with an empty vector.

-        auto const [nested_readers, file_type] = try_deduce_reader_type(reader);
-        bool ingestion_successful{};
+        auto const [nested_readers, file_type] = try_deduce_reader_type(reader);
+        if (true == nested_readers.empty()) {
+            SPDLOG_ERROR("Internal error: no reader layers returned for input {}", path.path);
+            std::ignore = m_archive_writer->close();
+            return false;
+        }
+        bool ingestion_successful{};

549-595: Do not gate per‑archive metadata on record_log_order (JSON path).

Filename, split number, and archive_creator_id should always be recorded; only log_event_idx should be gated.

-    auto initialize_fields_for_archive = [&]() -> bool {
-        if (false == m_record_log_order) {
-            return true;
-        }
-        log_event_idx_node_id
-                = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+    auto initialize_fields_for_archive = [&]() -> bool {
+        if (m_record_log_order) {
+            log_event_idx_node_id
+                    = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+        }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cFilename},
                     path.path
             );
             ErrorCodeSuccess != rc)
         {
             SPDLOG_ERROR(
                     "Failed to add metadata field \"{}\" ({})",
                     constants::range_index::cFilename,
                     static_cast<int64_t>(rc)
             );
             return false;
         }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cFileSplitNumber},
                     file_split_number
             );
             ErrorCodeSuccess != rc)
         {
             SPDLOG_ERROR(
                     "Failed to add metadata field \"{}\" ({})",
                     constants::range_index::cFileSplitNumber,
                     static_cast<int64_t>(rc)
             );
             return false;
         }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cArchiveCreatorId},
                     archive_creator_id
             );
             ErrorCodeSuccess != rc)
         {
             SPDLOG_ERROR(
                     "Failed to add metadata field \"{}\" ({})",
                     constants::range_index::cArchiveCreatorId,
                     static_cast<int64_t>(rc)
             );
             return false;
         }
         return true;
     };
@@
-    if (m_record_log_order) {
-        if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) {
-            SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc));
-            return false;
-        }
-    }
+    if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) {
+        SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc));
+        return false;
+    }

Also applies to: 687-693


611-618: Clarify error text (JSON scalar).

Users may see arrays; this message specifically rejects scalars.

-            SPDLOG_ERROR(
-                    "Encountered non-json-object while trying to parse {} after parsing {} "
+            SPDLOG_ERROR(
+                    "Encountered scalar JSON value while trying to parse {} after parsing {} "
                     "bytes",

716-781: Do not gate per‑archive metadata on record_log_order (KV‑IR path).

Mirror the JSON fix: always write filename/split/creator, gate only log_event_idx.

-    auto initialize_fields_for_archive = [&]() -> bool {
-        if (false == m_record_log_order) {
-            return true;
-        }
-        log_event_idx_node_id
-                = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+    auto initialize_fields_for_archive = [&]() -> bool {
+        if (m_record_log_order) {
+            log_event_idx_node_id
+                    = add_metadata_field(constants::cLogEventIdxName, NodeType::DeltaInteger);
+        }
         if (auto const rc = m_archive_writer->add_field_to_current_range(
                     std::string{constants::range_index::cFilename},
                     path.path
             );
             ErrorCodeSuccess != rc)
         {
@@
-    if (m_record_log_order) {
-        if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) {
-            SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc));
-            return false;
-        }
-    }
+    if (auto const rc = m_archive_writer->close_current_range(); ErrorCodeSuccess != rc) {
+        SPDLOG_ERROR("Failed to close metadata range: {}", static_cast<int64_t>(rc));
+        return false;
+    }

Also applies to: 860-865

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd3be42 and 6a39869.

📒 Files selected for processing (8)
  • components/core/src/clp_s/CMakeLists.txt (2 hunks)
  • components/core/src/clp_s/InputConfig.cpp (5 hunks)
  • components/core/src/clp_s/InputConfig.hpp (3 hunks)
  • components/core/src/clp_s/JsonParser.cpp (4 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/src/clp_s/indexer/CMakeLists.txt (2 hunks)
  • components/core/tests/clp_s_test_utils.cpp (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/InputConfig.hpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/JsonParser.cpp
🧠 Learnings (20)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-10-07T21:35:04.362Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:735-794
Timestamp: 2024-10-07T21:35:04.362Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir` method, encountering errors from `kv_log_event_result.error()` aside from `std::errc::no_message_available` and `std::errc::result_out_of_range` is anticipated behavior and does not require additional error handling or logging.

Applied to files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-24T14:46:00.664Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.hpp:158-158
Timestamp: 2024-10-24T14:46:00.664Z
Learning: In `components/core/src/clp/BufferedFileReader.hpp`, the `BufferedFileReader::try_read` function explicitly calls `BufferReader::try_read`, so documentation should reference `BufferReader::try_read`.

Applied to files:

  • components/core/src/clp_s/InputConfig.hpp
  • components/core/src/clp_s/indexer/CMakeLists.txt
  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2025-01-30T19:26:33.869Z
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.

Applied to files:

  • components/core/src/clp_s/InputConfig.hpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/indexer/CMakeLists.txt
  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-12-05T16:32:21.507Z
Learnt from: haiqi96
PR: y-scope/clp#619
File: components/core/src/clp/clp/decompression.cpp:313-313
Timestamp: 2024-12-05T16:32:21.507Z
Learning: In the C++ `FileDecompressor` implementations at `components/core/src/clp/clp/FileDecompressor.cpp` and `components/core/src/glt/glt/FileDecompressor.cpp`, the `temp_output_path` variable and associated logic are used to handle multiple compressed files with the same name, and should be kept. This logic is separate from the temporary output directory code removed in PR #619 and is necessary for proper file handling.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:756-765
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `parse_from_ir()` function, reaching the end of log events in a given IR is not considered an error case. The errors `std::errc::no_message_available` and `std::errc::result_out_of_range` are expected signals to break the deserialization loop and proceed accordingly.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:629-733
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In the `JsonParser::parse_kv_log_event` method in `components/core/src/clp_s/JsonParser.cpp`, prefer passing exceptions to higher levels for more graceful handling, rather than handling them at that level.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-09-27T23:19:17.079Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:717-726
Timestamp: 2024-09-27T23:19:17.079Z
Learning: In the `JsonParser::parse_from_IR` function, when deserializing, it's intentional to break out of the loop without additional error handling when `kv_log_event_result.has_error()` returns true for errors other than `no_message_available` and `result_out_of_range`, as this signifies the end of the IR file.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonParser.hpp
📚 Learning: 2024-10-24T14:45:26.265Z
Learnt from: haiqi96
PR: y-scope/clp#523
File: components/core/src/clp/BufferedFileReader.cpp:96-106
Timestamp: 2024-10-24T14:45:26.265Z
Learning: In `components/core/src/clp/BufferedFileReader.cpp`, refactoring the nested error handling conditions may not apply due to the specific logic in the original code.

Applied to files:

  • components/core/src/clp_s/indexer/CMakeLists.txt
  • components/core/src/clp_s/CMakeLists.txt
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-10-07T20:10:08.254Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/clp-s.cpp:196-265
Timestamp: 2024-10-07T20:10:08.254Z
Learning: In `clp-s.cpp`, the `run_serializer` function interleaves serialization and writing of IR files, making it difficult to restructure it into separate functions.

Applied to files:

  • components/core/src/clp_s/InputConfig.cpp
📚 Learning: 2024-11-15T03:15:45.919Z
Learnt from: LinZhihao-723
PR: y-scope/clp#593
File: components/core/tests/test-NetworkReader.cpp:216-219
Timestamp: 2024-11-15T03:15:45.919Z
Learning: In the `network_reader_with_valid_http_header_kv_pairs` test case in `components/core/tests/test-NetworkReader.cpp`, additional error handling for JSON parsing failures is not necessary, as the current error message is considered sufficient.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
🧬 Code graph analysis (4)
components/core/src/clp_s/InputConfig.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (280-330)
components/core/src/clp_s/JsonParser.hpp (1)
components/core/src/clp_s/InputConfig.cpp (1)
  • nodiscard (280-330)
components/core/src/clp_s/InputConfig.cpp (3)
components/core/src/clp_s/JsonParser.hpp (2)
  • reader (84-88)
  • reader (98-102)
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
  • reader (59-61)
  • reader (77-80)
  • reader (141-142)
  • reader (184-188)
components/core/src/clp_s/ArchiveReaderAdaptor.hpp (6)
  • decompressor (100-100)
  • decompressor (109-109)
  • decompressor (118-118)
  • decompressor (126-126)
  • decompressor (134-134)
  • decompressor (166-166)
components/core/src/clp_s/JsonParser.cpp (3)
components/core/src/clp_s/InputConfig.cpp (3)
  • try_deduce_reader_type (280-281)
  • close_nested_readers (332-341)
  • close_nested_readers (332-332)
components/core/src/clp_s/JsonParser.hpp (3)
  • reader (84-88)
  • reader (98-102)
  • path (224-224)
components/core/src/clp/ffi/ir_stream/Deserializer.hpp (4)
  • reader (59-61)
  • reader (77-80)
  • reader (141-142)
  • reader (184-188)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: package-image
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (13)
components/core/src/clp_s/CMakeLists.txt (1)

95-96: Ambiguous header name: Decompressor.hpp — verify and use qualified includes

components/core/src/clp_s/CMakeLists.txt adds ../clp/streaming_compression/Decompressor.hpp (lines 95–96) while the target also provides clp_s/Decompressor.hpp (line 186). Unqualified #include "Decompressor.hpp" may be ambiguous.

  • Action: Search for unqualified includes and either replace with fully qualified paths (e.g. #include "clp/streaming_compression/Decompressor.hpp" or #include "clp_s/Decompressor.hpp") or constrain the target's include directories to avoid overlap.
  • Suggested check: run rg -nP '^\s*#\sinclude\s[<"]Decompressor.hpp[">]' components -g '!/build/' and fix any matches.
components/core/src/clp_s/indexer/CMakeLists.txt (1)

63-66: No ODR/name collisions — different namespaces/symbols

Verified: core zstd Decompressor is clp::streaming_compression::zstd::Decompressor (components/core/src/clp/streaming_compression/zstd/Decompressor.{hpp,cpp}) and clp-s defines clp_s::ZstdDecompressor (components/core/src/clp_s/ZstdDecompressor.{hpp,cpp}); they are different fully‑qualified types so no link‑time ODR collision. clp_s also constructs clp::streaming_compression::zstd::Decompressor in InputConfig.cpp.

components/core/tests/clp_s_test_utils.cpp (1)

14-20: Signature change LGTM.

Parameter removal of file_type and addition of structurize_arrays are reflected correctly.

components/core/src/clp_s/JsonParser.hpp (1)

64-68: Public entrypoint rename LGTM.

ingest() API and doc read clearly.

components/core/src/clp_s/InputConfig.hpp (2)

8-8: Include LGTM.

Needed for std::pair in the new API.


25-28: FileType extensions LGTM.

Json/KeyValueIr/Zstd/Unknown enum layout is clear.

components/core/src/clp_s/InputConfig.cpp (5)

105-135: Internal helpers LGTM.

TU-local prototypes keep the surface area minimal.


206-213: Zstd magic check is fine.

memcmp-based check is simple and correct.


234-243: JSON detection behaviour is deliberate.

Only ‘{’ at the root is accepted; matches current ingestion support.


245-266: Type peek helper LGTM.

Clear precedence order: Zstd → KV‑IR → JSON → Unknown.


332-341: close_nested_readers LGTM.

Reverse‑order close of decompressors avoids leaks.

components/core/src/clp_s/JsonParser.cpp (2)

90-97: Constructor wiring LGTM.

New options (record_log_order, input_paths, network_auth) are propagated correctly.


828-839: get_pos is supported on NetworkReader — no change required

clp::ReaderInterface defines get_pos() (calls try_get_pos) and NetworkReader implements try_get_pos (components/core/src/clp/ReaderInterface.cpp:117; components/core/src/clp/NetworkReader.cpp:157).

Comment on lines +140 to 145
/**
* Closes all readers in a vector of nested readers, starting from the last reader.
* @param readers
*/
void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers);
} // namespace clp_s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Note behaviour of close_nested_readers in docs.

Mention that only Decompressor layers require/receive explicit close; other readers rely on RAII.

-/**
- * Closes all readers in a vector of nested readers, starting from the last reader.
+/**
+ * Closes all readers in a vector of nested readers, starting from the last reader.
+ * Currently only Decompressor layers perform an explicit close; other readers rely on RAII.
  * @param readers
  */
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Closes all readers in a vector of nested readers, starting from the last reader.
* @param readers
*/
void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers);
} // namespace clp_s
/**
* Closes all readers in a vector of nested readers, starting from the last reader.
* Currently only Decompressor layers perform an explicit close; other readers rely on RAII.
* @param readers
*/
void close_nested_readers(std::vector<std::shared_ptr<clp::ReaderInterface>> const& readers);
} // namespace clp_s
🤖 Prompt for AI Agents
In components/core/src/clp_s/InputConfig.hpp around lines 140 to 145, update the
doc comment for close_nested_readers to clearly state its behavior: explain that
it iterates readers from last to first and only calls explicit close on
Decompressor-type layers (other reader types rely on RAII/destructors and will
not be explicitly closed), and mention that the function accepts a const vector
of shared_ptr<clp::ReaderInterface> and performs no ownership transfer; keep the
brief @param description and clarify that callers should only expect
Decompressor resources to be finalized by this function.

Comment on lines +48 to 49
REQUIRE(parser.ingest());
REQUIRE_NOTHROW(archive_stats = parser.store());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Good switch to ingest(); consider asserting non-empty stats.

Current checks are fine; optionally assert archive_stats.size() > 0 to catch no-op ingests.

-    REQUIRE_NOTHROW(archive_stats = parser.store());
+    REQUIRE_NOTHROW(archive_stats = parser.store());
+    REQUIRE_FALSE(archive_stats.empty());
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
REQUIRE(parser.ingest());
REQUIRE_NOTHROW(archive_stats = parser.store());
REQUIRE(parser.ingest());
REQUIRE_NOTHROW(archive_stats = parser.store());
REQUIRE_FALSE(archive_stats.empty());
🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.cpp around lines 48-49, after calling
REQUIRE_NOTHROW(archive_stats = parser.store()); add an assertion that the
returned archive_stats is non-empty to catch no-op ingests; update the test to
assert archive_stats.size() > 0 (or use REQUIRE_FALSE(archive_stats.empty()))
immediately after the store call so the test fails when ingestion produced no
stats.

= std::dynamic_pointer_cast<clp::streaming_compression::Decompressor>(*rit);
nullptr != decompressor)
{
decompressor->close();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explicitly close other readers?

@gibber9809 gibber9809 Sep 19, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the ones we currently support don't even have a close method. Instead they do cleanup in the destructor conditional on being open.

@gibber9809 gibber9809 requested a review from wraymo September 19, 2025 15:09
wraymo
wraymo previously approved these changes Sep 19, 2025
@gibber9809 gibber9809 linked an issue Sep 19, 2025 that may be closed by this pull request

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-s): Automatically infer the file types of inputs to compression.

Comment thread docs/src/user-docs/core-clp-s.md Outdated
@gibber9809 gibber9809 changed the title feat(clp-s): Automatically infer the file-types of inputs to compression. feat(clp-s): Automatically infer the file types of inputs to compression. Sep 21, 2025

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to @wraymo's review.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4166c5d and e158243.

📒 Files selected for processing (1)
  • docs/src/user-docs/core-clp-s.md (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: build-macos (macos-14, false)
🔇 Additional comments (1)
docs/src/user-docs/core-clp-s.md (1)

16-18: Docs: explicitly call out auto‑detection and nested Zstd; tighten NDJSON wording.

Great bullets. Please:

  • Explicitly state that clp‑s auto‑detects JSON vs KV‑IR and automatically unwraps nested Zstd layers.
  • Note the failure mode and what users should do.
  • Optional wording polish: “newline‑delimited JSON (NDJSON)”.

Apply below diff to add a concise note and tweak wording:

-  * a new-line-delimited JSON (ndjson) log file (may be Zstd-compressed);
+  * a newline-delimited JSON (NDJSON) log file (may be Zstd-compressed);
   * a KV-IR file (may be Zstd-compressed);
+  
+:::{note}
+clp‑s automatically detects the input type (NDJSON or KV‑IR). If the input is Zstd‑compressed,
+including nested Zstd layers, clp‑s will transparently decompress and continue detection.  
+If deduction fails, clp‑s will exit with an error; pre‑decompress the input (e.g., `unzstd`) or
+ensure it is valid NDJSON or KV‑IR before retrying.
+:::

Comment on lines 61 to 65
```shell
AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
./clp-s c --single-file-archive --file-type kv-ir --auth s3 /mnt/data/archives \
./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

🧩 Analysis chain

Example update looks good; add JSONL.zst example and verify S3 filename extension.

Removing --file-type here is correct. Consider adding a JSONL.zst S3 example to demonstrate auto‑detection of compression. Also, please confirm that using .clp as an input filename for a KV‑IR object is intentional and won’t be confused with CLP single‑file archives.

Suggested addition right after this example:

 AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
   ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
   https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
+
+**Compress an NDJSON log on S3 that is Zstd‑compressed (auto‑detected):**
+
+```shell
+AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
+  ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
+  https://my-bucket.s3.us-east-2.amazonaws.com/app-logs.jsonl.zst
+```

To verify the filename/extension guidance across docs, you can grep for inputs that end in .clp outside decompression/search sections and confirm intent.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo
echo "1) Search for '.clp' occurrences (excluding common build/vendor dirs):"
rg -n --hidden -S '\.clp\b' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "2) Search for 'kv-ir' mentions:"
rg -n --hidden -S 'kv-?ir' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "3) Show docs/src/user-docs/core-clp-s.md (first 200 lines) if present:"
if [ -f docs/src/user-docs/core-clp-s.md ]; then
  sed -n '1,200p' docs/src/user-docs/core-clp-s.md
else
  echo "docs/src/user-docs/core-clp-s.md not found"
fi
echo
echo "4) Search for 'single-file-archive' mentions:"
rg -n --hidden -S 'single-file-archive' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true
echo
echo "5) Search for 'jsonl.zst' or 'jsonl.*zst' mentions:"
rg -n --hidden -S 'jsonl(\.zst|.*zst)' -g '!*node_modules/**' -g '!*dist/**' -g '!*build/**' -g '!*venv/**' || true

Length of output: 6248


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo
echo "1) Search for '.clp' occurrences (including ignored files):"
rg -uu -n --hidden '\.clp\b' || true
echo
echo "2) Search for 'kv-ir' occurrences:"
rg -uu -n --hidden 'kv-?ir' || true
echo
echo "3) Search for 'single-file-archive' mentions:"
rg -uu -n --hidden 'single-file-archive' || true
echo
echo "4) Search for 'jsonl.zst' mentions:"
rg -uu -n --hidden 'jsonl\.zst' || true
echo
echo "5) Show docs/src/user-docs/core-clp-s.md (first 200 lines) with line numbers:"
if [ -f docs/src/user-docs/core-clp-s.md ]; then
  nl -ba -w3 -s': ' docs/src/user-docs/core-clp-s.md | sed -n '1,200p'
else
  echo "docs/src/user-docs/core-clp-s.md not found"
fi

Length of output: 4626


Add JSONL.zst example; clarify KV‑IR filename extension

Add the JSONL.zst S3 example and clarify the KV‑IR filename in docs. The codebase defines the canonical IR file extension as ".clp.zst" (components/core/src/clp/ir/constants.hpp:7) but also treats a ".clp" stream suffix specially (components/core/src/clp/clp/FileCompressor.cpp — removes ".clp" suffix around line ~405). Update docs/src/user-docs/core-clp-s.md (example at ~lines 63–64) to use ".clp.zst" or add an explicit note that ".clp" is accepted to avoid confusion.

Suggested addition (unchanged):

 AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
   ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
   https://my-bucket.s3.us-east-2.amazonaws.com/kv-ir-log.clp
+
+**Compress an NDJSON log on S3 that is Zstd‑compressed (auto‑detected):**
+
+```shell
+AWS_ACCESS_KEY_ID='...' AWS_SECRET_ACCESS_KEY='...' \
+  ./clp-s c --single-file-archive --auth s3 /mnt/data/archives \
+  https://my-bucket.s3.us-east-2.amazonaws.com/app-logs.jsonl.zst
+```
🤖 Prompt for AI Agents
In docs/src/user-docs/core-clp-s.md around lines 61 to 65, the S3 example and
text are ambiguous about canonical KV‑IR filename extensions; update the example
to include a JSONL.zst S3 example (e.g. app-logs.jsonl.zst) and clarify that the
canonical IR extension is ".clp.zst" (per
components/core/src/clp/ir/constants.hpp) while noting that a ".clp" stream
suffix is also accepted/treated specially (per FileCompressor.cpp) so readers
aren’t confused; add the additional example line and a brief parenthetical note
about the accepted ".clp" suffix.

@gibber9809 gibber9809 merged commit 834e2e2 into y-scope:main Sep 21, 2025
28 of 29 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…ion. (y-scope#1312)

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for CLP archive compression for zstd compressed files

3 participants