feat(clp-package): Add support for compressing unstructured text logs using clp-json. by gibber9809 · Pull Request #1505 · y-scope/clp

gibber9809 · 2025-10-26T21:20:16Z

Description

This PR adds support for ingesting unstructured text logs into clp-json by leveraging the new log-converter binary.

This feature is exposed via the new --unstructured flag in the compress.py script. When this flag is used, the --timestamp-key flag is ignored, the logs passed to the compression script are treated as unstructured, and the logs are converted by log-converter in the compression task before by compressed using the timestamp key "timestamp".

This PR also includes some docs updates to explain the feature.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Manually validated that the clp-json package can successfully compress JSON logs in the usual way, and unstructured text logs using the --unstructured flag from both S3 and the local filesystem.
Validated that updated docs render correctly.

Summary by CodeRabbit

New Features
- Added --unstructured flag to support unstructured text logs across local and S3 flows; inputs are auto-converted to JSON with extracted timestamp (timestamp field set to "timestamp" for unstructured).
Bug Fixes
- Supplying --timestamp-key with --unstructured emits a compatibility warning and ignores the provided key.
- Improved conversion error reporting and centralized cleanup of temporary artifacts on failure or success.
Documentation
- Updated quick-start with examples and guidance for unstructured log compression.

coderabbitai · 2025-10-26T21:20:24Z

Walkthrough

Adds a new --unstructured CLI flag across compressors and adapters; propagates it into FsInputConfig/S3InputConfig and scheduler schemas; for CLP_S unstructured inputs runs a pre-compression log-converter producing converted inputs, forces timestamp_key="timestamp", switches compressor input to converted dir, and centralizes temporary-file cleanup and related warnings.

Changes

Cohort / File(s)	Change Summary
CLI entrypoints `components/clp-package-utils/clp_package_utils/scripts/compress.py`, `components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py`	Add `--unstructured` CLI argument; validate compatibility with `--timestamp-key` (ignore `timestamp-key` when unstructured and emit a warning); only warn about missing `--timestamp-key` when not unstructured; propagate flag into generated native compress commands.
Native CLI adapter `components/clp-package-utils/.../scripts/native/compress.py`	Thread parsed `unstructured` into CLP IO config construction; include `--unstructured` in generated native command when set; extend `FsInputConfig`/`S3InputConfig` creation to accept `unstructured: bool`.
Scheduler config schema `components/job-orchestration/job_orchestration/scheduler/job_config.py`	Add `unstructured: bool = False` field to `FsInputConfig` and `S3InputConfig` schema definitions.
Compression executor `components/job-orchestration/job_orchestration/executor/compress/compression_task.py`	For CLP_S with `unstructured=true`: add pre-compression log-converter producing a converted inputs directory, force `timestamp_key="timestamp"` for converted inputs, switch compressor input to converted dir, handle converter failures by cleaning temp files and returning an error, centralize temp-file cleanup, add helper to build converter command/env, and adjust credential handling. Minor signature and env-return changes for CLP_S helper.
Documentation `docs/src/user-docs/quick-start/clp-json.md`	Document `--unstructured` usage, add example converting text logs to JSON objects with `timestamp` and `message`, and note that timestamp key is `"timestamp"` when using `--unstructured`.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI
    participant Config
    participant CompressionTask
    participant LogConverter
    participant Compressor

    User->>CLI: compress --unstructured <input>
    CLI->>Config: Build FsInputConfig/S3InputConfig (unstructured=true)
    Config->>CompressionTask: Start task with unstructured flag

    alt CLP_S and unstructured
        CompressionTask->>LogConverter: Run log-converter on input
        alt conversion succeeds
            LogConverter-->>CompressionTask: converted_inputs_dir
            Note right of CompressionTask #DFF2E1: force timestamp_key="timestamp"
            CompressionTask->>Compressor: Compress converted_inputs_dir
        else conversion fails
            LogConverter-->>CompressionTask: error (stderr)
            CompressionTask->>CompressionTask: cleanup temp files
            CompressionTask-->>User: return failure with error details
        end
    else non-unstructured or FS input
        CompressionTask->>Compressor: Compress original inputs (logs list)
    end

    Compressor-->>CompressionTask: compression complete
    CompressionTask->>CompressionTask: cleanup temp files
    CompressionTask-->>User: return success output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Review compression_task.py: converter command/env helper, S3 credential handling for converter, conversion exit handling, timestamp_key override, and cleanup ordering.
Verify unstructured propagation CLI → native adapter → scheduler config and serialization (job_config.py).
Check native command assembly and warnings in compress.py/compress_from_s3.py.
Validate documentation example and timestamp-key language.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "feat(clp-package): Add support for compressing unstructured text logs using clp-json." is directly and fully related to the main objective of the changeset. The PR implements precisely what the title describes: it adds a new --unstructured flag across multiple CLI entry points (compress.py, native/compress.py, compress_from_s3.py), updates configuration classes to support the unstructured parameter, implements a log-converter workflow for pre-compression transformation, and includes documentation explaining the feature. The title is specific, clear, and concise, accurately capturing the primary feature from the developer's perspective without unnecessary noise or vague terminology.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e82bdb and 814dd23.

📒 Files selected for processing (5)

components/clp-package-utils/clp_package_utils/scripts/compress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3 hunks)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4 hunks)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1 hunks)
docs/src/user-docs/quick-start/clp-json.md (2 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-09-04T13:51:02.311Z

Learnt from: quinntaylormitchell
PR: y-scope/clp#1283
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:206-210
Timestamp: 2025-09-04T13:51:02.311Z
Learning: In components/clp-package-utils/clp_package_utils/scripts/compress.py, an empty string ("") is a valid value for the --timestamp-key parameter, so validation should not treat empty strings as missing values.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/compress.py

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

🧬 Code graph analysis (2)

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

S3Config (418-422)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

StorageEngine (94-96)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)

CompressionTaskStatus (40-45)

🪛 LanguageTool

docs/src/user-docs/quick-start/clp-json.md

[uncategorized] ~62-~62: Possible missing preposition found.
Context: ...ct-storage/index). ::: ### Compression unstructured text logs To compress some unstructure...

(AI_HYDRA_LEO_MISSING_OF)

🪛 markdownlint-cli2 (0.18.1)

docs/src/user-docs/quick-start/clp-json.md

6-6: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

🪛 Ruff (0.14.1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

409-409: Missing return type annotation for private function cleanup_temporary_files

Add return type annotation: None

(ANN202)

422-422: subprocess call: check for execution of untrusted input

(S603)

430-430: Logging statement uses f-string

(G004)

430-430: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (12)

docs/src/user-docs/quick-start/clp-json.md (1)

70-72: Clarify timestamp extraction behavior.

The statement "attempt to extract the timestamp" could be more precise. Consider documenting what happens when timestamp extraction fails or when no timestamp is found in the log message.

Do you want to add details about:

What happens if timestamp extraction fails?

Whether logs without timestamps are still processed?

The expected timestamp formats?

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

29-29: LGTM! Clean addition of the unstructured field.

The unstructured boolean field is correctly added to both FsInputConfig and S3InputConfig with a safe default value of False, ensuring backward compatibility.

Also applies to: 37-37

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

231-235: LGTM! Clear argument definition.

The --unstructured flag is well-defined with appropriate help text.

150-156: LGTM! Correct propagation of the unstructured flag.

The unstructured parameter is correctly propagated from parsed arguments to both FsInputConfig and S3InputConfig.

Also applies to: 166-174

components/clp-package-utils/clp_package_utils/scripts/compress.py (3)

166-170: LGTM! Consistent argument definition.

The --unstructured argument definition matches the native compress script.

94-95: LGTM! Correct conditional command generation.

The --unstructured flag is correctly appended to the compress command when set.

213-223: LGTM! Proper validation logic for flag interaction.

The validation correctly handles the interaction between --timestamp-key and --unstructured:

Warns when timestamp-key is missing and unstructured is not set (preserving existing behavior)

When both are provided, unstructured takes precedence with a clear warning

The logic respects the learning that empty strings are valid for timestamp-key.

Based on learnings

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (5)

5-5: LGTM! Required import for cleanup functionality.

The shutil module is correctly imported for removing the converted inputs directory.

302-307: LGTM! Correct conditional timestamp-key handling.

The logic correctly:

Forces timestamp_key to "timestamp" when unstructured mode is enabled

Falls back to the user-provided timestamp_key for structured inputs

Preserves existing behavior when timestamp_key is None

This aligns with the documentation stating that unstructured logs always use "timestamp" as the timestamp key.

392-407: LGTM! Clean conversion setup for unstructured inputs.

The conversion workflow is correctly structured:

Only activates for CLP_S with unstructured input

Creates a temporary directory for converted files

Builds the log-converter command with appropriate arguments

Passes converted directory to compression instead of using --files-from

The conditional logic ensures backward compatibility for structured inputs.

419-437: LGTM! Robust conversion error handling.

The conversion execution and error handling is well-structured:

Runs conversion subprocess with proper logging

Waits for completion and captures return code

On failure: performs cleanup, logs error, and returns structured failure output

Error message directs users to stderr log for debugging

The worker output structure matches the success case, ensuring consistent handling upstream.

537-537: LGTM! Proper cleanup after compression.

The cleanup function is correctly called after compression completes, ensuring temporary files are removed regardless of compression success or failure.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

docs/src/user-docs/quick-start/clp-json.md (2)
6-6: Remove trailing space.

There's a trailing space at the end of line 6 that should be removed for proper markdown formatting.

Apply this diff:
-If you're using a `clp-json` release, we only support ingesting unstructured text logs by 
+If you're using a `clp-json` release, we only support ingesting unstructured text logs by
62-62: Fix grammatical error in heading.

The heading is missing a verb or preposition. It should be "Compressing unstructured text logs" or "Compression of unstructured text logs".

Apply this diff:
-### Compression unstructured text logs
+### Compressing unstructured text logs
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
409-413: Add return type annotation and use explicit None checks.

Two issues flagged in previous reviews remain:

Missing return type annotation for the function

Using truthiness checks on Path objects, which are always truthy regardless of file existence

Apply this diff:
-    def cleanup_temporary_files():
-        if logs_list_path is not None:
+    def cleanup_temporary_files() -> None:
+        if logs_list_path is not None:
             logs_list_path.unlink()
-        if converted_inputs_path is not None:
+        if converted_inputs_path is not None:
             shutil.rmtree(converted_inputs_path)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 814dd23 and cf37dc8.

📒 Files selected for processing (2)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4 hunks)
docs/src/user-docs/quick-start/clp-json.md (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

🧬 Code graph analysis (1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

StorageEngine (94-96)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)

CompressionTaskStatus (40-45)

🪛 Ruff (0.14.1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

409-409: Missing return type annotation for private function cleanup_temporary_files

Add return type annotation: None

(ANN202)

422-422: subprocess call: check for execution of untrusted input

(S603)

430-430: Logging statement uses f-string

(G004)

430-430: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (5)

docs/src/user-docs/quick-start/clp-json.md (1)

64-80: Clear documentation of the unstructured mode.

The documentation clearly explains the --unstructured flag, provides a concrete example of the log conversion, and notes the fixed timestamp key behaviour. This aligns well with the implementation changes.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4)

5-5: Appropriate import for cleanup operations.

The shutil import is correctly added to support removal of the temporary converted inputs directory.

302-307: Correct timestamp-key handling for unstructured mode.

The logic correctly forces the timestamp key to "timestamp" when processing unstructured logs, aligning with the documented behaviour. For structured inputs, it preserves the existing logic.

419-437: Robust conversion error handling.

The conversion execution properly checks the return code, performs cleanup on failure, and returns a structured error response with zeroed sizes. The early return pattern prevents unnecessary compression attempts when conversion fails.

537-537: Proper cleanup after compression.

The cleanup call is correctly placed after compression completes, ensuring temporary files are removed in both success and failure scenarios.

coderabbitai · 2025-10-26T21:49:54Z

+    conversion_cmd = None
+    converted_inputs_path = None
+    if StorageEngine.CLP_S == clp_storage_engine and clp_config.input.unstructured:
+        converted_inputs_path = tmp_dir / f"{instance_id_str}-converted-tmp"
+        converted_inputs_path.mkdir()
+        conversion_cmd = [
+            str(clp_home / "bin" / "log-converter"),
+            "--inputs-from",
+            str(logs_list_path),
+            "--output-dir",
+            str(converted_inputs_path),
+        ]
+        compression_cmd.append(str(converted_inputs_path))
+    else:
+        compression_cmd.append("--files-from")
+        compression_cmd.append(str(logs_list_path))


🧹 Nitpick | 🔵 Trivial

Conversion setup logic is correct.

The conversion workflow is properly gated to CLP_S with unstructured input, creates a uniquely named temporary directory, and correctly switches the compression input source based on whether conversion is needed.

For defensive coding, consider adding exist_ok=True to the mkdir call:

- converted_inputs_path.mkdir() + converted_inputs_path.mkdir(exist_ok=True)

While the unique instance_id_str makes collisions extremely unlikely, this would prevent unexpected failures in edge cases.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

conversion_cmd = None

converted_inputs_path = None

if StorageEngine.CLP_S == clp_storage_engine and clp_config.input.unstructured:

converted_inputs_path = tmp_dir / f"{instance_id_str}-converted-tmp"

converted_inputs_path.mkdir()

conversion_cmd = [

str(clp_home / "bin" / "log-converter"),

"--inputs-from",

str(logs_list_path),

"--output-dir",

str(converted_inputs_path),

]

compression_cmd.append(str(converted_inputs_path))

else:

compression_cmd.append("--files-from")

compression_cmd.append(str(logs_list_path))

conversion_cmd = None

converted_inputs_path = None

if StorageEngine.CLP_S == clp_storage_engine and clp_config.input.unstructured:

converted_inputs_path = tmp_dir / f"{instance_id_str}-converted-tmp"

converted_inputs_path.mkdir(exist_ok=True)

conversion_cmd = [

str(clp_home / "bin" / "log-converter"),

"--inputs-from",

str(logs_list_path),

"--output-dir",

str(converted_inputs_path),

]

compression_cmd.append(str(converted_inputs_path))

else:

compression_cmd.append("--files-from")

compression_cmd.append(str(logs_list_path))

🤖 Prompt for AI Agents

In components/job-orchestration/job_orchestration/executor/compress/compression_task.py around lines 392 to 407, the temporary directory creation uses converted_inputs_path.mkdir() and should be made defensive to avoid rare race/collision failures; change the call to converted_inputs_path.mkdir(exist_ok=True) (you may also add parents=True if nested directories are possible) so mkdir does not raise if the directory already exists.

LinZhihao-723

We've made changes to the compression script in #1476 and will merge it soon. compress-from-s3.py will also need to be updated, as we split the S3 compression from the original compress.py script.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf37dc8 and b80b5ea.

📒 Files selected for processing (4)

components/clp-package-utils/clp_package_utils/scripts/compress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4 hunks)
docs/src/user-docs/quick-start/clp-json.md (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (9)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

343-347: LGTM: CLI argument correctly defined.

The --unstructured flag is properly configured with action="store_true" and includes clear help text.

160-160: LGTM: Unstructured flag consistently propagated.

The unstructured parameter is correctly passed to all input config types (FsInputConfig and both S3InputConfig branches). The validation of the interaction between --timestamp-key and --unstructured is appropriately handled in the calling scripts before reaching this point.

Also applies to: 184-184, 200-200

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (3)

183-187: LGTM: CLI argument correctly defined.

The --unstructured flag matches the definition in other scripts, ensuring consistency across the codebase.

100-101: LGTM: Conditional command generation is correct.

The --unstructured flag is appropriately appended to the compress command only when set.

263-273: LGTM: Validation logic correctly handles flag interaction.

The implementation properly addresses the mutual exclusivity of --timestamp-key and --unstructured:

Warns when neither is provided (line 263)

Warns and nullifies timestamp_key when both are provided (lines 268-273)

This ensures clear user expectations and prevents conflicting configurations.

components/clp-package-utils/clp_package_utils/scripts/compress.py (3)

153-157: LGTM: CLI argument correctly defined.

The --unstructured flag is consistent with the definitions in other scripts.

97-98: LGTM: Conditional command generation is correct.

The --unstructured flag is appropriately appended to the compress command only when set.

208-218: LGTM: Validation logic correctly handles flag interaction.

The validation logic is identical to compress_from_s3.py and correctly handles the mutual exclusivity of --timestamp-key and --unstructured. The logic is appropriately scoped within the CLP_S storage engine check since timestamp handling is specific to that engine.

Note: This validation logic is duplicated between this file and compress_from_s3.py. The codebase already contains TODO comments acknowledging this duplication (lines 253-254 in compress_from_s3.py, lines 280-281 in this file, per the AI summary context).

docs/src/user-docs/quick-start/clp-json.md (1)

6-7: LGTM: Caution text updated correctly.

The revised caution text accurately reflects that clp-json supports ingesting unstructured text logs by converting them to JSON.

LinZhihao-723

Overall lgtm. Some small fixes + docs polishing

LinZhihao-723 · 2025-10-27T20:12:01Z


-    if InputType.S3 == clp_config.input.type:
-        compression_env_vars = dict(os.environ)
+    compression_env_vars = dict(os.environ)


Might be worth to update the return annotation since it doens't return None now.

LinZhihao-723 · 2025-10-27T20:17:38Z

-    compression_cmd.append("--files-from")
-    compression_cmd.append(str(logs_list_path))
+    conversion_cmd = None
+    converted_inputs_path = None


How about converted_inputs_dir?

LinZhihao-723 · 2025-10-27T20:32:01Z

    # Wait for compression to finish
    return_code = proc.wait()
+
+    cleanup_temporary_files()


Preferred to move inside else branch:

if 0 != return_code: logger.error(f"Failed to compress, return_code={str(return_code)}") else: compression_successful = True cleanup_temporary_files()

As if the compression failed, we might want to keep the converted inputs and the logs list file for debugging.

Sure, we can do that for now. It feels like something that's useful for us but not end users though.

LinZhihao-723 · 2025-10-27T20:55:51Z

+To compress some unstructured text logs, run:
+
+```bash
+sbin/compress.sh --unstructured <path1> [<path2> ...]
+```
+
+The `--unstructured` flag will make clp-json treat your logs as unstructured text, and convert them
+into JSON. The conversion process will attempt to extract the timestamp from the log message, and
+store the timestamp and message in separate fields.
+* E.g., if your log event looks like
+  `2024-01-01 00:01:02.345 ERROR...`, it will be converted to
+  `{"timestamp": "2024-01-01 00:01:02.345", "message": " ERROR..."}`


How about rewriting in the following way (I can't comment it through suggestion due to formatting issue):

clp-json supports compressing unstructured text logs by converting them into JSON format. To enable
this mode, run the compression script with the --unstructured flag:

sbin/compress.sh --unstructured <path1> [<path2> ...]

When --unstructured is specified, clp-json will parse the unstructured text and convert each log
event into a JSON object. During this conversion, it attempts to extract the timestamp from each log
message and store it as a separate key-value pair. The resulting JSON object includes:

"timestamp": The extracted timestamp, stored as its original string representation to preserve
the format.

"message": The original log message.

For example, a log event like:

2024-01-01 00:01:02.345 ERROR...

will be converted into:

{ "timestamp": "2024-01-01 00:01:02.345", "message": " ERROR..." }

Yeah, this way of writing it reads very clearly -- will update with very slightly edited wording from what you put here.

LinZhihao-723 · 2025-10-27T20:56:17Z

-If you're using a `clp-json` release, you can only compress and search JSON logs. This limitation
-will be addressed in a future version of CLP.
+If you're using a `clp-json` release, we only support ingesting unstructured text logs by
+converting them to JSON. This limitation will be addressed in a future version of CLP.


Do we still need this "caution" section? I think with the below rewriting, we can drop this section.

I could maybe downgrade it to a note? I think its a somewhat notable limitation, so I want to make the limitation clear right at the start of the doc.

Sure make sense. note should be good.

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
356-371: Update return description in docstring.

The function returns (CompressionTaskStatus, worker_output), not “whether compression was successful”.
-    :return: tuple -- (whether compression was successful, output messages)
+    :return: Tuple[CompressionTaskStatus, Dict[str, Any]] -- (status, worker_output)
343-371: Consider encapsulating flow in a class to manage state and cleanup.

A ClpCompressor (holding logger, env, paths) would reduce parameter passing and centralize cleanup via context management. Based on learnings.

♻️ Duplicate comments (3)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3)

423-437: Make temp dir creation defensive; confirm clp-s positional input path.

Use exist_ok=True to avoid rare collisions or leftovers after retries.
Please confirm clp-s expects the converted directory as a positional input (no flag). If it actually requires a flag, this will fail.

-        converted_inputs_path.mkdir()
+        converted_inputs_path.mkdir(exist_ok=True)

Would you confirm the CLI accepts clp-s c <archive_dir> <converted_inputs_dir> without --files-from?

438-443: Add return type and make cleanup idempotent.

Annotate return type and tolerate already-removed files/dirs.

-    def cleanup_temporary_files():
-        if logs_list_path is not None:
-            logs_list_path.unlink()
-        if converted_inputs_path is not None:
-            shutil.rmtree(converted_inputs_path)
+    def cleanup_temporary_files() -> None:
+        if logs_list_path is not None:
+            # Be tolerant if the file was already removed
+            try:
+                logs_list_path.unlink(missing_ok=True)
+            except TypeError:
+                if logs_list_path.exists():
+                    logs_list_path.unlink()
+        if converted_inputs_path is not None:
+            shutil.rmtree(converted_inputs_path, ignore_errors=True)

566-571: Only clean up on success; fix logging style.

Retain artefacts on failure for debugging, and avoid f-string logging.

-    cleanup_temporary_files()
-    if 0 != return_code:
-        logger.error(f"Failed to compress, return_code={str(return_code)}")
-    else:
-        compression_successful = True
+    if 0 != return_code:
+        logger.error("Failed to compress, return_code=%d", return_code)
+    else:
+        compression_successful = True
+        cleanup_temporary_files()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6316189 and 9b7ce01.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

🧬 Code graph analysis (1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (5)

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

InputType (11-13)

ClpIoConfig (56-58)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

get_credential_env_vars (61-99)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

StorageEngine (95-97)

integration-tests/tests/utils/utils.py (1)

unlink (54-75)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)

CompressionTaskStatus (40-45)

🪛 Ruff (0.14.1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

438-438: Missing return type annotation for private function cleanup_temporary_files

Add return type annotation: None

(ANN202)

451-451: subprocess call: check for execution of untrusted input

(S603)

459-459: Logging statement uses f-string

(G004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (3)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3)

291-296: S3 auth scoping for CLP_S structured inputs looks correct.

Credentials only added when input is S3 and not unstructured; converter gets creds instead. Good separation.

300-305: Correct to force timestamp_key="timestamp" for unstructured inputs.

Ensures consistent key post-conversion; preserves original behaviour for structured inputs.

451-454: Subprocess input trust — validate CLP_HOME provenance.

You’re using exec-list form (good) and not shell=True. Please confirm CLP_HOME is controlled by the worker and not user-influenced; otherwise restrict/validate it to a trusted path.

coderabbitai · 2025-10-27T21:14:02Z

+    if conversion_return_code != 0:
+        cleanup_temporary_files()
+        logger.error(
+            f"Failed to convert unstructured log text, return_code={conversion_return_code}"
+        )
+        worker_output = {
+            "total_uncompressed_size": 0,
+            "total_compressed_size": 0,
+            "error_message": f"Check logs in {stderr_log_path}",
+        }
+        return CompressionTaskStatus.FAILED, worker_output
+


⚠️ Potential issue | 🟠 Major

Close stderr file on early conversion failure; switch to parameterized logging.

Avoids fd leak and aligns with logging best practices.

- if conversion_return_code != 0: - cleanup_temporary_files() - logger.error( - f"Failed to convert unstructured log text, return_code={conversion_return_code}" - ) - worker_output = { - "total_uncompressed_size": 0, - "total_compressed_size": 0, - "error_message": f"Check logs in {stderr_log_path}", - } - return CompressionTaskStatus.FAILED, worker_output + if conversion_return_code != 0: + cleanup_temporary_files() + logger.error( + "Failed to convert unstructured log text, return_code=%d", + conversion_return_code, + ) + worker_output = { + "total_uncompressed_size": 0, + "total_compressed_size": 0, + "error_message": f"Check logs in {stderr_log_path}", + } + stderr_log_file.close() + return CompressionTaskStatus.FAILED, worker_output

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if conversion_return_code != 0:

cleanup_temporary_files()

logger.error(

f"Failed to convert unstructured log text, return_code={conversion_return_code}"

)

worker_output = {

"total_uncompressed_size": 0,

"total_compressed_size": 0,

"error_message": f"Check logs in {stderr_log_path}",

}

return CompressionTaskStatus.FAILED, worker_output

if conversion_return_code != 0:

cleanup_temporary_files()

logger.error(

"Failed to convert unstructured log text, return_code=%d",

conversion_return_code,

)

worker_output = {

"total_uncompressed_size": 0,

"total_compressed_size": 0,

"error_message": f"Check logs in {stderr_log_path}",

}

stderr_log_file.close()

return CompressionTaskStatus.FAILED, worker_output

🧰 Tools

🪛 Ruff (0.14.1)

459-459: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

components/job-orchestration/job_orchestration/executor/compress/compression_task.py around lines 456-467: when conversion_return_code != 0 you must close the stderr file handle to avoid an fd leak and use parameterized logging instead of f-strings; before calling cleanup_temporary_files() or returning, call the stderr file's close method (e.g., stderr_file.close() or the actual variable name used where stderr was opened), and replace logger.error(f"Failed to convert unstructured log text, return_code={conversion_return_code}") with a parameterized call like logger.error("Failed to convert unstructured log text, return_code=%d", conversion_return_code) so logs are emitted safely.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
438-442: Add return type annotation and use explicit is not None checks.

The cleanup function has two issues already flagged in previous reviews:

Missing return type annotation (line 438)

Should use explicit is not None checks instead of truthiness checks (lines 439, 441)

Since pathlib.Path objects are always truthy, the current checks don't effectively distinguish between None and a Path object.

Based on static analysis

Apply this diff:
-    def cleanup_temporary_files():
-        if logs_list_path:
+    def cleanup_temporary_files() -> None:
+        if logs_list_path is not None:
             logs_list_path.unlink()
-        if converted_inputs_dir:
+        if converted_inputs_dir is not None:
             shutil.rmtree(converted_inputs_dir)
448-467: Use parameterized logging for the error message.

The stderr file close (line 466) correctly addresses the fd leak concern from previous reviews. However, the logging statement at line 459 still uses an f-string, which should be changed to parameterized logging for consistency and best practices.

Based on static analysis

Apply this diff:
         logger.error(
-            f"Failed to convert unstructured log text, return_code={conversion_return_code}"
+            "Failed to convert unstructured log text, return_code=%d",
+            conversion_return_code,
         )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b7ce01 and 354fecc.

📒 Files selected for processing (2)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (5 hunks)
docs/src/user-docs/quick-start/clp-json.md (2 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-28T15:00:22.170Z

Learnt from: LinZhihao-723
PR: y-scope/clp#1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

🧬 Code graph analysis (1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4)

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

InputType (11-13)

ClpIoConfig (56-58)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

get_credential_env_vars (61-99)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

StorageEngine (95-97)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)

CompressionTaskStatus (40-45)

🪛 LanguageTool

docs/src/user-docs/quick-start/clp-json.md

[uncategorized] ~81-~81: Loose punctuation mark.
Context: ...g JSON object includes: * "timestamp": The extracted timestamp, stored as its ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~83-~83: Loose punctuation mark.
Context: ...ve the timestamp format. * "message": The original log message. For example,...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Ruff (0.14.1)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py

438-438: Missing return type annotation for private function cleanup_temporary_files

Add return type annotation: None

(ANN202)

451-451: subprocess call: check for execution of untrusted input

(S603)

459-459: Logging statement uses f-string

(G004)

568-568: Logging statement uses f-string

(G004)

568-568: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (ubuntu-24.04)
GitHub Check: lint-check (macos-15)

🔇 Additional comments (7)

docs/src/user-docs/quick-start/clp-json.md (1)

67-103: Documentation for unstructured logs feature is clear and well-structured.

The new section comprehensively documents the --unstructured flag, including usage syntax, behaviour, JSON transformation details, a concrete example, and the important note about the fixed timestamp key. The writing is clear and accessible to users.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (6)

5-5: LGTM!

The shutil import is necessary for the temporary directory cleanup functionality added later in the file.

270-270: LGTM!

The return type annotation correctly reflects that compression_env_vars is now always initialized (line 291), never None. This improves type safety and addresses the maintainer's feedback.

291-306: LGTM!

The credential and timestamp-key logic correctly handles the unstructured input case:

S3 credentials are excluded from compression environment for unstructured inputs (line 292) because the log-converter handles S3 access, while the compressor only reads the converted local files.

Timestamp-key is forced to "timestamp" for unstructured inputs (lines 300-303), matching the documented behavior.

Existing structured input behavior is preserved (lines 303-305).

310-340: LGTM!

The log-converter command builder correctly constructs the command and environment variables, including S3 authentication when needed. The implementation follows the same pattern as _make_clp_s_command_and_env.

The maintainer's suggestions for docstring improvements are reasonable refinements but not critical.

423-436: LGTM!

The conversion setup correctly gates the conversion workflow to CLP_S with unstructured input and appropriately switches between directory ingestion (line 433) and file-list ingestion (line 436).

Based on learnings

571-571: LGTM!

The cleanup is correctly placed in the success path only, preserving temporary files for debugging when compression fails. This aligns with the maintainer's feedback and provides a better debugging experience.

LinZhihao-723

Some formatting updates. Otherwise lgtm. I will approve afterwards.

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 354fecc and 1d089a6.

📒 Files selected for processing (1)

docs/src/user-docs/quick-start/clp-json.md (2 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/src/user-docs/quick-start/clp-json.md

[uncategorized] ~81-~81: Loose punctuation mark.
Context: ...g JSON object includes: * "timestamp": The extracted timestamp, stored as its ...

(UNLIKELY_OPENING_PUNCTUATION)

[uncategorized] ~83-~83: Loose punctuation mark.
Context: ...ve the timestamp format. * "message": The original log message. For example,...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (1)

docs/src/user-docs/quick-start/clp-json.md (1)

67-103: New section is well-structured and clearly explains the feature.

The documentation for unstructured text log compression is clear, practical, and includes a helpful example. The note about the forced "timestamp" key provides important context for users.

Note: Static analysis flags at lines 81 and 83 about "loose punctuation" are false positives—these are standard markdown backticks for inline code formatting (e.g., "timestamp"), which is correct.

LinZhihao-723

For the PR title, how about:

feat(clp-package): Add support for compressing unstructured text logs using clp-json.

… using clp-json. (y-scope#1505) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

gibber9809 added 4 commits October 26, 2025 00:11

Integrate log-converter into clp-json package.

2b7bcb4

Properly forward unstructured argument; fix bugs in compression task.

92caa80

Merge remote-tracking branch 'upstream/main' into clp-json-log-converter

70958f6

Adds docs for --unstructured flag.

814dd23

gibber9809 requested a review from a team as a code owner October 26, 2025 21:20

coderabbitai Bot reviewed Oct 26, 2025

View reviewed changes

Address rabbit comments

cf37dc8

coderabbitai Bot reviewed Oct 26, 2025

View reviewed changes

LinZhihao-723 reviewed Oct 26, 2025

View reviewed changes

gibber9809 added 2 commits October 27, 2025 14:31

Merge remote-tracking branch 'upstream/main' into clp-json-log-converter

6ae8305

Add --unstructured option to new compression script

b80b5ea

coderabbitai Bot reviewed Oct 27, 2025

View reviewed changes

Comment thread docs/src/user-docs/quick-start/clp-json.md Outdated

gibber9809 added 3 commits October 27, 2025 15:03

Accept rabbit comment for improving docs.

340b978

Fix s3 support for log-converter in compression task.

17a719f

Fix bug in compression task.

6316189

gibber9809 requested a review from LinZhihao-723 October 27, 2025 18:19

LinZhihao-723 requested changes Oct 27, 2025

View reviewed changes

LinZhihao-723 and others added 2 commits October 27, 2025 16:57

Merge branch 'main' into clp-json-log-converter

ab3492f

Apply suggestions from code review

9b7ce01

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

coderabbitai Bot reviewed Oct 27, 2025

View reviewed changes

gibber9809 added 2 commits October 27, 2025 21:19

Address review comments.

da10ec8

Address rabbit comment.

354fecc

gibber9809 requested a review from LinZhihao-723 October 27, 2025 21:23

coderabbitai Bot reviewed Oct 27, 2025

View reviewed changes

Comment thread docs/src/user-docs/quick-start/clp-json.md

LinZhihao-723 requested changes Oct 27, 2025

View reviewed changes

Comment thread docs/src/user-docs/quick-start/clp-json.md Outdated

Comment thread docs/src/user-docs/quick-start/clp-json.md Outdated

Comment thread docs/src/user-docs/quick-start/clp-json.md Outdated

gibber9809 and others added 2 commits October 27, 2025 17:32

Apply suggestions from code review

db74332

Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

Fix grammatical error spotted by rabbit.

5f0abda

gibber9809 requested a review from LinZhihao-723 October 27, 2025 21:34

Make section header size consistent

1d089a6

coderabbitai Bot reviewed Oct 27, 2025

View reviewed changes

Comment thread docs/src/user-docs/quick-start/clp-json.md

Make note about clp-json limitations more clear.

1d72254

LinZhihao-723 approved these changes Oct 27, 2025

View reviewed changes

gibber9809 changed the title ~~feat(clp-package): Add support for ingesting unstructured text logs in clp-json.~~ feat(clp-package): Add support for compressing unstructured text logs using clp-json. Oct 27, 2025

gibber9809 merged commit 8691e36 into y-scope:main Oct 27, 2025
20 checks passed

junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026

feat(clp-package): Add support for compressing unstructured text logs…

5fb4239

… using clp-json. (y-scope#1505) Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>

Conversation

gibber9809 commented Oct 26, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

gibber9809 commented Oct 26, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 26, 2025 •

edited

Loading