Skip to content

feat(clp-package)!: Add compress-from-s3 Python and shell scripts for S3-object and S3-key-prefix compression; Restrict compress.py to accept local file paths only.#1476

Merged
LinZhihao-723 merged 14 commits into
y-scope:mainfrom
Eden-D-Zhang:s3-compression-script
Oct 26, 2025

Conversation

@Eden-D-Zhang

@Eden-D-Zhang Eden-D-Zhang commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

Description

This PR adds a new compress-from-3.sh CLI script that enables compressing logs directly from S3 object storage. The script provides two modes of operation:

  • s3-object: Compress specific S3 objects by providing their full URLs. Supports multiple objects that must be in the same region and bucket.
  • s3-key-prefix: Compress all objects under a given S3 key prefix.

Changes:

  • Added compress-from-s3 script to handle S3 log ingestion separate from filesystem ingestion.
  • Modified existing compress script to ingest exclusively from the filesystem.
  • Modified native compress.py script to handle new S3 ingestion options.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Verified that when logs_input is configured to fs:
    • The compress-from-s3.sh script fails to run.
    • compress.sh ingests from the filesystem successfully.
  • Verified that when logs_input is configured to s3:
    • The compress.sh script fails to run.
    • compress-from-s3.sh s3-object successfully ingests one or more S3 URLs (that share a bucket and prefix).
    • compress-from-s3.sh s3-key-prefix successfully ingests all objects under the key prefix in an S3 URL.

Latest Validation

compress-from-s3.py:

From cli directly:

  • Single object URL → success
  • Multiple object URLs with a shared prefix → success
  • Multiple object URLs without a shared prefix → fail
  • Single prefix URL → success
  • Multiple prefix URLs → fail
  • Single prefix URL but it’s an object → success

From input file:

  • Single object URL → success
  • Multiple object URLs with a shared prefix → success
  • Multiple object URLs without a shared prefix → fail
  • Single prefix URL → success
  • Multiple prefix URLs → fail
  • Multiple prefix URLs → fail
  • Single prefix URL but it’s an object → success
  • Empty input file (for both object compression and key-prefix compression) → fail

Additional cases:

  • From both positional and input file → fail
  • Given both an input file and positional args → fail
  • Given both s3-object and s3-key-prefix → fail (treated as previous)

compress.py (for both clp and clp-s storage engine):

From cli directly:

  • Single path → success
  • Multiple path → success

From input file:

  • Single path → success
  • Multiple path → success
  • No path → fail

Additional cases:

  • From both positional and input file → fail

Summary by CodeRabbit

  • New Features

    • S3-based log compression with object and key-prefix modes and a new command to run it.
    • Lightweight wrapper script to invoke the S3 compression workflow.
    • New CLI option to explicitly choose FS or S3 input mode.
  • Refactor

    • Simplified filesystem compression flow with unified FS validation and mounts.
    • Centralized S3 handling into a dedicated S3 compression path.
  • Bug Fixes

    • Improved error handling, exit codes and more reliable cleanup of temporary files.

@Eden-D-Zhang Eden-D-Zhang requested a review from a team as a code owner October 23, 2025 04:34
@coderabbitai

coderabbitai Bot commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Filesystem compressor was simplified to FS-only; a new S3 orchestrator script was added to handle containerized S3 workflows; the native compressor gained an --input-type option with S3 parsing and AWS-auth helpers; a shell entrypoint and two S3 constants were introduced.

Changes

Cohort / File(s) Summary
Filesystem compress script
components/clp-package-utils/clp_package_utils/scripts/compress.py
Removed InputType/S3 branching; _generate_logs_list(container_logs_list_path: pathlib.Path, parsed_args: argparse.Namespace) -> bool now filesystem-focused and returns bool; enforces StorageType.FS at runtime; always mounts input logs dir; injects --input-type fs into container command; adjusts error handling and cleanup.
S3 orchestrator script (new)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
New script implementing S3-only orchestration with subcommands s3-object and s3-key-prefix; provides URL-list generation, compression command construction (--input-type s3), validators, CLP config/DB creds loading, container config/mounts, container execution, exit-code logging, and temp-file cleanup; exposes helpers and main(argv).
Native compress module (multi-source)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
Added CLI --input-type (fs
Shell wrapper (new)
components/package-template/src/sbin/compress-from-s3.sh
New wrapper script that sets PYTHONPATH and runs python3 -m clp_package_utils.scripts.compress_from_s3 "$@".
Constants added
components/clp-package-utils/clp_package_utils/general.py
Added S3_KEY_PREFIX_COMPRESSION = "s3-key-prefix" and S3_OBJECT_COMPRESSION = "s3-object" constants.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant compress_from_s3 as compress_from_s3.py
    participant Container
    participant native_compress as native/compress.py

    User->>compress_from_s3: Run s3-object / s3-key-prefix subcommand
    compress_from_s3->>compress_from_s3: Validate args, load CLP config, ensure StorageType S3 & engine CLP_S
    compress_from_s3->>compress_from_s3: Generate URL list file (temp)
    compress_from_s3->>compress_from_s3: Build container config + command (--input-type s3)
    compress_from_s3->>Container: Start container with mounts + URL list
    Container->>native_compress: Invoke native/compress.py --input-type s3 --logs-list <path>
    native_compress->>native_compress: Parse/validate S3 URLs, get AWS auth, build S3InputConfig
    native_compress->>native_compress: Perform compression
    Container-->>compress_from_s3: Return exit code
    compress_from_s3->>compress_from_s3: Cleanup temp files
    compress_from_s3-->>User: Report result
Loading
sequenceDiagram
    participant User
    participant compress as compress.py
    participant Container
    participant native_compress as native/compress.py

    User->>compress: Run filesystem compression
    compress->>compress: Validate FS inputs, generate logs list (container path)
    compress->>compress: Build container config + command (--input-type fs)
    compress->>Container: Start container with mounts + logs-list
    Container->>native_compress: Invoke native/compress.py --input-type fs --logs-list <path>
    native_compress->>native_compress: Build FsInputConfig from provided paths
    native_compress->>native_compress: Perform compression
    Container-->>compress: Return exit code
    compress-->>User: Report result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review S3 URL parsing, deduplication and region/bucket/key_prefix logic in native/compress.py.
  • Verify AWS authentication extraction and error paths in native/compress.py.
  • Check argument validation, subcommand dispatch, and URL-list writing in compress_from_s3.py.
  • Confirm container command construction, mounts, --input-type consistency, and temp-file create/cleanup in compress.py and compress_from_s3.py.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title accurately captures the main objectives and changes in the changeset. It clearly identifies the two primary actions: adding new compress-from-s3 scripts (both Python and shell) with support for two compression modes (S3-object and S3-key-prefix), and restricting the existing compress.py to filesystem-only ingestion. The title directly maps to the key modifications reflected in the summaries—the creation of compress_from_s3.py and compress-from-s3.sh, the refactored compress.py with StorageType validation, and the updated native compress.py with new string-based input-type handling. The use of the breaking-change indicator (!) appropriately signals the architectural shift in responsibility between scripts, and the specific naming of compression modes demonstrates precision about functionality.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py (3)

33-33: Use module logger name.

Prefer name for logger hierarchy and filters.

-logger = logging.getLogger(__file__)
+logger = logging.getLogger(__name__)

171-173: Avoid bare except; capture the exception.

Catching all exceptions hides interruptions and makes debugging harder.

-    except:
-        logger.exception("Failed to load config.")
-        return -1
+    except Exception:
+        logger.exception("Failed to load config.")
+        return -1

240-251: Clean up temp files reliably (config + logs list).

logs list files accumulate; remove them in finally. Also prefer logger formatting for debug line.

-    proc = subprocess.run(cmd)
-    ret_code = proc.returncode
-    if 0 != ret_code:
-        logger.error("Compression failed.")
-        logger.debug(f"Docker command failed: {shlex.join(cmd)}")
-
-    # Remove generated files
-    generated_config_path_on_host.unlink()
-
-    return ret_code
+    try:
+        proc = subprocess.run(cmd)
+        ret_code = proc.returncode
+        if ret_code != 0:
+            logger.error("Compression failed.")
+            logger.debug("Docker command failed: %s", shlex.join(cmd))
+        return ret_code
+    finally:
+        try:
+            generated_config_path_on_host.unlink()
+        except Exception:
+            logger.debug("Failed to remove generated config file: %s", generated_config_path_on_host)
+        try:
+            container_logs_list_path.unlink()
+        except Exception:
+            logger.debug("Failed to remove logs list file: %s", container_logs_list_path)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

41-41: Use module logger name.

-logger = logging.getLogger(__file__)
+logger = logging.getLogger(__name__)

136-137: Log full traceback on job failure.

-        except Exception as ex:
-            logger.error(ex)
+        except Exception:
+            logger.exception("Compression job failed.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 237ba42 and f0b6878.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (8 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (5 hunks)
  • components/package-template/src/sbin/compress-from-s3.sh (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (141-143)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
  • StorageEngine (116-118)
  • StorageType (141-143)
  • validate_logs_dir (661-665)
  • get_clp_connection_params_and_type (220-241)
components/clp-package-utils/clp_package_utils/general.py (10)
  • dump_container_config (328-345)
  • generate_container_config (228-309)
  • generate_container_name (127-132)
  • generate_container_start_cmd (361-402)
  • get_clp_home (108-124)
  • get_container_config_filename (324-325)
  • JobType (58-62)
  • load_config_file (413-439)
  • validate_and_load_db_credentials_file (470-474)
  • validate_dataset_name (600-626)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
  • _generate_compress_cmd (69-101)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • AwsAuthentication (371-401)
  • CLPConfig (557-739)
  • StorageType (141-143)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (31-42)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (208-240)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress.py

49-49: Unnecessary mode argument

Remove mode argument

(UP015)


178-180: Logging statement uses f-string

(G004)

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

8-8: typing.List is deprecated, use list instead

(UP035)


32-32: Use __name__ with logging.getLogger()

Replace with __name__

(LOG002)


51-51: Unnecessary mode argument

Remove mode argument

(UP015)


200-200: Boolean positional value in function call

(FBT003)


208-210: Logging statement uses f-string

(G004)


220-220: Boolean positional value in function call

(FBT003)


222-222: Do not catch blind exception: Exception

(BLE001)


223-223: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


233-234: Logging statement uses f-string

(G004)


276-276: subprocess call: check for execution of untrusted input

(S603)


280-280: Logging statement uses f-string

(G004)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

9-9: typing.List is deprecated, use list instead

(UP035)


9-9: typing.Tuple is deprecated, use tuple instead

(UP035)


162-162: Avoid specifying long messages outside the exception class

(TRY003)


184-184: Avoid specifying long messages outside the exception class

(TRY003)


196-196: Avoid specifying long messages outside the exception class

(TRY003)


207-207: Unnecessary mode argument

Remove mode argument

(UP015)


220-220: Use X | Y for type annotations

Convert to X | Y

(UP007)


236-236: Avoid specifying long messages outside the exception class

(TRY003)


244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


244-244: Avoid specifying long messages outside the exception class

(TRY003)


249-251: Avoid specifying long messages outside the exception class

(TRY003)


253-255: Avoid specifying long messages outside the exception class

(TRY003)


260-260: Avoid specifying long messages outside the exception class

(TRY003)


270-270: Avoid specifying long messages outside the exception class

(TRY003)


286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


286-286: Avoid specifying long messages outside the exception class

(TRY003)


303-303: Avoid specifying long messages outside the exception class

(TRY003)


378-378: Logging statement uses f-string

(G004)


378-378: Redundant exception object included in logging.exception call

(TRY401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: package-image
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: antlr-code-committed (macos-15)

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment on lines +128 to +138
if parsed_args.inputs_from is None:
if len(parsed_args.inputs) == 0:
args_parser.error("No URL specified.")
if len(parsed_args.inputs) != 1:
args_parser.error(
f"s3-key-prefix accepts exactly one URL, got {len(parsed_args.inputs)}."
)

if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
args_parser.error("URL cannot be specified on the command line AND through a file.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Validate s3-key-prefix file contains exactly one non-empty URL.

Fail early with a clear message when --inputs-from has 0 or >1 lines.

-    if parsed_args.inputs_from is None:
+    if parsed_args.inputs_from is None:
         if len(parsed_args.inputs) == 0:
             args_parser.error("No URL specified.")
         if len(parsed_args.inputs) != 1:
             args_parser.error(
                 f"s3-key-prefix accepts exactly one URL, got {len(parsed_args.inputs)}."
             )
 
-    if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
+    if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
         args_parser.error("URL cannot be specified on the command line AND through a file.")
+    if parsed_args.inputs_from is not None:
+        with open(parsed_args.inputs_from, "r") as f:
+            non_empty = [ln.strip() for ln in f if ln.strip()]
+        if len(non_empty) == 0:
+            args_parser.error("No URL specified in --inputs-from.")
+        if len(non_empty) != 1:
+            args_parser.error(f"s3-key-prefix accepts exactly one URL, got {len(non_empty)} in file.")
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 128 to 138, the current logic only checks parsed_args.inputs when
--inputs-from is set but does not validate the contents of the file; update the
code to read the file at parsed_args.inputs_from, strip out
blank/whitespace-only lines, count the remaining lines and raise
args_parser.error with a clear message if the count is 0 or not exactly 1 (e.g.,
"s3-key-prefix file must contain exactly one non-empty URL, got N"), and ensure
the single URL is used as the input; keep existing mutual-exclusion checks but
fail early with this explicit validation.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py
Comment on lines +244 to +245
raise ValueError(f"Failed to parse URL '{url}': {e}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Preserve exception context when re‑raising.

-    except ValueError as e:
-        raise ValueError(f"Failed to parse URL '{url}': {e}")
+    except ValueError as e:
+        raise ValueError(f"Failed to parse URL '{url}': {e}") from e
@@
-    except ValueError as e:
-        raise ValueError(f"Failed to parse S3 URL: {e}")
+    except ValueError as e:
+        raise ValueError(f"Failed to parse S3 URL: {e}") from e

Also applies to: 286-286

🧰 Tools
🪛 Ruff (0.14.1)

244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


244-244: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 244-245 and 286-286, the ValueError raised when URL parsing fails
currently discards the original exception context; update those raise statements
to re-raise using exception chaining (raise ValueError(f"...") from e) so the
original exception is preserved for debugging and tracebacks.

Comment on lines +268 to +272
key_prefix = os.path.commonprefix(keys)
if not key_prefix:
raise ValueError("No common prefix found among object keys")

return first_region, first_bucket, key_prefix, keys

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use path-segment common prefix for S3 keys (not character prefix).

os.path.commonprefix is character-based and can yield partial segments (e.g., ab/ vs a/). Compute common “directory” prefix by '/' segments.

-    else:
-        key_prefix = os.path.commonprefix(keys)
-        if not key_prefix:
-            raise ValueError("No common prefix found among object keys")
+    else:
+        # Compute common prefix by path segments to avoid partial-component prefixes
+        prefix_parts = keys[0].split("/")
+        for key in keys[1:]:
+            parts = key.split("/")
+            i = 0
+            while i < len(prefix_parts) and i < len(parts) and prefix_parts[i] == parts[i]:
+                i += 1
+            prefix_parts = prefix_parts[:i]
+            if not prefix_parts:
+                break
+        if not prefix_parts:
+            raise ValueError("No common prefix found among object keys")
+        key_prefix = "/".join(prefix_parts)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.1)

270-270: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 268 to 272, replace the character-based os.path.commonprefix usage
with a path-segment based computation: split each key on '/' into segments, find
the longest common sequence of full segments across all keys, rejoin those
segments with '/' (preserve a trailing '/' if the common prefix is a complete
segment), and use that as key_prefix; if no common segments exist raise the same
ValueError. Ensure the rest of the function still returns first_region,
first_bucket, key_prefix, keys.

Comment on lines +374 to +379
try:
logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve())
clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args)
except Exception as e:
logger.exception(f"Failed to process input: {e}")
return -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Avoid f-string in logging.exception and keep message concise.

-    except Exception as e:
-        logger.exception(f"Failed to process input: {e}")
+    except Exception:
+        logger.exception("Failed to process input.")
         return -1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve())
clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args)
except Exception as e:
logger.exception(f"Failed to process input: {e}")
return -1
try:
logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve())
clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args)
except Exception:
logger.exception("Failed to process input.")
return -1
🧰 Tools
🪛 Ruff (0.14.1)

378-378: Logging statement uses f-string

(G004)


378-378: Redundant exception object included in logging.exception call

(TRY401)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 374 to 379, the logger.exception call uses an f-string to
interpolate the exception into the message; change it to a concise static
message and let logger.exception include the traceback automatically. Replace
logger.exception(f"Failed to process input: {e}") with logger.exception("Failed
to process input") (or logger.exception("Failed to process input",
exc_info=True)) and keep the return -1 as-is.

Comment on lines +1 to +9
#!/usr/bin/env bash

script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."

PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
python3 \
-m clp_package_utils.scripts.compress_from_s3 \
"$@"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Harden the wrapper: fail fast, preserve PYTHONPATH, and exec Python.

  • Add set -euo pipefail for predictable failure.
  • Preserve existing PYTHONPATH instead of overwriting.
  • Use exec so signals and exit code propagate cleanly to callers.
 #!/usr/bin/env bash
+
+set -euo pipefail
+
 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
 package_root="$script_dir/.."
 
-PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
-    python3 \
-    -m clp_package_utils.scripts.compress_from_s3 \
-    "$@"
+site_pkgs="$(readlink -f "$package_root/lib/python3/site-packages")"
+PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs" \
+    exec python3 \
+    -m clp_package_utils.scripts.compress_from_s3 \
+    "$@"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env bash
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
python3 \
-m clp_package_utils.scripts.compress_from_s3 \
"$@"
#!/usr/bin/env bash
set -euo pipefail
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
site_pkgs="$(readlink -f "$package_root/lib/python3/site-packages")"
PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs" \
exec python3 \
-m clp_package_utils.scripts.compress_from_s3 \
"$@"
🤖 Prompt for AI Agents
In components/package-template/src/sbin/compress-from-s3.sh around lines 1–9,
the wrapper should be hardened: add a strict failure mode by inserting set -euo
pipefail immediately after the shebang; compute and export PYTHONPATH so any
existing PYTHONPATH is preserved (if non-empty, prefix it with the existing
value and a colon, then append the resolved site-packages path); and replace the
direct python3 invocation with exec so the Python process inherits signals and
exit codes from the wrapper, forwarding all "$@" to the module.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

215-215: Add AWS config directory mount when present.

If archive_output or stream_output uses S3 storage with profile-based authentication, the container needs access to the AWS config directory. The generate_container_config function creates mounts.aws_config_dir when applicable, but it's not being added to necessary_mounts.

After line 215, add:

 necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]
+if mounts.aws_config_dir is not None:
+    necessary_mounts.append(mounts.aws_config_dir)

Based on learnings

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)

120-138: Validate s3-key-prefix file contains exactly one non-empty URL.

When --inputs-from is used, the current code doesn't validate the file contents. Users could provide a file with zero URLs or multiple URLs, which would only fail later during processing.

Add validation after line 137:

 if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
     args_parser.error("URL cannot be specified on the command line AND through a file.")
+
+if parsed_args.inputs_from is not None:
+    with open(parsed_args.inputs_from, "r") as f:
+        non_empty = [ln.strip() for ln in f if ln.strip()]
+    if len(non_empty) == 0:
+        args_parser.error("No URL specified in --inputs-from.")
+    if len(non_empty) != 1:
+        args_parser.error(f"s3-key-prefix accepts exactly one URL, got {len(non_empty)} in file.")

Based on learnings


251-270: Add AWS config directory mount for S3 authentication.

The generate_container_config function creates mounts.aws_config_dir when AWS config is present, but it's not added to necessary_mounts. Without this mount, profile-based S3 authentication will fail inside the container.

After line 251, add:

 necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir]
+if mounts.aws_config_dir is not None:
+    necessary_mounts.append(mounts.aws_config_dir)

Based on learnings

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)

373-378: Simplify logging.exception call on line 377.

The f-string and exception object in logger.exception() are redundant since the method automatically includes the traceback.

Apply this diff:

     except Exception:
-        logger.exception(f"Failed to process input.")
+        logger.exception("Failed to process input.")
         return -1

Based on learnings


217-271: Use path-segment-based common prefix for S3 keys.

Line 267 uses os.path.commonprefix, which is character-based and can produce partial path segments. For example, keys ["ab/file1", "a/file2"] would yield "a" instead of "" (no common directory prefix). This can lead to incorrect key_prefix values.

Replace lines 267-269 with segment-based logic:

-    else:
-        key_prefix = os.path.commonprefix(keys)
-        if not key_prefix:
-            raise ValueError("No common prefix found among object keys")
+    else:
+        # Compute common prefix by path segments to avoid partial-component prefixes
+        prefix_parts = keys[0].split("/")
+        for key in keys[1:]:
+            parts = key.split("/")
+            i = 0
+            while i < len(prefix_parts) and i < len(parts) and prefix_parts[i] == parts[i]:
+                i += 1
+            prefix_parts = prefix_parts[:i]
+            if not prefix_parts:
+                break
+        if not prefix_parts:
+            raise ValueError("No common prefix found among object keys")
+        key_prefix = "/".join(prefix_parts)

Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0b6878 and 878c574.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (9 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
  • StorageEngine (92-94)
  • StorageType (117-119)
  • validate_logs_dir (699-703)
  • get_clp_connection_params_and_type (198-219)
components/clp-package-utils/clp_package_utils/general.py (10)
  • dump_container_config (340-357)
  • generate_container_config (227-308)
  • generate_container_name (159-164)
  • generate_container_start_cmd (371-412)
  • get_clp_home (140-156)
  • get_container_config_filename (336-337)
  • JobType (97-101)
  • load_config_file (423-444)
  • validate_and_load_db_credentials_file (475-479)
  • validate_dataset_name (621-647)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • AwsAuthentication (383-413)
  • CLPConfig (595-797)
  • StorageType (117-119)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (31-42)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (208-240)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (117-119)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

8-8: typing.List is deprecated, use list instead

(UP035)


51-51: Unnecessary mode argument

Remove mode argument

(UP015)


200-200: Boolean positional value in function call

(FBT003)


221-221: Boolean positional value in function call

(FBT003)


223-223: Do not catch blind exception: Exception

(BLE001)


224-224: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


234-235: Logging statement uses f-string

(G004)


278-278: subprocess call: check for execution of untrusted input

(S603)


282-282: Logging statement uses f-string

(G004)


288-288: Do not catch blind exception: Exception

(BLE001)


294-294: Do not catch blind exception: Exception

(BLE001)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

9-9: typing.List is deprecated, use list instead

(UP035)


9-9: typing.Tuple is deprecated, use tuple instead

(UP035)


161-161: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Avoid specifying long messages outside the exception class

(TRY003)


195-195: Avoid specifying long messages outside the exception class

(TRY003)


206-206: Unnecessary mode argument

Remove mode argument

(UP015)


219-219: Use X | Y for type annotations

Convert to X | Y

(UP007)


235-235: Avoid specifying long messages outside the exception class

(TRY003)


243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


243-243: Avoid specifying long messages outside the exception class

(TRY003)


248-250: Avoid specifying long messages outside the exception class

(TRY003)


252-254: Avoid specifying long messages outside the exception class

(TRY003)


259-259: Avoid specifying long messages outside the exception class

(TRY003)


269-269: Avoid specifying long messages outside the exception class

(TRY003)


285-285: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


285-285: Avoid specifying long messages outside the exception class

(TRY003)


302-302: Avoid specifying long messages outside the exception class

(TRY003)


377-377: f-string without any placeholders

Remove extraneous f prefix

(F541)


377-377: Logging statement uses f-string

(G004)

components/clp-package-utils/clp_package_utils/scripts/compress.py

49-49: Unnecessary mode argument

Remove mode argument

(UP015)


244-244: subprocess call: check for execution of untrusted input

(S603)


248-248: Logging statement uses f-string

(G004)


254-254: Do not catch blind exception: Exception

(BLE001)


260-260: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (8)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)

62-102: LGTM!

The command generation logic is correct and handles all the necessary flags appropriately.


205-237: LGTM!

The validation logic correctly enforces S3 storage type and CLP_S storage engine requirements, with appropriate error messages to guide users.

components/clp-package-utils/clp_package_utils/scripts/compress.py (3)

16-16: LGTM!

The StorageType import and logger initialization are correct.

Also applies to: 33-33


69-115: LGTM!

The addition of the --source fs flag and the filesystem input validation logic are correct.


152-152: LGTM!

The updates correctly enforce filesystem-only compression with clear error messages directing users to the appropriate script for S3 inputs.

Also applies to: 175-183, 205-206

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)

143-196: LGTM!

The function correctly routes between filesystem and S3 input modes, with appropriate validation and error handling for each path.


274-303: LGTM!

The helper functions correctly parse S3 URLs and retrieve AWS authentication configuration with appropriate error handling.


329-335: LGTM!

The --source argument correctly enables multi-mode operation with appropriate choices and a sensible default.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment on lines +35 to +60
def _generate_url_list(
subcommand: str,
container_url_list_path: pathlib.Path,
parsed_args: argparse.Namespace,
) -> None:
"""
Generates URL list file for native script.

:param subcommand: 's3-object' or 's3-key-prefix'
:param container_url_list_path: Path to write URL list
:param parsed_args: Parsed command-line arguments
"""
with open(container_url_list_path, "w") as url_list_file:
url_list_file.write(f"{subcommand}\n")

if parsed_args.inputs_from is not None:
with open(parsed_args.inputs_from, "r") as input_file:
for line in input_file:
stripped_url = line.strip()
if "" == stripped_url:
continue
url_list_file.write(f"{stripped_url}\n")
else:
for url in parsed_args.inputs:
url_list_file.write(f"{url}\n")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM! Optional: Remove unnecessary mode argument.

The function logic is correct. The mode argument "r" on line 51 is the default and can be omitted.

🧰 Tools
🪛 Ruff (0.14.1)

51-51: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 35 to 60, remove the unnecessary explicit mode argument "r" when
opening parsed_args.inputs_from; change the open call to use the default mode
(open(parsed_args.inputs_from)) so the behavior is identical but the code is
slightly cleaner.

Comment on lines +277 to +295
try:
proc = subprocess.run(cmd)
ret_code = proc.returncode
if ret_code != 0:
logger.error("Compression failed.")
logger.debug(f"Docker command failed: {shlex.join(cmd)}")

return ret_code
finally:
try:
generated_config_path_on_host.unlink()
except Exception:
logger.debug(
"Failed to remove generated config file: %s", generated_config_path_on_host
)
try:
container_url_list_path.unlink()
except Exception:
logger.debug("Failed to remove URL list file: %s", container_url_list_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use logger formatting instead of f-string on line 282.

The cleanup logic is solid, but line 282 uses an f-string in logger.debug(), which should use lazy formatting.

Apply this diff:

         if ret_code != 0:
             logger.error("Compression failed.")
-            logger.debug(f"Docker command failed: {shlex.join(cmd)}")
+            logger.debug("Docker command failed: %s", shlex.join(cmd))
🧰 Tools
🪛 Ruff (0.14.1)

278-278: subprocess call: check for execution of untrusted input

(S603)


282-282: Logging statement uses f-string

(G004)


288-288: Do not catch blind exception: Exception

(BLE001)


294-294: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 277 to 295, the logger.debug call on line 282 uses an f-string
which prevents lazy formatting; replace the f-string with logger.debug("Docker
command failed: %s", shlex.join(cmd)) so the message uses logger's lazy
interpolation, leaving the surrounding return and cleanup logic unchanged.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment on lines +238 to +243
for url in urls:
try:
region_code, bucket, key = parse_s3_url(url)
parsed_urls.append((region_code, bucket, key))
except ValueError as e:
raise ValueError(f"Failed to parse URL '{url}': {e}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Preserve exception context when re-raising.

Lines 243 and 285 re-raise exceptions without chaining, which discards the original traceback.

Apply exception chaining:

         except ValueError as e:
-            raise ValueError(f"Failed to parse URL '{url}': {e}")
+            raise ValueError(f"Failed to parse URL '{url}': {e}") from e

Also at line 285:

     except ValueError as e:
-        raise ValueError(f"Failed to parse S3 URL: {e}")
+        raise ValueError(f"Failed to parse S3 URL: {e}") from e

Based on learnings

Also applies to: 282-285

🧰 Tools
🪛 Ruff (0.14.1)

243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


243-243: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 238-243 and 282-285, the code re-raises ValueError (and another
exception) without exception chaining, which discards the original traceback;
update both re-raises to use "raise <exception> from e" (e.g., raise
ValueError(f\"Failed to parse URL '{url}': {e}\") from e) so the original
exception context/traceback is preserved and propagate the original error as the
__cause__.

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed host compress.py.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
@y-scope y-scope deleted a comment from coderabbitai Bot Oct 24, 2025

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)

216-216: Mount AWS config directory for profile-based auth.

The necessary_mounts list should include mounts.aws_config_dir when present, so containers using S3 profile auth for archive/stream output can access AWS credentials.

This issue was previously flagged in past review comments. Please address the past review feedback.

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)

120-138: Validate s3-key-prefix file contains exactly one URL.

When --inputs-from is specified, the validation does not check the file contents to ensure it contains exactly one non-empty URL. The native script will fail late if the file contains 0 or >1 URLs.

This issue was previously flagged in past review comments. Please read and validate the file contents as suggested in the prior review.


251-251: Mount AWS config directory for S3 profile authentication.

The necessary_mounts list does not include mounts.aws_config_dir, which is required for profile-based AWS credentials to be accessible inside the container. S3 compression will fail if profile auth is configured.

This critical issue was previously flagged in past review comments. Please add the aws_config_dir mount as suggested in the prior review.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 878c574 and f085dc8.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (9 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (117-119)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • AwsAuthentication (383-413)
  • CLPConfig (595-797)
  • StorageType (117-119)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
  • S3InputConfig (31-42)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
  • parse_s3_url (208-240)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • StorageEngine (92-94)
  • StorageType (117-119)
  • get_clp_connection_params_and_type (198-219)
components/clp-package-utils/clp_package_utils/general.py (10)
  • dump_container_config (340-357)
  • generate_container_config (227-308)
  • generate_container_name (159-164)
  • generate_container_start_cmd (371-412)
  • get_clp_home (140-156)
  • get_container_config_filename (336-337)
  • JobType (97-101)
  • load_config_file (423-444)
  • validate_and_load_db_credentials_file (475-479)
  • validate_dataset_name (621-647)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress.py

57-57: Unnecessary mode argument

Remove mode argument

(UP015)


248-248: Logging statement uses f-string

(G004)


252-252: Do not catch blind exception: Exception

(BLE001)


257-257: Do not catch blind exception: Exception

(BLE001)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

9-9: typing.List is deprecated, use list instead

(UP035)


9-9: typing.Tuple is deprecated, use tuple instead

(UP035)


161-161: Avoid specifying long messages outside the exception class

(TRY003)


183-183: Avoid specifying long messages outside the exception class

(TRY003)


195-195: Avoid specifying long messages outside the exception class

(TRY003)


206-206: Unnecessary mode argument

Remove mode argument

(UP015)


219-219: Use X | Y for type annotations

Convert to X | Y

(UP007)


235-235: Avoid specifying long messages outside the exception class

(TRY003)


243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


243-243: Avoid specifying long messages outside the exception class

(TRY003)


248-250: Avoid specifying long messages outside the exception class

(TRY003)


252-254: Avoid specifying long messages outside the exception class

(TRY003)


259-259: Avoid specifying long messages outside the exception class

(TRY003)


269-269: Avoid specifying long messages outside the exception class

(TRY003)


285-285: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


285-285: Avoid specifying long messages outside the exception class

(TRY003)


302-302: Avoid specifying long messages outside the exception class

(TRY003)


377-377: f-string without any placeholders

Remove extraneous f prefix

(F541)


377-377: Logging statement uses f-string

(G004)

components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

8-8: typing.List is deprecated, use list instead

(UP035)


51-51: Unnecessary mode argument

Remove mode argument

(UP015)


200-200: Boolean positional value in function call

(FBT003)


221-221: Boolean positional value in function call

(FBT003)


223-223: Do not catch blind exception: Exception

(BLE001)


224-224: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


234-235: Logging statement uses f-string

(G004)


278-278: subprocess call: check for execution of untrusted input

(S603)


282-282: Logging statement uses f-string

(G004)


288-288: Do not catch blind exception: Exception

(BLE001)


294-294: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image

Comment on lines +233 to +237
logger.error(
f"S3 compression requires storage engine {StorageEngine.CLP_S}, "
f"but configured engine is {storage_engine}."
)
return -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use logger formatting instead of f-strings.

Lines 234-235 use f-strings in logger.error(), which prevents lazy interpolation.

Apply this diff:

     else:
         logger.error(
-            f"S3 compression requires storage engine {StorageEngine.CLP_S}, "
-            f"but configured engine is {storage_engine}."
+            "S3 compression requires storage engine `%s`, but configured engine is `%s`.",
+            StorageEngine.CLP_S,
+            storage_engine,
         )
         return -1
🧰 Tools
🪛 Ruff (0.14.1)

234-235: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 233 to 237, the logger.error call uses f-strings which prevents
lazy interpolation; replace the f-string usage with logger-style formatting by
passing a single format string and the values as separate arguments (e.g.
logger.error("S3 compression requires storage engine %s, but configured engine
is %s.", StorageEngine.CLP_S, storage_engine)) and keep the subsequent return -1
unchanged.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed compress_form_s3.py.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed native compression.py script.

Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (5)

240-243: Use logger formatting instead of f-strings.

Prefer lazy formatting to avoid eager interpolation and align with logging best practices.

Apply this diff:

         logger.error(
-            f"S3 compression requires storage engine {StorageEngine.CLP_S}, but configured engine"
-            f" is {storage_engine}."
+            "S3 compression requires storage engine `%s`, but configured engine is `%s`.",
+            StorageEngine.CLP_S,
+            storage_engine,
         )

276-276: Add AWS config directory mount for S3 authentication.

Without the aws_config_dir mount, profile-based AWS credentials won't be available inside the container, causing S3 compression to fail with authentication errors.

Apply this diff:

-    necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir]
+    necessary_mounts = [
+        mounts.clp_home,
+        mounts.data_dir,
+        mounts.logs_dir,
+        mounts.aws_config_dir,  # optional; skipped if None
+    ]

308-308: Use logger formatting instead of f-string.

Prefer lazy formatting to avoid eager interpolation and align with logging best practices.

Apply this diff:

-            logger.debug(f"Docker command failed: {shlex.join(cmd)}")
+            logger.debug("Docker command failed: %s", shlex.join(cmd))

140-151: Validate s3-key-prefix file contains exactly one URL.

When --inputs-from is used with s3-key-prefix, the file contents should be validated to ensure exactly one non-empty URL is provided. Currently, validation only occurs when URLs are passed on the command line.

Apply this diff after line 150:

     if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
         args_parser.error("URL cannot be specified on the command line AND through a file.")
+
+    if parsed_args.inputs_from is not None:
+        with open(parsed_args.inputs_from) as f:
+            non_empty_lines = [line.strip() for line in f if line.strip()]
+        if len(non_empty_lines) == 0:
+            args_parser.error("No URL specified in --inputs-from.")
+        if len(non_empty_lines) != 1:
+            args_parser.error(
+                f"{S3_KEY_PREFIX_COMPRESSION} accepts exactly one URL, got {len(non_empty_lines)} in file."
+            )

59-59: Optional: Remove unnecessary mode argument.

The explicit mode "r" is the default for open() and can be omitted.

Apply this diff:

-        with open(parsed_args.inputs_from, "r") as input_file:
+        with open(parsed_args.inputs_from) as input_file:
components/clp-package-utils/clp_package_utils/scripts/compress.py (2)

254-254: Use logger formatting instead of f-string.

Prefer lazy formatting to avoid eager interpolation and align with logging best practices.

Apply this diff:

-        logger.debug(f"Docker command failed: {shlex.join(cmd)}")
+        logger.debug("Docker command failed: %s", shlex.join(cmd))

220-220: Add AWS config directory mount for S3 output destinations.

If archive_output or stream_output uses S3 with profile-based authentication, the container needs access to AWS credentials. Without the aws_config_dir mount, S3 writes will fail.

Apply this diff:

-    necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir]
+    necessary_mounts = [
+        mounts.clp_home,
+        mounts.data_dir,
+        mounts.logs_dir,
+        mounts.input_logs_dir,
+        mounts.aws_config_dir,  # optional; skipped if None
+    ]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9466346 and 7b32f5b.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/general.py (1 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (9 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
  • StorageEngine (94-96)
  • StorageType (119-121)
  • get_clp_connection_params_and_type (200-221)
components/clp-package-utils/clp_package_utils/general.py (10)
  • dump_container_config (344-361)
  • generate_container_config (231-312)
  • generate_container_name (163-168)
  • generate_container_start_cmd (375-416)
  • get_clp_home (144-160)
  • get_container_config_filename (340-341)
  • JobType (101-105)
  • load_config_file (427-449)
  • validate_and_load_db_credentials_file (480-484)
  • validate_dataset_name (634-660)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
  • StorageType (119-121)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py

59-59: Unnecessary mode argument

Remove mode argument

(UP015)


221-221: Boolean positional value in function call

(FBT003)


241-242: Logging statement uses f-string

(G004)


250-250: Boolean positional value in function call

(FBT003)


252-252: Do not catch blind exception: Exception

(BLE001)


253-253: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


304-304: subprocess call: check for execution of untrusted input

(S603)


308-308: Logging statement uses f-string

(G004)

components/clp-package-utils/clp_package_utils/scripts/compress.py

59-59: Unnecessary mode argument

Remove mode argument

(UP015)


254-254: Logging statement uses f-string

(G004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/general.py (1)

52-53: LGTM! Clean constant declarations.

The S3 compression mode constants are well-named and consistent with the naming conventions used elsewhere in the codebase.

Comment on lines +252 to +254
except Exception as e:
logger.error(e)
return -1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use logger.exception for better error context.

When logging an exception in an exception handler, use logger.exception() to automatically include the traceback.

Apply this diff:

     except Exception as e:
-        logger.error(e)
+        logger.exception("Failed to validate dataset name.")
         return -1
🧰 Tools
🪛 Ruff (0.14.1)

252-252: Do not catch blind exception: Exception

(BLE001)


253-253: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 252 to 254, the exception handler currently calls logger.error(e)
which doesn't include the traceback; replace that call with
logger.exception("Failed to compress from S3") (or similar descriptive message)
so the traceback is logged automatically, and keep the subsequent return -1
unchanged.

return len(parsed_args.paths) != 0

no_path_found = True
with open(host_logs_list_path, "r") as host_logs_list_file:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Optional: Remove unnecessary mode argument.

The explicit mode "r" is the default for open() and can be omitted for cleaner code.

Apply this diff:

-        with open(host_logs_list_path, "r") as host_logs_list_file:
+        with open(host_logs_list_path) as host_logs_list_file:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(host_logs_list_path, "r") as host_logs_list_file:
with open(host_logs_list_path) as host_logs_list_file:
🧰 Tools
🪛 Ruff (0.14.1)

59-59: Unnecessary mode argument

Remove mode argument

(UP015)

🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress.py around
line 59, the open() call explicitly passes the default mode "r"; remove the
unnecessary mode argument so the call becomes open(host_logs_list_path) to keep
the code cleaner and rely on the default read mode.

@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package)!: Add compress-from-s3.sh script for S3 object compression. feat(clp-package)!: Add compress-from-s3 Python and shell scripts for S3-object or S3-key-prefix compression; Restrict compress.py to accept local file paths only. Oct 26, 2025

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with the following validations performed (also updated PR description):

compress-from-s3.py:

From cli directly:

  • Single object URL → success
  • Multiple object URLs with a shared prefix → success
  • Multiple object URLs without a shared prefix → fail
  • Single prefix URL → success
  • Multiple prefix URLs → fail
  • Single prefix URL but it’s an object → success

From input file:

  • Single object URL → success
  • Multiple object URLs with a shared prefix → success
  • Multiple object URLs without a shared prefix → fail
  • Single prefix URL → success
  • Multiple prefix URLs → fail
  • Multiple prefix URLs → fail
  • Single prefix URL but it’s an object → success
  • Empty input file (for both object compression and key-prefix compression) → fail

Additional cases:

  • From both positional and input file → fail
  • Given both an input file and positional args → fail
  • Given both s3-object and s3-key-prefix → fail (treated as previous)

compress.py (for both clp and clp-s storage engine):

From cli directly:

  • Single path → success
  • Multiple path → success

From input file:

  • Single path → success
  • Multiple path → success
  • No path → fail

Additional cases:

  • From both positional and input file → fail

Changed PR title to:

feat(clp-package)!: Add `compress-from-s3` Python and shell scripts for S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only.

@LinZhihao-723 LinZhihao-723 changed the title feat(clp-package)!: Add compress-from-s3 Python and shell scripts for S3-object or S3-key-prefix compression; Restrict compress.py to accept local file paths only. feat(clp-package)!: Add compress-from-s3 Python and shell scripts for S3-object and S3-key-prefix compression; Restrict compress.py to accept local file paths only. Oct 26, 2025
@LinZhihao-723 LinZhihao-723 merged commit 2ac456e into y-scope:main Oct 26, 2025
22 checks passed
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…or S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only. (y-scope#1476)

Co-authored-by: LinZhihao-723 <zh.lin@mail.utoronto.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants