feat(clp-package)!: Add compress-from-s3 Python and shell scripts for S3-object and S3-key-prefix compression; Restrict compress.py to accept local file paths only.#1476
Conversation
…and ingestion of s3 object keys
WalkthroughFilesystem compressor was simplified to FS-only; a new S3 orchestrator script was added to handle containerized S3 workflows; the native compressor gained an --input-type option with S3 parsing and AWS-auth helpers; a shell entrypoint and two S3 constants were introduced. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant compress_from_s3 as compress_from_s3.py
participant Container
participant native_compress as native/compress.py
User->>compress_from_s3: Run s3-object / s3-key-prefix subcommand
compress_from_s3->>compress_from_s3: Validate args, load CLP config, ensure StorageType S3 & engine CLP_S
compress_from_s3->>compress_from_s3: Generate URL list file (temp)
compress_from_s3->>compress_from_s3: Build container config + command (--input-type s3)
compress_from_s3->>Container: Start container with mounts + URL list
Container->>native_compress: Invoke native/compress.py --input-type s3 --logs-list <path>
native_compress->>native_compress: Parse/validate S3 URLs, get AWS auth, build S3InputConfig
native_compress->>native_compress: Perform compression
Container-->>compress_from_s3: Return exit code
compress_from_s3->>compress_from_s3: Cleanup temp files
compress_from_s3-->>User: Report result
sequenceDiagram
participant User
participant compress as compress.py
participant Container
participant native_compress as native/compress.py
User->>compress: Run filesystem compression
compress->>compress: Validate FS inputs, generate logs list (container path)
compress->>compress: Build container config + command (--input-type fs)
compress->>Container: Start container with mounts + logs-list
Container->>native_compress: Invoke native/compress.py --input-type fs --logs-list <path>
native_compress->>native_compress: Build FsInputConfig from provided paths
native_compress->>native_compress: Perform compression
Container-->>compress: Return exit code
compress-->>User: Report result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py (3)
33-33: Use module logger name.Prefer name for logger hierarchy and filters.
-logger = logging.getLogger(__file__) +logger = logging.getLogger(__name__)
171-173: Avoid bare except; capture the exception.Catching all exceptions hides interruptions and makes debugging harder.
- except: - logger.exception("Failed to load config.") - return -1 + except Exception: + logger.exception("Failed to load config.") + return -1
240-251: Clean up temp files reliably (config + logs list).logs list files accumulate; remove them in finally. Also prefer logger formatting for debug line.
- proc = subprocess.run(cmd) - ret_code = proc.returncode - if 0 != ret_code: - logger.error("Compression failed.") - logger.debug(f"Docker command failed: {shlex.join(cmd)}") - - # Remove generated files - generated_config_path_on_host.unlink() - - return ret_code + try: + proc = subprocess.run(cmd) + ret_code = proc.returncode + if ret_code != 0: + logger.error("Compression failed.") + logger.debug("Docker command failed: %s", shlex.join(cmd)) + return ret_code + finally: + try: + generated_config_path_on_host.unlink() + except Exception: + logger.debug("Failed to remove generated config file: %s", generated_config_path_on_host) + try: + container_logs_list_path.unlink() + except Exception: + logger.debug("Failed to remove logs list file: %s", container_logs_list_path)components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
41-41: Use module logger name.-logger = logging.getLogger(__file__) +logger = logging.getLogger(__name__)
136-137: Log full traceback on job failure.- except Exception as ex: - logger.error(ex) + except Exception: + logger.exception("Compression job failed.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/clp-package-utils/clp_package_utils/scripts/compress.py(8 hunks)components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(5 hunks)components/package-template/src/sbin/compress-from-s3.sh(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(141-143)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
StorageEngine(116-118)StorageType(141-143)validate_logs_dir(661-665)get_clp_connection_params_and_type(220-241)components/clp-package-utils/clp_package_utils/general.py (10)
dump_container_config(328-345)generate_container_config(228-309)generate_container_name(127-132)generate_container_start_cmd(361-402)get_clp_home(108-124)get_container_config_filename(324-325)JobType(58-62)load_config_file(413-439)validate_and_load_db_credentials_file(470-474)validate_dataset_name(600-626)components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
_generate_compress_cmd(69-101)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
AwsAuthentication(371-401)CLPConfig(557-739)StorageType(141-143)components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
S3InputConfig(31-42)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
parse_s3_url(208-240)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress.py
49-49: Unnecessary mode argument
Remove mode argument
(UP015)
178-180: Logging statement uses f-string
(G004)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
8-8: typing.List is deprecated, use list instead
(UP035)
32-32: Use __name__ with logging.getLogger()
Replace with __name__
(LOG002)
51-51: Unnecessary mode argument
Remove mode argument
(UP015)
200-200: Boolean positional value in function call
(FBT003)
208-210: Logging statement uses f-string
(G004)
220-220: Boolean positional value in function call
(FBT003)
222-222: Do not catch blind exception: Exception
(BLE001)
223-223: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
233-234: Logging statement uses f-string
(G004)
276-276: subprocess call: check for execution of untrusted input
(S603)
280-280: Logging statement uses f-string
(G004)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
9-9: typing.List is deprecated, use list instead
(UP035)
9-9: typing.Tuple is deprecated, use tuple instead
(UP035)
162-162: Avoid specifying long messages outside the exception class
(TRY003)
184-184: Avoid specifying long messages outside the exception class
(TRY003)
196-196: Avoid specifying long messages outside the exception class
(TRY003)
207-207: Unnecessary mode argument
Remove mode argument
(UP015)
220-220: Use X | Y for type annotations
Convert to X | Y
(UP007)
236-236: Avoid specifying long messages outside the exception class
(TRY003)
244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
244-244: Avoid specifying long messages outside the exception class
(TRY003)
249-251: Avoid specifying long messages outside the exception class
(TRY003)
253-255: Avoid specifying long messages outside the exception class
(TRY003)
260-260: Avoid specifying long messages outside the exception class
(TRY003)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
286-286: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
286-286: Avoid specifying long messages outside the exception class
(TRY003)
303-303: Avoid specifying long messages outside the exception class
(TRY003)
378-378: Logging statement uses f-string
(G004)
378-378: Redundant exception object included in logging.exception call
(TRY401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: package-image
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: antlr-code-committed (macos-15)
| if parsed_args.inputs_from is None: | ||
| if len(parsed_args.inputs) == 0: | ||
| args_parser.error("No URL specified.") | ||
| if len(parsed_args.inputs) != 1: | ||
| args_parser.error( | ||
| f"s3-key-prefix accepts exactly one URL, got {len(parsed_args.inputs)}." | ||
| ) | ||
|
|
||
| if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None: | ||
| args_parser.error("URL cannot be specified on the command line AND through a file.") | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Validate s3-key-prefix file contains exactly one non-empty URL.
Fail early with a clear message when --inputs-from has 0 or >1 lines.
- if parsed_args.inputs_from is None:
+ if parsed_args.inputs_from is None:
if len(parsed_args.inputs) == 0:
args_parser.error("No URL specified.")
if len(parsed_args.inputs) != 1:
args_parser.error(
f"s3-key-prefix accepts exactly one URL, got {len(parsed_args.inputs)}."
)
- if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
+ if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None:
args_parser.error("URL cannot be specified on the command line AND through a file.")
+ if parsed_args.inputs_from is not None:
+ with open(parsed_args.inputs_from, "r") as f:
+ non_empty = [ln.strip() for ln in f if ln.strip()]
+ if len(non_empty) == 0:
+ args_parser.error("No URL specified in --inputs-from.")
+ if len(non_empty) != 1:
+ args_parser.error(f"s3-key-prefix accepts exactly one URL, got {len(non_empty)} in file.")🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 128 to 138, the current logic only checks parsed_args.inputs when
--inputs-from is set but does not validate the contents of the file; update the
code to read the file at parsed_args.inputs_from, strip out
blank/whitespace-only lines, count the remaining lines and raise
args_parser.error with a clear message if the count is 0 or not exactly 1 (e.g.,
"s3-key-prefix file must contain exactly one non-empty URL, got N"), and ensure
the single URL is used as the input; keep existing mutual-exclusion checks but
fail early with this explicit validation.
| raise ValueError(f"Failed to parse URL '{url}': {e}") | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Preserve exception context when re‑raising.
- except ValueError as e:
- raise ValueError(f"Failed to parse URL '{url}': {e}")
+ except ValueError as e:
+ raise ValueError(f"Failed to parse URL '{url}': {e}") from e
@@
- except ValueError as e:
- raise ValueError(f"Failed to parse S3 URL: {e}")
+ except ValueError as e:
+ raise ValueError(f"Failed to parse S3 URL: {e}") from eAlso applies to: 286-286
🧰 Tools
🪛 Ruff (0.14.1)
244-244: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
244-244: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 244-245 and 286-286, the ValueError raised when URL parsing fails
currently discards the original exception context; update those raise statements
to re-raise using exception chaining (raise ValueError(f"...") from e) so the
original exception is preserved for debugging and tracebacks.
| key_prefix = os.path.commonprefix(keys) | ||
| if not key_prefix: | ||
| raise ValueError("No common prefix found among object keys") | ||
|
|
||
| return first_region, first_bucket, key_prefix, keys |
There was a problem hiding this comment.
Use path-segment common prefix for S3 keys (not character prefix).
os.path.commonprefix is character-based and can yield partial segments (e.g., ab/ vs a/). Compute common “directory” prefix by '/' segments.
- else:
- key_prefix = os.path.commonprefix(keys)
- if not key_prefix:
- raise ValueError("No common prefix found among object keys")
+ else:
+ # Compute common prefix by path segments to avoid partial-component prefixes
+ prefix_parts = keys[0].split("/")
+ for key in keys[1:]:
+ parts = key.split("/")
+ i = 0
+ while i < len(prefix_parts) and i < len(parts) and prefix_parts[i] == parts[i]:
+ i += 1
+ prefix_parts = prefix_parts[:i]
+ if not prefix_parts:
+ break
+ if not prefix_parts:
+ raise ValueError("No common prefix found among object keys")
+ key_prefix = "/".join(prefix_parts)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.1)
270-270: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 268 to 272, replace the character-based os.path.commonprefix usage
with a path-segment based computation: split each key on '/' into segments, find
the longest common sequence of full segments across all keys, rejoin those
segments with '/' (preserve a trailing '/' if the common prefix is a complete
segment), and use that as key_prefix; if no common segments exist raise the same
ValueError. Ensure the rest of the function still returns first_region,
first_bucket, key_prefix, keys.
| try: | ||
| logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve()) | ||
| clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args) | ||
| except Exception as e: | ||
| logger.exception(f"Failed to process input: {e}") | ||
| return -1 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Avoid f-string in logging.exception and keep message concise.
- except Exception as e:
- logger.exception(f"Failed to process input: {e}")
+ except Exception:
+ logger.exception("Failed to process input.")
return -1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve()) | |
| clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args) | |
| except Exception as e: | |
| logger.exception(f"Failed to process input: {e}") | |
| return -1 | |
| try: | |
| logs_to_compress = _get_logs_to_compress(pathlib.Path(parsed_args.logs_list).resolve()) | |
| clp_input_config = _generate_clp_io_config(clp_config, logs_to_compress, parsed_args) | |
| except Exception: | |
| logger.exception("Failed to process input.") | |
| return -1 |
🧰 Tools
🪛 Ruff (0.14.1)
378-378: Logging statement uses f-string
(G004)
378-378: Redundant exception object included in logging.exception call
(TRY401)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 374 to 379, the logger.exception call uses an f-string to
interpolate the exception into the message; change it to a concise static
message and let logger.exception include the traceback automatically. Replace
logger.exception(f"Failed to process input: {e}") with logger.exception("Failed
to process input") (or logger.exception("Failed to process input",
exc_info=True)) and keep the return -1 as-is.
| #!/usr/bin/env bash | ||
|
|
||
| script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" | ||
| package_root="$script_dir/.." | ||
|
|
||
| PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \ | ||
| python3 \ | ||
| -m clp_package_utils.scripts.compress_from_s3 \ | ||
| "$@" |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Harden the wrapper: fail fast, preserve PYTHONPATH, and exec Python.
- Add set -euo pipefail for predictable failure.
- Preserve existing PYTHONPATH instead of overwriting.
- Use exec so signals and exit code propagate cleanly to callers.
#!/usr/bin/env bash
+
+set -euo pipefail
+
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
package_root="$script_dir/.."
-PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \
- python3 \
- -m clp_package_utils.scripts.compress_from_s3 \
- "$@"
+site_pkgs="$(readlink -f "$package_root/lib/python3/site-packages")"
+PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs" \
+ exec python3 \
+ -m clp_package_utils.scripts.compress_from_s3 \
+ "$@"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/usr/bin/env bash | |
| script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" | |
| package_root="$script_dir/.." | |
| PYTHONPATH=$(readlink -f "$package_root/lib/python3/site-packages") \ | |
| python3 \ | |
| -m clp_package_utils.scripts.compress_from_s3 \ | |
| "$@" | |
| #!/usr/bin/env bash | |
| set -euo pipefail | |
| script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" | |
| package_root="$script_dir/.." | |
| site_pkgs="$(readlink -f "$package_root/lib/python3/site-packages")" | |
| PYTHONPATH="${PYTHONPATH:+$PYTHONPATH:}$site_pkgs" \ | |
| exec python3 \ | |
| -m clp_package_utils.scripts.compress_from_s3 \ | |
| "$@" |
🤖 Prompt for AI Agents
In components/package-template/src/sbin/compress-from-s3.sh around lines 1–9,
the wrapper should be hardened: add a strict failure mode by inserting set -euo
pipefail immediately after the shebang; compute and export PYTHONPATH so any
existing PYTHONPATH is preserved (if non-empty, prefix it with the existing
value and a colon, then append the resolved site-packages path); and replace the
direct python3 invocation with exec so the Python process inherits signals and
exit codes from the wrapper, forwarding all "$@" to the module.
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (5)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
215-215: Add AWS config directory mount when present.If
archive_outputorstream_outputuses S3 storage with profile-based authentication, the container needs access to the AWS config directory. Thegenerate_container_configfunction createsmounts.aws_config_dirwhen applicable, but it's not being added tonecessary_mounts.After line 215, add:
necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir] +if mounts.aws_config_dir is not None: + necessary_mounts.append(mounts.aws_config_dir)Based on learnings
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
120-138: Validate s3-key-prefix file contains exactly one non-empty URL.When
--inputs-fromis used, the current code doesn't validate the file contents. Users could provide a file with zero URLs or multiple URLs, which would only fail later during processing.Add validation after line 137:
if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None: args_parser.error("URL cannot be specified on the command line AND through a file.") + +if parsed_args.inputs_from is not None: + with open(parsed_args.inputs_from, "r") as f: + non_empty = [ln.strip() for ln in f if ln.strip()] + if len(non_empty) == 0: + args_parser.error("No URL specified in --inputs-from.") + if len(non_empty) != 1: + args_parser.error(f"s3-key-prefix accepts exactly one URL, got {len(non_empty)} in file.")Based on learnings
251-270: Add AWS config directory mount for S3 authentication.The
generate_container_configfunction createsmounts.aws_config_dirwhen AWS config is present, but it's not added tonecessary_mounts. Without this mount, profile-based S3 authentication will fail inside the container.After line 251, add:
necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir] +if mounts.aws_config_dir is not None: + necessary_mounts.append(mounts.aws_config_dir)Based on learnings
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (2)
373-378: Simplify logging.exception call on line 377.The f-string and exception object in
logger.exception()are redundant since the method automatically includes the traceback.Apply this diff:
except Exception: - logger.exception(f"Failed to process input.") + logger.exception("Failed to process input.") return -1Based on learnings
217-271: Use path-segment-based common prefix for S3 keys.Line 267 uses
os.path.commonprefix, which is character-based and can produce partial path segments. For example, keys["ab/file1", "a/file2"]would yield"a"instead of""(no common directory prefix). This can lead to incorrect key_prefix values.Replace lines 267-269 with segment-based logic:
- else: - key_prefix = os.path.commonprefix(keys) - if not key_prefix: - raise ValueError("No common prefix found among object keys") + else: + # Compute common prefix by path segments to avoid partial-component prefixes + prefix_parts = keys[0].split("/") + for key in keys[1:]: + parts = key.split("/") + i = 0 + while i < len(prefix_parts) and i < len(parts) and prefix_parts[i] == parts[i]: + i += 1 + prefix_parts = prefix_parts[:i] + if not prefix_parts: + break + if not prefix_parts: + raise ValueError("No common prefix found among object keys") + key_prefix = "/".join(prefix_parts)Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py(9 hunks)components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (4)
StorageEngine(92-94)StorageType(117-119)validate_logs_dir(699-703)get_clp_connection_params_and_type(198-219)components/clp-package-utils/clp_package_utils/general.py (10)
dump_container_config(340-357)generate_container_config(227-308)generate_container_name(159-164)generate_container_start_cmd(371-412)get_clp_home(140-156)get_container_config_filename(336-337)JobType(97-101)load_config_file(423-444)validate_and_load_db_credentials_file(475-479)validate_dataset_name(621-647)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
AwsAuthentication(383-413)CLPConfig(595-797)StorageType(117-119)components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
S3InputConfig(31-42)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
parse_s3_url(208-240)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(117-119)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
8-8: typing.List is deprecated, use list instead
(UP035)
51-51: Unnecessary mode argument
Remove mode argument
(UP015)
200-200: Boolean positional value in function call
(FBT003)
221-221: Boolean positional value in function call
(FBT003)
223-223: Do not catch blind exception: Exception
(BLE001)
224-224: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
234-235: Logging statement uses f-string
(G004)
278-278: subprocess call: check for execution of untrusted input
(S603)
282-282: Logging statement uses f-string
(G004)
288-288: Do not catch blind exception: Exception
(BLE001)
294-294: Do not catch blind exception: Exception
(BLE001)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
9-9: typing.List is deprecated, use list instead
(UP035)
9-9: typing.Tuple is deprecated, use tuple instead
(UP035)
161-161: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Avoid specifying long messages outside the exception class
(TRY003)
206-206: Unnecessary mode argument
Remove mode argument
(UP015)
219-219: Use X | Y for type annotations
Convert to X | Y
(UP007)
235-235: Avoid specifying long messages outside the exception class
(TRY003)
243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
243-243: Avoid specifying long messages outside the exception class
(TRY003)
248-250: Avoid specifying long messages outside the exception class
(TRY003)
252-254: Avoid specifying long messages outside the exception class
(TRY003)
259-259: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
285-285: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
285-285: Avoid specifying long messages outside the exception class
(TRY003)
302-302: Avoid specifying long messages outside the exception class
(TRY003)
377-377: f-string without any placeholders
Remove extraneous f prefix
(F541)
377-377: Logging statement uses f-string
(G004)
components/clp-package-utils/clp_package_utils/scripts/compress.py
49-49: Unnecessary mode argument
Remove mode argument
(UP015)
244-244: subprocess call: check for execution of untrusted input
(S603)
248-248: Logging statement uses f-string
(G004)
254-254: Do not catch blind exception: Exception
(BLE001)
260-260: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
🔇 Additional comments (8)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
62-102: LGTM!The command generation logic is correct and handles all the necessary flags appropriately.
205-237: LGTM!The validation logic correctly enforces S3 storage type and CLP_S storage engine requirements, with appropriate error messages to guide users.
components/clp-package-utils/clp_package_utils/scripts/compress.py (3)
16-16: LGTM!The StorageType import and logger initialization are correct.
Also applies to: 33-33
69-115: LGTM!The addition of the
--source fsflag and the filesystem input validation logic are correct.
152-152: LGTM!The updates correctly enforce filesystem-only compression with clear error messages directing users to the appropriate script for S3 inputs.
Also applies to: 175-183, 205-206
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
143-196: LGTM!The function correctly routes between filesystem and S3 input modes, with appropriate validation and error handling for each path.
274-303: LGTM!The helper functions correctly parse S3 URLs and retrieve AWS authentication configuration with appropriate error handling.
329-335: LGTM!The
--sourceargument correctly enables multi-mode operation with appropriate choices and a sensible default.
| def _generate_url_list( | ||
| subcommand: str, | ||
| container_url_list_path: pathlib.Path, | ||
| parsed_args: argparse.Namespace, | ||
| ) -> None: | ||
| """ | ||
| Generates URL list file for native script. | ||
|
|
||
| :param subcommand: 's3-object' or 's3-key-prefix' | ||
| :param container_url_list_path: Path to write URL list | ||
| :param parsed_args: Parsed command-line arguments | ||
| """ | ||
| with open(container_url_list_path, "w") as url_list_file: | ||
| url_list_file.write(f"{subcommand}\n") | ||
|
|
||
| if parsed_args.inputs_from is not None: | ||
| with open(parsed_args.inputs_from, "r") as input_file: | ||
| for line in input_file: | ||
| stripped_url = line.strip() | ||
| if "" == stripped_url: | ||
| continue | ||
| url_list_file.write(f"{stripped_url}\n") | ||
| else: | ||
| for url in parsed_args.inputs: | ||
| url_list_file.write(f"{url}\n") | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
LGTM! Optional: Remove unnecessary mode argument.
The function logic is correct. The mode argument "r" on line 51 is the default and can be omitted.
🧰 Tools
🪛 Ruff (0.14.1)
51-51: Unnecessary mode argument
Remove mode argument
(UP015)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 35 to 60, remove the unnecessary explicit mode argument "r" when
opening parsed_args.inputs_from; change the open call to use the default mode
(open(parsed_args.inputs_from)) so the behavior is identical but the code is
slightly cleaner.
| try: | ||
| proc = subprocess.run(cmd) | ||
| ret_code = proc.returncode | ||
| if ret_code != 0: | ||
| logger.error("Compression failed.") | ||
| logger.debug(f"Docker command failed: {shlex.join(cmd)}") | ||
|
|
||
| return ret_code | ||
| finally: | ||
| try: | ||
| generated_config_path_on_host.unlink() | ||
| except Exception: | ||
| logger.debug( | ||
| "Failed to remove generated config file: %s", generated_config_path_on_host | ||
| ) | ||
| try: | ||
| container_url_list_path.unlink() | ||
| except Exception: | ||
| logger.debug("Failed to remove URL list file: %s", container_url_list_path) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use logger formatting instead of f-string on line 282.
The cleanup logic is solid, but line 282 uses an f-string in logger.debug(), which should use lazy formatting.
Apply this diff:
if ret_code != 0:
logger.error("Compression failed.")
- logger.debug(f"Docker command failed: {shlex.join(cmd)}")
+ logger.debug("Docker command failed: %s", shlex.join(cmd))🧰 Tools
🪛 Ruff (0.14.1)
278-278: subprocess call: check for execution of untrusted input
(S603)
282-282: Logging statement uses f-string
(G004)
288-288: Do not catch blind exception: Exception
(BLE001)
294-294: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 277 to 295, the logger.debug call on line 282 uses an f-string
which prevents lazy formatting; replace the f-string with logger.debug("Docker
command failed: %s", shlex.join(cmd)) so the message uses logger's lazy
interpolation, leaving the surrounding return and cleanup logic unchanged.
| for url in urls: | ||
| try: | ||
| region_code, bucket, key = parse_s3_url(url) | ||
| parsed_urls.append((region_code, bucket, key)) | ||
| except ValueError as e: | ||
| raise ValueError(f"Failed to parse URL '{url}': {e}") |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Preserve exception context when re-raising.
Lines 243 and 285 re-raise exceptions without chaining, which discards the original traceback.
Apply exception chaining:
except ValueError as e:
- raise ValueError(f"Failed to parse URL '{url}': {e}")
+ raise ValueError(f"Failed to parse URL '{url}': {e}") from eAlso at line 285:
except ValueError as e:
- raise ValueError(f"Failed to parse S3 URL: {e}")
+ raise ValueError(f"Failed to parse S3 URL: {e}") from eBased on learnings
Also applies to: 282-285
🧰 Tools
🪛 Ruff (0.14.1)
243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
243-243: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/native/compress.py
around lines 238-243 and 282-285, the code re-raises ValueError (and another
exception) without exception chaining, which discards the original traceback;
update both re-raises to use "raise <exception> from e" (e.g., raise
ValueError(f\"Failed to parse URL '{url}': {e}\") from e) so the original
exception context/traceback is preserved and propagate the original error as the
__cause__.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Reviewed host compress.py.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
216-216: Mount AWS config directory for profile-based auth.The
necessary_mountslist should includemounts.aws_config_dirwhen present, so containers using S3 profile auth for archive/stream output can access AWS credentials.This issue was previously flagged in past review comments. Please address the past review feedback.
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
120-138: Validate s3-key-prefix file contains exactly one URL.When
--inputs-fromis specified, the validation does not check the file contents to ensure it contains exactly one non-empty URL. The native script will fail late if the file contains 0 or >1 URLs.This issue was previously flagged in past review comments. Please read and validate the file contents as suggested in the prior review.
251-251: Mount AWS config directory for S3 profile authentication.The
necessary_mountslist does not includemounts.aws_config_dir, which is required for profile-based AWS credentials to be accessible inside the container. S3 compression will fail if profile auth is configured.This critical issue was previously flagged in past review comments. Please add the aws_config_dir mount as suggested in the prior review.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py(9 hunks)components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(117-119)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
AwsAuthentication(383-413)CLPConfig(595-797)StorageType(117-119)components/job-orchestration/job_orchestration/scheduler/job_config.py (1)
S3InputConfig(31-42)components/clp-py-utils/clp_py_utils/s3_utils.py (1)
parse_s3_url(208-240)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
StorageEngine(92-94)StorageType(117-119)get_clp_connection_params_and_type(198-219)components/clp-package-utils/clp_package_utils/general.py (10)
dump_container_config(340-357)generate_container_config(227-308)generate_container_name(159-164)generate_container_start_cmd(371-412)get_clp_home(140-156)get_container_config_filename(336-337)JobType(97-101)load_config_file(423-444)validate_and_load_db_credentials_file(475-479)validate_dataset_name(621-647)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress.py
57-57: Unnecessary mode argument
Remove mode argument
(UP015)
248-248: Logging statement uses f-string
(G004)
252-252: Do not catch blind exception: Exception
(BLE001)
257-257: Do not catch blind exception: Exception
(BLE001)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py
9-9: typing.List is deprecated, use list instead
(UP035)
9-9: typing.Tuple is deprecated, use tuple instead
(UP035)
161-161: Avoid specifying long messages outside the exception class
(TRY003)
183-183: Avoid specifying long messages outside the exception class
(TRY003)
195-195: Avoid specifying long messages outside the exception class
(TRY003)
206-206: Unnecessary mode argument
Remove mode argument
(UP015)
219-219: Use X | Y for type annotations
Convert to X | Y
(UP007)
235-235: Avoid specifying long messages outside the exception class
(TRY003)
243-243: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
243-243: Avoid specifying long messages outside the exception class
(TRY003)
248-250: Avoid specifying long messages outside the exception class
(TRY003)
252-254: Avoid specifying long messages outside the exception class
(TRY003)
259-259: Avoid specifying long messages outside the exception class
(TRY003)
269-269: Avoid specifying long messages outside the exception class
(TRY003)
285-285: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
285-285: Avoid specifying long messages outside the exception class
(TRY003)
302-302: Avoid specifying long messages outside the exception class
(TRY003)
377-377: f-string without any placeholders
Remove extraneous f prefix
(F541)
377-377: Logging statement uses f-string
(G004)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
8-8: typing.List is deprecated, use list instead
(UP035)
51-51: Unnecessary mode argument
Remove mode argument
(UP015)
200-200: Boolean positional value in function call
(FBT003)
221-221: Boolean positional value in function call
(FBT003)
223-223: Do not catch blind exception: Exception
(BLE001)
224-224: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
234-235: Logging statement uses f-string
(G004)
278-278: subprocess call: check for execution of untrusted input
(S603)
282-282: Logging statement uses f-string
(G004)
288-288: Do not catch blind exception: Exception
(BLE001)
294-294: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
| logger.error( | ||
| f"S3 compression requires storage engine {StorageEngine.CLP_S}, " | ||
| f"but configured engine is {storage_engine}." | ||
| ) | ||
| return -1 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use logger formatting instead of f-strings.
Lines 234-235 use f-strings in logger.error(), which prevents lazy interpolation.
Apply this diff:
else:
logger.error(
- f"S3 compression requires storage engine {StorageEngine.CLP_S}, "
- f"but configured engine is {storage_engine}."
+ "S3 compression requires storage engine `%s`, but configured engine is `%s`.",
+ StorageEngine.CLP_S,
+ storage_engine,
)
return -1🧰 Tools
🪛 Ruff (0.14.1)
234-235: Logging statement uses f-string
(G004)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 233 to 237, the logger.error call uses f-strings which prevents
lazy interpolation; replace the f-string usage with logger-style formatting by
passing a single format string and the values as separate arguments (e.g.
logger.error("S3 compression requires storage engine %s, but configured engine
is %s.", StorageEngine.CLP_S, storage_engine)) and keep the subsequent return -1
unchanged.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Reviewed compress_form_s3.py.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Reviewed native compression.py script.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (7)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (5)
240-243: Use logger formatting instead of f-strings.Prefer lazy formatting to avoid eager interpolation and align with logging best practices.
Apply this diff:
logger.error( - f"S3 compression requires storage engine {StorageEngine.CLP_S}, but configured engine" - f" is {storage_engine}." + "S3 compression requires storage engine `%s`, but configured engine is `%s`.", + StorageEngine.CLP_S, + storage_engine, )
276-276: Add AWS config directory mount for S3 authentication.Without the
aws_config_dirmount, profile-based AWS credentials won't be available inside the container, causing S3 compression to fail with authentication errors.Apply this diff:
- necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir] + necessary_mounts = [ + mounts.clp_home, + mounts.data_dir, + mounts.logs_dir, + mounts.aws_config_dir, # optional; skipped if None + ]
308-308: Use logger formatting instead of f-string.Prefer lazy formatting to avoid eager interpolation and align with logging best practices.
Apply this diff:
- logger.debug(f"Docker command failed: {shlex.join(cmd)}") + logger.debug("Docker command failed: %s", shlex.join(cmd))
140-151: Validate s3-key-prefix file contains exactly one URL.When
--inputs-fromis used withs3-key-prefix, the file contents should be validated to ensure exactly one non-empty URL is provided. Currently, validation only occurs when URLs are passed on the command line.Apply this diff after line 150:
if len(parsed_args.inputs) > 0 and parsed_args.inputs_from is not None: args_parser.error("URL cannot be specified on the command line AND through a file.") + + if parsed_args.inputs_from is not None: + with open(parsed_args.inputs_from) as f: + non_empty_lines = [line.strip() for line in f if line.strip()] + if len(non_empty_lines) == 0: + args_parser.error("No URL specified in --inputs-from.") + if len(non_empty_lines) != 1: + args_parser.error( + f"{S3_KEY_PREFIX_COMPRESSION} accepts exactly one URL, got {len(non_empty_lines)} in file." + )
59-59: Optional: Remove unnecessary mode argument.The explicit mode
"r"is the default foropen()and can be omitted.Apply this diff:
- with open(parsed_args.inputs_from, "r") as input_file: + with open(parsed_args.inputs_from) as input_file:components/clp-package-utils/clp_package_utils/scripts/compress.py (2)
254-254: Use logger formatting instead of f-string.Prefer lazy formatting to avoid eager interpolation and align with logging best practices.
Apply this diff:
- logger.debug(f"Docker command failed: {shlex.join(cmd)}") + logger.debug("Docker command failed: %s", shlex.join(cmd))
220-220: Add AWS config directory mount for S3 output destinations.If
archive_outputorstream_outputuses S3 with profile-based authentication, the container needs access to AWS credentials. Without theaws_config_dirmount, S3 writes will fail.Apply this diff:
- necessary_mounts = [mounts.clp_home, mounts.data_dir, mounts.logs_dir, mounts.input_logs_dir] + necessary_mounts = [ + mounts.clp_home, + mounts.data_dir, + mounts.logs_dir, + mounts.input_logs_dir, + mounts.aws_config_dir, # optional; skipped if None + ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
components/clp-package-utils/clp_package_utils/general.py(1 hunks)components/clp-package-utils/clp_package_utils/scripts/compress.py(9 hunks)components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py (2)
components/clp-py-utils/clp_py_utils/clp_config.py (3)
StorageEngine(94-96)StorageType(119-121)get_clp_connection_params_and_type(200-221)components/clp-package-utils/clp_package_utils/general.py (10)
dump_container_config(344-361)generate_container_config(231-312)generate_container_name(163-168)generate_container_start_cmd(375-416)get_clp_home(144-160)get_container_config_filename(340-341)JobType(101-105)load_config_file(427-449)validate_and_load_db_credentials_file(480-484)validate_dataset_name(634-660)
components/clp-package-utils/clp_package_utils/scripts/compress.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageType(119-121)
🪛 Ruff (0.14.1)
components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
59-59: Unnecessary mode argument
Remove mode argument
(UP015)
221-221: Boolean positional value in function call
(FBT003)
241-242: Logging statement uses f-string
(G004)
250-250: Boolean positional value in function call
(FBT003)
252-252: Do not catch blind exception: Exception
(BLE001)
253-253: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
304-304: subprocess call: check for execution of untrusted input
(S603)
308-308: Logging statement uses f-string
(G004)
components/clp-package-utils/clp_package_utils/scripts/compress.py
59-59: Unnecessary mode argument
Remove mode argument
(UP015)
254-254: Logging statement uses f-string
(G004)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: package-image
🔇 Additional comments (1)
components/clp-package-utils/clp_package_utils/general.py (1)
52-53: LGTM! Clean constant declarations.The S3 compression mode constants are well-named and consistent with the naming conventions used elsewhere in the codebase.
| except Exception as e: | ||
| logger.error(e) | ||
| return -1 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use logger.exception for better error context.
When logging an exception in an exception handler, use logger.exception() to automatically include the traceback.
Apply this diff:
except Exception as e:
- logger.error(e)
+ logger.exception("Failed to validate dataset name.")
return -1🧰 Tools
🪛 Ruff (0.14.1)
252-252: Do not catch blind exception: Exception
(BLE001)
253-253: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
around lines 252 to 254, the exception handler currently calls logger.error(e)
which doesn't include the traceback; replace that call with
logger.exception("Failed to compress from S3") (or similar descriptive message)
so the traceback is logged automatically, and keep the subsequent return -1
unchanged.
| return len(parsed_args.paths) != 0 | ||
|
|
||
| no_path_found = True | ||
| with open(host_logs_list_path, "r") as host_logs_list_file: |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Optional: Remove unnecessary mode argument.
The explicit mode "r" is the default for open() and can be omitted for cleaner code.
Apply this diff:
- with open(host_logs_list_path, "r") as host_logs_list_file:
+ with open(host_logs_list_path) as host_logs_list_file:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with open(host_logs_list_path, "r") as host_logs_list_file: | |
| with open(host_logs_list_path) as host_logs_list_file: |
🧰 Tools
🪛 Ruff (0.14.1)
59-59: Unnecessary mode argument
Remove mode argument
(UP015)
🤖 Prompt for AI Agents
In components/clp-package-utils/clp_package_utils/scripts/compress.py around
line 59, the open() call explicitly passes the default mode "r"; remove the
unnecessary mode argument so the call becomes open(host_logs_list_path) to keep
the code cleaner and rely on the default read mode.
compress-from-s3 Python and shell scripts for S3-object or S3-key-prefix compression; Restrict compress.py to accept local file paths only.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Approved with the following validations performed (also updated PR description):
compress-from-s3.py:
From cli directly:
- Single object URL → success
- Multiple object URLs with a shared prefix → success
- Multiple object URLs without a shared prefix → fail
- Single prefix URL → success
- Multiple prefix URLs → fail
- Single prefix URL but it’s an object → success
From input file:
- Single object URL → success
- Multiple object URLs with a shared prefix → success
- Multiple object URLs without a shared prefix → fail
- Single prefix URL → success
- Multiple prefix URLs → fail
- Multiple prefix URLs → fail
- Single prefix URL but it’s an object → success
- Empty input file (for both object compression and key-prefix compression) → fail
Additional cases:
- From both positional and input file → fail
- Given both an input file and positional args → fail
- Given both
s3-objectands3-key-prefix→ fail (treated as previous)
compress.py (for both clp and clp-s storage engine):
From cli directly:
- Single path → success
- Multiple path → success
From input file:
- Single path → success
- Multiple path → success
- No path → fail
Additional cases:
- From both positional and input file → fail
Changed PR title to:
feat(clp-package)!: Add `compress-from-s3` Python and shell scripts for S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only.
compress-from-s3 Python and shell scripts for S3-object or S3-key-prefix compression; Restrict compress.py to accept local file paths only.compress-from-s3 Python and shell scripts for S3-object and S3-key-prefix compression; Restrict compress.py to accept local file paths only.
…or S3-object and S3-key-prefix compression; Restrict `compress.py` to accept local file paths only. (y-scope#1476) Co-authored-by: LinZhihao-723 <zh.lin@mail.utoronto.ca>
Description
This PR adds a new
compress-from-3.shCLI script that enables compressing logs directly from S3 object storage. The script provides two modes of operation:Changes:
compress-from-s3script to handle S3 log ingestion separate from filesystem ingestion.compressscript to ingest exclusively from the filesystem.nativecompress.pyscript to handle new S3 ingestion options.Checklist
breaking change.
Validation performed
logs_inputis configured tofs:compress-from-s3.shscript fails to run.compress.shingests from the filesystem successfully.logs_inputis configured tos3:compress.shscript fails to run.compress-from-s3.sh s3-objectsuccessfully ingests one or more S3 URLs (that share a bucket and prefix).compress-from-s3.sh s3-key-prefixsuccessfully ingests all objects under the key prefix in an S3 URL.Latest Validation
compress-from-s3.py:From cli directly:
From input file:
Additional cases:
s3-objectands3-key-prefix→ fail (treated as previous)compress.py(for bothclpandclp-sstorage engine):From cli directly:
From input file:
Additional cases:
Summary by CodeRabbit
New Features
Refactor
Bug Fixes