feat(clp-package): Add support for clp-json to ingest logs from S3.#651
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
…ss/fs_compression_task.py
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
kirkrodrigues
left a comment
There was a problem hiding this comment.
Minor comments.
For the PR title, how about:
feat(clp-package): Add support for clp-json to ingest logs from S3.
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
1-14: Organize imports by standard library, third-party, and local packages.Consider reorganizing the imports into three groups separated by a blank line:
- Standard library imports (
configparser,re,pathlib,typing)- Third-party imports (
boto3,botocore)- Local imports (
job_orchestration,clp_py_utils)import configparser import re from pathlib import Path from typing import List, Tuple + import boto3 from botocore.config import Config + from job_orchestration.scheduler.job_config import S3InputConfig from clp_py_utils.clp_config import S3Config from clp_py_utils.compression import FileMetadata
20-27: Enhance docstring with specific return value types.The docstring should specify the exact types being returned and clarify that the values are strings.
""" Parses the `aws_access_key_id` and `aws_secret_access_key` of `user` from the given credentials_file_path. :param credentials_file_path: :param user: - :return: A tuple of (aws_access_key_id, aws_secret_access_key) + :return: A tuple of (str, str) containing (aws_access_key_id, aws_secret_access_key) :raises: ValueError if the file doesn't exist, or doesn't contain valid aws credentials. """
74-75: Improve error message with URL format examples.The error message should include examples of valid URL formats to help users understand the expected format.
if match is None: - raise ValueError(f"Unsupported URL format: {s3_url}") + raise ValueError( + f"Unsupported URL format: {s3_url}. Expected formats:\n" + "- Host-style: https://<bucket>.s3.<region>.amazonaws.com/<key>\n" + "- Path-style: https://s3.<region>.amazonaws.com/<bucket>/<key>" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/clp-package-utils/clp_package_utils/scripts/compress.py(4 hunks)components/clp-package-utils/clp_package_utils/scripts/native/compress.py(4 hunks)components/clp-py-utils/clp_py_utils/s3_utils.py(2 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(11 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
- components/clp-package-utils/clp_package_utils/scripts/native/compress.py
- components/clp-package-utils/clp_package_utils/scripts/compress.py
- components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)
88-98: LGTM! Input validation is thorough.The function properly validates all required parameters before generating the URL.
101-137: LGTM! Pagination and metadata handling are well implemented.The function correctly handles pagination and object metadata retrieval, with appropriate filtering for directory-like paths.
Line range hint
140-176: LGTM! Error handling and retry configuration are well implemented.The function includes appropriate validation, retry configuration, and clear error messages.
…-scope#651) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Description
This PR adds s3 ingestion support for clp-s package. Currently, we support compressing with one s3_url(with prefix matching) at a time. User can either specify the credentials through commandline, or via a credentials file. If no credentials are provided, the package will attempt to use the associated IAM if there is any.
The PR contains the following changes:
The design decisions involved in this PR is documented in this doc
Due to time limitation, we leave the following items as todo for future:
If users provide invalid S3 information, such as incorrect credentials or a non-existent bucket, the issue is only detected when the package attempts to schedule the job. Additionally, if the provided S3 credentials lack sufficient privileges (e.g., list_object is allowed, but get_object is not), the package will return unhelpful error messages to the web UI ("see error log.txt blabla"), making it confusing for user.
Ideally, invalid S3 configurations should be detected at package startup to provide immediate and actionable feedback.
Validation performed
Manually verified that both fs compression and s3 compression works from the commandline.
Validations done after code review:
Launched compression successfully with
Launched compression with the following configuration and confirmed that errors is properly reported:
Also sanity checked the error for writing to S3
Summary by CodeRabbit
Release Notes
New Features
Improvements
Infrastructure
Bug Fixes