Skip to content

feat(clp-package): Add support for clp-json to ingest logs from S3.#651

Merged
haiqi96 merged 79 commits into
y-scope:mainfrom
haiqi96:s3_scheduler
Jan 16, 2025
Merged

feat(clp-package): Add support for clp-json to ingest logs from S3.#651
haiqi96 merged 79 commits into
y-scope:mainfrom
haiqi96:s3_scheduler

Conversation

@haiqi96

@haiqi96 haiqi96 commented Jan 3, 2025

Copy link
Copy Markdown
Contributor

Description

This PR adds s3 ingestion support for clp-s package. Currently, we support compressing with one s3_url(with prefix matching) at a time. User can either specify the credentials through commandline, or via a credentials file. If no credentials are provided, the package will attempt to use the associated IAM if there is any.

The PR contains the following changes:

  1. Introduces S3InputConfig and FsInputConfig for different ingestion path. Updated compression execution script to handle different input.
  2. Updated compression scheduler to handle scheduling for s3 objects. Note to avoid excessive change, we let s3 objects reuse the existing metadata class for FilePath.
  3. Updated compress.sh interface to allow specifying if the input is from local filesystem (fs) or s3(s3). Simplified native/compress.sh scripts' interface to always accept a target_list.
  4. Update execution container to install necessary dependency for curl lib

The design decisions involved in this PR is documented in this doc

Due to time limitation, we leave the following items as todo for future:

If users provide invalid S3 information, such as incorrect credentials or a non-existent bucket, the issue is only detected when the package attempts to schedule the job. Additionally, if the provided S3 credentials lack sufficient privileges (e.g., list_object is allowed, but get_object is not), the package will return unhelpful error messages to the web UI ("see error log.txt blabla"), making it confusing for user.

Ideally, invalid S3 configurations should be detected at package startup to provide immediate and actionable feedback.

Validation performed

Manually verified that both fs compression and s3 compression works from the commandline.

Validations done after code review:
Launched compression successfully with

  1. FS input, with --path-list that contains 2 files.
  2. FS input, with PATH positional argument
  3. Valid S3 URL with aws_key and id specified through argument
  4. Valid S3 URL with aws_key and id specified through credential file

Launched compression with the following configuration and confirmed that errors is properly reported:

  1. Valid S3 URL with invalid credential file.
  2. Invalid S3 URL that doesn't resolve to any objects
  3. Invalid S3 URL with unexisting bucket
  4. S3 URL with bucket that can't be accessed.
  5. S3 URL with no permission.

Also sanity checked the error for writing to S3

  1. Setting the destination to a path without write permission. verified the error message makes sense.
  2. Setting the archive_output with invalid s3 credentials. verified the error message makes sense.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for compressing files from both filesystem and Amazon S3 sources.
    • Introduced new functions for generating log lists for filesystem and S3 inputs.
    • Enhanced configuration management with new input types for compression tasks.
  • Improvements

    • Refined input validation and error handling for compression processes.
    • Improved task routing and execution for compression jobs.
    • Streamlined command generation for S3 operations and enhanced metadata handling.
  • Infrastructure

    • Updated Docker base images with additional certificate packages.
  • Bug Fixes

    • Resolved issues with path resolution and input type handling.

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

For the PR title, how about:

feat(clp-package): Add support for clp-json to ingest logs from S3.

Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/native/compress.py Outdated
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py Outdated
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py Outdated
Comment thread components/clp-py-utils/clp_py_utils/s3_utils.py Outdated
Comment thread components/clp-package-utils/clp_package_utils/scripts/compress.py Outdated
Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

1-14: Organize imports by standard library, third-party, and local packages.

Consider reorganizing the imports into three groups separated by a blank line:

  1. Standard library imports (configparser, re, pathlib, typing)
  2. Third-party imports (boto3, botocore)
  3. Local imports (job_orchestration, clp_py_utils)
 import configparser
 import re
 from pathlib import Path
 from typing import List, Tuple
+
 import boto3
 from botocore.config import Config
+
 from job_orchestration.scheduler.job_config import S3InputConfig
 from clp_py_utils.clp_config import S3Config
 from clp_py_utils.compression import FileMetadata

20-27: Enhance docstring with specific return value types.

The docstring should specify the exact types being returned and clarify that the values are strings.

     """
     Parses the `aws_access_key_id` and `aws_secret_access_key` of `user` from the given
     credentials_file_path.
     :param credentials_file_path:
     :param user:
-    :return: A tuple of (aws_access_key_id, aws_secret_access_key)
+    :return: A tuple of (str, str) containing (aws_access_key_id, aws_secret_access_key)
     :raises: ValueError if the file doesn't exist, or doesn't contain valid aws credentials.
     """

74-75: Improve error message with URL format examples.

The error message should include examples of valid URL formats to help users understand the expected format.

     if match is None:
-        raise ValueError(f"Unsupported URL format: {s3_url}")
+        raise ValueError(
+            f"Unsupported URL format: {s3_url}. Expected formats:\n"
+            "- Host-style: https://<bucket>.s3.<region>.amazonaws.com/<key>\n"
+            "- Path-style: https://s3.<region>.amazonaws.com/<bucket>/<key>"
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 996692e and 351f239.

📒 Files selected for processing (4)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py (4 hunks)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py (4 hunks)
  • components/clp-py-utils/clp_py_utils/s3_utils.py (2 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (11 hunks)
👮 Files not reviewed due to content moderation or server errors (3)
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/s3_utils.py (3)

88-98: LGTM! Input validation is thorough.

The function properly validates all required parameters before generating the URL.


101-137: LGTM! Pagination and metadata handling are well implemented.

The function correctly handles pagination and object metadata retrieval, with appropriate filtering for directory-like paths.


Line range hint 140-176: LGTM! Error handling and retry configuration are well implemented.

The function includes appropriate validation, retry configuration, and clear error messages.

@haiqi96 haiqi96 changed the title feat(clp-package): Add support for clp-s s3 ingestion feat(clp-package): Add support for clp-json to ingest logs from S3. Jan 16, 2025
@haiqi96 haiqi96 requested a review from kirkrodrigues January 16, 2025 17:43
@haiqi96 haiqi96 merged commit fae7fcc into y-scope:main Jan 16, 2025
@haiqi96 haiqi96 deleted the s3_scheduler branch June 16, 2025 20:05
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…-scope#651)

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants